Google Open Buildings

Google's Open Buildings is an open access dataset containing the geometry of buildings across most of Africa, South Asia and South-East Asia. This version of the dataset is transformed to be partitioned by admin 1 boundaries and available in cloud-native geospatial formats (PMTiles, GeoParquet). Note that https://source.coop/repositories/vida/google-microsoft-osm-open-buildings/description is likely a better source for this data. There is no commitment to keep this dataset up to date with the latest releases. The code used to do the processing can be found at https://github.com/opengeos/open-buildings

Product Details

Visibility: Public
Owner: Chris Holmes
Created: 26 Jun 2024
Last Updated: 3 Apr 2025

Product Contents

root

fgb-s2

geoparquet-by-country

geoparquet-s2-more-columns

collection.json

google-open-buildings.pmtiles

README.md

README

Google Open Buildings (Cloud-Native Geo distribution)

This dataset is a copy of the Google Open Buildings, offering the data in more GIS-friendly and cloud-native geospatial formats. The original dataset is distributed as gzipped csv files, and is available for download on sites.research.google.com/open-buildings/.

Most of the following information about the data is copied from that site; it is the canonical source data and remains the source of truth. This dataset just makes the data more accessible through more formats. The description of the original dataset is as follows:

The dataset contains 1.8 billion building detections, across an inference area of 58M km2 within Africa, South Asia, South-East Asia, Latin America and the Caribbean.

The current dataset is in its 3rd version (v3). V1 covered Africa, in v2 we expanded to South and South-East Asia and in the current version v3 detections from Latin America and the Caribbean are also included.

For each building in this dataset we include the polygon describing its footprint on the ground, a confidence score indicating how sure we are that this is a building, and a Plus Code corresponding to the centre of the building. There is no information about the type of building, its street address, or any details other than its geometry.

More explanation is in the FAQ.

Modifications to the original dataset.

A small percentage of the original data was actually multi-polygons, though the intent seemed to be to represent every building as it's own feature. So while processing the dataset into more accessible formats any multi-polygon encountered was split into polygons, and a new area_in_meters attribute was calculated. The confidence attribute remained the same for both polygons.

The latitude and longitude columns were removed from the original dataset, since those can easily be calculated from the polygons or the full_plus_code attribute.

In further processing, two new attributes were added, one from joining the Overture Maps country dataset, adding country_iso for the country the buildings are in, and quadkey for the level 12 quadkey the building is in. The dataset was then partitioned by country_iso, and if the data was over ~2 gigabytes it was split up further by quadkey.

Access the dataset

A PMTiles version of the dataset is available for visualization at google-open-buildings.pmtiles. You can see it at the PMTiles Viewer (the visualization is not great, working on a better one).

The geoparquet s2 folder contains versions of the data that reflect the original gzipped CSV files, each representing an S2 cell. These just have the first set of modifications (splitting multi-polygons and removing lat / long columns), plus adding country_iso and quadkey coluns.

The geoparquet-by-country directory includes the additional country_sio and quadkey partitioning, using a 'hive partition'

For instructions on how to get full countries as GeoParquet or other formats see this tutorial on DuckDB and GeoParquet with Google Open Buildings.

TODO: get STAC going, put stac browser here here, hopefully it'll view pmtiles too.

Download Dataset

You can use the browse links to navigate through the repository and download the data you'd like.

You can also access the S3 bucket directly with S3 tools (aws cli, boto3, etc), its S3 URI is s3://us-west-2.opendata.source.coop/google-research-open-buildings/

Dataset Licenses

The data is shared under the Creative Commons Attribution (CC BY-4.0) license and the Open Data Commons Open Database License (ODbL) v1.0 license. As the user, you can pick which of the two licenses you prefer and use the data under the terms of that license.

Authors

Wojciech Sirko
Sergii Kashubin
Marvin Ritter
Abigail Annkah
Yasser Salah Eddine Bouchareb
Yann Dauphin
Daniel Keysers
Maxim Neumann
Moustapha Cisse
John Quinn

Additional Processing

Chris Holmes

Citation & DOI

The paper explaining the dataset is at https://arxiv.org/abs/2107.12283), and the DOI is: https://doi.org/10.48550/arXiv.2107.12283

V2 notes

The data processing for v2 of the google buildings was a bit different - original text here:

In further processing, two new attributes were added, both from joining the geoBoundaries dataset, a great attribution-only dataset for admin level 0, 1 and 2 boundaries. The first is country, which uses the ISO country code, and the second is admin_1, which is the level 1 administrative boundary - usually a state or province. For the most part these just used the shapeName and shapeGroup fields from a ST_Within spatial join, but Angola had empty shapeName so adm1_name was used instead. There were an additional 5 million buildings that were not within. Usually because the boundaries did not quite capture the coastline (though occasionally it looked like the segmentation was capturing ships instead of buildings). For those the nearest boundary was used, using the <-> operator in PostGIS to find the closest one. This uses the centroid of the polygons so likely used the wrong boundary in some cases, but that should be relatively rare.

Note there is also now an id field. This was inadvertant and has no meaning, it was just to help processing in PostGIS. The intent was to remove it during final processing but that was forgotten.

The v2/geoparquet-admin-1 directory includes the additional country and admin_1 processing, and then uses those as the mechanism to split up the files. This leads to a bit better 'balance' between the files, with the largest file (IDN-East_Java) coming in at 2.12 gigabytes, compared to 4.9 gigabytes for the largest S2 one (2e7_buildings). These have been partitioned by the country attribute and split into one file per admin_1, so each folder contains all the level 1 admin boundaries for the country. This is done as a parquet 'partition' with each folder named something like country=EGY to enable partition aware tools to take advantage of the scheme and only read the necessary files. Thusfar only limited testing has been done with these, and the individual files aren't named like most examples, but it seems to work.

Note that these seem to have a bit too many features, there were some extras in processing. We'll work to clean those up soon.