Google-Microsoft Open Buildings - combined by VIDA
This dataset merges Google's V3 Open Buildings and Microsoft's latest Building Footprints. It contains 2,579,035,323 footprints and is divided into 185 partitions. Each footprint is labelled with its respective source, either Google or Microsoft. It can be accessed in cloud-native geospatial formats such as GeoParquet, FlatGeobuf and PMTiles.Google-Microsoft Open Buildings - combined by VIDA
Overview
This dataset merges Google's V3 Open Buildings and Microsoft's latest Building Footprints. It contains 2,579,035,323 footprints and is divided into 185 partitions. Each footprint is labelled with its respective source, either Google or Microsoft. It can be accessed in cloud-native geospatial formats such as GeoParquet, FlatGeobuf and PMTiles.
See it in action
You can Observable to get a quick overview of the dataset or go to VIDA to see it in action.
Original datasets
The original Google V3 open buildings is downloadable from this link as gzipped CSV files. Here are some key details about the original dataset:
The dataset contains 1.8 billion building detections, across an inference area of 58M km2 within Africa, South Asia, South-East Asia, Latin America and the Caribbean.
Each building in the dataset has a polygon defining its footprint on the ground, a confidence score indicating how certain we are that this is a building, and a Plus Code corresponding to the centre of the building. There is no information about the type of building, its street address, or any details other than its geometry.
For more comprehensive information, please visit the description page. You can also check out the FAQ section for additional information.
Microsoft
The latest version of Microsoft's building footprints can be downloaded from Microsoft Planetary Computer as gzipped partitioned files.
The Microsoft Global Open Buildings dataset was generated through Bing Maps, which detected a total of 1.24 billion buildings. These buildings were identified using imagery from Bing Maps, encompassing data collected between 2014 and 2023, including images from Maxar, Airbus, and IGN France.
For more detailed information please visit the github page
Data Formats
The data is available in the following formats:
- GeoParquet 1.1.0
- By country - single file
- By country - S2 partitioned
- FlatGeobuf
- By country - single file
- By country - S2 partitioned
- PMTiles
- Global - single layer
- Global - layer per country based on the 3-letter ISO code
- By country
Partitioning
This extensive dataset is organized into 185 root partitions. Each partition typically corresponds to a country's administrative boundary, as defined by the Comprehensive Global Administrative Zones (CGAZ) at the ADM0 level, which can be accessed here. There is also a sub-partition available, based on the S2 grid.
By country
Both FlatGeobuf and GeoParquet are categorized by country boundaries, in accordance with the ADM0 level of the CGAZ geoboundary definition. This means that building footprints are separated by countries within each format. For naming conventions, we utilize the country's ISO CODE.
/geoparquet/by_country/country_iso={ISO}/{ISO}.parquet
Note: There is a partition labeled
country_iso=None
, which represents a MULTIPOLYGON containing geoboundaries (POLYGONS) that have not been explicitly defined or named by CGAZ. These geoboundaries are still captured by CGAZ at the ADM0 level, but they lack specific names and therefore labellednull
. As a result, building footprints located within these geoboundaries are included in this partition labeledcountry_iso=None
. For instance, the area between Sudan and South Sudan includes a piece of land known as "Abyei" which remains unclaimed due to recurring conflicts, and therefore, it lacks an assigned name.
By country + S2 grid
To enhance performance, particularly with GeoParquet files, we've introduced an S2 sub-partitioning strategy. Each ISO partition is further divided using an S2 grid ID, ensuring a cap of 20 million building footprints per grid ID. This S2 grid partitioning is exclusive to GeoParquet files.
/geoparquet/by_country_s2/country_iso={ISO}/{S2_GRID_ID}.parquet
Schema
Each row in the dataset provides information on a specific building footprint with associated information on individual columns:
- boundary_id (INTEGER): A unique ID linking the CGAZ level 0 boundary ISO to an integer, created for partitioning the datasets within BigQuery.
- confidence (FLOAT): A metric denoting the model's confidence about the accuracy of the building footprint. Microsoft-sourced footprints set this column to null since the original dataset doesn't feature this attribute.
- bf_source (STRING): Indicates the footprint's origin - Google or Microsoft.
- area_in_meters (FLOAT): Represents the polygon's area in square meters.
- s2_id (INT): Exclusive to the S2 partitioning scheme, it represents the S2 grid ID.
- country_iso (STRING): 3-letter ISO code of the country the footprint belongs within.
- geohash (STRING): Geohash for the geometry at a precision level of 8.
- bbox (STRUCT): Struct containing
xmin, ymin, xmax, ymax
values for the bounding box of the geometry.
Data Processing
We invite you to read our blog post for more detailed information on our dataset merging approach, which includes insights into the optimization techniques we investigated and the query performance on BigQuery. In this section, we provide a high-level summary of the merging process, highlighting its crucial aspects.
We imported both datasets into BigQuery for further processing. From the Google dataset, we excluded columns like full_plus_code
, latitude
, and longitude
. For the Microsoft dataset we did not drop any columns.
We then matched each building footprint with a boundary ID, determined by the intersection of its centroid with the country geoboundaries in the CGAZ ADM0 dataset. Footprints whose centroids didn't overlap with any country geoboundary were mapped to the nearest geoboundary based on their centroid's position.
Contact details
If you'd like more information about the dataset or the processing steps, feel free to write an email to maarten@vida.place.
Changelog
Current version: 2.0
Version 2.0 - 2024-09-04
- Add 32,784,238 building footprints for various regions by updating to the latest Microsoft GlobalMLBuildingFootprints as of 2024-05-28.
- Update to GeoParquet schema version 1.1.0.
- Includes bbox struct for easy filtering.
- Introduce spatial ordering by geohash for FlatGeobuf and GeoParquet files.
- Add PMTiles files per country.
- Add PMTiles file with a layer per country ISO code.
Version 1.1 - 2023-10-02
- Added 11,631,283 building footprints for Morroco from the Google Earthquake dataset
- Added 24,532 building footprints for Libya from the Google Derna Flooding dataset
- Building footprints are added to the GeoParquet, FlatGeobuf and PMTiles archives.
- Fixed the missing GeoParquet version bug.
- Refactored S2 grid naming strategy from unsigned 64bit integers to signed 64bit integers.
Version 1 - 2023-08-29
Dataset Licenses
The data is shared under the Creative Commons Attribution (CC BY-4.0) license and the Open Data Commons Open Database License (ODbL) v1.0 license. As the user, you can pick which of the two licenses you prefer and use the data under the terms of that license.