Introduction

Geospatial data are comprised of multi-dimensional raster or point datasets, frequently generated in domains such as meteorology, oceanography, climate science and earth observation. Such data are naturally represented as multi-dimensional dense or sparse arrays with associated metadata. For instance, variables within raster datasets are 2D dense arrays, temporal stacks of 2D raster data are 3D dense arrays, and point cloud data (LiDAR) are 3D sparse arrays. This extends to a N-dimensional stack of geospatial data for temporal multispectral data.

The Data Management Problem

An issue with existing approaches of storing and processing geospatial data is that they are all generating collections of files, with each file corresponding to a certain geographic region and carrying its own metadata. Particularly in cloud stores like AWS S3 where all files/objects are immutable, updating a collection typically results in generating and adding new files. The problem is in managing those numerous files and their metadata, as you typically need a separate catalog service to discover which file or files contain the relevant data for a given spatiotemporal slice.

For example, COGs are cloud-optimized GeoTIFFs, a specification that determines the layout of a TIFF file for efficient access from S3. COGs are essentially a read-only blob when stored on S3 that cannot be updated or appended to. Managing a collection of COGs, as in the case of a temporal collection, requires a catalog service. In addition, as the resolution or the area that the COG covers increases, the issue of indexing becomes important. The COG header is an ordered list of tiles and does not use spatial indexing to identify the offset of the tiles intersecting with an area of interest.

Why TileDB?

TileDB is an excellent fit for geospatial applications, as it stores the data in multi-dimensional dense or sparse arrays, and unifies all spatial and temporal information in a single, intuitive, and efficiently sliceable way. TileDB eliminates the data management pains, while offering excellent cloud-optimized writing/reading performance. It is an open-source, embeddable library written in C / C++ that eliminates the need for an additional catalog service. It supports updates natively, and employs fast spatial indexing (such as R-Trees for its tiles).

TileDB currently supports the following dataset types:

  • Point cloud: These are 3D points of the form X, Y, Z, <attribute-fields>, which TileDB stores in a 3D sparse array. TileDB integrates with the popular PDAL library to support point cloud data ingestion into TileDB (e.g., from LAZ files) and access/computation.

  • Raster: These are 2D gridded image data, where each pixel may store any number of values. TileDB integrates with the popular GDAL and Rasterio libraries to support ingesting raster and vector data from a variety of formats into TileDB dense arrays, and perform advanced spatial processing.

  • SAR: Synthetic Aperture Radar (SAR) in remote sensing is used to create fine detailed representation of the earth and to model changes over time. The SAR measurement consists of a complex data type representing both the amplitude and phase of the radar response. TileDB stores SAR data as well as temporal stacks of SAR data in 2D or 3D dense arrays.

Storing geospatial data in TileDB allows you to take advantage of all TileDB benefits, such as cloud-optimized access, compression, parallel IO, and integration with the Data Science ecosystem (e.g., for parallel computing via Dask or Spark, or to perform even SQL queries on your data).

Last updated