Geospatial

Cloud analysis ready data, with data management solved at last

Geospatial data are comprised of multi-dimensional raster or point datasets, frequently generated in domains such as meteorology, oceanography, climate science and earth observation. Such data are naturally represented as multi-dimensional dense or sparse arrays with associated metadata. For instance, variables within raster datasets are 2D dense arrays, temporal stacks of 2D raster data are 3D dense arrays, and point cloud data (LiDAR) are 3D sparse arrays. This extends to a N-dimensional stack of geospatial data for temporal multispectral data.

LiDAR and raster examples

The Data Management Problem

An issue with existing approaches of storing and processing geospatial data is that they are all generating collections of files, with each file corresponding to a certain geographic region and carrying its own metadata. Particularly in cloud stores like AWS S3 where all files/objects are immutable, updating a collection typically results in generating and adding new files. The problem is in managing those numerous files and their metadata, as you typically need a separate catalog service to discover which file or files contain the relevant data for a given spatiotemporal slice.

Managing files and bands corresponding to different areas or timestamps becomes messy

For example, COGs are cloud-optimized GeoTIFFs, a specification that determines the layout of a TIFF file for efficient access from S3. COGs are essentially a read-only blob when stored on S3 that cannot be updated or appended to. Managing a collection of COGs, as in the case of a temporal collection, requires a catalog service. In addition, as the resolution or the area that the COG covers increases, the issue of indexing becomes important. The COG header is an ordered list of tiles and does not use spatial indexing to identify the offset of the tiles intersecting with an area of interest.

Why TileDB?

TileDB is an excellent fit for geospatial applications, as it stores the data in multi-dimensional dense or sparse arrays, and unifies all spatial and temporal information in a single, intuitive, and efficiently sliceable way. TileDB eliminates the data management pains, while offering excellent cloud-optimized writing/reading performance. It is an open-source, embeddable library written in C / C++ that eliminates the need for an additional catalog service. It supports updates natively, and employs fast spatial indexing (such as R-Trees for its tiles).

TileDB has been integrated into foundational geospatial libraries such as PDAL (sparse point cloud) and GDAL (raster), and you can find all the details in the TileDB Developer Geospatial docs. Storing your geospatial data in TileDB, you can also enjoy all the TileDB benefits, such as direct array data access via various APIs (C, C++, Python, R, Java, Go), embeddable SQL with MariaDB, scalable analysis with popular parallel computing tools like Spark and Dask, and at rest encryption. Finally, you can securely share your data with other users and perform serverless computations on the cloud with TileDB Cloud.