The TileDB GDAL driver enables indexed and efficient access to integral and partial tiles of geospatial data. The Python community typically use the Rasterio library to access GDAL drivers. Here we show how to ingest large raster data into TileDB in parallel using GDAL, Rasterio, xarray and Dask.
To highlight how TileDB works with large dense arrays we will use a dataset from the Sentinel-2 mission. You can either register and download a sample yourself from the Copernicus hub or use the requestor pays bucket from the AWS open data program, the latter being preferable if you wish to run your code on AWS using public data. The directory size of the Sentinel-2 image we are using is 788 MB.
You can run the Rasterio codeĀ as follows:
To verify that the Sentinel-2 dataset can be read by our installation of Rasterio,run the following (changing the filename):
Rasterio should successfully print out the metadata about the subdatasets within the the Sentinel-2 dataset as follows:
To ingest the Sentinel-2 data into TileDB with Rasterio, run:
A faster way to ingest large raster data into TileDB and take advantage of TileDB's parallel writes is by using Dask. Below we provide a detailed example on how to do this:
You can run PDAL and TileDB code in Python using conda environment packages.
If Python packages are not required the pdal
conda package can be used instead
Both of the packages above will provide updated PDAL and TileDB libraries for your conda environment.
First, create a TileDB config file tiledb.config
where you can set any TileDB configuration parameter (e.g., AWS keys if you would like to write to a TileDB array on S3). Make sure you also add the following, as currently TileDB does not handle duplicate points (this will change in a future version).
Then create a PDAL pipeline to translate some LAS data to a TileDB array, by storing the following in a file called pipeline.json
:
See PDAL documentation for information on available options for the TileDB PDAL writer. You can execute the pipeline with PDAL that will carry out the ingestion as follows:
We now have points and attributes stored in an array called sample_array
. This write uses the streaming mode of PDAL.
You can view this sample_array
directly from TileDB as follows (we demonstrate using TileDB's Python API, but any other API would work as well):
PDAL is single-threaded, but coupled with TileDB's parallel write support, can become a powerful tool for ingesting enormous amounts of point cloud data into TileDB. The PDAL driver supports appending to an existing dataset and we use this with Dask to create a parallel update.
We demonstrate parallel ingestion with the code below. Make sure to remove or move the sample_array
created in the previous example.
Although the TileDB driver is parallel (i.e., it uses multiple threads for decompression and IO), PDAL is single-threaded and therefore some tasks may benefit from additional boosting. Take for instance the following PDAL command that counts the number of points in the dataset using the TileDB driver.
We can write a simple script in Python with Dask and direct access to TileDB to perform the same operation completely in parallel:
In both cases we get the answer of 31530863 (for a 750MB compressed array). With single-threaded PDAL, the output from the time
command is the following on m5a.2xlarge
machine on AWS:
The above Python script using Dask is significantly faster:
TileDB supports a number of geospatial operations in SQL through the MariaDB integration. Please see the MariaDB documentation for more information: MariaDB Usage - Geospatial.
GDAL is a translator library for raster and vector datasets, there has been a supported TileDB raster driver since GDAL 3.0. You can run the GDAL code as follows:
To confirm that the TileDB driver is available, run:
Here are the supported options of the TileDB driver:
Download this simple GeoTIFF image. Simply run:
This will create a new TileDB array called <array-name>
and ingest the GeoTIFF image as a TileDB 2D dense array with a simple attribute that will store the greyscale value of each pixel. Note that the array name can be an S3 URI path as well. In that case, you would need to create an aws.config
file, and add your S3 keys in the following way:
Then, you can ingest directly into a TileDB array on S3 as follows:
After ingesting the array, you can get its info with:
Finally, you can check if the values in the TileDB array are the same as in the original GeoTIFF file:
If everything worked correctly, the values in the TileDB array must be identical to those in the original GeoTIFF file.
MapServer is an open source platform for publishing spatial data and interactive mapping applications to the web. MapServer allows you to render data from TileDB arrays and combine other sources and formats such as GeoJSON to render cartographic quality maps.
You can run the MapServer and TileDB examples as follows;
We will use the following MapServer mapfile;
And sample data from https://coast.noaa.gov/dataviewer/#/, in this case the 2017 NOAA NGS Ortho-rectified Oblique Imagery of the East Coast.
The use of CONNECTIONOPTIONS
is in the MapServer 8.0 release.
The following GDAL commands are used to produce a single tiledb array from multiple sources.
As in our TileDB and GDAL tutorials, we store the S3 credentials in an aws.config
file. Note that this aws.config
file should be stored in a location that is accessible by your web server but is not public.
To test rendering a map with MapServer we use the shp2img
command;
We have tested this mapfile on an AWS m5a.2xlarge
instance and successfully created a map from a query to a TileDB array stored on S3.