1 of 6

Geospatial

PDAL

Installation

You can run PDAL and TileDB code in Python using conda environment packages.

conda install python-pdal

If Python packages are not required the pdal conda package can be used instead

conda install pdal

Both of the packages above will provide updated PDAL and TileDB libraries for your conda environment.

Ingesting LAS Files

First, create a TileDB config file tiledb.config where you can set any TileDB configuration parameter (e.g., AWS keys if you would like to write to a TileDB array on S3). Make sure you also add the following, as currently TileDB does not handle duplicate points (this will change in a future version).

sm.dedup_coords true

Then create a PDAL pipeline to translate some LAS data to a TileDB array, by storing the following in a file called pipeline.json:

[
    {
      "type":"readers.las",
      "filename":"lrf_ws_epsg6341_nad832011utmz12_navd88_epoch2010_264000_4909000.laz"
    },
    {
      "type":"writers.tiledb",
      "config_file":"tiledb.config",
      "compression":"zstd",
      "compression_level": 75,
      "chunk_size": 50000,
      "array_name":"sample_array"
    }
]

See PDAL documentation for information on available options for the TileDB PDAL writer. You can execute the pipeline with PDAL that will carry out the ingestion as follows:

pdal pipeline -i pipeline.json

We now have points and attributes stored in an array called sample_array. This write uses the streaming mode of PDAL.

You can view this sample_array directly from TileDB as follows (we demonstrate using TileDB's Python API, but any other API would work as well):

import numpy as np
import pptk
import tiledb

ctx = tiledb.Ctx()

# Open the array and read from it.
with tiledb.SparseArray('sample_array', ctx=ctx, mode='r') as arr:
    # Get non-empty domain
    arr.nonempty_domain()
    # note that the attributes have different types
    arr.dump()
    data = arr[:]

data
    
coords = np.array([np.asarray(list(t)) for t in data['coords']])
v = pptk.viewer(coords, coords[:, 2])

Parallel Writes

PDAL is single-threaded, but coupled with TileDB's parallel write support, can become a powerful tool for ingesting enormous amounts of point cloud data into TileDB. The PDAL driver supports appending to an existing dataset and we use this with Dask to create a parallel update.

We demonstrate parallel ingestion with the code below. Make sure to remove or move the sample_array created in the previous example.

import glob

from dask.distributed import Client
import pdal


def update(json):
  pipeline = pdal.Pipeline(json)
  pipeline.loglevel = 8 #really noisy
  return pipeline.execute()


json = """
[
    {
      "type":"readers.las",
      "filename":"%s"
    },
    {
      "type":"writers.tiledb",
      "config_file":"tiledb.config",
      "chunk_size": 50000,
      "array_name":"sample_array",
      "append": %s
    }
 ]
 """

client = Client(threads_per_worker=6, n_workers=1)
point_clouds = glob.glob('*.laz')
bAppend = False

for f in point_clouds:
    manifest = json % (f, str(bAppend).lower())
    f = client.submit(update, manifest)
    if not bAppend:
      # block for schema creation and initial load
      f.result()

    bAppend = True

Parallel Reads

Although the TileDB driver is parallel (i.e., it uses multiple threads for decompression and IO), PDAL is single-threaded and therefore some tasks may benefit from additional boosting. Take for instance the following PDAL command that counts the number of points in the dataset using the TileDB driver.

pdal info --driver readers.tiledb --readers.tiledb.array_name=sample_array -i sample_array

We can write a simple script in Python with Dask and direct access to TileDB to perform the same operation completely in parallel:

from dask.distributed import Client, progress
import math
import numpy as np
import pdal
import tiledb

def count(array_name, x1, x2, y1, y2, z1, z2):
  with tiledb.SparseArray(array_name, 'r') as arr:
    return arr.domain_index[x1:x2, y1:y2, z1:z2]['coords'].shape[0]

if __name__ == "__main__":
  client = Client(threads_per_worker=6, n_workers=1)
  array_name = 'sample_array'

  jobs = []
  tile_div = 6

  jobs = []

  with tiledb.SparseArray(array_name, 'r') as arr:
    xs, ys, zs = arr.nonempty_domain()
    tile_x = math.ceil((xs[1] - xs[0]) / tile_div)
    x1 = xs[0]

    for i in range(tile_div):
      x2 = min(xs[0] + ((i + 1) * tile_x), xs[1])

      if x1 > x2:
        continue

      f = client.submit(count, array_name, x1, x2, *ys, *zs)
      jobs.append(f)

      x1 = np.nextafter(x2, x2 + 1)

    results = client.gather(jobs)
    total = sum(results)
    print(f"Total points: {total}")

In both cases we get the answer of 31530863 (for a 750MB compressed array). With single-threaded PDAL, the output from the time command is the following on m5a.2xlargemachine on AWS:

real	1m15.267s
user	1m46.927s
sys	  0m2.410s

The above Python script using Dask is significantly faster:

real	0m9.902s
user	0m1.465s
sys	  0m0.216s

GDAL

Installation

GDAL is a translator library for raster and vector datasets, there has been a supported TileDB raster driver since GDAL 3.0. You can run the GDAL code as follows:

docker run -it --rm -u 0 -v /local/path:/data tiledb/tiledb-geospatial /bin/bash

To confirm that the TileDB driver is available, run:

gdalinfo --formats | grep TileDB
TileDB -raster- (rw+vs): TileDB

Here are the supported options of the TileDB driver:

gdalinfo --format TileDB        
Format Details:
  Short Name: TileDB
  Long Name: TileDB
  Supports: Raster
  Help Topic: frmt_tiledb.html
  Supports: Subdatasets
  Supports: Open() - Open existing dataset.
  Supports: Create() - Create writable dataset.
  Supports: CreateCopy() - Create dataset by copying another.
  Supports: Virtual IO - eg. /vsimem/
  Creation Datatypes: Byte UInt16 Int16 UInt32 Int32 Float32 Float64 CInt16 CInt32 CFloat32 CFloat64

<CreationOptionList>
  <Option name="COMPRESSION" type="string-select" description="image compression to use" default="NONE">
    <Value>NONE</Value>
    <Value>GZIP</Value>
    <Value>ZSTD</Value>
    <Value>LZ4</Value>
    <Value>RLE</Value>
    <Value>BZIP2</Value>
    <Value>DOUBLE-DELTA</Value>
    <Value>POSITIVE-DELTA</Value>
  </Option>
  <Option name="COMPRESSION_LEVEL" type="int" description="Compression level" />
  <Option name="BLOCKXSIZE" type="int" description="Tile Width" />
  <Option name="BLOCKYSIZE" type="int" description="Tile Height" />
  <Option name="STATS" type="boolean" description="Dump TileDB stats" />
  <Option name="TILEDB_CONFIG" type="string" description="location of configuration file for TileDB" />
  <Option name="TILEDB_ATTRIBUTE" type="string" description="co-registered file to add as TileDB attributes" />
</CreationOptionList>

<OpenOptionList>
  <Option name="STATS" type="boolean" description="Dump TileDB stats" />
  <Option name="TILEDB_ATTRIBUTE" type="string" description="Attribute to read from each band" />
  <Option name="TILEDB_CONFIG" type="string" description="location of configuration file for TileDB" />
</OpenOptionList>

Ingesting Data

Download this simple GeoTIFF image. Simply run:

gdal_translate -OF TileDB UTM2GTIF.TIF <array-name>

This will create a new TileDB array called <array-name> and ingest the GeoTIFF image as a TileDB 2D dense array with a simple attribute that will store the greyscale value of each pixel. Note that the array name can be an S3 URI path as well. In that case, you would need to create an aws.config file, and add your S3 keys in the following way:

vfs.s3.aws_access_key_id xxxxxx
vfs.s3.aws_secret_access_key xxxxxx

Then, you can ingest directly into a TileDB array on S3 as follows:

gdal_translate -OF TileDB UTM2GTIF.TIF -CO TILEDB_CONFIG=aws.config <array-name>

After ingesting the array, you can get its info with:

gdalinfo <array-name>
# or, gdalinfo -OO TILEDB_CONFIG=aws.config <array-name>, if the array is on S3

Finally, you can check if the values in the TileDB array are the same as in the original GeoTIFF file:

gdallocationinfo <array-name> 10 10
Report:
  Location: (10P,10L)
  Band 1:
    Value: 156

gdallocationinfo UTM2GTIF.TIF 10 10 
Report:
  Location: (10P,10L)
  Band 1:
    Value: 156

If everything worked correctly, the values in the TileDB array must be identical to those in the original GeoTIFF file.

Rasterio

The TileDB GDAL driver enables indexed and efficient access to integral and partial tiles of geospatial data. The Python community typically use the Rasterio library to access GDAL drivers. Here we show how to ingest large raster data into TileDB in parallel using GDAL, Rasterio, xarray and Dask.

To highlight how TileDB works with large dense arrays we will use a dataset from the Sentinel-2 mission. You can either register and download a sample yourself from the Copernicus hub or use the requestor pays bucket from the AWS open data program, the latter being preferable if you wish to run your code on AWS using public data. The directory size of the Sentinel-2 image we are using is 788 MB.

Installation

You can run the Rasterio code as follows:

docker run -it --rm -u 0 -v /local/path:/data tiledb/tiledb-geospatial /bin/bash

To verify that the Sentinel-2 dataset can be read by our installation of Rasterio,run the following (changing the filename):

rio info --subdatasets S2A_MSIL1C_20190829T163901_N0208_R126_T16TDM_20190829T201831.SAFE/MTD_MSIL1C.xml

Rasterio should successfully print out the metadata about the subdatasets within the the Sentinel-2 dataset as follows:

:
:
SENTINEL2_L1C:S2A_MSIL1C_20190829T163901_N0208_R126_T16TDM_20190829T201831.SAFE/MTD_MSIL1C.xml:10m:EPSG_32616
SENTINEL2_L1C:S2A_MSIL1C_20190829T163901_N0208_R126_T16TDM_20190829T201831.SAFE/MTD_MSIL1C.xml:20m:EPSG_32616
SENTINEL2_L1C:S2A_MSIL1C_20190829T163901_N0208_R126_T16TDM_20190829T201831.SAFE/MTD_MSIL1C.xml:60m:EPSG_32616
SENTINEL2_L1C:S2A_MSIL1C_20190829T163901_N0208_R126_T16TDM_20190829T201831.SAFE/MTD_MSIL1C.xml:TCI:EPSG_32616

Ingestion

To ingest the Sentinel-2 data into TileDB with Rasterio, run:

rio convert -f TileDB --co BLOCKXSIZE=1024 --co BLOCKYSIZE=1024 SENTINEL2_L1C:S2A_MSIL1C_20190829T163901_N0208_R126_T16TDM_20190829T201831.SAFE/MTD_MSIL1C.xml:10m:EPSG_32616 single.tiledb

A faster way to ingest large raster data into TileDB and take advantage of TileDB's parallel writes is by using Dask. Below we provide a detailed example on how to do this:

import math
import os
import sys

import dask
from dask.diagnostics import ProgressBar
import numpy as np

import rasterio
from rasterio.transform import Affine
import tiledb
import xarray as xr
import xml.etree.ElementTree as ET


pbar = ProgressBar()
pbar.register()

def translate(_input, output):
    with rasterio.open(_input) as src:
        profile = src.profile
        trans = Affine.to_gdal(src.transform)
        dt = np.dtype(src.dtypes[0])  # read first band data type

    # no support for to_rasterio in xarray so we are going to write the metadata file  # noqa
    profile['driver'] = 'TileDB'
    tile_x_size = 1024
    tile_y_size = 1024
    w = profile['width']
    h = profile['height']
    nBlocksX = math.ceil(w / (tile_x_size * 1.0))
    nBlocksY = math.ceil(h / (tile_y_size * 1.0))

    arr = xr.open_rasterio(_input,
                           chunks={'x': tile_x_size, 'y': tile_y_size})

    # # create output TileDB dataset
    dom = tiledb.Domain(
            tiledb.Dim(name='BANDS', domain=(0, profile['count'] - 1),
                       tile=1),
            tiledb.Dim(name='Y', domain=(0, (nBlocksY * tile_y_size) - 1),
                       tile=tile_y_size, dtype=np.uint64),
            tiledb.Dim(name='X', domain=(0, (nBlocksX * tile_x_size) - 1),
                       tile=tile_x_size, dtype=np.uint64))

    schema = tiledb.ArraySchema(domain=dom, sparse=False,
                                attrs=[tiledb.Attr(name='TDB_VALUES',
                                       dtype=dt)])

    tiledb.DenseArray.create(output, schema)

    with dask.config.set(scheduler='processes'):
        with tiledb.DenseArray(output, 'w') as arr_output:
            arr.data.to_tiledb(arr_output)

    # add minimal metadata to read the file
    vfs = tiledb.VFS()
    meta = f"{output}/{os.path.basename(output)}.tdb.aux.xml"
    try:
        f = vfs.open(meta, 'w')
        root = ET.Element('PAMDataset')
        geo = ET.SubElement(root, 'GeoTransform')
        geo.text = ', '.join(map(str, trans))
        meta = ET.SubElement(root, 'Metadata')
        meta.set('domain', 'IMAGE_STRUCTURE')
        xsize = ET.SubElement(meta, 'MDI')
        xsize.set('key', 'X_SIZE')
        xsize.text = str(w)
        ysize = ET.SubElement(meta, 'MDI')
        ysize.set('key', 'Y_SIZE')
        ysize.text = str(h)
        dtype = ET.SubElement(meta, 'MDI')
        dtype.set('key', 'DATA_TYPE')
        dtype.text =  profile['dtype']       
        vfs.write(f, ET.tostring(root))
    finally:
        vfs.close(f)


if __name__ == "__main__":
    translate(sys.argv[1], sys.argv[2])

MapServer

MapServer is an open source platform for publishing spatial data and interactive mapping applications to the web. MapServer allows you to render data from TileDB arrays and combine other sources and formats such as GeoJSON to render cartographic quality maps.

Installation

You can run the MapServer and TileDB examples as follows;

docker run -it --rm -u 0 -v /local/path:/data tiledb/tiledb-geospatial /bin/bash

MapServer configuration

We will use the following MapServer mapfile;

MAP
  IMAGETYPE      PNG
  SIZE           400 300
  EXTENT         -77.8751 34.1472 -77.7869 34.267
  CONFIG         "TILEDB_CONFIG" "./aws.config"

  PROJECTION
    "init=epsg:4326"
  END

  WEB
    METADATA
      "wms_title" "TileDB tutorial"
      "wms_onlineresource" "https://myhost/mapserv"
      "wms_enable_request" "*"
    END
  END
  
  LAYER
    NAME    "tiledb_coastal"
    TYPE    RASTER
    DATA    "s3://tiledb-mapserver/tiledb_mosaic"
    CONNECTIONOPTIONS
      "TILEDB_CONFIG" "aws.config"
    END
    OFFSITE 0 0 0
    PROJECTION
      "init=epsg:4326"
    END
    METADATA
      "wms_title" "TileDB Sample 2017 NOAA NGS Ortho-rectified Oblique Imagery of the East Coast"
    END
  END # tiledb_coastal raster layer ends here

END # end of map file

And sample data from https://coast.noaa.gov/dataviewer/#/, in this case the 2017 NOAA NGS Ortho-rectified Oblique Imagery of the East Coast.

The use of CONNECTIONOPTIONS is in the MapServer 8.0 release.

The following GDAL commands are used to produce a single tiledb array from multiple sources.

gdalbuildvrt mosaic.vrt *.tif
gdal_translate -OF TILEDB -CO COMPRESSION=ZSTD -CO BLOCKXSIZE=1024 -CO BLOCKYSIZE=1024 mosaic.vrt tiledb_mosaic

As in our TileDB and GDAL tutorials, we store the S3 credentials in an aws.config file. Note that this aws.config file should be stored in a location that is accessible by your web server but is not public.

To test rendering a map with MapServer we use the shp2img command;

shp2img -m tutorial.map -l tiledb_coastal -e -77.82 34.17 -77.80 34.20 -s 250 250  -o ~/Desktop/test.png -map_debug 3

We have tested this mapfile on an AWS m5a.2xlarge instance and successfully created a map from a query to a TileDB array stored on S3.

MariaDB Geospatial

TileDB supports a number of geospatial operations in SQL through the MariaDB integration. Please see the MariaDB documentation for more information: .

Rasterio

Installation

You can run the Rasterio code as follows:

docker run -it --rm -u 0 -v /local/path:/data tiledb/tiledb-geospatial /bin/bash

To verify that the Sentinel-2 dataset can be read by our installation of Rasterio,run the following (changing the filename):

rio info --subdatasets S2A_MSIL1C_20190829T163901_N0208_R126_T16TDM_20190829T201831.SAFE/MTD_MSIL1C.xml

Rasterio should successfully print out the metadata about the subdatasets within the the Sentinel-2 dataset as follows:

:
:
SENTINEL2_L1C:S2A_MSIL1C_20190829T163901_N0208_R126_T16TDM_20190829T201831.SAFE/MTD_MSIL1C.xml:10m:EPSG_32616
SENTINEL2_L1C:S2A_MSIL1C_20190829T163901_N0208_R126_T16TDM_20190829T201831.SAFE/MTD_MSIL1C.xml:20m:EPSG_32616
SENTINEL2_L1C:S2A_MSIL1C_20190829T163901_N0208_R126_T16TDM_20190829T201831.SAFE/MTD_MSIL1C.xml:60m:EPSG_32616
SENTINEL2_L1C:S2A_MSIL1C_20190829T163901_N0208_R126_T16TDM_20190829T201831.SAFE/MTD_MSIL1C.xml:TCI:EPSG_32616

Ingestion

To ingest the Sentinel-2 data into TileDB with Rasterio, run:

rio convert -f TileDB --co BLOCKXSIZE=1024 --co BLOCKYSIZE=1024 SENTINEL2_L1C:S2A_MSIL1C_20190829T163901_N0208_R126_T16TDM_20190829T201831.SAFE/MTD_MSIL1C.xml:10m:EPSG_32616 single.tiledb

A faster way to ingest large raster data into TileDB and take advantage of TileDB's parallel writes is by using Dask. Below we provide a detailed example on how to do this:

import math
import os
import sys

import dask
from dask.diagnostics import ProgressBar
import numpy as np

import rasterio
from rasterio.transform import Affine
import tiledb
import xarray as xr
import xml.etree.ElementTree as ET


pbar = ProgressBar()
pbar.register()

def translate(_input, output):
    with rasterio.open(_input) as src:
        profile = src.profile
        trans = Affine.to_gdal(src.transform)
        dt = np.dtype(src.dtypes[0])  # read first band data type

    # no support for to_rasterio in xarray so we are going to write the metadata file  # noqa
    profile['driver'] = 'TileDB'
    tile_x_size = 1024
    tile_y_size = 1024
    w = profile['width']
    h = profile['height']
    nBlocksX = math.ceil(w / (tile_x_size * 1.0))
    nBlocksY = math.ceil(h / (tile_y_size * 1.0))

    arr = xr.open_rasterio(_input,
                           chunks={'x': tile_x_size, 'y': tile_y_size})

    # # create output TileDB dataset
    dom = tiledb.Domain(
            tiledb.Dim(name='BANDS', domain=(0, profile['count'] - 1),
                       tile=1),
            tiledb.Dim(name='Y', domain=(0, (nBlocksY * tile_y_size) - 1),
                       tile=tile_y_size, dtype=np.uint64),
            tiledb.Dim(name='X', domain=(0, (nBlocksX * tile_x_size) - 1),
                       tile=tile_x_size, dtype=np.uint64))

    schema = tiledb.ArraySchema(domain=dom, sparse=False,
                                attrs=[tiledb.Attr(name='TDB_VALUES',
                                       dtype=dt)])

    tiledb.DenseArray.create(output, schema)

    with dask.config.set(scheduler='processes'):
        with tiledb.DenseArray(output, 'w') as arr_output:
            arr.data.to_tiledb(arr_output)

    # add minimal metadata to read the file
    vfs = tiledb.VFS()
    meta = f"{output}/{os.path.basename(output)}.tdb.aux.xml"
    try:
        f = vfs.open(meta, 'w')
        root = ET.Element('PAMDataset')
        geo = ET.SubElement(root, 'GeoTransform')
        geo.text = ', '.join(map(str, trans))
        meta = ET.SubElement(root, 'Metadata')
        meta.set('domain', 'IMAGE_STRUCTURE')
        xsize = ET.SubElement(meta, 'MDI')
        xsize.set('key', 'X_SIZE')
        xsize.text = str(w)
        ysize = ET.SubElement(meta, 'MDI')
        ysize.set('key', 'Y_SIZE')
        ysize.text = str(h)
        dtype = ET.SubElement(meta, 'MDI')
        dtype.set('key', 'DATA_TYPE')
        dtype.text =  profile['dtype']       
        vfs.write(f, ET.tostring(root))
    finally:
        vfs.close(f)


if __name__ == "__main__":
    translate(sys.argv[1], sys.argv[2])

PDAL

Installation

You can run PDAL and TileDB code in Python using conda environment packages.

conda install python-pdal

If Python packages are not required the pdal conda package can be used instead

conda install pdal

Both of the packages above will provide updated PDAL and TileDB libraries for your conda environment.

Ingesting LAS Files

sm.dedup_coords true

Then create a PDAL pipeline to translate some LAS data to a TileDB array, by storing the following in a file called pipeline.json:

[
    {
      "type":"readers.las",
      "filename":"lrf_ws_epsg6341_nad832011utmz12_navd88_epoch2010_264000_4909000.laz"
    },
    {
      "type":"writers.tiledb",
      "config_file":"tiledb.config",
      "compression":"zstd",
      "compression_level": 75,
      "chunk_size": 50000,
      "array_name":"sample_array"
    }
]

See PDAL documentation for information on available options for the TileDB PDAL writer. You can execute the pipeline with PDAL that will carry out the ingestion as follows:

pdal pipeline -i pipeline.json

We now have points and attributes stored in an array called sample_array. This write uses the streaming mode of PDAL.

You can view this sample_array directly from TileDB as follows (we demonstrate using TileDB's Python API, but any other API would work as well):

import numpy as np
import pptk
import tiledb

ctx = tiledb.Ctx()

# Open the array and read from it.
with tiledb.SparseArray('sample_array', ctx=ctx, mode='r') as arr:
    # Get non-empty domain
    arr.nonempty_domain()
    # note that the attributes have different types
    arr.dump()
    data = arr[:]

data
    
coords = np.array([np.asarray(list(t)) for t in data['coords']])
v = pptk.viewer(coords, coords[:, 2])

Parallel Writes

We demonstrate parallel ingestion with the code below. Make sure to remove or move the sample_array created in the previous example.

import glob

from dask.distributed import Client
import pdal


def update(json):
  pipeline = pdal.Pipeline(json)
  pipeline.loglevel = 8 #really noisy
  return pipeline.execute()


json = """
[
    {
      "type":"readers.las",
      "filename":"%s"
    },
    {
      "type":"writers.tiledb",
      "config_file":"tiledb.config",
      "chunk_size": 50000,
      "array_name":"sample_array",
      "append": %s
    }
 ]
 """

client = Client(threads_per_worker=6, n_workers=1)
point_clouds = glob.glob('*.laz')
bAppend = False

for f in point_clouds:
    manifest = json % (f, str(bAppend).lower())
    f = client.submit(update, manifest)
    if not bAppend:
      # block for schema creation and initial load
      f.result()

    bAppend = True

Parallel Reads

pdal info --driver readers.tiledb --readers.tiledb.array_name=sample_array -i sample_array

We can write a simple script in Python with Dask and direct access to TileDB to perform the same operation completely in parallel:

from dask.distributed import Client, progress
import math
import numpy as np
import pdal
import tiledb

def count(array_name, x1, x2, y1, y2, z1, z2):
  with tiledb.SparseArray(array_name, 'r') as arr:
    return arr.domain_index[x1:x2, y1:y2, z1:z2]['coords'].shape[0]

if __name__ == "__main__":
  client = Client(threads_per_worker=6, n_workers=1)
  array_name = 'sample_array'

  jobs = []
  tile_div = 6

  jobs = []

  with tiledb.SparseArray(array_name, 'r') as arr:
    xs, ys, zs = arr.nonempty_domain()
    tile_x = math.ceil((xs[1] - xs[0]) / tile_div)
    x1 = xs[0]

    for i in range(tile_div):
      x2 = min(xs[0] + ((i + 1) * tile_x), xs[1])

      if x1 > x2:
        continue

      f = client.submit(count, array_name, x1, x2, *ys, *zs)
      jobs.append(f)

      x1 = np.nextafter(x2, x2 + 1)

    results = client.gather(jobs)
    total = sum(results)
    print(f"Total points: {total}")

In both cases we get the answer of 31530863 (for a 750MB compressed array). With single-threaded PDAL, the output from the time command is the following on m5a.2xlargemachine on AWS:

real	1m15.267s
user	1m46.927s
sys	  0m2.410s

The above Python script using Dask is significantly faster:

real	0m9.902s
user	0m1.465s
sys	  0m0.216s

GDAL

Installation

GDAL is a translator library for raster and vector datasets, there has been a supported TileDB raster driver since GDAL 3.0. You can run the GDAL code as follows:

docker run -it --rm -u 0 -v /local/path:/data tiledb/tiledb-geospatial /bin/bash

To confirm that the TileDB driver is available, run:

gdalinfo --formats | grep TileDB
TileDB -raster- (rw+vs): TileDB

Here are the supported options of the TileDB driver:

gdalinfo --format TileDB        
Format Details:
  Short Name: TileDB
  Long Name: TileDB
  Supports: Raster
  Help Topic: frmt_tiledb.html
  Supports: Subdatasets
  Supports: Open() - Open existing dataset.
  Supports: Create() - Create writable dataset.
  Supports: CreateCopy() - Create dataset by copying another.
  Supports: Virtual IO - eg. /vsimem/
  Creation Datatypes: Byte UInt16 Int16 UInt32 Int32 Float32 Float64 CInt16 CInt32 CFloat32 CFloat64

<CreationOptionList>
  <Option name="COMPRESSION" type="string-select" description="image compression to use" default="NONE">
    <Value>NONE</Value>
    <Value>GZIP</Value>
    <Value>ZSTD</Value>
    <Value>LZ4</Value>
    <Value>RLE</Value>
    <Value>BZIP2</Value>
    <Value>DOUBLE-DELTA</Value>
    <Value>POSITIVE-DELTA</Value>
  </Option>
  <Option name="COMPRESSION_LEVEL" type="int" description="Compression level" />
  <Option name="BLOCKXSIZE" type="int" description="Tile Width" />
  <Option name="BLOCKYSIZE" type="int" description="Tile Height" />
  <Option name="STATS" type="boolean" description="Dump TileDB stats" />
  <Option name="TILEDB_CONFIG" type="string" description="location of configuration file for TileDB" />
  <Option name="TILEDB_ATTRIBUTE" type="string" description="co-registered file to add as TileDB attributes" />
</CreationOptionList>

<OpenOptionList>
  <Option name="STATS" type="boolean" description="Dump TileDB stats" />
  <Option name="TILEDB_ATTRIBUTE" type="string" description="Attribute to read from each band" />
  <Option name="TILEDB_CONFIG" type="string" description="location of configuration file for TileDB" />
</OpenOptionList>

Ingesting Data

Download this simple GeoTIFF image. Simply run:

gdal_translate -OF TileDB UTM2GTIF.TIF <array-name>

vfs.s3.aws_access_key_id xxxxxx
vfs.s3.aws_secret_access_key xxxxxx

Then, you can ingest directly into a TileDB array on S3 as follows:

gdal_translate -OF TileDB UTM2GTIF.TIF -CO TILEDB_CONFIG=aws.config <array-name>

After ingesting the array, you can get its info with:

gdalinfo <array-name>
# or, gdalinfo -OO TILEDB_CONFIG=aws.config <array-name>, if the array is on S3

Finally, you can check if the values in the TileDB array are the same as in the original GeoTIFF file:

gdallocationinfo <array-name> 10 10
Report:
  Location: (10P,10L)
  Band 1:
    Value: 156

gdallocationinfo UTM2GTIF.TIF 10 10 
Report:
  Location: (10P,10L)
  Band 1:
    Value: 156

If everything worked correctly, the values in the TileDB array must be identical to those in the original GeoTIFF file.