Rasterio
The TileDB GDAL driver enables indexed and efficient access to integral and partial tiles of geospatial data. The Python community typically use the Rasterio library to access GDAL drivers. Here we show how to ingest large raster data into TileDB in parallel using GDAL, Rasterio, xarray and Dask.
To highlight how TileDB works with large dense arrays we will use a dataset from the Sentinel-2 mission. You can either register and download a sample yourself from the Copernicus hub or use the requestor pays bucket from the AWS open data program, the latter being preferable if you wish to run your code on AWS using public data. The directory size of the Sentinel-2 image we are using is 788 MB.

Installation

You can run the Rasterio code as follows:
1
docker run -it --rm -u 0 -v /local/path:/data tiledb/tiledb-geospatial /bin/bash
Copied!
To verify that the Sentinel-2 dataset can be read by our installation of Rasterio,run the following (changing the filename):
1
rio info --subdatasets S2A_MSIL1C_20190829T163901_N0208_R126_T16TDM_20190829T201831.SAFE/MTD_MSIL1C.xml
Copied!
Rasterio should successfully print out the metadata about the subdatasets within the the Sentinel-2 dataset as follows:
1
:
2
:
3
SENTINEL2_L1C:S2A_MSIL1C_20190829T163901_N0208_R126_T16TDM_20190829T201831.SAFE/MTD_MSIL1C.xml:10m:EPSG_32616
4
SENTINEL2_L1C:S2A_MSIL1C_20190829T163901_N0208_R126_T16TDM_20190829T201831.SAFE/MTD_MSIL1C.xml:20m:EPSG_32616
5
SENTINEL2_L1C:S2A_MSIL1C_20190829T163901_N0208_R126_T16TDM_20190829T201831.SAFE/MTD_MSIL1C.xml:60m:EPSG_32616
6
SENTINEL2_L1C:S2A_MSIL1C_20190829T163901_N0208_R126_T16TDM_20190829T201831.SAFE/MTD_MSIL1C.xml:TCI:EPSG_32616
Copied!

Ingestion

To ingest the Sentinel-2 data into TileDB with Rasterio, run:
1
rio convert -f TileDB --co BLOCKXSIZE=1024 --co BLOCKYSIZE=1024 SENTINEL2_L1C:S2A_MSIL1C_20190829T163901_N0208_R126_T16TDM_20190829T201831.SAFE/MTD_MSIL1C.xml:10m:EPSG_32616 single.tiledb
Copied!
A faster way to ingest large raster data into TileDB and take advantage of TileDB's parallel writes is by using Dask. Below we provide a detailed example on how to do this:
Python
1
import math
2
import os
3
import sys
4
5
import dask
6
from dask.diagnostics import ProgressBar
7
import numpy as np
8
9
import rasterio
10
from rasterio.transform import Affine
11
import tiledb
12
import xarray as xr
13
import xml.etree.ElementTree as ET
14
15
16
pbar = ProgressBar()
17
pbar.register()
18
19
def translate(_input, output):
20
with rasterio.open(_input) as src:
21
profile = src.profile
22
trans = Affine.to_gdal(src.transform)
23
dt = np.dtype(src.dtypes[0]) # read first band data type
24
25
# no support for to_rasterio in xarray so we are going to write the metadata file # noqa
26
profile['driver'] = 'TileDB'
27
tile_x_size = 1024
28
tile_y_size = 1024
29
w = profile['width']
30
h = profile['height']
31
nBlocksX = math.ceil(w / (tile_x_size * 1.0))
32
nBlocksY = math.ceil(h / (tile_y_size * 1.0))
33
34
arr = xr.open_rasterio(_input,
35
chunks={'x': tile_x_size, 'y': tile_y_size})
36
37
# # create output TileDB dataset
38
dom = tiledb.Domain(
39
tiledb.Dim(name='BANDS', domain=(0, profile['count'] - 1),
40
tile=1),
41
tiledb.Dim(name='Y', domain=(0, (nBlocksY * tile_y_size) - 1),
42
tile=tile_y_size, dtype=np.uint64),
43
tiledb.Dim(name='X', domain=(0, (nBlocksX * tile_x_size) - 1),
44
tile=tile_x_size, dtype=np.uint64))
45
46
schema = tiledb.ArraySchema(domain=dom, sparse=False,
47
attrs=[tiledb.Attr(name='TDB_VALUES',
48
dtype=dt)])
49
50
tiledb.DenseArray.create(output, schema)
51
52
with dask.config.set(scheduler='processes'):
53
with tiledb.DenseArray(output, 'w') as arr_output:
54
arr.data.to_tiledb(arr_output)
55
56
# add minimal metadata to read the file
57
vfs = tiledb.VFS()
58
meta = f"{output}/{os.path.basename(output)}.tdb.aux.xml"
59
try:
60
f = vfs.open(meta, 'w')
61
root = ET.Element('PAMDataset')
62
geo = ET.SubElement(root, 'GeoTransform')
63
geo.text = ', '.join(map(str, trans))
64
meta = ET.SubElement(root, 'Metadata')
65
meta.set('domain', 'IMAGE_STRUCTURE')
66
xsize = ET.SubElement(meta, 'MDI')
67
xsize.set('key', 'X_SIZE')
68
xsize.text = str(w)
69
ysize = ET.SubElement(meta, 'MDI')
70
ysize.set('key', 'Y_SIZE')
71
ysize.text = str(h)
72
dtype = ET.SubElement(meta, 'MDI')
73
dtype.set('key', 'DATA_TYPE')
74
dtype.text = profile['dtype']
75
vfs.write(f, ET.tostring(root))
76
finally:
77
vfs.close(f)
78
79
80
if __name__ == "__main__":
81
translate(sys.argv[1], sys.argv[2])
Copied!
Last modified 13d ago
Copy link