The TileDB GDAL driver enables indexed and efficient access to integral and partial tiles of geospatial data. The Python community typically use the Rasterio library to access GDAL drivers. Here we show how to ingest large raster data into TileDB in parallel using GDAL, Rasterio, xarray and Dask.
To highlight how TileDB works with large dense arrays we will use a dataset from the Sentinel-2 mission. You can either register and download a sample yourself from the Copernicushub or use the requestor pays bucket from the AWS open data program, the latter being preferable if you wish to run your code on AWS using public data. The directory size of the Sentinel-2 image we are using is 788 MB.
To ingest the Sentinel-2 data into TileDB with Rasterio, run:
rio convert -f TileDB --co BLOCKXSIZE=1024 --co BLOCKYSIZE=1024 SENTINEL2_L1C:S2A_MSIL1C_20190829T163901_N0208_R126_T16TDM_20190829T201831.SAFE/MTD_MSIL1C.xml:10m:EPSG_32616 single.tiledb
A faster way to ingest large raster data into TileDB and take advantage of TileDB's parallel writes is by using Dask. Below we provide a detailed example on how to do this:
import mathimport osimport sysimport daskfrom dask.diagnostics import ProgressBarimport numpy as npimport rasteriofrom rasterio.transform import Affineimport tiledbimport xarray as xrimport xml.etree.ElementTree as ETpbar =ProgressBar()pbar.register()deftranslate(_input,output):with rasterio.open(_input)as src: profile = src.profile trans = Affine.to_gdal(src.transform) dt = np.dtype(src.dtypes[0])# read first band data type# no support for to_rasterio in xarray so we are going to write the metadata file # noqa profile['driver']='TileDB' tile_x_size =1024 tile_y_size =1024 w = profile['width'] h = profile['height'] nBlocksX = math.ceil(w / (tile_x_size *1.0)) nBlocksY = math.ceil(h / (tile_y_size *1.0)) arr = xr.open_rasterio(_input, chunks={'x': tile_x_size, 'y': tile_y_size})# # create output TileDB dataset dom = tiledb.Domain( tiledb.Dim(name='BANDS', domain=(0, profile['count'] -1), tile=1), tiledb.Dim(name='Y', domain=(0, (nBlocksY * tile_y_size) -1), tile=tile_y_size, dtype=np.uint64), tiledb.Dim(name='X', domain=(0, (nBlocksX * tile_x_size) -1), tile=tile_x_size, dtype=np.uint64)) schema = tiledb.ArraySchema(domain=dom, sparse=False, attrs=[tiledb.Attr(name='TDB_VALUES', dtype=dt)]) tiledb.DenseArray.create(output, schema)with dask.config.set(scheduler='processes'):with tiledb.DenseArray(output, 'w')as arr_output: arr.data.to_tiledb(arr_output)# add minimal metadata to read the file vfs = tiledb.VFS() meta =f"{output}/{os.path.basename(output)}.tdb.aux.xml"try: f = vfs.open(meta, 'w') root = ET.Element('PAMDataset') geo = ET.SubElement(root, 'GeoTransform') geo.text =', '.join(map(str, trans)) meta = ET.SubElement(root, 'Metadata') meta.set('domain', 'IMAGE_STRUCTURE') xsize = ET.SubElement(meta, 'MDI') xsize.set('key', 'X_SIZE') xsize.text =str(w) ysize = ET.SubElement(meta, 'MDI') ysize.set('key', 'Y_SIZE') ysize.text =str(h) dtype = ET.SubElement(meta, 'MDI') dtype.set('key', 'DATA_TYPE') dtype.text = profile['dtype'] vfs.write(f, ET.tostring(root))finally: vfs.close(f)if__name__=="__main__":translate(sys.argv[1], sys.argv[2])