Architecture
Last updated
Last updated
TileDB Open Source is a powerful open-source storage engine that implements the array data model and enforces an efficient on-disk data format.
TileDB is written in C++, but it is designed with extreme interoperability in mind. Specifically, TileDB exposes numerous APIs (such as C, C#, Python, R, Go and Java) and integrates with Apache Arrow. In addition, TileDB uses the APIs and Arrow integration to further integrate with machine learning tools (such as Tensorflow, Keras and Pytorch), distributed computing frameworks (such as Spark and Dask), SQL engines (such as MariaDB, Presto and Trino), even domain-specific tools, such as Hail, PDAL and GDAL. We expect the TileDB APIs and integrations to continue to grow.
TileDB is also optimized for a variety of storage backends. It works on any POSIX filesystem (e.g., Lustre) and HDFS, but extra care has been taken in both the data format and engine to make it ideal for object stores, such as AWS S3, Azure Blob Storage, Google Cloud Storage and Minio. Finally, TileDB supports a special RAM backend to store and access arrays exclusively in memory for latency demanding applications.
In addition to efficient array storage and access functionality, TileDB offers some additional important features:
Versioning and time traveling via immutable writes (along with consolidation options)
Filter condition push-downs, moving compute closer to the data for improved performance
We are working on pushing more and more computational operations down to the storage engine (such as aggregates, group-bys, linear algebra operations and more).