Unlike traditional databases where clients access data via SQL and JDBC/ODBC connectors, TileDB allows for direct access in a fast and scalable manner with familiar tools. It also supports pushing down computations from the higher-level applications down to storage transparently, optimizing the overall performance.
TileDB is an embeddable library built in C and C++ that comes with a variety of APIs and integrations:
APIs: C, C++, Python (numpy, pandas), R, Java, Go
SQL engines: PrestoDB, MariaDB
Parallel frameworks: Dask, Spark
Geospatial: PDAL, GDAL and Rasterio
TileDB takes particular care so that the various APIs and integrations are efficient. More specifically, it minimizes the amount of data that needs to be copied and/or converted from one format to another on the path from the storage backend (e.g., AWS S3) to the memory space of the end application that consumes it (e.g., Python numpy) and vice versa. This is in contrast to traditional databases where the user must submit a SQL query, retrieve the results via ODBC/JDBC connectors and convert the data to a format that the end tool understands.
TileDB's bindings allow the user to continue to work with familiar tooling in the Data Science ecosystem, such as Python numpy/pandas, Spark, Dask, GDAL, SQL engines, etc., without being forced to learn a new language or use unfamiliar APIs or sacrificing performance.
TileDB's scalability stems from its thread-/process-safe reads and writes that do not require any synchronization or locking, and which can be carried out by numerous independent machines in an embarrassingly parallel manner. This is unlike most traditional distributed databases that require the client process to communicate with a single master node (typically in SQL via ODBC/JDBC) to send and retrieve data.
The figure below demonstrates TileDB's distributed architecture for the Developer and Cloud offerings. In TileDB Developer, the distributed engine of your choice (e.g., Spark, Dask, PrestoDB) uses the embeddable TileDB library to perform parallel IO to a scalable storage backend (e.g., AWS S3). In TileDB Cloud, all IO requests go through an elastic REST service (so that access policies can be properly enforced), but each request is serviced by a different node that directly communicates with the caller process.
One of the exciting features we are working on is identifying common operations in all the Data Science tools and pushing them down to the TileDB engine itself, completely transparently to the user. For example, when you submit a query in PrestoDB, Spark or MariaDB, the TileDB connector detects slicing on dimensions and projections on attributes, and pushes them down to its engine.
We are planning on doing the same with filters, joins and group-by queries.TileDB can implement such operations in C++ with numerous optimizations, such as vectorization, multi-threaded parallelism, hardware acceleration, etc. Moreover, less data may need to be communicated between the storage engine and the higher-level applications, thus improving performance and minimizing egress costs in TileDB Cloud. Pushing compute to a common storage layer allows all tools integrating with TileDB to inherit these performance benefits, without having to rebuild such optimizations from scratch.
We are also working on the more challenging case of complex distributed computation, such as SQL joins/group-by queries and Linear Algebra operations. We are exploring ways of modifying the execution plan to scale out with intelligent partitioning based on TileDB data statistics.