The TileDB model and format pave the way towards efficient data management, but alone they are not enough. A storage engine is necessary to implement the various features and achieve performance via parallelism, great engineering around the format use, and efficient interoperability with higher level compute layers.
Towards this end, we built a powerful C++ library with the following goals in mind:
  • Fast multi-threaded writes (from multiple input layouts)
  • Fast multi-threaded reads (into multiple output layouts)
  • Atomicity, concurrency and (eventual) consistency of interleaved reads and writes, following a multiple writer / multiple reader model without locking or coordination
  • Time traveling
  • Consolidation and vacuuming
  • Numerous efficient APIs and integrations with a growing set of SQL engines and data science tools, adopting zero-copy techniques wherever possible
  • A modular and extensible design to support a growing set of storage backends (local disk, S3, GCS, HDFS, and more)
  • Backwards compatibility as the TileDB format gets improved with new versions
  • The core C++ library should be embedded and, therefore, serverless
The figure below shows the TileDB Embedded architecture. The core library is open-source and does everything in terms of data storage and access and exposes a C and C++ API. The rest of the APIs are efficiently built on top of those two APIs. The integrations with the SQL engines and other tools are done using these APIs and we carefully zero-copy wherever possible (i.e., we write the sliced results directly in the memory buffers exposed by the higher level applications, avoiding multiple data copies and thus boosting performance). We recently published our integration with Apache Arrow, so expect the TileDB integrations to grow. It is important to stress that we intentionally designed the library to be embedded and, therefore, serverless. That allows one to use it easily via the various APIs without spinning up clusters, but also to integrate with distributed systems (like Spark, Dask and Presto) if you wish extra scalability. TileDB Embedded along with its APIs and integrations are all open-source.
Last modified 1yr ago
Copy link