The Solution

The need for a universal data engine

The challenges we discussed in The Problem section call for a bold, groundbreaking and holistic approach. We are the first to make the observation that data management is actually not very different across application domains, despite how different the data and terminology looks like. Therefore, we decided to invent one data model, one data format and one storage engine to provide a foundational solution for data management for all verticals. On top of those, we built extra management layers for access control and logging, in order to support data sharing and collaboration at planet scale. Our efforts led to the creation of a powerful serverless platform for designing and executing any distributed algorithm, which offers ease of use, performance, and low cost. We explain these contributions below.

A universal data format and storage engine

Regardless of the application domain and data types, the data storage and access requirements that are common:

  • Data compression and, thus, data chunking for efficient selective data retrieval

  • Support for multiple storage backends and extendibility to new ones

  • Parallel IO

  • Parallelism for (de)compression and other data filtering

  • Minimization of IO requests, which further requires:

    • A lightweight protocol between the client and the storage backend

    • Collocation of data that are frequently fetched together in the files

  • An embeddable storage library for easy use by higher level applications

  • Data versioning and time traveling

  • Atomicity, concurrency and consistency for reads and write operations

  • A growing set of efficient language APIs for flexible data access

  • A growing set of integrations with SQL engines and other data science tools

It does not really matter if the data is an image, or a cohort of genomic variants, or 3D LiDAR points, or dataframes, or flat binary objects. Anyone that attempts to build a performant storage engine and wishes to be used by diverse programing languages and tools will have to consider the above bullet points.

Another important observation is that there indeed exists a single data structure that can model any data type: (dense or sparse) multi-dimensional arrays. Arrays are more flexible and diverse than dataframes and can help in designing and implementing highly performant storage engines. Multi-dimensional (ND) slicing is arguably the most frequently used operator in any computational workload, e.g., for fast initial data filtering or for implementing out-of-core (to scale beyond RAM) or distributed (to scale beyond a single machine) algorithms.

How we innovate:

  • We introduce a novel universal data model that can represent any data as dense or sparse multi-dimensional arrays

  • We design a novel open-spec data format around arrays that enables building a powerful universal storage engine.

  • We implement the first universal storage engine around the array format, called TileDB Embedded. It is built in C++ and open-sourced under the MIT License, and it comes with numerous efficient language APIs. It supports several storage backends and it is extendible to future ones.

  • We integrate TileDB Embedded into a growing set of SQL engines and data science tools, all open-source.

The benefits:

  • You can store any data type in a single format processed by a single powerful engine.

  • You can store your data on any backend.

  • You can efficiently access your data via numerous languages and processing tools.

  • You can experience extreme parallelism and performance.

  • You can enjoy fast updates, data versioning and time traveling at the storage level.

Planet-scale sharing

Through our interactions with users and customers across a variety of verticals, it is evident that people are struggling to securely share data and code with others, even beyond their organizations. This can be for collaboration and scientific reproducibility, or because companies wish to monetize their proprietary data and work with third parties. Either way, data sharing today starts to break organizational boundaries and become planet-scale: any individual or organization seeks the ability to easily share data and code with literally anyone else in the world, enforcing access policies and logging every single action.

But how would you typically share data today?

  • Store data in a database. Databases offer advanced access control and logging features, but (1) you inherit all the problems described in previous sections and (2) you are limited within the organization that runs the database and the database cluster resources you have allocated.

  • Store files on a cloud object store. You can always store your data on a cloud object store and use its features to manage data access. However, an object store is designed to store files and, therefore, it offers file semantics when it comes to access control and logging. If your application requires the creation of numerous files and fine-grained access policies (on a byte range level), then sharing selected data with others becomes extremely cumbersome. In addition, cloud object stores were not designed to support user policies in the order of millions for each object and, therefore, cannot be considered as planet-scale solutions.

The fact that we built a powerful storage engine on a universal data format allows us to address the above problems and provide planet-scale sharing in a holistic way by building a unified platform on top of TileDB Embedded.

How we innovate:

  • We built a novel platform called TileDB Cloud for planet-scale sharing.

  • TileDB Cloud is totally serverless and has a pay-as-you-go model.

  • TileDB Cloud uses TileDB Embedded for data storage and access.

  • TileDB Cloud operates solely using the open-spec array format.

The benefits:

  • You can easily share with anyone on the planet or discover new data.

  • No need to spin up clusters for sharing your data or accessing others' data.

  • No idle compute cost.

  • TileDB Cloud scales to any number of users accessing the same array concurrently.

  • TileDB Cloud logs every access and allows you to easily audit the logs.

  • You own your data, TileDB Cloud only securely enforces the access policies.

  • There is no vendor lock-in, your data remains in TileDB's open-spec array format and is accessible via the open-source TileDB Embedded outside TileDB Cloud.

Serverless compute

We feel quite confident that we are on a great track with data storage and access control. But how about computations? Through interactions with users and personal experiences, we made some further observations:

  • Data scientists really like Jupyter notebooks, as they provide a great way to run and share code, while combining comprehensive documentation in text form along with the code. Moreover, it is faster and cheaper to have a hosted notebook on the cloud close to the data one wishes to access (e.g., on EC2 instances in the same region as the S3 bucket storing the data). Manually spinning up such hosted notebooks is rather cumbersome.

  • There are certain computations, such as SQL queries or arbitrary user-defined functions (UDFs) in a variety of languages, that the users would wish to perform on cloud-stored or shared data, in a totally serverless manner and avoiding transferring and copying data across multiple hops. This is because it is easier (no need to spin up cloud instances and clusters), cheaper (no idle compute costs) and faster (compute is sent to the data).

  • Any complex out-of-core or distributed algorithm working on large quantities of data can be expressed as a direct acyclic graph (DAG) of simple tasks, where each task slices and operates on a small block of data. For example, Dask relies on this task execution model.

In order to support planet-scale sharing as explained in the previous section, we were required to build a powerful, scalable and entire serverless infrastructure for TileDB Cloud. This infrastructure enabled us to provide further generic serverless compute capabilities.

How we innovate:

  • TileDB Cloud allows you to spin up Jupyter notebooks on your browser.

  • TileDB Cloud allows you to share notebooks with other users.

  • TileDB Cloud allows you to register, share and perform any UDF in a serverless manner.

    • TileDB Cloud currently supports only Python UDFs, but it is architected to be language agnostic. Support for other languages (e.g., R) is coming up soon.

  • TileDB Cloud allows you to define arbitrary task DAGs and execute them in a serverless manner.

The benefits:

  • Great ease of use for exploring and analyzing anyone's data.

  • Extreme scalability as you can spin up any number of tasks and arbitrarily large DAGs.

  • No engineering hassles for spinning up clusters.

  • Lower TCO by preventing cluster provisioning and eliminating idle compute cost.

  • You can build any distributed complex algorithm using the TileDB Cloud serverless infrastructure, and share it with any other user in the world.

The era of pluggable compute

From monolithic databases, data management evolved to compute engines with pluggable storage, and now we take it one step further with TileDB: we introduce the concept of universal data engines with pluggable compute. The main benefit is that data management features great similarity across all different applications. That includes data modeling, storage, slicing, access control, logging and interoperability. These features can be built once and be shared with every higher level application that needs to perform some more specialized computations. Abstracting data management and pushing common primitives down to a common storage engine eliminates obstacles and saves incredible amounts of time for data scientists and analysts that wish to focus on the science instead of the engineering, but also helps developers build brilliant new computational engines and algorithms on top of the universal data engine.

We look forward to hearing your feedback about the TileDB vision and software ecosystem. The utter goal of the TileDB project (and the TileDB, Inc. company) is to accelerate science and technology. And you can contribute to that goal!