Designing a universal data model
These are exciting times for anyone working on data problems, as the data industry is as hot and as hyped as ever. Numerous databases, data warehouses, data lakes, lakehouses, feature stores, metadata stores, file managers, etc. have been hitting the market in the past few years. At TileDB we are trying to answer a simple question: instead of building a new data system every time our data needs change, can we build a single database that can store, govern, and process all data — tables, images, video, genomics, LiDAR, features, metadata, flat files and any other data type that may pop up in the future?
This question was born from the simple observation that all database systems (and variations) share significant similarities, including laying data out on the storage medium of choice, and fetching it for processing based on certain query workloads. Therefore, to answer the above question, we had to ask a slightly different one: is there a data model that can efficiently capture all data from all applications? Because if such a universal data model exists, it can serve as the foundation for building a universal database with all the subsystems common to all databases (query planner, executor, authenticator, transaction manager, APIs, etc.). We discovered that such a model does exist, and it is based on multi-dimensional arrays.
Before elaborating on why arrays are universal by describing the data model and their use cases, we need to answer yet another question: why should you care about a universal data model and a universal database? Here are a few important reasons:
Data diversity. You may think that it’s all about tabular data for which a traditional data warehouse (or data lake, or lakehouse) can do the trick, but in reality organizations possess a ton of other very valuable data, such as images, video, audio, genomics, point clouds, flat files and many more. And they wish to perform a variety of operations on these data collections, from analytics, to data science and machine learning.
Vendor optimization. In order to be able to manage their diverse data, organizations resort to either buying numerous different data systems, e.g., a data warehouse, plus an ML platform, plus a metadata store, plus a file manager. That costs money and time; money because some of the vendors have overlapping functionality that you pay twice for (e.g., authentication, access control, etc), and time because teams have to learn to operate numerous different systems, and wrangle data when they need to gain insights by combining disparate data sources.
Holistic governance. Even if organizations are happy with their numerous vendors, each different data system has its own access controls and logging capabilities. Therefore, if an organization needs to enforce centralized governance over all its data, it needs to build it in-house. That costs more money and time.
Even if you are already convinced of the importance of a universal database, we need to make one last vital remark. A universal database is unusable if it does not offer excellent performance for all the data types it is serving. In other words, a universal database must be performing as efficiently as the purpose-built ones, otherwise there will be a lot of skepticism on adopting it. And this is where the difficulty of universal databases lies and why no one had built such a system before TileDB.
In these docs you will be able to learn that multi-dimensional arrays are the right bet not only for their universality, but also for their performance. We will describe a lot of the critical decisions we took at TileDB when designing the array data model and an efficient on-disk format, as well as developing a powerful storage engine to support it.
For further reading on why we chose arrays as first-class citizens in TileDB, see our blog post Why Arrays as a Universal Data Model.