Welcome to TileDB!

The database for data scientists

TileDB is a data management solution invented to help Data Science teams make faster discoveries, by giving them a more natural way to store, analyze and share large sets of diverse data, so that they can stop wasting time working around performance limitations, inadequate data storage formats, and unfamiliar tooling. TileDB, Inc. was founded in February 2017 to further develop and maintain the TileDB project originally created in a collaboration between Intel Labs and MIT.

Here you will find all the documentation about TileDB and its products. You can jump to Quickstart if you'd like to start using TileDB right away. Please feel free to contact us for any issues.

Why TileDB?

A New Powerful Data Format

We took a bottom-up approach to Data Science with TileDB, starting with storage. TileDB introduces the only format and storage engine (open-source under the MIT License) that handles both dense and sparse multi-dimensional arrays. It supports efficient writes/reads of array data on multiple storage backends, including cloud object stores like AWS S3. One of its important features is its rapid, highly parallel, lock-free, batch updates, which are architected around immutable objects to work particularly well on the cloud. All update logic and functionality like time traveling is built into the storage engine itself. TileDB accommodates all Data Science applications with a single format and a unified intuitive API.

Ecosystem Integration

TileDB is an embeddable C++ library that ships with efficient APIs in C, C++, Python, R, Java and Go, and enables direct access to the data (instead of typically slow ODBC/JDBC access). It is also integrated with Spark, Dask, PrestoDB, MariaDB, Arrow and popular geospatial libraries like PDAL, GDAL and Rasterio. TileDB takes one step further and, while it allows you to compute natively with your popular tools, it pushes-down as much compute as possible to storage, such as filter conditions from the SQL engines, dataframe computations from Dask and Spark, etc. This leads to performance boost due to fast processing in C++, and the minimization of data copying across the software stack. Storing your data in TileDB, you can take advantage of the entire Data Science ecosystem, departing from old monolithic and domain-specific solutions.

Sharing Made Easy

With TileDB Cloud, you are able to manage your arrays on the cloud, and easily share them inside your organization or with other users globally, while monitoring all activity. The key is to push access control and logging to storage, so that all higher-level tools can inherit it. It is the array abstraction that makes it truly easy and intuitive for you to share any kind of data (dataframes, geospatial, genomics, time series, etc). Slicing arrays works natively just as in the case of the open-source storage engine, via fast REST and zero-copying wherever possible. TileDB Cloud is serverless, scalable and elastic, and comes with a pay-as-you-go pricing model. If you wish to run TileDB Cloud under your full control in your own private cluster, you can enjoy all the features of the cloud service with LDAP and SAML support by using TileDB Enterprise.

Everything Serverless

With TileDB Cloud you can perform array slicing, SQL and Python UDFs on TileDB data stored on AWS S3, all serverless and from your laptop. Everything works elastically and in a pay-as-you-go fashion. No need to spin up or tear down machines and build complicated packages. Just install the TileDB client, sign up, and go. We are hard at work to add more serverless functionality, such as deploying sophisticated and diverse workflows in multiple programming languages. Stay tuned!

Products

We are committed to technological innovation that can be contributed to the open-source community, and utilized to solve pain points on the cloud and the enterprise. TileDB consists of 3 major offerings:

Developer
Cloud
Enterprise
  • Fast C++ array engine

  • Efficient language bindings (C, C++, Python, R, Java, Go APIs)

  • Integration with Spark and Dask

  • Embedded SQL via MariaDB

  • MariaDB and PrestoDB data connectors

  • Integration with PDAL, GDAL and Rasterio

  • A genomics module for large-scale population genetics analysis (gVCF data)

  • All open-source under the MIT License

  • See the Developer Docs

  • Array sharing with other users on the cloud, inside or outside your organization

  • Serverless SQL and Python user-defined functions (UDFs)

  • No deployment hassle, just install the TileDB client and go

  • Monitor all activity through logs, always defined with array semantics

  • Pay-as-you-go pricing model on usage and data retrieved

  • See the Cloud docs

  • Use TileDB Cloud in your own private cluster under your total control

  • Authenticate the users in your organization via LDAP and SAML

  • Enjoy support from the TileDB team

Use Cases

Any dataset in the Data Science space can be modeled as a multi-dimensional array, which is the natural data representation that most high-level computational tools use today (numpy, Pandas, Spark, etc.). We demonstrate the versatility of TileDB and its storage format with three use cases:

Genomics
Geospatial
Dataframes
  • Excellent for population genetic studies

  • Store huge collections of gVCF data in a TileDB 2D sparse array

  • Save 40% in space, enjoy parallel IO, and process cost efficiently on the cloud

  • Solve the N+1 problem with rapid updates and linear scalability

  • Interface in C/C++ or Python, and scale with Spark and Dask

  • Store satellite images, LiDAR, weather data and more as dense or sparse arrays

  • Use familiar geospatial tooling like PDAL, GDAL and Rasterio

  • Enjoy cloud-optimized storage and parallel IO

  • Create multi-dimensional cubes with arbitrary attributes and metadata

  • Store dataframes as sparse multi-dimensional arrays

  • Push all update and time traveling logic to the TileDB storage engine

  • Take advantage of multi-dimensional slicing for rapid OLAP queries

  • Integrate with Pandas and scale slicing and dicing with Dask and Spark

  • Process fast SQL queries with Spark, PrestoDB and MariaDB

All the software that serves those uses cases is open-source and can be used for free with TileDB Developer for storage and analysis. You can also enjoy them with access control, logging and serverless functionality with TileDB Cloud or TileDB Enterprise.

TileDB's applicability is not limited to these three use cases. We look forward to hearing from you about your use case, and we are eager to help you address your storage, compute and management needs with TileDB.