TileDB is a data management solution invented to help Data Science teams make faster discoveries, by giving them a more natural way to store, analyze and share large sets of diverse data, so that they can stop wasting time working around performance limitations, inadequate data storage formats, and unfamiliar tooling. TileDB, Inc. was founded in February 2017 to further develop and maintain the TileDB project originally created in a collaboration between Intel Labs and MIT.
We took a bottom-up approach to Data Science with TileDB, starting with storage. TileDB introduces the only format and storage engine (open-source under the MIT License) that handles both dense and sparse multi-dimensional arrays. It supports efficient writes/reads of array data on multiple storage backends, including cloud object stores like AWS S3. One of its important features is its rapid, highly parallel, lock-free, batch updates, which are architected around immutable objects to work particularly well on the cloud. All update logic and functionality like time traveling is built into the storage engine itself. TileDB accommodates all Data Science applications with a single format and a unified intuitive API.
TileDB is an embeddable C++ library that ships with efficient APIs in C, C++, Python, R, Java and Go, and enables direct access to the data (instead of typically slow ODBC/JDBC access). It is also integrated with Spark, Dask, PrestoDB, MariaDB, Arrow and popular geospatial libraries like PDAL, GDAL and Rasterio. TileDB takes one step further and, while it allows you to compute natively with your popular tools, it pushes-down as much compute as possible to storage, such as filter conditions from the SQL engines, dataframe computations from Dask and Spark, etc. This leads to performance boost due to fast processing in C++, and the minimization of data copying across the software stack. Storing your data in TileDB, you can take advantage of the entire Data Science ecosystem, departing from old monolithic and domain-specific solutions.
With TileDB Cloud, you are able to manage your arrays on the cloud, and easily share them inside your organization or with other users globally, while monitoring all activity. The key is to push access control and logging to storage, so that all higher-level tools can inherit it. It is the array abstraction that makes it truly easy and intuitive for you to share any kind of data (dataframes, geospatial, genomics, time series, etc). Slicing arrays works natively just as in the case of the open-source storage engine, via fast REST and zero-copying wherever possible. TileDB Cloud is serverless, scalable and elastic, and comes with a pay-as-you-go pricing model. If you wish to run TileDB Cloud under your full control in your own private cluster, you can enjoy all the features of the cloud service with LDAP and SAML support by using TileDB Enterprise.
With TileDB Cloud you can perform array slicing, SQL and Python UDFs on TileDB data stored on AWS S3, all serverless and from your laptop. Everything works elastically and in a pay-as-you-go fashion. No need to spin up or tear down machines and build complicated packages. Just install the TileDB client, sign up, and go. We are hard at work to add more serverless functionality, such as deploying sophisticated and diverse workflows in multiple programming languages. Stay tuned!
We are committed to technological innovation that can be contributed to the open-source community, and utilized to solve pain points on the cloud and the enterprise. TileDB consists of 3 major offerings:
Fast C++ array engine
Efficient language bindings (C, C++, Python, R, Java, Go APIs)
Integration with Spark and Dask
Embedded SQL via MariaDB
MariaDB and PrestoDB data connectors
Integration with PDAL, GDAL and Rasterio
A genomics module for large-scale population genetics analysis (gVCF data)
All open-source under the MIT License
See the Developer Docs
Array sharing with other users on the cloud, inside or outside your organization
Serverless SQL and Python user-defined functions (UDFs)
No deployment hassle, just install the TileDB client and go
Monitor all activity through logs, always defined with array semantics
Pay-as-you-go pricing model on usage and data retrieved
See the Cloud docs
Use TileDB Cloud in your own private cluster under your total control
Authenticate the users in your organization via LDAP and SAML
Enjoy support from the TileDB team
Any dataset in the Data Science space can be modeled as a multi-dimensional array, which is the natural data representation that most high-level computational tools use today (numpy, Pandas, Spark, etc.). We demonstrate the versatility of TileDB and its storage format with three use cases:
Excellent for population genetic studies
Store huge collections of gVCF data in a TileDB 2D sparse array
Save 40% in space, enjoy parallel IO, and process cost efficiently on the cloud
Solve the N+1 problem with rapid updates and linear scalability
Interface in C/C++ or Python, and scale with Spark and Dask
Store satellite images, LiDAR, weather data and more as dense or sparse arrays
Use familiar geospatial tooling like PDAL, GDAL and Rasterio
Enjoy cloud-optimized storage and parallel IO
Create multi-dimensional cubes with arbitrary attributes and metadata
Store dataframes as sparse multi-dimensional arrays
Push all update and time traveling logic to the TileDB storage engine
Take advantage of multi-dimensional slicing for rapid OLAP queries
Integrate with Pandas and scale slicing and dicing with Dask and Spark
Process fast SQL queries with Spark, PrestoDB and MariaDB
All the software that serves those uses cases is open-source and can be used for free with TileDB Developer for storage and analysis. You can also enjoy them with access control, logging and serverless functionality with TileDB Cloud or TileDB Enterprise.
TileDB's applicability is not limited to these three use cases. We look forward to hearing from you about your use case, and we are eager to help you address your storage, compute and management needs with TileDB.