Who is TileDB for?

TileDB Embedded

Tables and key-values

Tabular data can be efficiently modeled as (dense or sparse) arrays and stored in TileDB Embedded. That gives you the following benefits:

  • Fast row (1D dense modeling) or multi-column (ND sparse modeling) slicing

  • A "columnar" format for more performant compression and column access (similar to Parquet)

  • A variety of compressor options

  • SQL access via our integration with MariaDB, PrestoDB or Spark

  • Direct data access (bypassing a SQL engine for performance) via C, C++, Python, R, Java, Go

  • Cloud-optimized storage on AWS S3, GCS and Azure Blob Storage

  • Tables can be enhanced with arbitrary metadata

  • Data versioning and time traveling

  • Atomicity (a read or write cannot corrupt an array)

  • Lock-free concurrency (multi-reader / multi-writer model)

  • Eventual consistency on cloud object stores

Key-value stores can be modeled as sparse arrays with string dimension domains. Managing key-value stores with TileDB Embedded offers you all the benefits outlined above for tables.

Machine Learning

Machine learning applications deal with very diverse data and tooling. Practitioners spend most of their time on managing the data, and converting it from one format to another. This OpenML blog post explains very nicely all the data challenges of the Machine Learning community. TileDB Embedded is an ideal solution for Machine Learning and addresses all the stated challenges, such as native sparse data support, a unified way to store all diverse data (e.g., tables, images, video, etc.), interoperability with a growing tooling ecosystem, performance on cloud object stores, time traveling and more.

For more details, read our blog post on why TileDB is the data engine for Machine Learning.

LiDAR

One of the biggest challenges we identified with LiDAR data is that scientists typically end up with myriads of LAZ files that are difficult to manage, especially on cloud object stores. Here are a few benefits with storing LiDAR data in TileDB Embedded:

  • Store a collection of LAZ files in a single 3D sparse array on a single projection

  • Unify metadata of LAZ file collection inside the single 3D array

  • Enjoy rapid slicing due to TileDB's R-tree indexing

  • Enjoy extreme parallelism through a lock-free multi-reader / multi-writer model

  • Cloud-optimized storage on AWS S3, GCS and Azure Blob Storage

  • Ingest and access LiDAR data using our integration with PDAL

  • Access data directly via APIs or use our integrations with Spark, Dask and SQL engines

AIS

Currently, AIS data are stored either in CSV (text) or HDF5 (as dense arrays) files. Those formats are very inefficient when it comes to AIS data that are sparse (2D points in space, or 3D if we add time as well). TileDB Embedded is ideal for managing AIS data at scale:

  • Store large AIS data as 2D (only positions) or 3D (positions and time) sparse arrays

  • Reduce storage costs via TileDB's "columnar" format and compression

  • Enjoy rapid slicing due to TileDB's R-tree indexing

  • Enjoy extreme parallelism through a lock-free multi-reader / multi-writer model

  • Cloud-optimized storage on AWS S3, GCS and Azure Blob Storage

  • Access data directly via numerous APIs or use our integrations with Spark, Dask and SQL engines

Weather and SAR

Weather data is traditionally stored in dense array formats like NetCDF, which currently builds upon HDF5 that is not architected to work well on cloud object stores. The community has expressed the need for cloud-ready weather data (see this UK Met Office Informatics Lab blog post). TileDB is ideal for this type of data (see recent efforts by UK Met Office to store numerous NetCDF files in TileDB). In summary, TileDB Embedded offers the following benefits for weather data:

  • Model numerous NetCDF files as a single dense, labeled TileDB array in a lossless way

  • Reduce storage costs via compression

  • Enjoy a cloud-optimized data format with rapid, lock-free, concurrent writing and reading

  • Access data directly via numerous APIs or use our integrations with Spark, Dask and GDAL

The SAR technology is a game changer for satellite imaging and earth observation. TileDB is an excellent choice for managing large temporal collections of SAR images at scale and in a cloud-optimized way, addressing the challenges with managing numerous COG files and associated metadata in the cloud. You can store SAR data as 2D or 3D dense arrays and enjoy all the benefits of our storage engine, such as rapid slicing, compression, parallelism, numerous APIs and integrations. You can read about our efforts with Capella Space on SAR data in this blog post.

Genomics

Population genomic analysis requires working on huge collections of variant data, which typically come in the VCF format. To mitigate issues revolving around IO locality and thus performance, it is typical to merge VCF files into a single combined (or population) VCF file. That fixes the locality issue but introduces another important challenge: the new file cannot be updated with new samples (it essentially needs to be created from scratch). This causes huge pain especially due to the volume of genomics data, as well as the fact that organizations wish to store this data on cheap cloud object stores, where all objects are immutable. This problem is often called "the N+1 problem". TileDB Embedded is an excellent solution to storing VCF data with the following benefits:

  • Store VCF data as sparse arrays (see TileDB-VCF)

  • Solve the N+1 problem utilizing TileDB's rapid updates

  • Reduce your storage overhead by 50% as compared to the raw single-sample VCF dataset

  • Enjoy extreme scalability with TileDB's lock-free multi-writer / multi-reader model

  • Access your data via numerous APIs and integrations with Spark, Dask and SQL engines

  • Enjoy data versioning and time traveling capabilities

TileDB Cloud

Embrace serverless

Have you ever felt that it is a great hassle to spin up clusters on the cloud? Is it often the case that you forget to shut down compute instances and, therefore, get unnecessarily charged? Do you sometime feel that you do not know how many machines to spin up to get the highest performance at the lowest cost? Do you find it cumbersome to deploy distributed systems to enjoy scalable compute? Do you feel that you unnecessarily pay for idle compute for not-well-parallelized algorithms in a given large cluster? If the answer to any of these questions is affirmative, then you are going to love the serverless TileDB Cloud capabilities. Specifically, with TileDB Cloud, you can:

  • Store your data in your own cloud buckets and only pay-as-you-go for serverless compute

  • Execute serverless SQL queries

  • Execute serverless user-defined functions (UDFs)

    • TileDB Cloud currently supports only Python UDFs, but it is architected to be language-agnostic

    • UDF support for R and other languages is coming up soon

  • Create arbitrary UDF DAGs and deploy them in a serverless manner

  • Scale your distributed compute with ease

  • No more cluster management, no more charges for idle compute

Share and discover datasets

TileDB Cloud allows you to share data with anyone, either within an organization, or any other individual user on the planet, always securely enforcing access policies you define. TileDB Cloud does not host any data. Instead, you continue to own and host your data in your cloud buckets. TileDB Cloud gives you an easy way to manage access to your data by thousands of users. The most important novelty here is that the shared data is analysis-ready. This means that you can access anyone's data without having to download or copy the data bearing the hosting fees. Also you do not need to convert the data into another format since it is already efficiently accessible by the numerous TileDB APIs and tool integrations. You can either slice or perform any UDF on this data in a serverless manner within TileDB Cloud, or use your own compute cluster with Spark, Dask or PrestoDB. Either way, you can enjoy TileDB Cloud's planet-scale access control.

In addition, you can use TileDB Cloud to discover public data to reproduce results or use in your own work. Over the next months we will be adding several large popular datasets from a variety of domains (e.g., genomics, LiDAR, weather, etc), all stored in the open-source TileDB format. Similarly, you can curate, host and publish your own datasets on TileDB Cloud, making them easily and inexpensively accessible by other users. Our goal is to build a very dynamic community around shared data.

Explore data with Jupyter notebooks

Did you ever wish to quickly explore public datasets fast, easily and inexpensively? Jupyter notebooks provide a great way to interactively explore data, and TileDB Cloud made it super easy to use Jupyter Lab embedded in its UI, accessible from your browser. You can sign up, sign in, and go explore TileDB datasets in seconds, getting very low charges only for the time you use a notebook. TileDB Cloud currently provides multiple system configurations to fit your resource needs, along with multiple images with pre-installed popular data science packages. We are always open to creating more diverse configurations and adapt to your package requirements.

Build and share your own distributed algorithms

Have you ever wanted to build innovative distributed algorithms for scalable SQL query execution or Linear Algebra or machine learning, but you were blocked by the fact that you needed to build an entire distributed infrastructure before working on the actual algorithm? Then TileDB Cloud provides what you were waiting for: a full-fledged serverless infrastructure that allows you to easily build sophisticated task DAGs with arbitrary UDFs, and implement any distributed algorithm, from a SQL group by to PCA to k-means clustering to anything you can imagine. And of course, you can register and share your algorithms with anyone on the planet. Over the next few months we will be releasing our own distributed algorithms for everyone to use, but we are hoping to see your own brilliance and creativity through your future contributions to the platform.