Since a dataframe is either a dense or a sparse array, you can create a dataframe in the same manner as described in Create Arrays. However, in this section we provide auxiliary ways to create the array that will model your dataframe.
Since a dataframe is either a dense or a sparse array, you can write to a dataframe in the same manner as described in . However, in this section we provide auxiliary ways to write the array that models your dataframe.
All data in TileDB is modeled either as a dense or a sparse array. This of course includes dataframes. In other words, dataframes are arrays, and, as such, all functionality described in Arrays applies to dataframes as well. However, we have implemented special functionality for creating, ingesting and reading dataframes to make usage more natural to users familiar with dataframes. For example, we provide SQL support via MariaDB, integration with Pandas and Arrow Table in Python, etc.
TileDB currently offers CSV functionality only in Python.
TileDB-Py supports function from_csv
that can perform three tasks:
schema_only
ingest (default)
append
In this section, we will cover the schema_only
mode, which allows solely the creation of the array schema for a dataframe stored in a CSV file, without ingesting any rows / cells in it.
You can create a dataframe as a 1D dense array from the data stored in a CSV file as follows:
You can see the schema as follows:
Modeling dataframes as dense arrays is useful if there is no particular column that receives conditions in most query workloads. The underlying 1D dense array enables rapid slicing of any set of contiguous rows.
You can set the tile extent of the 1D array upon creation as follows:
You can create a dataframe as a ND sparse array by setting sparse=True
and selecting any subset of columns as dimensions in the index_dims
parameter.
The resulting array can have any number of dimensions. You can see the schema as follows:
You can set other parameters upon creation, such as cell_order
and tile_order
, the tile capacity
and whether the array allows_duplicates
. For example:
If you wish to parse and store certain columns as datetime types, you can specify it using parse_dates
as follows (applicable to both dense and sparse arrays):
You can add filters (e.g., compression) for dimensions and attributes using dim_filters
and attr_filters
, and you can also add filters for the offsets of variable-length attributes using filter_offsets
(applicable to both dense and sparse arrays).
Command from_csv
uses pandas.read_csv
under the covers to parse the CSV file and infer the data types of the columns. You can bypass pandas type inference and force the types of the columns using column_types
(applicable to both dense and sparse dataframes):
You can set an attribute as nullable by casting to a pandas datatype that can handle NA values. Applicable data types include: pandas.StringDType()
, pandas.Int8DType()
, pandas.Int16DType()
, pandas.Int32DType()
, pandas.Int64DType()
, pandas.UInt8DType()
, pandas.UInt16DType()
, pandas.UInt32DType()
, pandas.UInt64DType()
, pandas.BooleanDType()
.
You can create a dataframe from a CSV file stored in cloud storage (e.g., AWS S3), without the need to download the file on your local machine. It is as simple as providing the cloud URI for the CSV file:
For this to work, you will need to store your cloud credentials in the appropriate environment variables, for example (for AWS):
Alternatively, you can create a VFS object, properly configured with the cloud credentials, to from_pandas
as follows (for AWS):
Similarly, you can ingest into an array on cloud storage as well, creating a context configured with the cloud credentials and passing it to from_csv
. An example for AWS is shown below:
The above can be combined if both the input CSV and output array reside on cloud storage. An example for AWS is shown below:
Function from_csv
does not support encryption. if this is important to you and we will escalate its development.
Essentially dtype
above creates a pandas dataframe setting the col1
datatype to a string type that handles missing values, which TileDB picks up and defines col1
as a nullable attribute. See for more information.
A dataframe is a specialization of an array (see Use Cases). As such, any TileDB API works natively for writing to and reading from a dataframe modeled as an array. However, Python Pandas has a popular offering for dataframes in main memory and, therefore, TileDB offers special optimized reading functionality to read directly from an array into a Pandas dataframe. This How To guide describes this functionality.
Sections Create Dataframes and Write Dataframes describe how to ingest a dataframe into a 1D dense or a ND sparse array. This section covers how to read from the ingested dataframes directly into either Pandas or Arrow Table. Throughout the section, we call dense dataframe a dataframe modeled as a 1D dense array, and sparse dataframe a dataframe modeled as a ND sparse array.
Since the dataframe is an array, you can read the underlying schema in the same manner as for arrays as follows:
Suppose you have ingested a CSV file into a 1D dense array.
To find out how many rows were ingested, you can take a look at the array non-empty domain:
To read data from an array into a Pandas dataframe, you can use the df
operator:
For dense arrays, this operator allows you to efficiently slice any subset of rows:
TileDB is a columnar format and, therefore, allows you to efficiently subselect on columns / attributes as follows:
Suppose you have ingested a CSV file into a 2D sparse array.
This array allows for efficient slicing on the two dimensions as follows:
You can prevent the Pandas dataframe from materializing the index columns (which will boost up reading performance) as follows:
You can check the non-empty domain on the two dimensions as follows:
Being a columnar format, TileDB allows you to efficiently subselect on attributes and dimensions as follows:
If you are using Apache Arrow, TileDB can return dataframe results directly as Arrow Tables with zero-copy as follows:
TileDB supports SQL via its integration with MariaDB. A simple example is shown below, but for more details read section Embedded SQL.
ingest a single CSV file and create the underlying array schema in a single command
append one or more CSV files into an existing array / dataframe
Throughout the section, we call dense dataframe a dataframe modeled as a 1D dense array, and sparse dataframe a dataframe modeled as a ND sparse array.
You can ingest a CSV file into a dense dataframe as follows:
The resulting array in this case is always 1D. Note that in this case mode="ingest"
is the default value in from_csv
. You can see the schema as follows:
One thing to notice early is the dimension domain, which will be equal to [0, rows_num)
, where rows_num
is the number of rows in the CSV file you are ingesting. This can be relaxed as described later.
You can set the tile extent upon ingestion as follows:
To ingest a CSV file into a sparse array, all you need to do is set sparse=True
and select any subset of columns as dimensions in the index_dims
parameter.
The resulting array can have any number of dimensions. You can see the schema as follows:
By default, the domain of each dimension is calculated as a tight bound of the coordinate values of the ingested data. We show later that this can be tweaked.
You can set other parameters upon ingestion, such as cell_order
and tile_order
, the tile capacity
and whether the array allows_duplicates
. For example:
If you intend to add more data in the future into the dataframe you are ingesting (for both the dense and sparse case), you will need to set the dimension domains to their full domains (otherwise TileDB may complain that you are attempting to write outside of the dimension domain bounds). This can be done by simply setting full_domain=True
:
If you wish to parse and store certain columns as datetime types, you can specify it using parse_dates
as follows (applicable to both dense and sparse dataframes):
In the case there are missing values in the dataset for some columns, from_csv
allows you to specify fill values to replace those missing values. You can do this on a per attribute basis with fillna
as follows (applicable to both dense and sparse dataframes):
You can add filters (e.g., compression) for dimensions and attributes using dim_filters
and attr_filters
(applicable to both dense and sparse arrays).
Command from_csv
uses pandas.read_csv
under the covers to parse the CSV file and infer the data types of the columns. You can bypass pandas type inference and force the types of the columns using column_types
(applicable to both dense and sparse dataframes):
You can set an attribute as nullable by casting to a pandas datatype that can handle NA values. Applicable data types include: pandas.StringDType()
, pandas.Int8DType()
, pandas.Int16DType()
, pandas.Int32DType()
, pandas.Int64DType()
, pandas.UInt8DType()
, pandas.UInt16DType()
, pandas.UInt32DType()
, pandas.UInt64DType()
, pandas.BooleanDType()
.
In case the CSV file does not fit in main memory or if you wish to lower the memory consumption of your CSV file ingestion, you can use parameter chunksize
to control the number of rows you'd like to ingest in a batch fashion. Each batch will create a new fragment in the array (applicable to both dense and sparse dataframes).
If there are 7.6 million records in the CSV file, the above will create 8 fragments since chunksize
is set to a million rows.
This function call sets the dimension to its full domain, since the data is fetched into memory in chunks and, therefore, it is not possible to know the tight domain of the entire CSV file in advance.
For the dense case, you will need to instruct TileDB where in the dense dimension each write will occur, i.e., you need to provide the start index of each write. This can be done by setting row_start_idx
as follows:
For example, if you wish to ingest two CSV files after creating an empty dataframe, where the first CSV file has 1000 rows and the second 2000 rows, then you can set row_start_idx=0
for the first file (default) and row_start_idx=1000
for the second file. That will create two fragments in the array, one in domain slice [0,999]
and the other in [1000, 2999]
.
You can ingest multiple CSV files by providing a list of URIs as follows:
Note that it is important to set the chunksize
parameter as explained in the previous subsection, which will also set the dimension to its full domain.
TileDB currently offers a CSV ingestion function only in Python. Function tiledb.from_csv
accepts all the parameters of along with the TileDB specific parameters that are explained in this section.
Section describe how to create a dataframe as a 1D dense or a ND sparse array. This section covers how to:
Function from_csv
does not support encryption. if this is important to you and we will escalate its development.
Essentially dtype
above creates a pandas dataframe setting the col1
datatype to a string type that handles missing values, which TileDB picks up and defines col1
as a nullable attribute. See for more information.
Suppose you have already created the array, e.g., using mode='schema_only'
as explained in or have already ingested a CSV as explained in (setting full_domain=True
). To append the data of a CSV file into this array (provided that the data complies with the created schema), you simply need to run from_csv
setting mode=append
(applicable to both dense and sparse dataframes):
This functionality is identical to what is described in , noting that you will need to make proper use of mode=ingest
or mode=append
as explained above, instead of mode=schema_only
.