Write Dataframe From CSV

TileDB currently offers a CSV ingestion function only in Python. Function tiledb.from_csv accepts all the parameters of pandas.read_csv along with the TileDB specific parameters that are explained in this section.

Section Creating Dataframes describe how to create a dataframe as a 1D dense or a ND sparse array. This section covers how to:

  1. ingest a single CSV file and create the underlying array schema in a single command

  2. append one or more CSV files into an existing array / dataframe

Throughout the section, we call dense dataframe a dataframe modeled as a 1D dense array, and sparse dataframe a dataframe modeled as a ND sparse array.

Ingest a Single CSV File

Create and Write to a Dense Dataframe

You can ingest a CSV file into a dense dataframe as follows:

tiledb.from_csv("my_array", "data.csv")

The resulting array in this case is always 1D. Note that in this case mode="ingest" is the default value in from_csv. You can see the schema as follows:

A = tiledb.open("my_array", mode="r")
A.schema 

One thing to notice early is the dimension domain, which will be equal to [0, rows_num), where rows_num is the number of rows in the CSV file you are ingesting. This can be relaxed as described later.

You can set the tile extent upon ingestion as follows:

tiledb.from_csv("my_array", "data.csv", tile=1000000)

Create and Write to a Sparse Dataframe

To ingest a CSV file into a sparse array, all you need to do is set sparse=True and select any subset of columns as dimensions in the index_dims parameter.

tiledb.from_csv("my_array", "data.csv", sparse=True, index_dims=['col3'])

The resulting array can have any number of dimensions. You can see the schema as follows:

A = tiledb.open("taxi_array", mode="r")
A.schema

By default, the domain of each dimension is calculated as a tight bound of the coordinate values of the ingested data. We show later that this can be tweaked.

You can set other parameters upon ingestion, such as cell_order and tile_order, the tile capacity and whether the array allows_duplicates. For example:

tiledb.from_csv("my_array", 
                "data.csv",
                sparse=True, 
                index_dims=['col3'],
                cell_order='hilbert'
                capacity=100000,
                allows_duplicates=False)

If you intend to add more data in the future into the dataframe you are ingesting (for both the dense and sparse case), you will need to set the dimension domains to their full domains (otherwise TileDB may complain that you are attempting to write outside of the dimension domain bounds). This can be done by simply setting full_domain=True:

tiledb.from_csv("my_array", 
                "data.csv",
                sparse=True,           # Applicable to dense too
                index_dims=['col3'],
                cell_order='hilbert'
                capacity=100000,
                full_domain=True)

Parse Dates

If you wish to parse and store certain columns as datetime types, you can specify it using parse_dates as follows (applicable to both dense and sparse dataframes):

tiledb.from_csv("my_array", 
                "data.csv", 
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                parse_dates=['col1', 'col2'])

Add Fill Values For Nulls

In the case there are missing values in the dataset for some columns, from_csv allows you to specify fill values to replace those missing values. You can do this on a per attribute basis with fillna as follows (applicable to both dense and sparse dataframes):

tiledb.from_csv("my_array", 
                "data.csv", 
                capacity=100000, 
                sparse=True, 
                index_col=['col3'], 
                parse_dates=['col1', 'col2'],
                fillna={'col4': ''})

Add Filters

You can add filters (e.g., compression) for dimensions and attributes using dim_filters and attr_filters (applicable to both dense and sparse arrays).

tiledb.from_csv("my_array", 
                "data.csv", 
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                dim_filters=tiledb.FilterList([tiledb.ZstdFilter(level=-1)]),           # all dims will get this filter
                attr_filters={'col1': tiledb.FilterList([tiledb.GzipFilter(level=-1)])} # you can set a different filter per attribute (applies also to dimensions)                

Function from_csv does not support encryption. Let us know if this is important to you and we will escalate its development.

Set Column Types

Command from_csv uses pandas.read_csv under the covers to parse the CSV file and infer the data types of the columns. You can bypass pandas type inference and force the types of the columns using column_types(applicable to both dense and sparse dataframes):

tiledb.from_csv("my_array", 
                "data.csv", 
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                column_types={"col1" : np.int32, "col2": np.float64})

Set Nullable Attributes

You can set an attribute as nullable by casting to a pandas datatype that can handle NA values. Applicable data types include: pandas.StringDType(), pandas.Int8DType(), pandas.Int16DType(), pandas.Int32DType(), pandas.Int64DType(), pandas.UInt8DType(), pandas.UInt16DType(), pandas.UInt32DType(), pandas.UInt64DType(), pandas.BooleanDType() .

tiledb.from_csv("my_array", 
                "data.csv", 
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                dtype={"col1" : pandas.StringDType())

Essentially dtype above creates a pandas dataframe setting the col1 datatype to a string type that handles missing values, which TileDB picks up and defines col1 as a nullable attribute. See how pandas handles missing values for more information.

Ingest Large CSV Files

In case the CSV file does not fit in main memory or if you wish to lower the memory consumption of your CSV file ingestion, you can use parameter chunksize to control the number of rows you'd like to ingest in a batch fashion. Each batch will create a new fragment in the array (applicable to both dense and sparse dataframes).

tiledb.from_csv("my_array", "data.csv", chunksize=1000000)

If there are 7.6 million records in the CSV file, the above will create 8 fragments since chunksize is set to a million rows.

This function call sets the dimension to its full domain, since the data is fetched into memory in chunks and, therefore, it is not possible to know the tight domain of the entire CSV file in advance.

Append CSV Files

Suppose you have already created the array, e.g., using mode='schema_only' as explained in Create Dataframe From CSV or have already ingested a CSV as explained in Ingest a Single CSV (setting full_domain=True). To append the data of a CSV file into this array (provided that the data complies with the created schema), you simply need to run from_csv setting mode=append (applicable to both dense and sparse dataframes):

tiledb.from_csv("my_array", "data.csv", mode='append')

For the dense case, you will need to instruct TileDB where in the dense dimension each write will occur, i.e., you need to provide the start index of each write. This can be done by setting row_start_idx as follows:

tiledb.from_csv("my_array", 
                "data.csv",
                mode='append',
                row_start_idx=1000000)

For example, if you wish to ingest two CSV files after creating an empty dataframe, where the first CSV file has 1000 rows and the second 2000 rows, then you can set row_start_idx=0 for the first file (default) and row_start_idx=1000 for the second file. That will create two fragments in the array, one in domain slice [0,999] and the other in [1000, 2999].

Ingest Multiple CSV Files

You can ingest multiple CSV files by providing a list of URIs as follows:

tiledb.from_csv("my_array", 
                ["data1.csv", "data2.csv"], 
                chunksize=1000000)

Note that it is important to set the chunksize parameter as explained in the previous subsection, which will also set the dimension to its full domain.

Ingest From / To Cloud Storage

This functionality is identical to what is described in Create From / To Cloud Storage, noting that you will need to make proper use of mode=ingest or mode=append as explained above, instead of mode=schema_only.

Last updated