TileDB currently offers a CSV ingestion function only in Python. Function tiledb.from_csv
accepts all the parameters of pandas.read_csv
along with the TileDB specific parameters that are explained in this section.
Section Creating Dataframes describe how to create a dataframe as a 1D dense or a ND sparse array. This section covers how to:
ingest a single CSV file and create the underlying array schema in a single command
append one or more CSV files into an existing array / dataframe
Throughout the section, we call dense dataframe a dataframe modeled as a 1D dense array, and sparse dataframe a dataframe modeled as a ND sparse array.
You can ingest a CSV file into a dense dataframe as follows:
The resulting array in this case is always 1D. Note that in this case mode="ingest"
is the default value in from_csv
. You can see the schema as follows:
One thing to notice early is the dimension domain, which will be equal to [0, rows_num)
, where rows_num
is the number of rows in the CSV file you are ingesting. This can be relaxed as described later.
You can set the tile extent upon ingestion as follows:
To ingest a CSV file into a sparse array, all you need to do is set sparse=True
and select any subset of columns as dimensions in the index_dims
parameter.
The resulting array can have any number of dimensions. You can see the schema as follows:
By default, the domain of each dimension is calculated as a tight bound of the coordinate values of the ingested data. We show later that this can be tweaked.
You can set other parameters upon ingestion, such as cell_order
and tile_order
, the tile capacity
and whether the array allows_duplicates
. For example:
If you intend to add more data in the future into the dataframe you are ingesting (for both the dense and sparse case), you will need to set the dimension domains to their full domains (otherwise TileDB may complain that you are attempting to write outside of the dimension domain bounds). This can be done by simply setting full_domain=True
:
If you wish to parse and store certain columns as datetime types, you can specify it using parse_dates
as follows (applicable to both dense and sparse dataframes):
In the case there are missing values in the dataset for some columns, from_csv
allows you to specify fill values to replace those missing values. You can do this on a per attribute basis with fillna
as follows (applicable to both dense and sparse dataframes):
You can add filters (e.g., compression) for dimensions and attributes using dim_filters
and attr_filters
(applicable to both dense and sparse arrays).
Function from_csv
does not support encryption. Let us know if this is important to you and we will escalate its development.
Command from_csv
uses pandas.read_csv
under the covers to parse the CSV file and infer the data types of the columns. You can bypass pandas type inference and force the types of the columns using column_types
(applicable to both dense and sparse dataframes):
You can set an attribute as nullable by casting to a pandas datatype that can handle NA values. Applicable data types include: pandas.StringDType()
, pandas.Int8DType()
, pandas.Int16DType()
, pandas.Int32DType()
, pandas.Int64DType()
, pandas.UInt8DType()
, pandas.UInt16DType()
, pandas.UInt32DType()
, pandas.UInt64DType()
, pandas.BooleanDType()
.
Essentially dtype
above creates a pandas dataframe setting the col1
datatype to a string type that handles missing values, which TileDB picks up and defines col1
as a nullable attribute. See how pandas handles missing values for more information.
In case the CSV file does not fit in main memory or if you wish to lower the memory consumption of your CSV file ingestion, you can use parameter chunksize
to control the number of rows you'd like to ingest in a batch fashion. Each batch will create a new fragment in the array (applicable to both dense and sparse dataframes).
If there are 7.6 million records in the CSV file, the above will create 8 fragments since chunksize
is set to a million rows.
This function call sets the dimension to its full domain, since the data is fetched into memory in chunks and, therefore, it is not possible to know the tight domain of the entire CSV file in advance.
Suppose you have already created the array, e.g., using mode='schema_only'
as explained in Create Dataframe From CSV or have already ingested a CSV as explained in Ingest a Single CSV (setting full_domain=True
). To append the data of a CSV file into this array (provided that the data complies with the created schema), you simply need to run from_csv
setting mode=append
(applicable to both dense and sparse dataframes):
For the dense case, you will need to instruct TileDB where in the dense dimension each write will occur, i.e., you need to provide the start index of each write. This can be done by setting row_start_idx
as follows:
For example, if you wish to ingest two CSV files after creating an empty dataframe, where the first CSV file has 1000 rows and the second 2000 rows, then you can set row_start_idx=0
for the first file (default) and row_start_idx=1000
for the second file. That will create two fragments in the array, one in domain slice [0,999]
and the other in [1000, 2999]
.
You can ingest multiple CSV files by providing a list of URIs as follows:
Note that it is important to set the chunksize
parameter as explained in the previous subsection, which will also set the dimension to its full domain.
This functionality is identical to what is described in Create From / To Cloud Storage, noting that you will need to make proper use of mode=ingest
or mode=append
as explained above, instead of mode=schema_only
.