CSV Ingestion
TileDB currently offers a CSV ingestion function only in Python. Function tiledb.from_csv accepts all the parameters of pandas.read_csv along with the TileDB specific parameters that are explained in this section.

Ingest Into A Dense Array

You can ingest a CSV file into a dense array as follows:
1
tiledb.from_csv("my_array", "data.csv")
Copied!
The resulting array in this case is always 1D. You can see the schema as follows:
1
A = tiledb.open("my_array", mode="r")
2
A.schema
Copied!
You can set the tile extent upon ingestion as follows:
1
tiledb.from_csv("my_array", "data.csv", tile=1000000)
Copied!
You can read the data back into a Pandas dataframe as follows (see Reading Into Dataframes for more details):
1
A = tiledb.open("my_array", mode="r")
2
A.df[:]
Copied!

Ingest Into A Sparse Array

To ingest a CSV file into a sparse array, all you need to do is set sparse=True and select any subset of columns as dimensions in the index_col parameter.
1
tiledb.from_csv("my_array", "data.csv", sparse=True, index_col=['col3'])
Copied!
The resulting array can have any number of dimensions. You can see the schema as follows:
1
A = tiledb.open("taxi_array", mode="r")
2
A.schema
Copied!
You can set other parameters upon ingestion, such as cell_order and tile_order, the tile capacity and whether the array allows_duplicates. For example:
1
tiledb.from_csv("my_array",
2
"data.csv",
3
sparse=True,
4
index_col=['col3'],
5
cell_order='hilbert'
6
capacity=100000)
Copied!
Similar to dense arrays, you can read into a Pandas dataframe as:
1
A = tiledb.open("taxi_array", mode="r")
2
A.df[:]
Copied!
By default, the domain of each dimension is calculated as a tight bound of the coordinated values of the ingested data. It is often desirable to set the dimension domains to their full domains, such as when we incrementally ingest more CSV data into the array (see Ingest Into An Existing Array, Ingest Large CSV Files and Ingest Multiple CSV Files). This can be done by simply setting full_domain=True:
1
tiledb.from_csv("my_array",
2
"data.csv",
3
sparse=True,
4
index_col=['col3'],
5
cell_order='hilbert'
6
capacity=100000,
7
full_domain=True)
Copied!

Parsing Dates

If you wish to parse and store certain columns as datetime types, you can specify it using parse_dates as follows:
1
tiledb.from_csv("my_array",
2
"data.csv",
3
capacity=100000,
4
sparse=True,
5
index_col=['col3'],
6
parse_dates=['col1', 'col2'])
Copied!

Adding Fill Values For Nulls

In the case there are missing values in the dataset for some columns, from_csv allows you to specify fill values to replace those missing values. You can do this on a per attribute basis with fillna as follows:
1
tiledb.from_csv("my_array",
2
"data.csv",
3
capacity=100000,
4
sparse=True,
5
index_col=['col3'],
6
parse_dates=['col1', 'col2'],
7
fillna={'col4': ''})
Copied!

Adding Filters

You can add filters (e.g., compression) for dimensions and attributes using dim_filters and attr_filters. If you wish to set the same filter for all dimensions (e.g., Zstd) and the same filter for all attributes (e.g., GZip), you can do it as follows.
1
tiledb.from_csv("my_array",
2
"data.csv",
3
capacity=100000,
4
sparse=True,
5
index_col=['col3'],
6
dim_filters=tiledb.FilterList([tiledb.ZstdFilter(level=-1)]),
7
attr_filters=tiledb.FilterList([tiledb.GzipFilter(level=-1)]))
Copied!
If you wish to pass filters selectively to a subset of dimensions or attributes, you need to pass a dictionary to dim_filters and attr_filters, e.g., attr_filters={'col1': tiledb.FilterList([tiledb.GzipFilter(level=-1)])} will only set the Gzip filter only to the col1 attribute.
Function from_csv does not support encryption. Let us know if this is important to you and we will escalate its development.

Create Array Schema From A CSV File

It is possible to create only the array schema from a CSV file, i.e., without ingesting the actual data to the array. This is useful if you'd like to ingest CSV files into an existing array, either because you wish to ingest files in parallel (e.g., via multiple processes on a cloud object store), or because you may not have all the CSV files available at the time of ingestion (e.g., the data may be coming in successive batches).
To create an array from a CSV file without ingesting the data, you can set mode='schema_only' as in the example shown below:
1
tiledb.from_csv("my_array", "data.csv", mode='schema_only')
Copied!
Note that in this case TileDB sets full_domain='True' by default, since you typically use this mode when you wish to ingest more than one files in the array and, therefore, it is safer to set the dimensions to their full domains. This behaves similarly in the dense case as well.

Ingest Into An Existing Array

Suppose you have already created the array, e.g., using mode='schema_only' as explained in the subsection above. To ingest the data of a CSV file into this array (provided that the data complies with the created schema), you can simply run:
1
tiledb.from_csv("my_array",
2
"data.csv",
3
mode='append',
4
fillna={'col1': ''})
Copied!
For the dense case and in the event that you wish to ingest multiple CSV files in parallel, you will need to instruct TileDB where in the dense dimension each write will occur, i.e., you need to provide the start index of each write. This can be done by setting row_start_idx as follows:
1
tiledb.from_csv("my_array",
2
"data.csv",
3
mode='append',
4
row_start_idx=1000000,
5
fillna={'col1': ''})
Copied!
For example, if you wish to ingest two CSV files after creating the array, where the first CSV file has 1000 rows and the second 2000 rows, then you can set row_start_idx=0 for the first file (default) and row_start_idx=1000 for the second file. That will create two fragments in the array, one in domain slice [0,999] and the other in [1000, 2999].

Ingest Large CSV Files

In case the CSV file does not fit in main memory or if you wish to lower the memory consumption of your CSV file ingestion, you can use parameter chunksize to control the number of rows you'd like to ingest in a batch fashion. Each batch will create a new fragment in the array.
1
tiledb.from_csv("my_array", "data.csv", chunksize=1000000)
Copied!
If there are 7.6 million records in the CSV file, the above will create 8 fragments since chunksize is set to a million rows.
This function call sets the dimension to its full domain, since the data is fetched into memory in chunks and, therefore, it is not possible to know the tight domain of the entire CSV file in advance.
Finally, note that chunksize is applicable to sparse arrays as well, used in as similar fashion.

Ingest Multiple CSV Files

You can ingest multiple CSV files by providing a list of URIs as follows:
1
tiledb.from_csv("my_array",
2
["data1.csv", "data2.csv"],
3
chunksize=1000000)
Copied!
Note that it is important to set the chunksize parameter as explained in the previous subsection, which will also set the dimension to its full domain.

Ingest From / To Cloud Storage

You can ingest CSV files directly from cloud storage (e.g., AWS S3), without the need to download them on your local machine. It is as simple as providing the cloud URI for the CSV file:
1
tiledb.from_csv("my_array", "s3://my_bucket/data.csv")
Copied!
Similarly, you can ingest into an array on cloud storage as well:
1
tiledb.from_csv("s3://my_bucket/my_array", "s3://my_bucket2/data.csv")
Copied!