1 of 6

Dataframes

All data in TileDB is modeled either as a dense or a sparse array. This of course includes dataframes. In other words, dataframes are arrays, and, as such, all functionality described in Arrays applies to dataframes as well. However, we have implemented special functionality for creating, ingesting and reading dataframes to make usage more natural to users familiar with dataframes. For example, we provide SQL support via MariaDB, integration with Pandas and Arrow Table in Python, etc.

Create Dataframes

Since a dataframe is either a dense or a sparse array, you can create a dataframe in the same manner as described in Create Arrays. However, in this section we provide auxiliary ways to create the array that will model your dataframe.

Create Dataframe From CSV

TileDB currently offers CSV functionality only in Python.

TileDB-Py supports function from_csv that can perform three tasks:

schema_only
ingest (default)
append

In this section, we will cover the schema_only mode, which allows solely the creation of the array schema for a dataframe stored in a CSV file, without ingesting any rows / cells in it.

Dataframe as a Dense Array

You can create a dataframe as a 1D dense array from the data stored in a CSV file as follows:

tiledb.from_csv("my_array", "data.csv", mode="schema_only")

You can see the schema as follows:

A = tiledb.open("my_array", mode="r")
A.schema

Modeling dataframes as dense arrays is useful if there is no particular column that receives conditions in most query workloads. The underlying 1D dense array enables rapid slicing of any set of contiguous rows.

You can set the tile extent of the 1D array upon creation as follows:

tiledb.from_csv("my_array", "data.csv", tile=1000000)

Dataframe as a Sparse Array

You can create a dataframe as a ND sparse array by setting sparse=True and selecting any subset of columns as dimensions in the index_dims parameter.

tiledb.from_csv("my_array", 
                "data.csv", 
                mode="schema_only",
                sparse=True, 
                index_dims=['col3'])

The resulting array can have any number of dimensions. You can see the schema as follows:

A = tiledb.open("taxi_array", mode="r")
A.schema

You can set other parameters upon creation, such as cell_order and tile_order, the tile capacity and whether the array allows_duplicates. For example:

tiledb.from_csv("my_array", 
                "data.csv",
                mode="schema_only",
                sparse=True, 
                index_dims=['col3'],
                cell_order='hilbert'
                capacity=100000,
                allows_duplicates=False)

Parse Dates

If you wish to parse and store certain columns as datetime types, you can specify it using parse_dates as follows (applicable to both dense and sparse arrays):

tiledb.from_csv("my_array", 
                "data.csv", 
                mode="schema_only",
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                parse_dates=['col1', 'col2'])

Add Filters

You can add filters (e.g., compression) for dimensions and attributes using dim_filters and attr_filters, and you can also add filters for the offsets of variable-length attributes using filter_offsets (applicable to both dense and sparse arrays).

tiledb.from_csv("my_array", 
                "data.csv", 
                mode="schema_only",
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                filter_offsets=tiledb.FilterList([tiledb.GzipFilter(level=-1)]),        # all var-length attribute offsets will get this filter
                dim_filters=tiledb.FilterList([tiledb.ZstdFilter(level=-1)]),           # all dims will get this filter
                attr_filters={'col1': tiledb.FilterList([tiledb.GzipFilter(level=-1)])} # you can set a different filter per attribute (applies also to dimensions)

Function from_csv does not support encryption. Let us know if this is important to you and we will escalate its development.

Set Column Types

Command from_csv uses pandas.read_csv under the covers to parse the CSV file and infer the data types of the columns. You can bypass pandas type inference and force the types of the columns using column_types(applicable to both dense and sparse dataframes):

tiledb.from_csv("my_array", 
                "data.csv", 
                mode="schema_only",
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                column_types={"col1" : np.int32, "col2": np.float64})

Set Nullable Attributes

You can set an attribute as nullable by casting to a pandas datatype that can handle NA values. Applicable data types include: pandas.StringDType(), pandas.Int8DType(), pandas.Int16DType(), pandas.Int32DType(), pandas.Int64DType(), pandas.UInt8DType(), pandas.UInt16DType(), pandas.UInt32DType(), pandas.UInt64DType(), pandas.BooleanDType() .

tiledb.from_csv("my_array", 
                "data.csv", 
                mode="schema_only",
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                dtype={"col1" : pandas.StringDtype()})

Essentially dtype above creates a pandas dataframe setting the col1 datatype to a string type that handles missing values, which TileDB picks up and defines col1 as a nullable attribute. See how pandas handles missing values for more information.

Create From / To Cloud Storage

From Cloud Storage

You can create a dataframe from a CSV file stored in cloud storage (e.g., AWS S3), without the need to download the file on your local machine. It is as simple as providing the cloud URI for the CSV file:

tiledb.from_csv("my_array", "s3://my_bucket/data.csv", mode="schema_only")

For this to work, you will need to store your cloud credentials in the appropriate environment variables, for example (for AWS):

os.environ["AWS_ACCESS_KEY_ID"] = # ADD KEY ID
os.environ["AWS_SECRET_ACCESS_KEY"] = # ADD SECRET KEY
os.environ["AWS_DEFAULT_REGION"] = # ADD REGION

Alternatively, you can create a VFS object, properly configured with the cloud credentials, to from_pandas as follows (for AWS):

# Create a properly configured VFS object
cfg = tiledb.Config()
cfg["vfs.s3.aws_access_key_id"] = # ADD KEY ID
cfg["vfs.s3.aws_secret_access_key"] = # ADD SECRET KEY
cfg["vfs.s3.region"] = # ADD REGION
vfs = tiledb.VFS(config=cfg)

# Open the CSV file with `vfs`
csv_file = vfs.open("s3://my_bucket/data.csv")

# Invoke `from_csv` passing `csv_file`
tiledb.from_csv("my_array", csv_file, mode="schema_only")

To Cloud Storage

Similarly, you can ingest into an array on cloud storage as well, creating a context configured with the cloud credentials and passing it to from_csv. An example for AWS is shown below:

# Create a properly configured Ctx object
cfg = tiledb.Config()
cfg["vfs.s3.aws_access_key_id"] = # ADD KEY ID
cfg["vfs.s3.aws_secret_access_key"] = # ADD SECRET KEY
cfg["vfs.s3.region"] = # ADD REGION
ctx = tiledb.Ctx(cfg)

# Invoke `from_csv` passing `ctx`
tiledb.from_csv("s3://my_bucket/my_array", "data.csv", ctx=ctx)

From and To Cloud Storage

The above can be combined if both the input CSV and output array reside on cloud storage. An example for AWS is shown below:

# Create a properly configured VFS object for the CSV file
cfg_csv = tiledb.Config()
cfg_csv["vfs.s3.aws_access_key_id"] = # ADD KEY ID FOR THE CSV FILE
cfg_csv["vfs.s3.aws_secret_access_key"] = # ADD SECRET KEY FOR THE CSV FILE
cfg_csv["vfs.s3.region"] = # ADD REGION FOR THE CSV FILE
vfs = tiledb.VFS(config=cfg_csv)

# Open the CSV file with `vfs`
csv_file = vfs.open("s3://my_bucket/data.csv")

# Create a properly configured Ctx object for the array
cfg_array = tiledb.Config()
cfg_array["vfs.s3.aws_access_key_id"] = # ADD KEY ID FOR THE ARRAY
cfg_array["vfs.s3.aws_secret_access_key"] = # ADD SECRET KEY FOR THE ARRAY
cfg_array["vfs.s3.region"] = # ADD REGION FOR THE ARRAY
ctx_array = tiledb.Ctx(cfg_array)

# Invoke `from_csv` passing `csv_file` and `ctx_array`
tiledb.from_csv("my_array", csv_file, mode="schema_only", ctx=ctx_array)

Write Dataframes

Since a dataframe is either a dense or a sparse array, you can write to a dataframe in the same manner as described in Write Arrays. However, in this section we provide auxiliary ways to write the array that models your dataframe.

Write Dataframe From CSV

TileDB currently offers a CSV ingestion function only in Python. Function tiledb.from_csv accepts all the parameters of pandas.read_csv along with the TileDB specific parameters that are explained in this section.

Section Creating Dataframes describe how to create a dataframe as a 1D dense or a ND sparse array. This section covers how to:

ingest a single CSV file and create the underlying array schema in a single command
append one or more CSV files into an existing array / dataframe

Throughout the section, we call dense dataframe a dataframe modeled as a 1D dense array, and sparse dataframe a dataframe modeled as a ND sparse array.

Ingest a Single CSV File

Create and Write to a Dense Dataframe

You can ingest a CSV file into a dense dataframe as follows:

tiledb.from_csv("my_array", "data.csv")

The resulting array in this case is always 1D. Note that in this case mode="ingest" is the default value in from_csv. You can see the schema as follows:

A = tiledb.open("my_array", mode="r")
A.schema

One thing to notice early is the dimension domain, which will be equal to [0, rows_num), where rows_num is the number of rows in the CSV file you are ingesting. This can be relaxed as described later.

You can set the tile extent upon ingestion as follows:

tiledb.from_csv("my_array", "data.csv", tile=1000000)

Create and Write to a Sparse Dataframe

To ingest a CSV file into a sparse array, all you need to do is set sparse=True and select any subset of columns as dimensions in the index_dims parameter.

tiledb.from_csv("my_array", "data.csv", sparse=True, index_dims=['col3'])

The resulting array can have any number of dimensions. You can see the schema as follows:

A = tiledb.open("taxi_array", mode="r")
A.schema

By default, the domain of each dimension is calculated as a tight bound of the coordinate values of the ingested data. We show later that this can be tweaked.

You can set other parameters upon ingestion, such as cell_order and tile_order, the tile capacity and whether the array allows_duplicates. For example:

tiledb.from_csv("my_array", 
                "data.csv",
                sparse=True, 
                index_dims=['col3'],
                cell_order='hilbert'
                capacity=100000,
                allows_duplicates=False)

If you intend to add more data in the future into the dataframe you are ingesting (for both the dense and sparse case), you will need to set the dimension domains to their full domains (otherwise TileDB may complain that you are attempting to write outside of the dimension domain bounds). This can be done by simply setting full_domain=True:

tiledb.from_csv("my_array", 
                "data.csv",
                sparse=True,           # Applicable to dense too
                index_dims=['col3'],
                cell_order='hilbert'
                capacity=100000,
                full_domain=True)

Parse Dates

If you wish to parse and store certain columns as datetime types, you can specify it using parse_dates as follows (applicable to both dense and sparse dataframes):

tiledb.from_csv("my_array", 
                "data.csv", 
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                parse_dates=['col1', 'col2'])

Add Fill Values For Nulls

In the case there are missing values in the dataset for some columns, from_csv allows you to specify fill values to replace those missing values. You can do this on a per attribute basis with fillna as follows (applicable to both dense and sparse dataframes):

tiledb.from_csv("my_array", 
                "data.csv", 
                capacity=100000, 
                sparse=True, 
                index_col=['col3'], 
                parse_dates=['col1', 'col2'],
                fillna={'col4': ''})

Add Filters

You can add filters (e.g., compression) for dimensions and attributes using dim_filters and attr_filters (applicable to both dense and sparse arrays).

tiledb.from_csv("my_array", 
                "data.csv", 
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                dim_filters=tiledb.FilterList([tiledb.ZstdFilter(level=-1)]),           # all dims will get this filter
                attr_filters={'col1': tiledb.FilterList([tiledb.GzipFilter(level=-1)])} # you can set a different filter per attribute (applies also to dimensions)

Function from_csv does not support encryption. Let us know if this is important to you and we will escalate its development.

Set Column Types

tiledb.from_csv("my_array", 
                "data.csv", 
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                column_types={"col1" : np.int32, "col2": np.float64})

Set Nullable Attributes

tiledb.from_csv("my_array", 
                "data.csv", 
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                dtype={"col1" : pandas.StringDType())

Ingest Large CSV Files

In case the CSV file does not fit in main memory or if you wish to lower the memory consumption of your CSV file ingestion, you can use parameter chunksize to control the number of rows you'd like to ingest in a batch fashion. Each batch will create a new fragment in the array (applicable to both dense and sparse dataframes).

tiledb.from_csv("my_array", "data.csv", chunksize=1000000)

If there are 7.6 million records in the CSV file, the above will create 8 fragments since chunksize is set to a million rows.

This function call sets the dimension to its full domain, since the data is fetched into memory in chunks and, therefore, it is not possible to know the tight domain of the entire CSV file in advance.

Append CSV Files

Suppose you have already created the array, e.g., using mode='schema_only' as explained in Create Dataframe From CSV or have already ingested a CSV as explained in Ingest a Single CSV (setting full_domain=True). To append the data of a CSV file into this array (provided that the data complies with the created schema), you simply need to run from_csv setting mode=append (applicable to both dense and sparse dataframes):

tiledb.from_csv("my_array", "data.csv", mode='append')

For the dense case, you will need to instruct TileDB where in the dense dimension each write will occur, i.e., you need to provide the start index of each write. This can be done by setting row_start_idx as follows:

tiledb.from_csv("my_array", 
                "data.csv",
                mode='append',
                row_start_idx=1000000)

For example, if you wish to ingest two CSV files after creating an empty dataframe, where the first CSV file has 1000 rows and the second 2000 rows, then you can set row_start_idx=0 for the first file (default) and row_start_idx=1000 for the second file. That will create two fragments in the array, one in domain slice [0,999] and the other in [1000, 2999].

Ingest Multiple CSV Files

You can ingest multiple CSV files by providing a list of URIs as follows:

tiledb.from_csv("my_array", 
                ["data1.csv", "data2.csv"], 
                chunksize=1000000)

Note that it is important to set the chunksize parameter as explained in the previous subsection, which will also set the dimension to its full domain.

Ingest From / To Cloud Storage

This functionality is identical to what is described in Create From / To Cloud Storage, noting that you will need to make proper use of mode=ingest or mode=append as explained above, instead of mode=schema_only.

Read Dataframes

A dataframe is a specialization of an array (see Use Cases). As such, any TileDB API works natively for writing to and reading from a dataframe modeled as an array. However, Python Pandas has a popular offering for dataframes in main memory and, therefore, TileDB offers special optimized reading functionality to read directly from an array into a Pandas dataframe. This How To guide describes this functionality.

Sections Create Dataframes and Write Dataframes describe how to ingest a dataframe into a 1D dense or a ND sparse array. This section covers how to read from the ingested dataframes directly into either Pandas or Arrow Table. Throughout the section, we call dense dataframe a dataframe modeled as a 1D dense array, and sparse dataframe a dataframe modeled as a ND sparse array.

Read the Dataframe Schema

Since the dataframe is an array, you can read the underlying schema in the same manner as for arrays as follows:

A = tiledb.open("my_array", mode="r")
A.schema

Read From A Dense Dataframe

Suppose you have ingested a CSV file into a 1D dense array.

To find out how many rows were ingested, you can take a look at the array non-empty domain:

A = tiledb.open("my_array", mode="r")
A.nonempty_domain()
# Example ((0, 7667791),)

To read data from an array into a Pandas dataframe, you can use the df operator:

A.df[:]

For dense arrays, this operator allows you to efficiently slice any subset of rows:

A.df[11:20]

TileDB is a columnar format and, therefore, allows you to efficiently subselect on columns / attributes as follows:

A.query(attrs=['attr1']).df[:]

Read From A Sparse Dataframe

Suppose you have ingested a CSV file into a 2D sparse array.

This array allows for efficient slicing on the two dimensions as follows:

# If both dimensions are integers
A.df[1:10, 1:100] 

# Or, natively on the datatype of the dimensions (e.g., datetime)
A.df[slice(np.datetime64("2019-01-01 00:00:00"), np.datetime64("2019-01-02 23:59:59")), 0:10]

You can prevent the Pandas dataframe from materializing the index columns (which will boost up reading performance) as follows:

A.query(index_col=[]).df[1:100, 0:10]

You can check the non-empty domain on the two dimensions as follows:

A.nonempty_domain()

Being a columnar format, TileDB allows you to efficiently subselect on attributes and dimensions as follows:

A.query(attrs=['attr1'], dims=['dim1']).df[:]

Read into Arrow Tables

If you are using Apache Arrow, TileDB can return dataframe results directly as Arrow Tables with zero-copy as follows:

A.query(return_arrow=True).df[:]

Read Using SQL

TileDB supports SQL via its integration with MariaDB. A simple example is shown below, but for more details read section Embedded SQL.

import tiledb.sql, pandas

db = tiledb.sql.connect()
pandas.read_sql(sql="select * from `<array_uri>`", con=db)

Create Dataframe From CSV

TileDB currently offers CSV functionality only in Python.

TileDB-Py supports function from_csv that can perform three tasks:

schema_only
ingest (default)
append

In this section, we will cover the schema_only mode, which allows solely the creation of the array schema for a dataframe stored in a CSV file, without ingesting any rows / cells in it.

Dataframe as a Dense Array

You can create a dataframe as a 1D dense array from the data stored in a CSV file as follows:

tiledb.from_csv("my_array", "data.csv", mode="schema_only")

You can see the schema as follows:

A = tiledb.open("my_array", mode="r")
A.schema

You can set the tile extent of the 1D array upon creation as follows:

tiledb.from_csv("my_array", "data.csv", tile=1000000)

Dataframe as a Sparse Array

You can create a dataframe as a ND sparse array by setting sparse=True and selecting any subset of columns as dimensions in the index_dims parameter.

tiledb.from_csv("my_array", 
                "data.csv", 
                mode="schema_only",
                sparse=True, 
                index_dims=['col3'])

The resulting array can have any number of dimensions. You can see the schema as follows:

A = tiledb.open("taxi_array", mode="r")
A.schema

You can set other parameters upon creation, such as cell_order and tile_order, the tile capacity and whether the array allows_duplicates. For example:

tiledb.from_csv("my_array", 
                "data.csv",
                mode="schema_only",
                sparse=True, 
                index_dims=['col3'],
                cell_order='hilbert'
                capacity=100000,
                allows_duplicates=False)

Parse Dates

If you wish to parse and store certain columns as datetime types, you can specify it using parse_dates as follows (applicable to both dense and sparse arrays):

tiledb.from_csv("my_array", 
                "data.csv", 
                mode="schema_only",
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                parse_dates=['col1', 'col2'])

Add Filters

tiledb.from_csv("my_array", 
                "data.csv", 
                mode="schema_only",
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                filter_offsets=tiledb.FilterList([tiledb.GzipFilter(level=-1)]),        # all var-length attribute offsets will get this filter
                dim_filters=tiledb.FilterList([tiledb.ZstdFilter(level=-1)]),           # all dims will get this filter
                attr_filters={'col1': tiledb.FilterList([tiledb.GzipFilter(level=-1)])} # you can set a different filter per attribute (applies also to dimensions)

Function from_csv does not support encryption. Let us know if this is important to you and we will escalate its development.

Set Column Types

tiledb.from_csv("my_array", 
                "data.csv", 
                mode="schema_only",
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                column_types={"col1" : np.int32, "col2": np.float64})

Set Nullable Attributes

tiledb.from_csv("my_array", 
                "data.csv", 
                mode="schema_only",
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                dtype={"col1" : pandas.StringDtype()})

Create From / To Cloud Storage

From Cloud Storage

tiledb.from_csv("my_array", "s3://my_bucket/data.csv", mode="schema_only")

For this to work, you will need to store your cloud credentials in the appropriate environment variables, for example (for AWS):

os.environ["AWS_ACCESS_KEY_ID"] = # ADD KEY ID
os.environ["AWS_SECRET_ACCESS_KEY"] = # ADD SECRET KEY
os.environ["AWS_DEFAULT_REGION"] = # ADD REGION

Alternatively, you can create a VFS object, properly configured with the cloud credentials, to from_pandas as follows (for AWS):

# Create a properly configured VFS object
cfg = tiledb.Config()
cfg["vfs.s3.aws_access_key_id"] = # ADD KEY ID
cfg["vfs.s3.aws_secret_access_key"] = # ADD SECRET KEY
cfg["vfs.s3.region"] = # ADD REGION
vfs = tiledb.VFS(config=cfg)

# Open the CSV file with `vfs`
csv_file = vfs.open("s3://my_bucket/data.csv")

# Invoke `from_csv` passing `csv_file`
tiledb.from_csv("my_array", csv_file, mode="schema_only")

To Cloud Storage

Similarly, you can ingest into an array on cloud storage as well, creating a context configured with the cloud credentials and passing it to from_csv. An example for AWS is shown below:

# Create a properly configured Ctx object
cfg = tiledb.Config()
cfg["vfs.s3.aws_access_key_id"] = # ADD KEY ID
cfg["vfs.s3.aws_secret_access_key"] = # ADD SECRET KEY
cfg["vfs.s3.region"] = # ADD REGION
ctx = tiledb.Ctx(cfg)

# Invoke `from_csv` passing `ctx`
tiledb.from_csv("s3://my_bucket/my_array", "data.csv", ctx=ctx)

From and To Cloud Storage

The above can be combined if both the input CSV and output array reside on cloud storage. An example for AWS is shown below:

# Create a properly configured VFS object for the CSV file
cfg_csv = tiledb.Config()
cfg_csv["vfs.s3.aws_access_key_id"] = # ADD KEY ID FOR THE CSV FILE
cfg_csv["vfs.s3.aws_secret_access_key"] = # ADD SECRET KEY FOR THE CSV FILE
cfg_csv["vfs.s3.region"] = # ADD REGION FOR THE CSV FILE
vfs = tiledb.VFS(config=cfg_csv)

# Open the CSV file with `vfs`
csv_file = vfs.open("s3://my_bucket/data.csv")

# Create a properly configured Ctx object for the array
cfg_array = tiledb.Config()
cfg_array["vfs.s3.aws_access_key_id"] = # ADD KEY ID FOR THE ARRAY
cfg_array["vfs.s3.aws_secret_access_key"] = # ADD SECRET KEY FOR THE ARRAY
cfg_array["vfs.s3.region"] = # ADD REGION FOR THE ARRAY
ctx_array = tiledb.Ctx(cfg_array)

# Invoke `from_csv` passing `csv_file` and `ctx_array`
tiledb.from_csv("my_array", csv_file, mode="schema_only", ctx=ctx_array)

Write Dataframe From CSV

Section Creating Dataframes describe how to create a dataframe as a 1D dense or a ND sparse array. This section covers how to:

ingest a single CSV file and create the underlying array schema in a single command
append one or more CSV files into an existing array / dataframe

Throughout the section, we call dense dataframe a dataframe modeled as a 1D dense array, and sparse dataframe a dataframe modeled as a ND sparse array.

Ingest a Single CSV File

Create and Write to a Dense Dataframe

You can ingest a CSV file into a dense dataframe as follows:

tiledb.from_csv("my_array", "data.csv")

The resulting array in this case is always 1D. Note that in this case mode="ingest" is the default value in from_csv. You can see the schema as follows:

A = tiledb.open("my_array", mode="r")
A.schema

You can set the tile extent upon ingestion as follows:

tiledb.from_csv("my_array", "data.csv", tile=1000000)

Create and Write to a Sparse Dataframe

To ingest a CSV file into a sparse array, all you need to do is set sparse=True and select any subset of columns as dimensions in the index_dims parameter.

tiledb.from_csv("my_array", "data.csv", sparse=True, index_dims=['col3'])

The resulting array can have any number of dimensions. You can see the schema as follows:

A = tiledb.open("taxi_array", mode="r")
A.schema

By default, the domain of each dimension is calculated as a tight bound of the coordinate values of the ingested data. We show later that this can be tweaked.

You can set other parameters upon ingestion, such as cell_order and tile_order, the tile capacity and whether the array allows_duplicates. For example:

tiledb.from_csv("my_array", 
                "data.csv",
                sparse=True, 
                index_dims=['col3'],
                cell_order='hilbert'
                capacity=100000,
                allows_duplicates=False)

tiledb.from_csv("my_array", 
                "data.csv",
                sparse=True,           # Applicable to dense too
                index_dims=['col3'],
                cell_order='hilbert'
                capacity=100000,
                full_domain=True)

Parse Dates

If you wish to parse and store certain columns as datetime types, you can specify it using parse_dates as follows (applicable to both dense and sparse dataframes):

tiledb.from_csv("my_array", 
                "data.csv", 
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                parse_dates=['col1', 'col2'])

Add Fill Values For Nulls

tiledb.from_csv("my_array", 
                "data.csv", 
                capacity=100000, 
                sparse=True, 
                index_col=['col3'], 
                parse_dates=['col1', 'col2'],
                fillna={'col4': ''})

Add Filters

You can add filters (e.g., compression) for dimensions and attributes using dim_filters and attr_filters (applicable to both dense and sparse arrays).

tiledb.from_csv("my_array", 
                "data.csv", 
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                dim_filters=tiledb.FilterList([tiledb.ZstdFilter(level=-1)]),           # all dims will get this filter
                attr_filters={'col1': tiledb.FilterList([tiledb.GzipFilter(level=-1)])} # you can set a different filter per attribute (applies also to dimensions)

Function from_csv does not support encryption. Let us know if this is important to you and we will escalate its development.

Set Column Types

tiledb.from_csv("my_array", 
                "data.csv", 
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                column_types={"col1" : np.int32, "col2": np.float64})

Set Nullable Attributes

tiledb.from_csv("my_array", 
                "data.csv", 
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                dtype={"col1" : pandas.StringDType())

Ingest Large CSV Files

tiledb.from_csv("my_array", "data.csv", chunksize=1000000)

If there are 7.6 million records in the CSV file, the above will create 8 fragments since chunksize is set to a million rows.

Append CSV Files

tiledb.from_csv("my_array", "data.csv", mode='append')

tiledb.from_csv("my_array", 
                "data.csv",
                mode='append',
                row_start_idx=1000000)

Ingest Multiple CSV Files

You can ingest multiple CSV files by providing a list of URIs as follows:

tiledb.from_csv("my_array", 
                ["data1.csv", "data2.csv"], 
                chunksize=1000000)

Note that it is important to set the chunksize parameter as explained in the previous subsection, which will also set the dimension to its full domain.

Dataframes

Table of Contents

Create Dataframes

Table of Contents

Create Dataframe From CSV

Dataframe as a Dense Array

Dataframe as a Sparse Array

Parse Dates

Add Filters

Set Column Types

Set Nullable Attributes

Create From / To Cloud Storage

From Cloud Storage

To Cloud Storage

From and To Cloud Storage

Write Dataframes

Table of Contents

Write Dataframe From CSV

Ingest a Single CSV File

Create and Write to a Dense Dataframe

Create and Write to a Sparse Dataframe

Parse Dates

Add Fill Values For Nulls

Add Filters

Set Column Types

Set Nullable Attributes

Ingest Large CSV Files

Append CSV Files

Ingest Multiple CSV Files

Ingest From / To Cloud Storage

Read Dataframes

Read the Dataframe Schema

Read From A Dense Dataframe

Read From A Sparse Dataframe

Read into Arrow Tables

Read Using SQL

Create Dataframes

Table of Contents

Write Dataframes

Table of Contents

Dataframes

Table of Contents

Create Dataframe From CSV

Dataframe as a Dense Array

Dataframe as a Sparse Array

Parse Dates

Add Filters

Set Column Types

Set Nullable Attributes

Create From / To Cloud Storage

From Cloud Storage

To Cloud Storage

From and To Cloud Storage

Read Dataframes

Read the Dataframe Schema

Read From A Dense Dataframe

Read From A Sparse Dataframe

Read into Arrow Tables

Read Using SQL

Write Dataframe From CSV

Ingest a Single CSV File

Create and Write to a Dense Dataframe

Create and Write to a Sparse Dataframe

Parse Dates

Add Fill Values For Nulls

Add Filters

Set Column Types

Set Nullable Attributes

Ingest Large CSV Files

Append CSV Files

Ingest Multiple CSV Files

Ingest From / To Cloud Storage