Consolidation

The presence of numerous fragments may impact the TileDB read performance. This is because many fragments would lead to fragment metadata being loaded to main memory from numerous different files in storage. Moreover, the locality of the result cells of a subarray query may be destroyed in case those cells appear in multiple different fragment files, instead of concentrated byte regions within the same fragment files.

To mitigate this problem, TileDB has a consolidation feature, which allows you to merge

  1. Lightweight fragment metadata footers into a single file.

  2. A subset of fragments into a single fragment.

  3. A subset of array metadata files into a single one.

Consolidation is thread-/process-safe and can be done in the background while you continue reading from the array without being blocked. Moreover, consolidation does not hinder the ability to do time traveling at a fine granularity, as it does not delete fragments that participated in consolidation (and, therefore, they are still queryable). The user is responsible for vacuuming fragments, fragment metadata and array metadata that got consolidated to save space, at the cost of not being able to time travel across the old (finer) fragments.

Fragment metadata consolidation

Each fragment metadata file (located in a fragment folder) contains some lightweight information in its footer. This is mostly the non-empty domain and offsets for other metadata included in other parts of the file. If there are numerous fragments, reading the array may be slow on cloud object stores due to the numerous REST requests to fetch the fragment metadata footers. TileDB offers a consolidation process (with mode fragment_meta), which merges the fragment metadata footers of a subset of fragments into a single file that has suffix .meta, stored in the array folder. This file is named similarly to fragments, i.e., it carries a timestamp range that helps with time traveling. It also contains all the URIs of the fragments whose metadata footers are consolidated in that file. Upon reading an array, only this file is efficiently fetched from the backend, since it is typically very small in size (even for hundreds of thousands of fragments).

Fragment consolidation

If mode fragments is passed to the consolidation function, then the fragment consolidation algorithms is executed, which is explained in detail below.

Basic concepts

There are two important points to stress regarding fragment consolidation:

  1. Consolidating fragments where at least one fragment is dense produces a dense fragment.

  2. Consolidating fragments where all fragments are sparse produces a sparse fragment.

The figure below shows consolidation of two fragments, one dense and one sparse. Note that this can occur only in dense arrays, since sparse arrays can have only sparse fragments. The array in the figure has a 2x2 space tiling. Recall that a dense fragment consists of a dense hyper-rectangle and that it stores only integral tiles. Due to the sparse cell in the second fragment that is located in the lower left space tile, the dense hyper-rectangle of the produced consolidated dense fragment must cover all four space tiles. Therefore, TileDB must fill the empty cells in this hyper-rectangle with empty values, illustrated in grey color in the figure below.

Consolidating both dense and sparse fragments

Consolidating only sparse fragments is simpler and applies to both dense and sparse arrays. The figure below illustrates consolidation of two sparse fragments, where the resulting consolidated fragment is also sparse and there is no injection of empty values.

Consolidating only sparse fragments

Recall that each fragment is associated with its creation timestamp upon writing. A consolidated fragment instead is associated with the timestamp range that includes the timestamps of the fragments that produced it (see Consolidated Fragments). This is particularly important for time traveling, since opening an array at a timestamp will consider all the consolidated fragments whose end timestamp is at or before the query timestamp. In other words, although consolidation generally leads to better performance, it affects the granularity of time traveling.

Preprocessing

Before the consolidation algorithm begins, TileDB applies a simple optimization in a pre-processing step, which may lead to great performance benefits depending on the “shape” of the existing fragments. Specifically, TileDB identifies dense fragments whose non-empty domain completely covers older adjacent (dense or sparse) fragments, and directly deletes the old fragment directories without performing any actual consolidation.

This clean-up process is illustrated with an example in the figure below. Suppose the first fragment is dense and covers the entire array, i.e., [1,4], [1,4], the second is dense and covers [1,2], [1,2], the third is sparse as shown in the figure, and the fourth one is dense covering [1,2], [1,4]. Observe that, if those four fragments were to be consolidated, the cells of the second and third fragment would be completely overwritten from the cells of the fourth fragment. Therefore, the existence of those two fragments would make no difference to the consolidation result. Deleting them altogether before the consolidation algorithm commences will result in boosting the algorithm performance (since fewer cells will be read and checked for overwrites).

Fragment clean-up in a preprocessing stage

Algorithm

The consolidation algorithm is performed in steps. In each step, a subset of adjacent (in the timeline) fragments is selected for consolidation. The algorithm proceeds until a determined number of steps were executed, or until the algorithm specifies that no further fragments are to be consolidated. The choice of the next fragment subset for consolidation is based on certain rules and user-defined parameters, explained below. The number of steps is also configurable, controlled by sm.consolidation.steps.

Let us focus on a single step, during which the algorithm must select and consolidate a subset of fragments based on certain criteria:

  • The first criterion is if a subset of fragments is “consolidatable”, i.e., eligible for consolidation in a way that does not violate correctness. Any subset consisting of solely sparse fragments is always consolidatable. However, if a fragment subset contains one or more dense fragments, TileDB performs an important check; if the union of the non-empty domains of the fragments (which is equal to the non-empty domain of the resulting consolidated fragment) overlaps with any fragment created prior to this subset, then the subset is marked as non-consolidatable. Recall that the fragment that results from consolidating a subset of fragments containing at least one dense fragment is always a dense fragment. Therefore, empty regions in the non-emtpy domain of the consolidated fragment will be filled with special values. Those values may erroneously overwrite older valid cell values. Such a scenario is illustrated in the figure below. The second and third fragments are not consolidatable, since their non-empty domain contains empty regions that overlap with the first (older) fragment. Consequently, consolidating the second and third fragment results in a logical view that is not identical to the one before consolidation, violating correctness. This criterion detects and prevents such cases.

The second and third fragments are not consolidatable
  • The second criterion is the comparative fragment size. Ideally, we must consolidate fragments of approximately equal size. Otherwise, we may end up in a situation where, for example, a 100GB fragment gets consolidated with a 1MB one, which would unnecessarily waste consolidation time. This is controlled by parameter sm.consolidation.step_size_ratio; if the size ratio of two adjacent fragments is larger than this parameter, then no fragment subset that contains those two fragments will be considered for consolidation.

  • The third criterion is the fragment amplification factor, applicable to the case where the fragment subset to be consolidated contains at least one dense fragment. If the non-empty domain of the resulting fragment has too many empty cells, its size may become considerably larger than the sum of sizes of the original fragments to be consolidated. This is because the consolidated fragment is dense and inserts special fill values for all empty cells in its non-empty domain (see figure below). The amplification factor is the ratio between the consolidated fragment size and the sum of sizes of the original fragments. This is controlled by sm.consolidation.amplification, which should not be exceed for a fragment subset to be eligible for consolidation. The default value 1.0 means that the fragments will be consolidated if there is no amplification at all, i.e., if the size of the resulting consolidated fragment is smaller than or equal to the sum of sizes of the original fragments. As an example, this happens when the non-empty domain of the consolidated fragment does not contain any empty cells.

Consolidation leading to huge injection of fill values
  • The fourth criterion is the collective fragment size. Among all eligible fragment subsets for consolidation, we must first select to consolidate the ones that have the smallest sum of fragment sizes. This will quickly reduce the number of fragments (hence boosting read performance), without resorting to costly consolidation of larger fragments.

  • The final criterion is the number of fragments to consolidate in each step. This is controlled by sm.consolidation.step_min_frags and sm.consolidation.step_max_frags; the algorithm will select the subset of fragments (complying with all the above criteria) that has the maximum cardinality smaller than or equal to sm.consolidation.step_max_frags and larger than or equal to sm.consolidation.step_min_frags. If no fragment subset is eligible with cardinality at least sm.consolidation.step_min_frags, then the consolidation algorithm terminates.

The algorithm is based on dynamic programming and runs in time O(max_frags * total_frags), where total_frags is the total number of fragments considered in a given step, and max_frags is equal to the sm.consolidation.step_max_frags config parameter.

When computing the union of the non-empty domains of the fragments to be consolidated, in case there is at least one dense fragment, the union is always expanded to coincide with the space tile extents. This affects criterion 1 (since the expanded domain union may now overlap with some older fragments) and 2 (since the expanded union may amplify resulting consolidated fragment size).

Array metadata consolidation

Similar to array fragments, array metadata can also be consolidated (with mode array_meta). Since the array metadata is typically small and can fit in main-memory, consolidating them is rather simple. TileDB simply reads all the array metadata (from all the existing array metadata fragments) in main memory, creates an up-to-date view of the metadata, and then flushes them to a new array metadata file that carries in its name the timestamp range determined by the first timestamp of the first array metadata and the second timestamp of the last array metadata files that got consolidated.

Vacuuming

Vacuuming applies to consolidated fragments, consolidated array metadata and consolidated fragment metadata as follows:

  • Fragments: During consolidation, a .vac file is produced with all the fragment URIs that participated in consolidation. When the vacuuming function is called with mode "fragments", all the fragment folders whose URI is in the .vac file get deleted.

  • Array metadata: During consolidation, a .vac file is produced with all the array metadata URIs that participated in consolidation. When the vacuuming function is called with mode "array_meta", all the array metadata files whose URI is in the .vac file get deleted.

  • Fragment metadata: Vacuuming simply deletes all .meta files except for the last one.