Consolidation

Basic Concepts

The presence of numerous fragments may impact the TileDB read performance. This is because many fragments would lead to fragment metadata being loaded to main memory from numerous different files in storage. Moreover, the locality of the result cells of a subarray query may be destroyed in case those cells appear in multiple different fragment files, instead of concentrated byte regions within the same fragment files.

To mitigate this problem, TileDB has a consolidation feature, which allows you to merge the existing fragments into a single one. Consolidation is thread-/process-safe and can be done in the background while you continue reading from the array without being blocked.

Currently, consolidation process-safety is not guaranteed on S3. This is due to S3’s eventual consistency model, which does not allow us to exclusively “lock” an array when consolidation takes place (TileDB is using filelocking that works well on strongly consistent filesystems). We are working on a solution that will appear in a future release. Until then, make sure to avoid reading the array when it is being consolidated.

There are two important points to stress regarding consolidation:

  1. Consolidating fragments where at least one fragment is dense produces a dense fragment.

  2. Consolidating fragments where all fragments are sparse produces a sparse fragment.

The figure below shows consolidation of two fragments, one dense and one sparse. Note that this can occur only in dense arrays, since sparse arrays can have only sparse fragments. The array in the figure has a 2x2 space tiling. Recall that a dense fragment consists of a dense hyper-rectangle and that it stores only integral tiles. Due to the sparse cell in the second fragment that is located in the lower left space tile, the dense hyper-rectangle of the produced consolidated dense fragment must cover all four space tiles. Therefore, TileDB must fill the empty cells in this hyper-rectangle with empty values, illustrated in grey color in the figure below.

Consolidating both dense and sparse fragments

Consolidating only sparse fragments is simpler and apply to both dense and sparse arrays. The figure below illustrates consolidation of two sparse fragments, where the resulting consolidated fragment is also sparse and there is no injection of empty values.

Consolidating only sparse fragments

Recall that each fragment is associated with its creation timestamp upon writing. A consolidated fragment instead is associated with the timestamp range that includes the timestamps of the fragments that produced it (see Physical Storage for more details). This is particularly important for time traveling, since opening an array at a timestamp will consider all the consolidated fragments whose start timestamp is at or before the query timestamp. In other words, although consolidation generally leads to better performance, it affects the granularity of time traveling.

Consolidation Internals

Preprocessing

Before the consolidation algorithm begins, TileDB applies a simple optimization in a pre-processing step, which may lead to great performance benefits depending on the “shape” of the existing fragments. Specifically, TileDB identifies dense fragments whose non-empty domain completely covers older adjacent (dense or sparse) fragments, and directly deletes the old fragment directories without performing any actual consolidation.

This clean-up process is illustrated with an example in the figure below. Suppose the first fragment is dense and covers the entire array, i.e., [1,4], [1,4], the second is dense and covers [1,2], [1,2], the third is sparse as shown in the figure, and the fourth one is dense covering [1,2], [1,4]. Observe that, if those four fragments were to be consolidated, the cells of the second and third fragment would be completely overwritten from the cells of the fourth fragment. Therefore, the existence of those two fragments would make no difference to the consolidation result. Deleting them altogether before the consolidation algorithm commences will result in boosting the algorithm performance (since fewer cells will be read and checked for overwrites).

Fragment clean-up in a preprocessing stage

Algorithm

The consolidation algorithm is performed in steps. In each step, a subset of adjacent (in the timeline) fragments is selected for consolidation. The algorithm proceeds until a determined number of steps were executed, or until the algorithm specifies that no further fragments are to be consolidated. The choice of the next fragment subset for consolidation is based on certain rules and user-defined parameters, explained below. The number of steps is also configurable, controlled by sm.consolidation.steps.

Let us focus on a single step, during which the algorithm must select and consolidate a subset of fragments based on certain criteria:

  • The first criterion is if a subset of fragments is “consolidatable”, i.e., eligible for consolidation in a way that does not violate correctness. Any subset consisting of solely sparse fragments is always consolidatable. However, if a fragment subset contains one or more dense fragments, TileDB performs an important check; if the union of the non-empty domains of the fragments (which is equal to the non-empty domain of the resulting consolidated fragment) overlaps with any fragment created prior to this subset, then the subset is marked as non-consolidatable. Recall that the fragment that results from consolidating a subset of fragments containing at least one dense fragment is always a dense fragment. Therefore, empty regions in the non-emtpy domain of the consolidated fragment will be filled with special values. Those values may erroneously overwrite older valid cell values. Such a scenario is illustrated in the figure below. The second and third fragments are not consolidatable, since their non-empty domain contains empty regions that overlap with the first (older) fragment. Consequently, consolidating the second and third fragment results in a logical view that is not identical to the one before consolidation, violating correctness. This criterion detects and prevents such cases.

The second and third fragments are not consolidatable
  • The second criterion is the comparative fragment size. Ideally, we must consolidate fragments of approximately equal size. Otherwise, we may end up in a situation where, for example, a 100GB fragment gets consolidated with a 1MB one, which would unnecessarily waste consolidation time. This is controlled by parameter sm.consolidation.step_size_ratio; if the size ratio of two adjacent fragments is larger than this parameter, then no fragment subset that contains those two fragments will be considered for consolidation.

  • The third criterion is the fragment amplification factor, applicable to the case where the fragment subset to be consolidated contains at least one dense fragment. If the non-empty domain of the resulting fragment has too many empty cells, its size may become considerably larger than the sum of sizes of the original fragments to be consolidated. This is because the consolidated fragment is dense and inserts special fill values for all empty cells in its non-empty domain (see figure below). The amplification factor is the ratio between the consolidated fragment size and the sum of sizes of the original fragments. This is controlled by sm.consolidation.amplification, which should not be exceed for a fragment subset to be eligible for consolidation. The default value 1.0 means that the fragments will be consolidated if there is no amplification at all, i.e., if the size of the resulting consolidated fragment is smaller than or equal to the sum of sizes of the original fragments. As an example, this happens when the non-empty domain of the consolidated fragment does not contain any empty cells.

Consolidation leading to huge injection of fill values
  • The fourth criterion is the collective fragment size. Among all eligible fragment subsets for consolidation, we must first select to consolidate the ones that have the smallest sum of fragment sizes. This will quickly reduce the number of fragments (hence boosting read performance), without resorting to costly consolidation of larger fragments.

  • The final criterion is the number of fragments to consolidate in each step. This is controlled by sm.consolidation.step_min_frags and sm.consolidation.step_max_frags; the algorithm will select the subset of fragments (complying with all the above criteria) that has the maximum cardinality smaller than or equal to sm.consolidation.step_max_frags and larger than or equal to sm.consolidation.step_min_frags. If no fragment subset is eligible with cardinality at least sm.consolidation.step_min_frags, then the consolidation algorithm terminates.

The algorithm is based on dynamic programming and runs in time O(max_frags * total_frags), where total_frags is the total number of fragments considered in a given step, and max_frags is equal to the sm.consolidation.step_max_frags config parameter.

When computing the union of the non-empty domains of the fragments to be consolidated, in case there is at least one dense fragment, the union is always expanded to coincide with the space tile extents. This affects criterion 1 (since the expanded domain union may now overlap with some older fragments) and 2 (since the expanded union may amplify resulting consolidated fragment size).

Array Metadata Consolidation

Similar to array fragments, array metadata fragments (see Array Metadata) can also be consolidated. Since the array metadata is typically small and can fit in main-memory, consolidating them is rather simple. TileDB simply reads all the array metadata (from all the existing array metadata fragments) in main memory, creates an up-to-date view of the metadata, and then flushes them to a new file (i.e., array metadata fragment).