PDAL

Installation

You can run the PDAL and TileDB code as follows:
1
docker run -it --rm -u 0 -v /local/path:/data -w /data tiledb/tiledb-geospatial /bin/bash
Copied!

Ingesting LAS Files

First, create a TileDB config file tiledb.config where you can set any TileDB configuration parameter (e.g., AWS keys if you would like to write to a TileDB array on S3). Make sure you also add the following, as currently TileDB does not handle duplicate points (this will change in a future version).
1
sm.dedup_coords true
Copied!
Then create a PDAL pipeline to translate some LAS data to a TileDB array, by storing the following in a file called pipeline.json:
1
[
2
{
3
"type":"readers.las",
4
"filename":"lrf_ws_epsg6341_nad832011utmz12_navd88_epoch2010_264000_4909000.laz"
5
},
6
{
7
"type":"writers.tiledb",
8
"config_file":"tiledb.config",
9
"compression":"zstd",
10
"compression_level": 75,
11
"chunk_size": 50000,
12
"array_name":"sample_array"
13
}
14
]
Copied!
You can execute the pipeline with PDAL that will carry out the ingestion as follows:
1
pdal pipeline -i pipeline.json
Copied!
We now have points and attributes stored in an array called sample_array. This write uses the streaming mode of PDAL.
You can view this sample_array directly from TileDB as follows (we demonstrate using TileDB's Python API, but any other API would work as well):
Python
1
import numpy as np
2
import pptk
3
import tiledb
4
5
ctx = tiledb.Ctx()
6
7
# Open the array and read from it.
8
with tiledb.SparseArray('sample_array', ctx=ctx, mode='r') as arr:
9
# Get non-empty domain
10
arr.nonempty_domain()
11
# note that the attributes have different types
12
arr.dump()
13
data = arr[:]
14
15
data
16
17
coords = np.array([np.asarray(list(t)) for t in data['coords']])
18
v = pptk.viewer(coords, coords[:, 2])
Copied!

Parallel Writes

PDAL is single-threaded, but coupled with TileDB's parallel write support, can become a powerful tool for ingesting enormous amounts of point cloud data into TileDB. The PDAL driver supports appending to an existing dataset and we use this with Dask to create a parallel update.
We demonstrate parallel ingestion with the code below. Make sure to remove or move the sample_array created in the previous example.
Python
1
import glob
2
3
from dask.distributed import Client
4
import pdal
5
6
7
def update(json):
8
pipeline = pdal.Pipeline(json)
9
pipeline.loglevel = 8 #really noisy
10
return pipeline.execute()
11
12
13
json = """
14
[
15
{
16
"type":"readers.las",
17
"filename":"%s"
18
},
19
{
20
"type":"writers.tiledb",
21
"config_file":"tiledb.config",
22
"chunk_size": 50000,
23
"array_name":"sample_array",
24
"append": %s
25
}
26
]
27
"""
28
29
client = Client(threads_per_worker=6, n_workers=1)
30
point_clouds = glob.glob('*.laz')
31
bAppend = False
32
33
for f in point_clouds:
34
manifest = json % (f, str(bAppend).lower())
35
f = client.submit(update, manifest)
36
if not bAppend:
37
# block for schema creation and initial load
38
f.result()
39
40
bAppend = True
Copied!

Parallel Reads

Although the TileDB driver is parallel (i.e., it uses multiple threads for decompression and IO), PDAL is single-threaded and therefore some tasks may benefit from additional boosting. Take for instance the following PDAL command that counts the number of points in the dataset using the TileDB driver.
1
pdal info --driver readers.tiledb --readers.tiledb.array_name=sample_array -i sample_array
Copied!
We can write a simple script in Python with Dask and direct access to TileDB to perform the same operation completely in parallel:
Python
1
from dask.distributed import Client, progress
2
import math
3
import numpy as np
4
import pdal
5
import tiledb
6
7
def count(array_name, x1, x2, y1, y2, z1, z2):
8
with tiledb.SparseArray(array_name, 'r') as arr:
9
return arr.domain_index[x1:x2, y1:y2, z1:z2]['coords'].shape[0]
10
11
if __name__ == "__main__":
12
client = Client(threads_per_worker=6, n_workers=1)
13
array_name = 'sample_array'
14
15
jobs = []
16
tile_div = 6
17
18
jobs = []
19
20
with tiledb.SparseArray(array_name, 'r') as arr:
21
xs, ys, zs = arr.nonempty_domain()
22
tile_x = math.ceil((xs[1] - xs[0]) / tile_div)
23
x1 = xs[0]
24
25
for i in range(tile_div):
26
x2 = min(xs[0] + ((i + 1) * tile_x), xs[1])
27
28
if x1 > x2:
29
continue
30
31
f = client.submit(count, array_name, x1, x2, *ys, *zs)
32
jobs.append(f)
33
34
x1 = np.nextafter(x2, x2 + 1)
35
36
results = client.gather(jobs)
37
total = sum(results)
38
print(f"Total points: {total}")
Copied!
In both cases we get the answer of 31530863 (for a 750MB compressed array). With single-threaded PDAL, the output from the time command is the following on m5a.2xlargemachine on AWS:
1
real 1m15.267s
2
user 1m46.927s
3
sys 0m2.410s
Copied!
The above Python script using Dask is significantly faster:
1
real 0m9.902s
2
user 0m1.465s
3
sys 0m0.216s
Copied!
Last modified 13d ago