We have created two handy scripts for setting up TileDB-Spark and Apache Arrow on an EMR cluster. Arrow is optional but will increase performance if you use PySpark or SparkR.
EMR requires that the bootstrap scripts be copied to an S3 bucket. You can sync the scripts from our TileDB-Spark repo to S3 as follows:
Create the EMR cluster as follows:
From the AWS EMR console, click on "Create Cluster".
Click on link "Go to advanced options".
In Step 1, make sure Spark is selected.
In Step 3, click on "Bootstrap Actions", then select a custom action, and click on "Configure and add". For the "Script location", you will need to point to where you have uploaded the bootstrap scripts, eg. s3://my_bucket/path/emr_bootstrap/install-tiledb-spark.sh
.
Continue to create the cluster. It typically takes 10-20 minutes for the cluster to be ready.
Follow the same procedure as above, but in Step 3 add one more bootstrap action, providing the location of our CRAN packages script, e.g., s3://my_bucket/path/emr_bootstrap/install-cran-packages.sh
. Moreover, under, "Optional arguments" you must add --packages arrow
(potentially adding any other CRAN package of your choice).
TileDB-Spark provides a metric source to collect timing and input metric details. This can be helpful in tracking performance of TileDB and the TileDB-Spark driver.
In Step 1 of the EMR launch cluster console, there is a section "Edit software settings". Paste the following json config, which will enable the spark metrics source from TileDB: