Syncing to Unity Catalog

This document walks through the steps to register a OneTable synced Delta table in Unity Catalog on Databricks.

Pre-requisites

  1. Source table(s) (Hudi/Iceberg) already written to external storage locations like S3/GCS/ADLS. If you don’t have a source table written in S3/GCS/ADLS, you can follow the steps in this tutorial to set it up.
  2. Setup connection to external storage locations from Databricks.
    • Follow the steps outlined here for Amazon S3
    • Follow the steps outlined here for Google Cloud Storage
    • Follow the steps outlined here for Azure Data Lake Storage Gen2 and Blob Storage.
  3. Create a Unity Catalog metastore in Databricks as outlined here.
  4. Create an external location in Databricks as outlined here.
  5. Clone the OneTable repository and create the utilities-0.1.0-SNAPSHOT-bundled.jar by following the steps on the Installation page

Steps

Running sync

Create my_config.yaml in the cloned OneTable directory.

  1. sourceFormat: HUDI|ICEBERG # choose only one
  2. targetFormats:
  3. - DELTA
  4. datasets:
  5. -
  6. tableBasePath: s3://path/to/source/data
  7. tableName: table_name
  8. partitionSpec: partitionpath:VALUE # you only need to specify partitionSpec for HUDI sourceFormat
  1. Replace s3://path/to/source/data to gs://path/to/source/data if you have your source table in GCS and abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data> if you have your source table in ADLS.
  2. And replace with appropriate values for sourceFormat, and tableName fields.

From your terminal under the cloned OneTable directory, run the sync process using the below command.

  1. java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml

At this point, if you check your bucket path, you will be able to see _delta_log directory with 00000000000000000000.json which contains the logs that helps query engines to interpret the source table as a Delta table.

Register the target table in Unity Catalog

In your Databricks workspace, under SQL editor, run the following queries.

  1. CREATE CATALOG onetable;
  2. CREATE SCHEMA onetable.synced_delta_schema;
  3. CREATE TABLE onetable.synced_delta_schema.<table_name>
  4. USING DELTA
  5. LOCATION 's3://path/to/source/data';

Replace s3://path/to/source/data to gs://path/to/source/data if you have your source table in GCS and abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data> if you have your source table in ADLS.

Validating the results

You can now see the created delta table in Unity Catalog under Catalog as <table_name> under synced_delta_schema and also query the table in the SQL editor:

  1. SELECT * FROM onetable.synced_delta_schema.<table_name>;

Conclusion

In this guide we saw how to,

  1. sync a source table to create metadata for the desired target table formats using OneTable
  2. catalog the data in Delta format in Unity Catalog on Databricks
  3. query the Delta table using Databricks SQL editor