Syncing to Unity Catalog
This document walks through the steps to register a OneTable synced Delta table in Unity Catalog on Databricks.
Pre-requisites
- Source table(s) (Hudi/Iceberg) already written to external storage locations like S3/GCS/ADLS. If you don’t have a source table written in S3/GCS/ADLS, you can follow the steps in this tutorial to set it up.
- Setup connection to external storage locations from Databricks.
- Create a Unity Catalog metastore in Databricks as outlined here.
- Create an external location in Databricks as outlined here.
- Clone the OneTable repository and create the
utilities-0.1.0-SNAPSHOT-bundled.jarby following the steps on the Installation page
Steps
Running sync
Create my_config.yaml in the cloned OneTable directory.
sourceFormat: HUDI|ICEBERG # choose only onetargetFormats:- DELTAdatasets:-tableBasePath: s3://path/to/source/datatableName: table_namepartitionSpec: partitionpath:VALUE # you only need to specify partitionSpec for HUDI sourceFormat
- Replace
s3://path/to/source/datatogs://path/to/source/dataif you have your source table in GCS andabfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>if you have your source table in ADLS.- And replace with appropriate values for
sourceFormat, andtableNamefields.
From your terminal under the cloned OneTable directory, run the sync process using the below command.
java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml
At this point, if you check your bucket path, you will be able to see
_delta_logdirectory with 00000000000000000000.json which contains the logs that helps query engines to interpret the source table as a Delta table.
Register the target table in Unity Catalog
In your Databricks workspace, under SQL editor, run the following queries.
CREATE CATALOG onetable;CREATE SCHEMA onetable.synced_delta_schema;CREATE TABLE onetable.synced_delta_schema.<table_name>USING DELTALOCATION 's3://path/to/source/data';
Replace
s3://path/to/source/datatogs://path/to/source/dataif you have your source table in GCS andabfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>if you have your source table in ADLS.
Validating the results
You can now see the created delta table in Unity Catalog under Catalog as <table_name> under
synced_delta_schema and also query the table in the SQL editor:
SELECT * FROM onetable.synced_delta_schema.<table_name>;
Conclusion
In this guide we saw how to,
- sync a source table to create metadata for the desired target table formats using OneTable
- catalog the data in Delta format in Unity Catalog on Databricks
- query the Delta table using Databricks SQL editor
