Syncing to Unity Catalog
This document walks through the steps to register a OneTable synced Delta table in Unity Catalog on Databricks.
Pre-requisites
- Source table(s) (Hudi/Iceberg) already written to external storage locations like S3/GCS/ADLS. If you don’t have a source table written in S3/GCS/ADLS, you can follow the steps in this tutorial to set it up.
- Setup connection to external storage locations from Databricks.
- Create a Unity Catalog metastore in Databricks as outlined here.
- Create an external location in Databricks as outlined here.
- Clone the OneTable repository and create the
utilities-0.1.0-SNAPSHOT-bundled.jar
by following the steps on the Installation page
Steps
Running sync
Create my_config.yaml
in the cloned OneTable directory.
sourceFormat: HUDI|ICEBERG # choose only one
targetFormats:
- DELTA
datasets:
-
tableBasePath: s3://path/to/source/data
tableName: table_name
partitionSpec: partitionpath:VALUE # you only need to specify partitionSpec for HUDI sourceFormat
- Replace
s3://path/to/source/data
togs://path/to/source/data
if you have your source table in GCS andabfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>
if you have your source table in ADLS.- And replace with appropriate values for
sourceFormat
, andtableName
fields.
From your terminal under the cloned OneTable directory, run the sync process using the below command.
java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml
At this point, if you check your bucket path, you will be able to see
_delta_log
directory with 00000000000000000000.json which contains the logs that helps query engines to interpret the source table as a Delta table.
Register the target table in Unity Catalog
In your Databricks workspace, under SQL editor, run the following queries.
CREATE CATALOG onetable;
CREATE SCHEMA onetable.synced_delta_schema;
CREATE TABLE onetable.synced_delta_schema.<table_name>
USING DELTA
LOCATION 's3://path/to/source/data';
Replace
s3://path/to/source/data
togs://path/to/source/data
if you have your source table in GCS andabfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>
if you have your source table in ADLS.
Validating the results
You can now see the created delta table in Unity Catalog under Catalog as <table_name>
under
synced_delta_schema
and also query the table in the SQL editor:
SELECT * FROM onetable.synced_delta_schema.<table_name>;
Conclusion
In this guide we saw how to,
- sync a source table to create metadata for the desired target table formats using OneTable
- catalog the data in Delta format in Unity Catalog on Databricks
- query the Delta table using Databricks SQL editor