Syncing to Glue Data Catalog
This document walks through the steps to register a OneTable synced table in Glue Data Catalog on AWS.
Pre-requisites
- Source table(s) (Hudi/Delta/Iceberg) already written to Amazon S3. If you don’t have the source table written in S3 already, you can follow the steps in this tutorial to set it up
- Setup access to interact with AWS APIs from the command line. If you haven’t installed AWSCLIv2, you do so by following the steps outlined in AWS docs and also set up access credentials by following the steps here
- Clone the OneTable repository and create the
utilities-0.1.0-SNAPSHOT-bundled.jar
by following the steps on the Installation page
Steps
Running sync
Create my_config.yaml
in the cloned OneTable directory.
Hudi
sourceFormat: DELTA|ICEBERG # choose only one
targetFormats:
- HUDI
datasets:
-
tableBasePath: s3://path/to/source/data
tableName: table_name
Delta
sourceFormat: HUDI|ICEBERG # choose only one
targetFormats:
- DELTA
datasets:
-
tableBasePath: s3://path/to/source/data
tableName: table_name
partitionSpec: partitionpath:VALUE # you only need to specify partitionSpec for HUDI sourceFormat
Iceberg
sourceFormat: HUDI|DELTA # choose only one
targetFormats:
- ICEBERG
datasets:
-
tableBasePath: s3://path/to/source/data
tableName: table_name
partitionSpec: partitionpath:VALUE # you only need to specify partitionSpec for HUDI sourceFormat
Replace with appropriate values for
sourceFormat
,tableBasePath
andtableName
fields.
From your terminal under the cloned onetable directory, run the sync process using the below command.
java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml
At this point, if you check your bucket path, you will be able to see the
.hoodie
or_delta_log
ormetadata
directory with metadata files which contains the information that helps query engines interpret the data as the target table.
Register the target table in Glue Data Catalog
From your terminal, create a glue database.
aws glue create-database --database-input "{\"Name\":\"onetable_synced_db\"}"
From your terminal, create a glue crawler. Modify the <yourAccountId>
, <yourRoleName>
and <path/to/your/data>
, with appropriate values.
export accountId=<yourAccountId>
export roleName=<yourRoleName>
export s3DataPath=s3://<path/to/source/data>
Hudi
aws glue create-crawler --name onetable_crawler --role arn:aws:iam::${accountId}:role/service-role/${roleName} --database onetable_synced_db --targets "{\"HudiTargets\":[{\"Paths\":[\"${s3DataPath}\"]}]}"
Delta
aws glue create-crawler --name onetable_crawler --role arn:aws:iam::${accountId}:role/service-role/${roleName} --database onetable_synced_db --targets "{\"DeltaTargets\":[{\"Paths\":[\"${s3DataPath}\"]}]}"
Iceberg
aws glue create-crawler --name onetable_crawler --role arn:aws:iam::${accountId}:role/service-role/${roleName} --database onetable_synced_db --targets "{\"IcebergTargets\":[{\"Paths\":[\"${s3DataPath}\"]}]}"
From your terminal, run the glue crawler.
aws glue start-crawler --name onetable_crawler
Once the crawler succeeds, you’ll be able to query this Iceberg table from Athena, EMR and/or Redshift query engines.
Validating the results
After the crawler runs successfully, you can inspect the catalogued tables in Glue and also query the table in Amazon Athena like below:
SELECT * FROM onetable_synced_db.<table_name>;
Conclusion
In this guide we saw how to,
- sync a source table to create metadata for the desired target table formats using OneTable
- catalog the data in the target table format in Glue Data Catalog
- query the target table using Amazon Athena