In this page, we explain how to use Hudi on Microsoft Azure.

Disclaimer

This page is maintained by the Hudi community. If the information is inaccurate or you have additional information to add. Please feel free to create a JIRA ticket. Contribution is highly appreciated.

Supported Storage System

There are two storage systems support Hudi .

  • Azure Blob Storage
  • Azure Data Lake Gen 2

Verified Combination of Spark and storage system

HDInsight Spark2.4 on Azure Data Lake Storage Gen 2

This combination works out of the box. No extra config needed.

Databricks Spark2.4 on Azure Data Lake Storage Gen 2

  • Import Hudi jar to databricks workspace

  • Mount the file system to dbutils.

    1. dbutils.fs.mount(
    2. source = "abfss://xxx@xxx.dfs.core.windows.net",
    3. mountPoint = "/mountpoint",
    4. extraConfigs = configs)
  • When writing Hudi dataset, use abfss URL
    1. inputDF.write
    2. .format("org.apache.hudi")
    3. .options(opts)
    4. .mode(SaveMode.Append)
    5. .save("abfss://<<storage-account>>.dfs.core.windows.net/hudi-tables/customer")
  • When reading Hudi dataset, use the mounting point
    1. spark.read
    2. .format("org.apache.hudi")
    3. .load("/mountpoint/hudi-tables/customer")