In this page, we explain how to get your Hudi spark job to store into AWS S3.

AWS configs

There are two configurations required for Hudi-S3 compatibility:

  • Adding AWS Credentials for Hudi
  • Adding required Jars to classpath

AWS Credentials

The simplest way to use Hudi with S3, is to configure your SparkSession or SparkContext with S3 credentials. Hudi will automatically pick this up and talk to S3.

Alternatively, add the required configs in your core-site.xml from where Hudi can fetch them. Replace the fs.defaultFS with your S3 bucket name and Hudi should be able to read/write from the bucket.

  1. <property>
  2. <name>fs.defaultFS</name>
  3. <value>s3://ysharma</value>
  4. </property>
  5. <property>
  6. <name>fs.s3.awsAccessKeyId</name>
  7. <value>AWS_KEY</value>
  8. </property>
  9. <property>
  10. <name>fs.s3.awsSecretAccessKey</name>
  11. <value>AWS_SECRET</value>
  12. </property>
  13. <property>
  14. <name>fs.s3a.awsAccessKeyId</name>
  15. <value>AWS_KEY</value>
  16. </property>
  17. <property>
  18. <name>fs.s3a.awsSecretAccessKey</name>
  19. <value>AWS_SECRET</value>
  20. </property>
  21. <property>
  22. <name>fs.s3a.endpoint</name>
  23. <value>http://IP-Address:Port</value>
  24. </property>
  25. <property>
  26. <name>fs.s3a.path.style.access</name>
  27. <value>true</value>
  28. </property>
  29. <property>
  30. <name>fs.s3a.signing-algorithm</name>
  31. <value>S3SignerType</value>
  32. </property>

Utilities such as hudi-cli or Hudi Streamer tool, can pick up s3 creds via environmental variable prefixed with HOODIE_ENV_. For e.g below is a bash snippet to setup such variables and then have cli be able to work on datasets stored in s3

  1. export HOODIE_ENV_fs_DOT_s3a_DOT_access_DOT_key=$accessKey
  2. export HOODIE_ENV_fs_DOT_s3a_DOT_secret_DOT_key=$secretKey
  3. export HOODIE_ENV_fs_DOT_s3_DOT_awsAccessKeyId=$accessKey
  4. export HOODIE_ENV_fs_DOT_s3_DOT_awsSecretAccessKey=$secretKey

AWS Libs

AWS hadoop libraries to add to our classpath

  • com.amazonaws:aws-java-sdk:1.10.34
  • org.apache.hadoop:hadoop-aws:2.7.3

AWS glue data libraries are needed if AWS glue data is used

  • com.amazonaws.glue:aws-glue-datacatalog-hive2-client:1.11.0
  • com.amazonaws:aws-java-sdk-glue:1.11.475

AWS S3 Versioned Bucket

With versioned buckets any object deleted creates a Delete Marker, as Hudi cleans up files using Cleaner utility the number of Delete Markers increases over time. It is important to configure the Lifecycle Rule correctly to clean up these delete markers as the List operation can choke if the number of delete markers reaches 1000. We recommend cleaning up Delete Markers after 1 day in Lifecycle Rule.