This page covers the different ways of configuring your job to write/read Hudi tables. At a high level, you can control behaviour at few levels.

  • Hudi Table Config: Basic Hudi Table configuration parameters.
  • Environment Config: Hudi supports passing configurations via a configuration file hudi-default.conf in which each line consists of a key and a value separated by whitespace or = sign. For example:
    1. hoodie.datasource.hive_sync.mode jdbc
    2. hoodie.datasource.hive_sync.jdbcurl jdbc:hive2://localhost:10000
    3. hoodie.datasource.hive_sync.support_timestamp false
    It helps to have a central configuration file for your common cross job configurations/tunings, so all the jobs on your cluster can utilize it. It also works with Spark SQL DML/DDL, and helps avoid having to pass configs inside the SQL statements.

By default, Hudi would load the configuration file under /etc/hudi/conf directory. You can specify a different configuration directory location by setting the HUDI_CONF_DIR environment variable.

  • Spark Datasource Configs: These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or choosing query type to read.
  • Flink Sql Configs: These configs control the Hudi Flink SQL source/sink connectors, providing ability to define record keys, pick out the write operation, specify how to merge records, enable/disable asynchronous compaction or choosing query type to read.
  • Write Client Configs: Internally, the Hudi datasource uses a RDD based HoodieWriteClient API to actually perform writes to storage. These configs provide deep control over lower level aspects like file sizing, compression, parallelism, compaction, write schema, cleaning etc. Although Hudi provides sane defaults, from time-time these configs may need to be tweaked to optimize for specific workloads.
  • Reader Configs: Please fill in the description for Config Group Name: Reader Configs
  • Metastore and Catalog Sync Configs: Configurations used by the Hudi to sync metadata to external metastores and catalogs.
  • Metrics Configs: These set of configs are used to enable monitoring and reporting of key Hudi stats and metrics.
  • Record Payload Config: This is the lowest level of customization offered by Hudi. Record payloads define how to produce new values to upsert based on incoming new record and stored old record. Hudi provides default implementations such as OverwriteWithLatestAvroPayload which simply update table with the latest/last-written record. This can be overridden to a custom class extending HoodieRecordPayload class, on both datasource and WriteClient levels.
  • Kafka Connect Configs: These set of configs are used for Kafka Connect Sink Connector for writing Hudi Tables
  • Amazon Web Services Configs: Configurations specific to Amazon Web Services.
  • Hudi Streamer Configs: These set of configs are used for Hudi Streamer utility which provides the way to ingest from different sources such as DFS or Kafka.

In the tables below (N/A) means there is no default value set

Externalized Config File

Instead of directly passing configuration settings to every Hudi job, you can also centrally set them in a configuration file hudi-default.conf. By default, Hudi would load the configuration file under /etc/hudi/conf directory. You can specify a different configuration directory location by setting the HUDI_CONF_DIR environment variable. This can be useful for uniformly enforcing repeated configs (like Hive sync or write/index tuning), across your entire data lake.

Hudi Table Config

Basic Hudi Table configuration parameters.

Hudi Table Basic Configs

Configurations of the Hudi Table like type of ingestion, storage formats, hive table name etc. Configurations are loaded from hoodie.properties, these properties are usually set during initializing a path as hoodie base path and never changes during the lifetime of a hoodie table.

Basic Configs

Config Name Default Description
hoodie.bootstrap.base.path (N/A) Base path of the dataset that needs to be bootstrapped as a Hudi table
Config Param: BOOTSTRAP_BASE_PATH
hoodie.database.name (N/A) Database name that will be used for incremental query.If different databases have the same table name during incremental query, we can set it to limit the table name under a specific database
Config Param: DATABASE_NAME
hoodie.table.checksum (N/A) Table checksum is used to guard against partial writes in HDFS. It is added as the last entry in hoodie.properties and then used to validate while reading table config.
Config Param: TABLE_CHECKSUM
Since Version: 0.11.0
hoodie.table.create.schema (N/A) Schema used when creating the table, for the first time.
Config Param: CREATE_SCHEMA
hoodie.table.keygenerator.class (N/A) Key Generator class property for the hoodie table
Config Param: KEY_GENERATOR_CLASS_NAME
hoodie.table.metadata.partitions (N/A) Comma-separated list of metadata partitions that have been completely built and in-sync with data table. These partitions are ready for use by the readers
Config Param: TABLE_METADATA_PARTITIONS
Since Version: 0.11.0
hoodie.table.metadata.partitions.inflight (N/A) Comma-separated list of metadata partitions whose building is in progress. These partitions are not yet ready for use by the readers.
Config Param: TABLE_METADATA_PARTITIONS_INFLIGHT
Since Version: 0.11.0
hoodie.table.name (N/A) Table name that will be used for registering with Hive. Needs to be same across runs.
Config Param: NAME
hoodie.table.partition.fields (N/A) Fields used to partition the table. Concatenated values of these fields are used as the partition path, by invoking toString()
Config Param: PARTITION_FIELDS
hoodie.table.precombine.field (N/A) Field used in preCombining before actual write. By default, when two records have the same key value, the largest value for the precombine field determined by Object.compareTo(..), is picked.
Config Param: PRECOMBINE_FIELD
hoodie.table.recordkey.fields (N/A) Columns used to uniquely identify the table. Concatenated values of these fields are used as the record key component of HoodieKey.
Config Param: RECORDKEY_FIELDS
hoodie.table.secondary.indexes.metadata (N/A) The metadata of secondary indexes
Config Param: SECONDARY_INDEXES_METADATA
Since Version: 0.13.0
hoodie.timeline.layout.version (N/A) Version of timeline used, by the table.
Config Param: TIMELINE_LAYOUT_VERSION
hoodie.archivelog.folder archived path under the meta folder, to store archived timeline instants at.
Config Param: ARCHIVELOG_FOLDER
hoodie.bootstrap.index.class org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex Implementation to use, for mapping base files to bootstrap base file, that contain actual data.
Config Param: BOOTSTRAP_INDEX_CLASS_NAME
hoodie.bootstrap.index.enable true Whether or not, this is a bootstrapped table, with bootstrap base data and an mapping index defined, default true.
Config Param: BOOTSTRAP_INDEX_ENABLE
hoodie.compaction.payload.class org.apache.hudi.common.model.OverwriteWithLatestAvroPayload Payload class to use for performing compactions, i.e merge delta logs with current base file and then produce a new base file.
Config Param: PAYLOAD_CLASS_NAME
hoodie.compaction.record.merger.strategy eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in hoodie.datasource.write.record.merger.impls which has the same merger strategy id
Config Param: RECORD_MERGER_STRATEGY
Since Version: 0.13.0
hoodie.datasource.write.hive_style_partitioning false Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
Config Param: HIVE_STYLE_PARTITIONING_ENABLE
hoodie.partition.metafile.use.base.format false If true, partition metafiles are saved in the same format as base-files for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles are saved as properties files.
Config Param: PARTITION_METAFILE_USE_BASE_FORMAT
hoodie.populate.meta.fields true When enabled, populates all meta fields. When disabled, no meta fields are populated and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing
Config Param: POPULATE_META_FIELDS
hoodie.table.base.file.format PARQUET Base file format to store all the base file data.
Config Param: BASE_FILE_FORMAT
hoodie.table.cdc.enabled false When enable, persist the change data if necessary, and can be queried as a CDC query mode.
Config Param: CDC_ENABLED
Since Version: 0.13.0
hoodie.table.cdc.supplemental.logging.mode DATA_BEFORE_AFTER org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode: Change log capture supplemental logging mode. The supplemental log is used for accelerating the generation of change log details. OP_KEY_ONLY: Only keeping record keys in the supplemental logs, so the reader needs to figure out the update before image and after image. DATA_BEFORE: Keeping the before images in the supplemental logs, so the reader needs to figure out the update after images. DATA_BEFORE_AFTER(default): Keeping the before and after images in the supplemental logs, so the reader can generate the details directly from the logs.
Config Param: CDC_SUPPLEMENTAL_LOGGING_MODE
Since Version: 0.13.0
hoodie.table.log.file.format HOODIE_LOG Log format used for the delta logs.
Config Param: LOG_FILE_FORMAT
hoodie.table.timeline.timezone LOCAL User can set hoodie commit timeline timezone, such as utc, local and so on. local is default
Config Param: TIMELINE_TIMEZONE
hoodie.table.type COPY_ON_WRITE The table type for the underlying data, for this write. This can’t change between writes.
Config Param: TYPE
hoodie.table.version ZERO Version of table, used for running upgrade/downgrade steps between releases with potentially breaking/backwards compatible changes.
Config Param: VERSION

Advanced Configs

Config Name Default Description
hoodie.datasource.write.drop.partition.columns false When set to true, will not write the partition columns into hudi. By default, false.
Config Param: DROP_PARTITION_COLUMNS
hoodie.datasource.write.partitionpath.urlencode false Should we url encode the partition path value, before creating the folder structure.
Config Param: URL_ENCODE_PARTITIONING

Spark Datasource Configs

These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or choosing query type to read.

Read Options

Options useful for reading tables via read.format.option(...)

Basic Configs

Config Name Default Description
hoodie.datasource.read.begin.instanttime (N/A) Required when hoodie.datasource.query.type is set to incremental. Represents the instant time to start incrementally pulling data from. The instanttime here need not necessarily correspond to an instant on the timeline. New data written with an instant_time > BEGIN_INSTANTTIME are fetched out. For e.g: ‘20170901080000’ will get all new data written after Sep 1, 2017 08:00AM. Note that if hoodie.read.timeline.holes.resolution.policy set to USE_TRANSITION_TIME, will use instant’s stateTransitionTime to perform comparison.
Config Param: BEGIN_INSTANTTIME
hoodie.datasource.read.end.instanttime (N/A) Used when hoodie.datasource.query.type is set to incremental. Represents the instant time to limit incrementally fetched data to. When not specified latest commit time from timeline is assumed by default. When specified, new data written with an instant_time <= END_INSTANTTIME are fetched out. Point in time type queries make more sense with begin and end instant times specified. Note that if hoodie.read.timeline.holes.resolution.policy set to USE_TRANSITION_TIME, will use instant’s stateTransitionTime to perform comparison.
Config Param: END_INSTANTTIME
hoodie.datasource.query.type snapshot Whether data needs to be read, in incremental mode (new data since an instantTime) (or) read_optimized mode (obtain latest view, based on base files) (or) snapshot mode (obtain latest view, by merging base and (if any) log files)
Config Param: QUERY_TYPE
hoodie.datasource.write.precombine.field ts Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)
Config Param: READ_PRE_COMBINE_FIELD

Advanced Configs

Config Name Default Description
as.of.instant (N/A) The query instant for time travel. Without specified this option, we query the latest snapshot.
Config Param: TIME_TRAVEL_AS_OF_INSTANT
hoodie.datasource.read.paths (N/A) Comma separated list of file paths to read within a Hudi table.
Config Param: READ_PATHS
hoodie.datasource.merge.type payload_combine For Snapshot query on merge on read table, control whether we invoke the record payload implementation to merge (payload_combine) or skip merging altogetherskip_merge
Config Param: REALTIME_MERGE
hoodie.datasource.query.incremental.format latest_state This config is used alone with the ‘incremental’ query type.When set to ‘latest_state’, it returns the latest records’ values.When set to ‘cdc’, it returns the cdc data.
Config Param: INCREMENTAL_FORMAT
Since Version: 0.13.0
hoodie.datasource.read.extract.partition.values.from.path false When set to true, values for partition columns (partition values) will be extracted from physical partition path (default Spark behavior). When set to false partition values will be read from the data file (in Hudi partition columns are persisted by default). This config is a fallback allowing to preserve existing behavior, and should not be used otherwise.
Config Param: EXTRACT_PARTITION_VALUES_FROM_PARTITION_PATH
Since Version: 0.11.0
hoodie.datasource.read.file.index.listing.mode lazy Overrides Hudi’s file-index implementation’s file listing mode: when set to ‘eager’, file-index will list all partition paths and corresponding file slices w/in them eagerly, during initialization, prior to partition-pruning kicking in, meaning that all partitions will be listed including ones that might be subsequently pruned out; when set to ‘lazy’, partitions and file-slices w/in them will be listed lazily (ie when they actually accessed, instead of when file-index is initialized) allowing partition pruning to occur before that, only listing partitions that has already been pruned. Please note that, this config is provided purely to allow to fallback to behavior existing prior to 0.13.0 release, and will be deprecated soon after.
Config Param: FILE_INDEX_LISTING_MODE_OVERRIDE
Since Version: 0.13.0
hoodie.datasource.read.file.index.listing.partition-path-prefix.analysis.enabled true Controls whether partition-path prefix analysis is enabled w/in the file-index, allowing to avoid necessity to recursively list deep folder structures of partitioned tables w/ multiple partition columns, by carefully analyzing provided partition-column predicates and deducing corresponding partition-path prefix from them (if possible).
Config Param: FILE_INDEX_LISTING_PARTITION_PATH_PREFIX_ANALYSIS_ENABLED
Since Version: 0.13.0
hoodie.datasource.read.incr.fallback.fulltablescan.enable false When doing an incremental query whether we should fall back to full table scans if file does not exist.
Config Param: INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN
hoodie.datasource.read.incr.filters For use-cases like DeltaStreamer which reads from Hoodie Incremental table and applies opaque map functions, filters appearing late in the sequence of transformations cannot be automatically pushed down. This option allows setting filters directly on Hoodie Source.
Config Param: PUSH_DOWN_INCR_FILTERS
hoodie.datasource.read.incr.path.glob For the use-cases like users only want to incremental pull from certain partitions instead of the full table. This option allows using glob pattern to directly filter on path.
Config Param: INCR_PATH_GLOB
hoodie.datasource.read.schema.use.end.instanttime false Uses end instant schema when incrementally fetched data to. Default: users latest instant schema.
Config Param: INCREMENTAL_READ_SCHEMA_USE_END_INSTANTTIME
hoodie.datasource.read.use.new.parquet.file.format false Read using the new Hudi parquet file format. The new Hudi parquet file format is introduced as an experimental feature in 0.14.0. Currently, the new Hudi parquet file format only applies to bootstrap and MOR queries. Schema evolution is also not supported by the new file format.
Config Param: USE_NEW_HUDI_PARQUET_FILE_FORMAT
Since Version: 0.14.0
hoodie.datasource.streaming.startOffset earliest Start offset to pull data from hoodie streaming source. allow earliest, latest, and specified start instant time
Config Param: START_OFFSET
Since Version: 0.13.0
hoodie.enable.data.skipping false Enables data-skipping allowing queries to leverage indexes to reduce the search space by skipping over files
Config Param: ENABLE_DATA_SKIPPING
Since Version: 0.10.0
hoodie.file.index.enable true Enables use of the spark file index implementation for Hudi, that speeds up listing of large tables.
Config Param: ENABLE_HOODIE_FILE_INDEX
hoodie.read.timeline.holes.resolution.policy FAIL When doing incremental queries, there could be hollow commits (requested or inflight commits that are not the latest) that are produced by concurrent writers and could lead to potential data loss. This config allows users to have different ways of handling this situation. The valid values are [FAIL, BLOCK, USE_TRANSITION_TIME]: Use FAIL to throw an exception when hollow commit is detected. This is helpful when hollow commits are not expected. Use BLOCK to block processing commits from going beyond the hollow ones. This fits the case where waiting for hollow commits to finish is acceptable. Use USE_TRANSITION_TIME (experimental) to query commits in range by state transition time (completion time), instead of commit time (start time). Using this mode will result in begin.instanttime and end.instanttime using stateTransitionTime instead of the instant’s commit time.
Config Param: INCREMENTAL_READ_HANDLE_HOLLOW_COMMIT
Since Version: 0.14.0
hoodie.schema.on.read.enable false Enables support for Schema Evolution feature
Config Param: SCHEMA_EVOLUTION_ENABLED

Write Options

You can pass down any of the WriteClient level configs directly using options() or option(k,v) methods.

  1. inputDF.write()
  2. .format("org.apache.hudi")
  3. .options(clientOpts) // any of the Hudi client opts can be passed in as well
  4. .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
  5. .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
  6. .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
  7. .option(HoodieWriteConfig.TABLE_NAME, tableName)
  8. .mode(SaveMode.Append)
  9. .save(basePath);

Options useful for writing tables via write.format.option(...)

Basic Configs

Config Name Default Description
hoodie.datasource.hive_sync.mode (N/A) Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.
Config Param: HIVE_SYNC_MODE
hoodie.datasource.write.partitionpath.field (N/A) Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value obtained by invoking .toString()
Config Param: PARTITIONPATH_FIELD
hoodie.datasource.write.recordkey.field (N/A) Record key field. Value to be used as the recordKey component of HoodieKey. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: a.b.c
Config Param: RECORDKEY_FIELD
hoodie.clustering.async.enabled false Enable running of clustering service, asynchronously as inserts happen on the table.
Config Param: ASYNC_CLUSTERING_ENABLE
Since Version: 0.7.0
hoodie.clustering.inline false Turn on inline clustering - clustering will be run after each write operation is complete
Config Param: INLINE_CLUSTERING_ENABLE
Since Version: 0.7.0
hoodie.datasource.hive_sync.enable false When set to true, register/sync the table to Apache Hive metastore.
Config Param: HIVE_SYNC_ENABLED
hoodie.datasource.hive_sync.jdbcurl jdbc:hive2://localhost:10000 Hive metastore url
Config Param: HIVE_URL
hoodie.datasource.hive_sync.metastore.uris thrift://localhost:9083 Hive metastore url
Config Param: METASTORE_URIS
hoodie.datasource.meta.sync.enable false Enable Syncing the Hudi Table with an external meta store or data catalog.
Config Param: META_SYNC_ENABLED
hoodie.datasource.write.hive_style_partitioning false Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
Config Param: HIVE_STYLE_PARTITIONING
hoodie.datasource.write.operation upsert Whether to do upsert, insert or bulk_insert for the write operation. Use bulk_insert to load new data into a table, and there on use upsert/insert. bulk insert uses a disk based write path to scale to load large inputs without need to cache it.
Config Param: OPERATION
hoodie.datasource.write.precombine.field ts Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)
Config Param: PRECOMBINE_FIELD
hoodie.datasource.write.table.type COPY_ON_WRITE The table type for the underlying data, for this write. This can’t change between writes.
Config Param: TABLE_TYPE

Advanced Configs

Config Name Default Description
hoodie.datasource.hive_sync.serde_properties (N/A) Serde properties to hive table.
Config Param: HIVE_TABLE_SERDE_PROPERTIES
hoodie.datasource.hive_sync.table_properties (N/A) Additional properties to store with table.
Config Param: HIVE_TABLE_PROPERTIES
hoodie.datasource.overwrite.mode (N/A) Controls whether overwrite use dynamic or static mode, if not configured, respect spark.sql.sources.partitionOverwriteMode
Config Param: OVERWRITE_MODE
Since Version: 0.14.0
hoodie.datasource.write.partitions.to.delete (N/A) Comma separated list of partitions to delete. Allows use of wildcard *
Config Param: PARTITIONS_TO_DELETE
hoodie.datasource.write.table.name (N/A) Table name for the datasource write. Also used to register the table into meta stores.
Config Param: TABLE_NAME
hoodie.datasource.compaction.async.enable true Controls whether async compaction should be turned on for MOR table writing.
Config Param: ASYNC_COMPACT_ENABLE
hoodie.datasource.hive_sync.assume_date_partitioning false Assume partitioning is yyyy/MM/dd
Config Param: HIVE_ASSUME_DATE_PARTITION
hoodie.datasource.hive_sync.auto_create_database true Auto create hive database if does not exists
Config Param: HIVE_AUTO_CREATE_DATABASE
hoodie.datasource.hive_sync.base_file_format PARQUET Base file format for the sync.
Config Param: HIVE_BASE_FILE_FORMAT
hoodie.datasource.hive_sync.batch_num 1000 The number of partitions one batch when synchronous partitions to hive.
Config Param: HIVE_BATCH_SYNC_PARTITION_NUM
hoodie.datasource.hive_sync.bucket_sync false Whether sync hive metastore bucket specification when using bucket index.The specification is ‘CLUSTERED BY (trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS’
Config Param: HIVE_SYNC_BUCKET_SYNC
hoodie.datasource.hive_sync.create_managed_table false Whether to sync the table as managed table.
Config Param: HIVE_CREATE_MANAGED_TABLE
hoodie.datasource.hive_sync.database default The name of the destination database that we should sync the hudi table to.
Config Param: HIVE_DATABASE
hoodie.datasource.hive_sync.ignore_exceptions false Ignore exceptions when syncing with Hive.
Config Param: HIVE_IGNORE_EXCEPTIONS
hoodie.datasource.hive_sync.partition_extractor_class org.apache.hudi.hive.MultiPartKeysValueExtractor Class which implements PartitionValueExtractor to extract the partition values, default ‘org.apache.hudi.hive.MultiPartKeysValueExtractor’.
Config Param: HIVE_PARTITION_EXTRACTOR_CLASS
hoodie.datasource.hive_sync.partition_fields Field in the table to use for determining hive partition columns.
Config Param: HIVE_PARTITION_FIELDS
hoodie.datasource.hive_sync.password hive hive password to use
Config Param: HIVE_PASS
hoodie.datasource.hive_sync.skip_ro_suffix false Skip the _ro suffix for Read optimized table, when registering
Config Param: HIVE_SKIP_RO_SUFFIX_FOR_READ_OPTIMIZED_TABLE
hoodie.datasource.hive_sync.support_timestamp false ‘INT64’ with original type TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. Disabled by default for backward compatibility. NOTE: On Spark entrypoints, this is defaulted to TRUE
Config Param: HIVE_SUPPORT_TIMESTAMP_TYPE
hoodie.datasource.hive_sync.sync_as_datasource true
Config Param: HIVE_SYNC_AS_DATA_SOURCE_TABLE
hoodie.datasource.hive_sync.sync_comment false Whether to sync the table column comments while syncing the table.
Config Param: HIVE_SYNC_COMMENT
hoodie.datasource.hive_sync.table unknown The name of the destination table that we should sync the hudi table to.
Config Param: HIVE_TABLE
hoodie.datasource.hive_sync.use_jdbc true Use JDBC when hive synchronization is enabled
Config Param: HIVE_USE_JDBC
hoodie.datasource.hive_sync.use_pre_apache_input_format false Flag to choose InputFormat under com.uber.hoodie package instead of org.apache.hudi package. Use this when you are in the process of migrating from com.uber.hoodie to org.apache.hudi. Stop using this after you migrated the table definition to org.apache.hudi input format
Config Param: HIVE_USE_PRE_APACHE_INPUT_FORMAT
hoodie.datasource.hive_sync.username hive hive user name to use
Config Param: HIVE_USER
hoodie.datasource.insert.dup.policy none Note This is only applicable to Spark SQL writing.<br />When operation type is set to “insert”, users can optionally enforce a dedup policy. This policy will be employed when records being ingested already exists in storage. Default policy is none and no action will be taken. Another option is to choose “drop”, on which matching records from incoming will be dropped and the rest will be ingested. Third option is “fail” which will fail the write operation when same records are re-ingested. In other words, a given record as deduced by the key generation policy can be ingested only once to the target table of interest.
Config Param: INSERT_DUP_POLICY
Since Version: 0.14.0
hoodie.datasource.meta_sync.condition.sync false If true, only sync on conditions like schema change or partition change.
Config Param: HIVE_CONDITIONAL_SYNC
hoodie.datasource.write.commitmeta.key.prefix _ Option keys beginning with this prefix, are automatically added to the commit/deltacommit metadata. This is useful to store checkpointing information, in a consistent way with the hudi timeline
Config Param: COMMIT_METADATA_KEYPREFIX
hoodie.datasource.write.drop.partition.columns false When set to true, will not write the partition columns into hudi. By default, false.
Config Param: DROP_PARTITION_COLUMNS
hoodie.datasource.write.insert.drop.duplicates false If set to true, records from the incoming dataframe will not overwrite existing records with the same key during the write operation. <br /> Note Just for Insert operation in Spark SQL writing since 0.14.0, users can switch to the config hoodie.datasource.insert.dup.policy instead for a simplified duplicate handling experience. The new config will be incorporated into all other writing flows and this config will be fully deprecated in future releases.
Config Param: INSERT_DROP_DUPS
hoodie.datasource.write.keygenerator.class org.apache.hudi.keygen.SimpleKeyGenerator Key generator class, that implements org.apache.hudi.keygen.KeyGenerator
Config Param: KEYGENERATOR_CLASS_NAME
hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled false When set to true, consistent value will be generated for a logical timestamp type column, like timestamp-millis and timestamp-micros, irrespective of whether row-writer is enabled. Disabled by default so as not to break the pipeline that deploy either fully row-writer path or non row-writer path. For example, if it is kept disabled then record key of timestamp type with value 2016-12-29 09:54:00 will be written as timestamp 2016-12-29 09:54:00.0 in row-writer path, while it will be written as long value 1483023240000000 in non row-writer path. If enabled, then the timestamp value will be written in both the cases.
Config Param: KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED
Since Version: 0.10.1
hoodie.datasource.write.partitionpath.urlencode false Should we url encode the partition path value, before creating the folder structure.
Config Param: URL_ENCODE_PARTITIONING
hoodie.datasource.write.payload.class org.apache.hudi.common.model.OverwriteWithLatestAvroPayload Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for PRECOMBINE_FIELD_OPT_VAL in-effective
Config Param: PAYLOAD_CLASS_NAME
hoodie.datasource.write.reconcile.schema false This config controls how writer’s schema will be selected based on the incoming batch’s schema as well as existing table’s one. When schema reconciliation is DISABLED, incoming batch’s schema will be picked as a writer-schema (therefore updating table’s schema). When schema reconciliation is ENABLED, writer-schema will be picked such that table’s schema (after txn) is either kept the same or extended, meaning that we’ll always prefer the schema that either adds new columns or stays the same. This enables us, to always extend the table’s schema during evolution and never lose the data (when, for ex, existing column is being dropped in a new batch)
Config Param: RECONCILE_SCHEMA
hoodie.datasource.write.record.merger.impls org.apache.hudi.common.model.HoodieAvroRecordMerger List of HoodieMerger implementations constituting Hudi’s merging strategy — based on the engine used. These merger impls will filter by hoodie.datasource.write.record.merger.strategy Hudi will pick most efficient implementation to perform merging/combining of the records (during update, reading MOR table, etc)
Config Param: RECORD_MERGER_IMPLS
Since Version: 0.13.0
hoodie.datasource.write.record.merger.strategy eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in hoodie.datasource.write.record.merger.impls which has the same merger strategy id
Config Param: RECORD_MERGER_STRATEGY
Since Version: 0.13.0
hoodie.datasource.write.row.writer.enable true When set to true, will perform write operations directly using the spark native Row representation, avoiding any additional conversion costs.
Config Param: ENABLE_ROW_WRITER
hoodie.datasource.write.streaming.checkpoint.identifier default_single_writer A stream identifier used for HUDI to fetch the right checkpoint(batch id to be more specific) corresponding this writer. Please note that keep the identifier an unique value for different writer if under multi-writer scenario. If the value is not set, will only keep the checkpoint info in the memory. This could introduce the potential issue that the job is restart(batch id is lost) while spark checkpoint write fails, causing spark will retry and rewrite the data.
Config Param: STREAMING_CHECKPOINT_IDENTIFIER
Since Version: 0.13.0
hoodie.datasource.write.streaming.disable.compaction false By default for MOR table, async compaction is enabled with spark streaming sink. By setting this config to true, we can disable it and the expectation is that, users will schedule and execute compaction in a different process/job altogether. Some users may wish to run it separately to manage resources across table services and regular ingestion pipeline and so this could be preferred on such cases.
Config Param: STREAMING_DISABLE_COMPACTION
Since Version: 0.14.0
hoodie.datasource.write.streaming.ignore.failed.batch false Config to indicate whether to ignore any non exception error (e.g. writestatus error) within a streaming microbatch. Turning this on, could hide the write status errors while the spark checkpoint moves ahead.So, would recommend users to use this with caution.
Config Param: STREAMING_IGNORE_FAILED_BATCH
hoodie.datasource.write.streaming.retry.count 3 Config to indicate how many times streaming job should retry for a failed micro batch.
Config Param: STREAMING_RETRY_CNT
hoodie.datasource.write.streaming.retry.interval.ms 2000 Config to indicate how long (by millisecond) before a retry should issued for failed microbatch
Config Param: STREAMING_RETRY_INTERVAL_MS
hoodie.meta.sync.client.tool.class org.apache.hudi.hive.HiveSyncTool Sync tool class name used to sync to metastore. Defaults to Hive.
Config Param: META_SYNC_CLIENT_TOOL_CLASS_NAME
hoodie.spark.sql.insert.into.operation insert Sql write operation to use with INSERT_INTO spark sql command. This comes with 3 possible values, bulk_insert, insert and upsert. bulk_insert is generally meant for initial loads and is known to be performant compared to insert. But bulk_insert may not do small file management. If you prefer hudi to automatically manage small files, then you can go with “insert”. There is no precombine (if there are duplicates within the same batch being ingested, same dups will be ingested) with bulk_insert and insert and there is no index look up as well. If you may use INSERT_INTO for mutable dataset, then you may have to set this config value to “upsert”. With upsert, you will get both precombine and updates to existing records on storage is also honored. If not, you may see duplicates.
Config Param: SPARK_SQL_INSERT_INTO_OPERATION
Since Version: 0.14.0
hoodie.spark.sql.optimized.writes.enable true Controls whether spark sql prepped update, delete, and merge are enabled.
Config Param: SPARK_SQL_OPTIMIZED_WRITES
Since Version: 0.14.0
hoodie.sql.bulk.insert.enable false When set to true, the sql insert statement will use bulk insert. This config is deprecated as of 0.14.0. Please use hoodie.spark.sql.insert.into.operation instead.
Config Param: SQL_ENABLE_BULK_INSERT
hoodie.sql.insert.mode upsert Insert mode when insert data to pk-table. The optional modes are: upsert, strict and non-strict.For upsert mode, insert statement do the upsert operation for the pk-table which will update the duplicate record.For strict mode, insert statement will keep the primary key uniqueness constraint which do not allow duplicate record.While for non-strict mode, hudi just do the insert operation for the pk-table. This config is deprecated as of 0.14.0. Please use hoodie.spark.sql.insert.into.operation and hoodie.datasource.insert.dup.policy as you see fit.
Config Param: SQL_INSERT_MODE
hoodie.streamer.source.kafka.value.deserializer.class io.confluent.kafka.serializers.KafkaAvroDeserializer This class is used by kafka client to deserialize the records
Config Param: KAFKA_AVRO_VALUE_DESERIALIZER_CLASS
Since Version: 0.9.0
hoodie.write.set.null.for.missing.columns false When a nullable column is missing from incoming batch during a write operation, the write operation will fail schema compatibility check. Set this option to true will make the missing column be filled with null values to successfully complete the write operation.
Config Param: SET_NULL_FOR_MISSING_COLUMNS
Since Version: 0.14.1

PreCommit Validator Configurations

The following set of configurations help validate new data before commits.

Advanced Configs

Config Name Default Description
hoodie.precommit.validators Comma separated list of class names that can be invoked to validate commit
Config Param: VALIDATOR_CLASS_NAMES
hoodie.precommit.validators.equality.sql.queries Spark SQL queries to run on table before committing new data to validate state before and after commit. Multiple queries separated by ‘;’ delimiter are supported. Example: “select count(*) from \<TABLE_NAME\> Note \<TABLE_NAME\> is replaced by table state before and after commit.
Config Param: EQUALITY_SQL_QUERIES
hoodie.precommit.validators.inequality.sql.queries Spark SQL queries to run on table before committing new data to validate state before and after commit.Multiple queries separated by ‘;’ delimiter are supported.Example query: ‘select count(*) from \<TABLE_NAME\> where col=null’Note \<TABLE_NAME\> variable is expected to be present in query.
Config Param: INEQUALITY_SQL_QUERIES
hoodie.precommit.validators.single.value.sql.queries Spark SQL queries to run on table before committing new data to validate state after commit.Multiple queries separated by ‘;’ delimiter are supported.Expected result is included as part of query separated by ‘#’. Example query: ‘query1#result1:query2#result2’Note \<TABLE_NAME\> variable is expected to be present in query.
Config Param: SINGLE_VALUE_SQL_QUERIES

These configs control the Hudi Flink SQL source/sink connectors, providing ability to define record keys, pick out the write operation, specify how to merge records, enable/disable asynchronous compaction or choosing query type to read.

Flink jobs using the SQL can be configured through the options in WITH clause. The actual datasource level configs are listed below.

Basic Configs

Config Name Default Description
hoodie.database.name (N/A) Database name to register to Hive metastore
Config Param: DATABASE_NAME
hoodie.table.name (N/A) Table name to register to Hive metastore
Config Param: TABLE_NAME
path (N/A) Base path for the target hoodie table. The path would be created if it does not exist, otherwise a Hoodie table expects to be initialized successfully
Config Param: PATH
read.end-commit (N/A) End commit instant for reading, the commit time format should be ‘yyyyMMddHHmmss’
Config Param: READ_END_COMMIT
read.start-commit (N/A) Start commit instant for reading, the commit time format should be ‘yyyyMMddHHmmss’, by default reading from the latest instant for streaming read
Config Param: READ_START_COMMIT
archive.max_commits 50 Max number of commits to keep before archiving older commits into a sequential log, default 50
Config Param: ARCHIVE_MAX_COMMITS
archive.min_commits 40 Min number of commits to keep before archiving older commits into a sequential log, default 40
Config Param: ARCHIVE_MIN_COMMITS
cdc.enabled false When enable, persist the change data if necessary, and can be queried as a CDC query mode
Config Param: CDC_ENABLED
cdc.supplemental.logging.mode DATA_BEFORE_AFTER Setting ‘op_key_only’ persists the ‘op’ and the record key only, setting ‘data_before’ persists the additional ‘before’ image, and setting ‘data_before_after’ persists the additional ‘before’ and ‘after’ images.
Config Param: SUPPLEMENTAL_LOGGING_MODE
changelog.enabled false Whether to keep all the intermediate changes, we try to keep all the changes of a record when enabled: 1). The sink accept the UPDATE_BEFORE message; 2). The source try to emit every changes of a record. The semantics is best effort because the compaction job would finally merge all changes of a record into one. default false to have UPSERT semantics
Config Param: CHANGELOG_ENABLED
clean.async.enabled true Whether to cleanup the old commits immediately on new commits, enabled by default
Config Param: CLEAN_ASYNC_ENABLED
clean.retain_commits 30 Number of commits to retain. So data will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much you can incrementally pull on this table, default 30
Config Param: CLEAN_RETAIN_COMMITS
clustering.async.enabled false Async Clustering, default false
Config Param: CLUSTERING_ASYNC_ENABLED
clustering.plan.strategy.small.file.limit 600 Files smaller than the size specified here are candidates for clustering, default 600 MB
Config Param: CLUSTERING_PLAN_STRATEGY_SMALL_FILE_LIMIT
clustering.plan.strategy.target.file.max.bytes 1073741824 Each group can produce ‘N’ (CLUSTERING_MAX_GROUP_SIZE/CLUSTERING_TARGET_FILE_SIZE) output file groups, default 1 GB
Config Param: CLUSTERING_PLAN_STRATEGY_TARGET_FILE_MAX_BYTES
compaction.async.enabled true Async Compaction, enabled by default for MOR
Config Param: COMPACTION_ASYNC_ENABLED
compaction.delta_commits 5 Max delta commits needed to trigger compaction, default 5 commits
Config Param: COMPACTION_DELTA_COMMITS
hive_sync.enabled false Asynchronously sync Hive meta to HMS, default false
Config Param: HIVE_SYNC_ENABLED
hive_sync.jdbc_url jdbc:hive2://localhost:10000 Jdbc URL for hive sync, default ‘jdbc:hive2://localhost:10000’
Config Param: HIVE_SYNC_JDBC_URL
hive_sync.metastore.uris Metastore uris for hive sync, default ‘’
Config Param: HIVE_SYNC_METASTORE_URIS
hive_sync.mode HMS Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql, default ‘hms’
Config Param: HIVE_SYNC_MODE
hoodie.datasource.query.type snapshot Decides how data files need to be read, in 1) Snapshot mode (obtain latest view, based on row & columnar data); 2) incremental mode (new data since an instantTime); 3) Read Optimized mode (obtain latest view, based on columnar data) .Default: snapshot
Config Param: QUERY_TYPE
hoodie.datasource.write.hive_style_partitioning false Whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
Config Param: HIVE_STYLE_PARTITIONING
hoodie.datasource.write.partitionpath.field Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value obtained by invoking .toString(), default ‘’
Config Param: PARTITION_PATH_FIELD
hoodie.datasource.write.recordkey.field uuid Record key field. Value to be used as the recordKey component of HoodieKey. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: a.b.c
Config Param: RECORD_KEY_FIELD
index.type FLINK_STATE Index type of Flink write job, default is using state backed index.
Config Param: INDEX_TYPE
metadata.compaction.delta_commits 10 Max delta commits for metadata table to trigger compaction, default 10
Config Param: METADATA_COMPACTION_DELTA_COMMITS
metadata.enabled false Enable the internal metadata table which serves table metadata like level file listings, default disabled
Config Param: METADATA_ENABLED
precombine.field ts Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)
Config Param: PRECOMBINE_FIELD
read.streaming.enabled false Whether to read as streaming source, default false
Config Param: READ_AS_STREAMING
table.type COPY_ON_WRITE Type of table to write. COPY_ON_WRITE (or) MERGE_ON_READ
Config Param: TABLE_TYPE
write.operation upsert The write operation, that this write should do
Config Param: OPERATION
write.parquet.max.file.size 120 Target size for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance.
Config Param: WRITE_PARQUET_MAX_FILE_SIZE

Advanced Configs

Config Name Default Description
clustering.tasks (N/A) Parallelism of tasks that do actual clustering, default same as the write task parallelism
Config Param: CLUSTERING_TASKS
compaction.tasks (N/A) Parallelism of tasks that do actual compaction, default same as the write task parallelism
Config Param: COMPACTION_TASKS
hive_sync.conf.dir (N/A) The hive configuration directory, where the hive-site.xml lies in, the file should be put on the client machine
Config Param: HIVE_SYNC_CONF_DIR
hive_sync.serde_properties (N/A) Serde properties to hive table, the data format is k1=v1 k2=v2
Config Param: HIVE_SYNC_TABLE_SERDE_PROPERTIES
hive_sync.table_properties (N/A) Additional properties to store with table, the data format is k1=v1 k2=v2
Config Param: HIVE_SYNC_TABLE_PROPERTIES
hoodie.datasource.write.keygenerator.class (N/A) Key generator class, that implements will extract the key out of incoming record
Config Param: KEYGEN_CLASS_NAME
read.tasks (N/A) Parallelism of tasks that do actual read, default is the parallelism of the execution environment
Config Param: READ_TASKS
source.avro-schema (N/A) Source avro schema string, the parsed schema is used for deserialization
Config Param: SOURCE_AVRO_SCHEMA
source.avro-schema.path (N/A) Source avro schema file path, the parsed schema is used for deserialization
Config Param: SOURCE_AVRO_SCHEMA_PATH
write.bucket_assign.tasks (N/A) Parallelism of tasks that do bucket assign, default same as the write task parallelism
Config Param: BUCKET_ASSIGN_TASKS
write.index_bootstrap.tasks (N/A) Parallelism of tasks that do index bootstrap, default same as the write task parallelism
Config Param: INDEX_BOOTSTRAP_TASKS
write.partition.format (N/A) Partition path format, only valid when ‘write.datetime.partitioning’ is true, default is: 1) ‘yyyyMMddHH’ for timestamp(3) WITHOUT TIME ZONE, LONG, FLOAT, DOUBLE, DECIMAL; 2) ‘yyyyMMdd’ for DATE and INT.
Config Param: PARTITION_FORMAT
write.tasks (N/A) Parallelism of tasks that do actual write, default is the parallelism of the execution environment
Config Param: WRITE_TASKS
clean.policy KEEP_LATEST_COMMITS Clean policy to manage the Hudi table. Available option: KEEP_LATEST_COMMITS, KEEP_LATEST_FILE_VERSIONS, KEEP_LATEST_BY_HOURS.Default is KEEP_LATEST_COMMITS.
Config Param: CLEAN_POLICY
clean.retain_file_versions 5 Number of file versions to retain. default 5
Config Param: CLEAN_RETAIN_FILE_VERSIONS
clean.retain_hours 24 Number of hours for which commits need to be retained. This config provides a more flexible option ascompared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group, corresponding to commits with commit times older than the configured number of hours to be retained are cleaned.
Config Param: CLEAN_RETAIN_HOURS
clustering.delta_commits 4 Max delta commits needed to trigger clustering, default 4 commits
Config Param: CLUSTERING_DELTA_COMMITS
clustering.plan.partition.filter.mode NONE Partition filter mode used in the creation of clustering plan. Available values are - NONE: do not filter table partition and thus the clustering plan will include all partitions that have clustering candidate.RECENT_DAYS: keep a continuous range of partitions, worked together with configs ‘clustering.plan.strategy.daybased.lookback.partitions’ and ‘clustering.plan.strategy.daybased.skipfromlatest.partitions.SELECTED_PARTITIONS: keep partitions that are in the specified range [‘clustering.plan.strategy.cluster.begin.partition’, ‘clustering.plan.strategy.cluster.end.partition’].DAY_ROLLING: clustering partitions on a rolling basis by the hour to avoid clustering all partitions each time, which strategy sorts the partitions asc and chooses the partition of which index is divided by 24 and the remainder is equal to the current hour.
Config Param: CLUSTERING_PLAN_PARTITION_FILTER_MODE_NAME
clustering.plan.strategy.class org.apache.hudi.client.clustering.plan.strategy.FlinkSizeBasedClusteringPlanStrategy Config to provide a strategy class (subclass of ClusteringPlanStrategy) to create clustering plan i.e select what file groups are being clustered. Default strategy, looks at the last N (determined by clustering.plan.strategy.daybased.lookback.partitions) day based partitions picks the small file slices within those partitions.
Config Param: CLUSTERING_PLAN_STRATEGY_CLASS
clustering.plan.strategy.cluster.begin.partition Begin partition used to filter partition (inclusive)
Config Param: CLUSTERING_PLAN_STRATEGY_CLUSTER_BEGIN_PARTITION
clustering.plan.strategy.cluster.end.partition End partition used to filter partition (inclusive)
Config Param: CLUSTERING_PLAN_STRATEGY_CLUSTER_END_PARTITION
clustering.plan.strategy.daybased.lookback.partitions 2 Number of partitions to list to create ClusteringPlan, default is 2
Config Param: CLUSTERING_TARGET_PARTITIONS
clustering.plan.strategy.daybased.skipfromlatest.partitions 0 Number of partitions to skip from latest when choosing partitions to create ClusteringPlan
Config Param: CLUSTERING_PLAN_STRATEGY_SKIP_PARTITIONS_FROM_LATEST
clustering.plan.strategy.max.num.groups 30 Maximum number of groups to create as part of ClusteringPlan. Increasing groups will increase parallelism, default is 30
Config Param: CLUSTERING_MAX_NUM_GROUPS
clustering.plan.strategy.partition.regex.pattern Filter clustering partitions that matched regex pattern
Config Param: CLUSTERING_PLAN_STRATEGY_PARTITION_REGEX_PATTERN
clustering.plan.strategy.partition.selected Partitions to run clustering
Config Param: CLUSTERING_PLAN_STRATEGY_PARTITION_SELECTED
clustering.plan.strategy.sort.columns Columns to sort the data by when clustering
Config Param: CLUSTERING_SORT_COLUMNS
clustering.schedule.enabled false Schedule the cluster plan, default false
Config Param: CLUSTERING_SCHEDULE_ENABLED
compaction.delta_seconds 3600 Max delta seconds time needed to trigger compaction, default 1 hour
Config Param: COMPACTION_DELTA_SECONDS
compaction.max_memory 100 Max memory in MB for compaction spillable map, default 100MB
Config Param: COMPACTION_MAX_MEMORY
compaction.schedule.enabled true Schedule the compaction plan, enabled by default for MOR
Config Param: COMPACTION_SCHEDULE_ENABLED
compaction.target_io 512000 Target IO in MB for per compaction (both read and write), default 500 GB
Config Param: COMPACTION_TARGET_IO
compaction.timeout.seconds 1200 Max timeout time in seconds for online compaction to rollback, default 20 minutes
Config Param: COMPACTION_TIMEOUT_SECONDS
compaction.trigger.strategy num_commits Strategy to trigger compaction, options are ‘num_commits’: trigger compaction when there are at least N delta commits after last completed compaction; ‘num_commits_after_last_request’: trigger compaction when there are at least N delta commits after last completed/requested compaction; ‘time_elapsed’: trigger compaction when time elapsed > N seconds since last compaction; ‘num_and_time’: trigger compaction when both NUM_COMMITS and TIME_ELAPSED are satisfied; ‘num_or_time’: trigger compaction when NUM_COMMITS or TIME_ELAPSED is satisfied. Default is ‘num_commits’
Config Param: COMPACTION_TRIGGER_STRATEGY
hive_sync.assume_date_partitioning false Assume partitioning is yyyy/mm/dd, default false
Config Param: HIVE_SYNC_ASSUME_DATE_PARTITION
hive_sync.auto_create_db true Auto create hive database if it does not exists, default true
Config Param: HIVE_SYNC_AUTO_CREATE_DB
hive_sync.db default Database name for hive sync, default ‘default’
Config Param: HIVE_SYNC_DB
hive_sync.file_format PARQUET File format for hive sync, default ‘PARQUET’
Config Param: HIVE_SYNC_FILE_FORMAT
hive_sync.ignore_exceptions false Ignore exceptions during hive synchronization, default false
Config Param: HIVE_SYNC_IGNORE_EXCEPTIONS
hive_sync.partition_extractor_class org.apache.hudi.hive.MultiPartKeysValueExtractor Tool to extract the partition value from HDFS path, default ‘MultiPartKeysValueExtractor’
Config Param: HIVE_SYNC_PARTITION_EXTRACTOR_CLASS_NAME
hive_sync.partition_fields Partition fields for hive sync, default ‘’
Config Param: HIVE_SYNC_PARTITION_FIELDS
hive_sync.password hive Password for hive sync, default ‘hive’
Config Param: HIVE_SYNC_PASSWORD
hive_sync.skip_ro_suffix false Skip the _ro suffix for Read optimized table when registering, default false
Config Param: HIVE_SYNC_SKIP_RO_SUFFIX
hive_sync.support_timestamp true INT64 with original type TIMESTAMP_MICROS is converted to hive timestamp type. Disabled by default for backward compatibility.
Config Param: HIVE_SYNC_SUPPORT_TIMESTAMP
hive_sync.table unknown Table name for hive sync, default ‘unknown’
Config Param: HIVE_SYNC_TABLE
hive_sync.table.strategy ALL Hive table synchronization strategy. Available option: RO, RT, ALL.
Config Param: HIVE_SYNC_TABLE_STRATEGY
hive_sync.use_jdbc true Use JDBC when hive synchronization is enabled, default true
Config Param: HIVE_SYNC_USE_JDBC
hive_sync.username hive Username for hive sync, default ‘hive’
Config Param: HIVE_SYNC_USERNAME
hoodie.bucket.index.hash.field Index key field. Value to be used as hashing to find the bucket ID. Should be a subset of or equal to the recordKey fields. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: a.b.c
Config Param: INDEX_KEY_FIELD
hoodie.bucket.index.num.buckets 4 Hudi bucket number per partition. Only affected if using Hudi bucket index.
Config Param: BUCKET_INDEX_NUM_BUCKETS
hoodie.datasource.merge.type payload_combine For Snapshot query on merge on read table. Use this key to define how the payloads are merged, in 1) skip_merge: read the base file records plus the log file records; 2) payload_combine: read the base file records first, for each record in base file, checks whether the key is in the log file records(combines the two records with same key for base and log file records), then read the left log file records
Config Param: MERGE_TYPE
hoodie.datasource.write.keygenerator.type SIMPLE Key generator type, that implements will extract the key out of incoming record. Note This is being actively worked on. Please use hoodie.datasource.write.keygenerator.class instead.
Config Param: KEYGEN_TYPE
hoodie.datasource.write.partitionpath.urlencode false Whether to encode the partition path url, default false
Config Param: URL_ENCODE_PARTITIONING
hoodie.index.bucket.engine SIMPLE Type of bucket index engine. Available options: [SIMPLE CONSISTENT_HASHING]
Config Param: BUCKET_INDEX_ENGINE_TYPE
index.bootstrap.enabled false Whether to bootstrap the index state from existing hoodie table, default false
Config Param: INDEX_BOOTSTRAP_ENABLED
index.global.enabled true Whether to update index for the old partition path if same key record with different partition path came in, default true
Config Param: INDEX_GLOBAL_ENABLED
index.partition.regex .* Whether to load partitions in state if partition path matching, default *
Config Param: INDEX_PARTITION_REGEX
index.state.ttl 0.0 Index state ttl in days, default stores the index permanently
Config Param: INDEX_STATE_TTL
partition.default_name HIVE_DEFAULT_PARTITION The default partition name in case the dynamic partition column value is null/empty string
Config Param: PARTITION_DEFAULT_NAME
payload.class org.apache.hudi.common.model.EventTimeAvroPayload Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for the option in-effective
Config Param: PAYLOAD_CLASS_NAME
read.data.skipping.enabled false Enables data-skipping allowing queries to leverage indexes to reduce the search space byskipping over files
Config Param: READ_DATA_SKIPPING_ENABLED
read.streaming.check-interval 60 Check interval for streaming read of SECOND, default 1 minute
Config Param: READ_STREAMING_CHECK_INTERVAL
read.streaming.skip_clustering true Whether to skip clustering instants to avoid reading base files of clustering operations for streaming read to improve read performance.
Config Param: READ_STREAMING_SKIP_CLUSTERING
read.streaming.skip_compaction true Whether to skip compaction instants and avoid reading compacted base files for streaming read to improve read performance. This option can be used to avoid reading duplicates when changelog mode is enabled, it is a solution to keep data integrity
Config Param: READ_STREAMING_SKIP_COMPACT
read.utc-timezone true Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Hive 0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone, by default true
Config Param: READ_UTC_TIMEZONE
record.merger.impls org.apache.hudi.common.model.HoodieAvroRecordMerger List of HoodieMerger implementations constituting Hudi’s merging strategy — based on the engine used. These merger impls will filter by record.merger.strategy. Hudi will pick most efficient implementation to perform merging/combining of the records (during update, reading MOR table, etc)
Config Param: RECORD_MERGER_IMPLS
record.merger.strategy eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in record.merger.impls which has the same merger strategy id
Config Param: RECORD_MERGER_STRATEGY
write.batch.size 256.0 Batch buffer size in MB to flush data into the underneath filesystem, default 256MB
Config Param: WRITE_BATCH_SIZE
write.bulk_insert.shuffle_input true Whether to shuffle the inputs by specific fields for bulk insert tasks, default true
Config Param: WRITE_BULK_INSERT_SHUFFLE_INPUT
write.bulk_insert.sort_input true Whether to sort the inputs by specific fields for bulk insert tasks, default true
Config Param: WRITE_BULK_INSERT_SORT_INPUT
write.bulk_insert.sort_input.by_record_key false Whether to sort the inputs by record keys for bulk insert tasks, default false
Config Param: WRITE_BULK_INSERT_SORT_INPUT_BY_RECORD_KEY
write.client.id Unique identifier used to distinguish different writer pipelines for concurrent mode
Config Param: WRITE_CLIENT_ID
write.commit.ack.timeout -1 Timeout limit for a writer task after it finishes a checkpoint and waits for the instant commit success, only for internal use
Config Param: WRITE_COMMIT_ACK_TIMEOUT
write.ignore.failed false Flag to indicate whether to ignore any non exception error (e.g. writestatus error). within a checkpoint batch. By default false. Turning this on, could hide the write status errors while the flink checkpoint moves ahead. So, would recommend users to use this with caution.
Config Param: IGNORE_FAILED
write.insert.cluster false Whether to merge small files for insert mode, if true, the write throughput will decrease because the read/write of existing small file, only valid for COW table, default false
Config Param: INSERT_CLUSTER
write.log.max.size 1024 Maximum size allowed in MB for a log file before it is rolled over to the next version, default 1GB
Config Param: WRITE_LOG_MAX_SIZE
write.log_block.size 128 Max log block size in MB for log file, default 128MB
Config Param: WRITE_LOG_BLOCK_SIZE
write.merge.max_memory 100 Max memory in MB for merge, default 100MB
Config Param: WRITE_MERGE_MAX_MEMORY
write.parquet.block.size 120 Parquet RowGroup size. It’s recommended to make this large enough that scan costs can be amortized by packing enough column values into a single row group.
Config Param: WRITE_PARQUET_BLOCK_SIZE
write.parquet.page.size 1 Parquet page size. Page is the unit of read within a parquet file. Within a block, pages are compressed separately.
Config Param: WRITE_PARQUET_PAGE_SIZE
write.partition.overwrite.mode STATIC When INSERT OVERWRITE a partitioned data source table, we currently support 2 modes: static and dynamic. Static mode deletes all the partitions that match the partition specification(e.g. PARTITION(a=1,b)) in the INSERT statement, before overwriting. Dynamic mode doesn’t delete partitions ahead, and only overwrite those partitions that have data written into it at runtime. By default we use static mode to keep the same behavior of previous version.
Config Param: WRITE_PARTITION_OVERWRITE_MODE
write.precombine false Flag to indicate whether to drop duplicates before insert/upsert. By default these cases will accept duplicates, to gain extra performance: 1) insert operation; 2) upsert for MOR table, the MOR table deduplicate on reading
Config Param: PRE_COMBINE
write.rate.limit 0 Write record rate limit per second to prevent traffic jitter and improve stability, default 0 (no limit)
Config Param: WRITE_RATE_LIMIT
write.retry.interval.ms 2000 Flag to indicate how long (by millisecond) before a retry should issued for failed checkpoint batch. By default 2000 and it will be doubled by every retry
Config Param: RETRY_INTERVAL_MS
write.retry.times 3 Flag to indicate how many times streaming job should retry for a failed checkpoint batch. By default 3
Config Param: RETRY_TIMES
write.sort.memory 128 Sort memory in MB, default 128MB
Config Param: WRITE_SORT_MEMORY
write.task.max.size 1024.0 Maximum memory in MB for a write task, when the threshold hits, it flushes the max size data bucket to avoid OOM, default 1GB
Config Param: WRITE_TASK_MAX_SIZE
write.utc-timezone true Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Default value is utc timezone for forward compatibility.
Config Param: WRITE_UTC_TIMEZONE

Write Client Configs

Internally, the Hudi datasource uses a RDD based HoodieWriteClient API to actually perform writes to storage. These configs provide deep control over lower level aspects like file sizing, compression, parallelism, compaction, write schema, cleaning etc. Although Hudi provides sane defaults, from time-time these configs may need to be tweaked to optimize for specific workloads.

Common Configurations

The following set of configurations are common across Hudi.

Basic Configs

Config Name Default Description
hoodie.base.path (N/A) Base path on lake storage, under which all the table data is stored. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under this base path directory.
Config Param: BASE_PATH

Advanced Configs

Config Name Default Description
as.of.instant (N/A) The query instant for time travel. Without specified this option, we query the latest snapshot.
Config Param: TIMESTAMP_AS_OF
hoodie.memory.compaction.max.size (N/A) Maximum amount of memory used in bytes for compaction operations in bytes , before spilling to local storage.
Config Param: MAX_MEMORY_FOR_COMPACTION
hoodie.common.diskmap.compression.enabled true Turn on compression for BITCASK disk map used by the External Spillable Map
Config Param: DISK_MAP_BITCASK_COMPRESSION_ENABLED
hoodie.common.spillable.diskmap.type BITCASK When handling input data that cannot be held in memory, to merge with a file on storage, a spillable diskmap is employed. By default, we use a persistent hashmap based loosely on bitcask, that offers O(1) inserts, lookups. Change this to ROCKS_DB to prefer using rocksDB, for handling the spill.
Config Param: SPILLABLE_DISK_MAP_TYPE
hoodie.datasource.write.reconcile.schema false This config controls how writer’s schema will be selected based on the incoming batch’s schema as well as existing table’s one. When schema reconciliation is DISABLED, incoming batch’s schema will be picked as a writer-schema (therefore updating table’s schema). When schema reconciliation is ENABLED, writer-schema will be picked such that table’s schema (after txn) is either kept the same or extended, meaning that we’ll always prefer the schema that either adds new columns or stays the same. This enables us, to always extend the table’s schema during evolution and never lose the data (when, for ex, existing column is being dropped in a new batch)
Config Param: RECONCILE_SCHEMA
hoodie.fs.atomic_creation.support This config is used to specify the file system which supports atomic file creation . atomic means that an operation either succeeds and has an effect or has fails and has no effect; now this feature is used by FileSystemLockProvider to guaranteeing that only one writer can create the lock file at a time. since some FS does not support atomic file creation (eg: S3), we decide the FileSystemLockProvider only support HDFS,local FS and View FS as default. if you want to use FileSystemLockProvider with other FS, you can set this config with the FS scheme, eg: fs1,fs2
Config Param: HOODIE_FS_ATOMIC_CREATION_SUPPORT
Since Version: 0.14.0
hoodie.memory.dfs.buffer.max.size 16777216 Property to control the max memory in bytes for dfs input stream buffer size
Config Param: MAX_DFS_STREAM_BUFFER_SIZE
hoodie.read.timeline.holes.resolution.policy FAIL When doing incremental queries, there could be hollow commits (requested or inflight commits that are not the latest) that are produced by concurrent writers and could lead to potential data loss. This config allows users to have different ways of handling this situation. The valid values are [FAIL, BLOCK, USE_TRANSITION_TIME]: Use FAIL to throw an exception when hollow commit is detected. This is helpful when hollow commits are not expected. Use BLOCK to block processing commits from going beyond the hollow ones. This fits the case where waiting for hollow commits to finish is acceptable. Use USE_TRANSITION_TIME (experimental) to query commits in range by state transition time (completion time), instead of commit time (start time). Using this mode will result in begin.instanttime and end.instanttime using stateTransitionTime instead of the instant’s commit time.
Config Param: INCREMENTAL_READ_HANDLE_HOLLOW_COMMIT
Since Version: 0.14.0
hoodie.schema.on.read.enable false Enables support for Schema Evolution feature
Config Param: SCHEMA_EVOLUTION_ENABLE
hoodie.write.set.null.for.missing.columns false When a nullable column is missing from incoming batch during a write operation, the write operation will fail schema compatibility check. Set this option to true will make the missing column be filled with null values to successfully complete the write operation.
Config Param: SET_NULL_FOR_MISSING_COLUMNS
Since Version: 0.14.1

Metadata Configs

Configurations used by the Hudi Metadata Table. This table maintains the metadata about a given Hudi table (e.g file listings) to avoid overhead of accessing cloud storage, during queries.

Basic Configs

Config Name Default Description
hoodie.metadata.enable true Enable the internal metadata table which serves table metadata like level file listings
Config Param: ENABLE
Since Version: 0.7.0
hoodie.metadata.index.bloom.filter.enable false Enable indexing bloom filters of user data files under metadata table. When enabled, metadata table will have a partition to store the bloom filter index and will be used during the index lookups.
Config Param: ENABLE_METADATA_INDEX_BLOOM_FILTER
Since Version: 0.11.0
hoodie.metadata.index.column.stats.enable false Enable indexing column ranges of user data files under metadata table key lookups. When enabled, metadata table will have a partition to store the column ranges and will be used for pruning files during the index lookups.
Config Param: ENABLE_METADATA_INDEX_COLUMN_STATS
Since Version: 0.11.0

Advanced Configs

Config Name Default Description
hoodie.metadata.index.bloom.filter.column.list (N/A) Comma-separated list of columns for which bloom filter index will be built. If not set, only record key will be indexed.
Config Param: BLOOM_FILTER_INDEX_FOR_COLUMNS
Since Version: 0.11.0
hoodie.metadata.index.column.stats.column.list (N/A) Comma-separated list of columns for which column stats index will be built. If not set, all columns will be indexed
Config Param: COLUMN_STATS_INDEX_FOR_COLUMNS
Since Version: 0.11.0
hoodie.metadata.index.column.stats.processing.mode.override (N/A) By default Column Stats Index is automatically determining whether it should be read and processed either’in-memory’ (w/in executing process) or using Spark (on a cluster), based on some factors like the size of the Index and how many columns are read. This config allows to override this behavior.
Config Param: COLUMN_STATS_INDEX_PROCESSING_MODE_OVERRIDE
Since Version: 0.12.0
_hoodie.metadata.ignore.spurious.deletes true There are cases when extra files are requested to be deleted from metadata table which are never added before. This config determines how to handle such spurious deletes
Config Param: IGNORE_SPURIOUS_DELETES
Since Version: 0.10.0
hoodie.assume.date.partitioning false Should HoodieWriteClient assume the data is partitioned by dates, i.e three levels from base path. This is a stop-gap to support tables created by versions < 0.3.1. Will be removed eventually
Config Param: ASSUME_DATE_PARTITIONING
Since Version: 0.3.0
hoodie.file.listing.parallelism 200 Parallelism to use, when listing the table on lake storage.
Config Param: FILE_LISTING_PARALLELISM_VALUE
Since Version: 0.7.0
hoodie.metadata.auto.initialize true Initializes the metadata table by reading from the file system when the table is first created. Enabled by default. Warning: This should only be disabled when manually constructing the metadata table outside of typical Hudi writer flows.
Config Param: AUTO_INITIALIZE
Since Version: 0.14.0
hoodie.metadata.compact.max.delta.commits 10 Controls how often the metadata table is compacted.
Config Param: COMPACT_NUM_DELTA_COMMITS
Since Version: 0.7.0
hoodie.metadata.dir.filter.regex Directories matching this regex, will be filtered out when initializing metadata table from lake storage for the first time.
Config Param: DIR_FILTER_REGEX
Since Version: 0.7.0
hoodie.metadata.index.async false Enable asynchronous indexing of metadata table.
Config Param: ASYNC_INDEX_ENABLE
Since Version: 0.11.0
hoodie.metadata.index.bloom.filter.file.group.count 4 Metadata bloom filter index partition file group count. This controls the size of the base and log files and read parallelism in the bloom filter index partition. The recommendation is to size the file group count such that the base files are under 1GB.
Config Param: METADATA_INDEX_BLOOM_FILTER_FILE_GROUP_COUNT
Since Version: 0.11.0
hoodie.metadata.index.bloom.filter.parallelism 200 Parallelism to use for generating bloom filter index in metadata table.
Config Param: BLOOM_FILTER_INDEX_PARALLELISM
Since Version: 0.11.0
hoodie.metadata.index.check.timeout.seconds 900 After the async indexer has finished indexing upto the base instant, it will ensure that all inflight writers reliably write index updates as well. If this timeout expires, then the indexer will abort itself safely.
Config Param: METADATA_INDEX_CHECK_TIMEOUT_SECONDS
Since Version: 0.11.0
hoodie.metadata.index.column.stats.file.group.count 2 Metadata column stats partition file group count. This controls the size of the base and log files and read parallelism in the column stats index partition. The recommendation is to size the file group count such that the base files are under 1GB.
Config Param: METADATA_INDEX_COLUMN_STATS_FILE_GROUP_COUNT
Since Version: 0.11.0
hoodie.metadata.index.column.stats.inMemory.projection.threshold 100000 When reading Column Stats Index, if the size of the expected resulting projection is below the in-memory threshold (counted by the # of rows), it will be attempted to be loaded “in-memory” (ie not using the execution engine like Spark, Flink, etc). If the value is above the threshold execution engine will be used to compose the projection.
Config Param: COLUMN_STATS_INDEX_IN_MEMORY_PROJECTION_THRESHOLD
Since Version: 0.12.0
hoodie.metadata.index.column.stats.parallelism 200 Parallelism to use, when generating column stats index.
Config Param: COLUMN_STATS_INDEX_PARALLELISM
Since Version: 0.11.0
hoodie.metadata.log.compaction.blocks.threshold 5 Controls the criteria to log compacted files groups in metadata table.
Config Param: LOG_COMPACT_BLOCKS_THRESHOLD
Since Version: 0.14.0
hoodie.metadata.log.compaction.enable false This configs enables logcompaction for the metadata table.
Config Param: ENABLE_LOG_COMPACTION_ON_METADATA_TABLE
Since Version: 0.14.0
hoodie.metadata.max.deltacommits.when_pending 1000 When there is a pending instant in data table, this config limits the allowed number of deltacommits in metadata table to prevent the metadata table’s timeline from growing unboundedly as compaction won’t be triggered due to the pending data table instant.
Config Param: METADATA_MAX_NUM_DELTACOMMITS_WHEN_PENDING
Since Version: 0.14.0
hoodie.metadata.max.init.parallelism 100000 Maximum parallelism to use when initializing Record Index.
Config Param: RECORD_INDEX_MAX_PARALLELISM
Since Version: 0.14.0
hoodie.metadata.max.logfile.size 2147483648 Maximum size in bytes of a single log file. Larger log files can contain larger log blocks thereby reducing the number of blocks to search for keys
Config Param: MAX_LOG_FILE_SIZE_BYTES_PROP
Since Version: 0.14.0
hoodie.metadata.max.reader.buffer.size 10485760 Max memory to use for the reader buffer while merging log blocks
Config Param: MAX_READER_BUFFER_SIZE_PROP
Since Version: 0.14.0
hoodie.metadata.max.reader.memory 1073741824 Max memory to use for the reader to read from metadata
Config Param: MAX_READER_MEMORY_PROP
Since Version: 0.14.0
hoodie.metadata.metrics.enable false Enable publishing of metrics around metadata table.
Config Param: METRICS_ENABLE
Since Version: 0.7.0
hoodie.metadata.optimized.log.blocks.scan.enable false Optimized log blocks scanner that addresses all the multi-writer use-cases while appending to log files. It also differentiates original blocks written by ingestion writers and compacted blocks written by log compaction.
Config Param: ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN
Since Version: 0.13.0
hoodie.metadata.record.index.enable false Create the HUDI Record Index within the Metadata Table
Config Param: RECORD_INDEX_ENABLE_PROP
Since Version: 0.14.0
hoodie.metadata.record.index.growth.factor 2.0 The current number of records are multiplied by this number when estimating the number of file groups to create automatically. This helps account for growth in the number of records in the dataset.
Config Param: RECORD_INDEX_GROWTH_FACTOR_PROP
Since Version: 0.14.0
hoodie.metadata.record.index.max.filegroup.count 10000 Maximum number of file groups to use for Record Index.
Config Param: RECORD_INDEX_MAX_FILE_GROUP_COUNT_PROP
Since Version: 0.14.0
hoodie.metadata.record.index.max.filegroup.size 1073741824 Maximum size in bytes of a single file group. Large file group takes longer to compact.
Config Param: RECORD_INDEX_MAX_FILE_GROUP_SIZE_BYTES_PROP
Since Version: 0.14.0
hoodie.metadata.record.index.min.filegroup.count 10 Minimum number of file groups to use for Record Index.
Config Param: RECORD_INDEX_MIN_FILE_GROUP_COUNT_PROP
Since Version: 0.14.0
hoodie.metadata.spillable.map.path Path on local storage to use, when keys read from metadata are held in a spillable map.
Config Param: SPILLABLE_MAP_DIR_PROP
Since Version: 0.14.0

Metaserver Configs

Configurations used by the Hudi Metaserver.

Advanced Configs

Config Name Default Description
hoodie.database.name (N/A) Database name that will be used for incremental query.If different databases have the same table name during incremental query, we can set it to limit the table name under a specific database
Config Param: DATABASE_NAME
Since Version: 0.13.0
hoodie.table.name (N/A) Table name that will be used for registering with Hive. Needs to be same across runs.
Config Param: TABLE_NAME
Since Version: 0.13.0
hoodie.metaserver.connect.retries 3 Number of retries while opening a connection to metaserver
Config Param: METASERVER_CONNECTION_RETRIES
Since Version: 0.13.0
hoodie.metaserver.connect.retry.delay 1 Number of seconds for the client to wait between consecutive connection attempts
Config Param: METASERVER_CONNECTION_RETRY_DELAY
Since Version: 0.13.0
hoodie.metaserver.enabled false Enable Hudi metaserver for storing Hudi tables’ metadata.
Config Param: METASERVER_ENABLE
Since Version: 0.13.0
hoodie.metaserver.uris thrift://localhost:9090 Metaserver server uris
Config Param: METASERVER_URLS
Since Version: 0.13.0

Storage Configs

Configurations that control aspects around writing, sizing, reading base and log files.

Basic Configs

Config Name Default Description
hoodie.parquet.compression.codec gzip Compression Codec for parquet files
Config Param: PARQUET_COMPRESSION_CODEC_NAME
hoodie.parquet.max.file.size 125829120 Target size in bytes for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance.
Config Param: PARQUET_MAX_FILE_SIZE

Advanced Configs

Config Name Default Description
hoodie.logfile.data.block.format (N/A) Format of the data block within delta logs. Following formats are currently supported “avro”, “hfile”, “parquet”
Config Param: LOGFILE_DATA_BLOCK_FORMAT
hoodie.avro.write.support.class org.apache.hudi.avro.HoodieAvroWriteSupport Provided write support class should extend HoodieAvroWriteSupport class and it is loaded at runtime. This is only required when trying to override the existing write context.
Config Param: HOODIE_AVRO_WRITE_SUPPORT_CLASS
Since Version: 0.14.0
hoodie.bloom.index.filter.dynamic.max.entries 100000 The threshold for the maximum number of keys to record in a dynamic Bloom filter row. Only applies if filter type is BloomFilterTypeCode.DYNAMIC_V0.
Config Param: BLOOM_FILTER_DYNAMIC_MAX_ENTRIES
hoodie.bloom.index.filter.type DYNAMIC_V0 org.apache.hudi.common.bloom.BloomFilterTypeCode: Filter type used by Bloom filter. SIMPLE: Bloom filter that is based on the configured size. DYNAMIC_V0(default): Bloom filter that is auto sized based on number of keys.
Config Param: BLOOM_FILTER_TYPE
hoodie.hfile.block.size 1048576 Lower values increase the size in bytes of metadata tracked within HFile, but can offer potentially faster lookup times.
Config Param: HFILE_BLOCK_SIZE
hoodie.hfile.compression.algorithm GZ Compression codec to use for hfile base files.
Config Param: HFILE_COMPRESSION_ALGORITHM_NAME
hoodie.hfile.max.file.size 125829120 Target file size in bytes for HFile base files.
Config Param: HFILE_MAX_FILE_SIZE
hoodie.index.bloom.fpp 0.000000001 Only applies if index type is BLOOM. Error rate allowed given the number of entries. This is used to calculate how many bits should be assigned for the bloom filter and the number of hash functions. This is usually set very low (default: 0.000000001), we like to tradeoff disk space for lower false positives. If the number of entries added to bloom filter exceeds the configured value (hoodie.index.bloom.num_entries), then this fpp may not be honored.
Config Param: BLOOM_FILTER_FPP_VALUE
hoodie.index.bloom.num_entries 60000 Only applies if index type is BLOOM. This is the number of entries to be stored in the bloom filter. The rationale for the default: Assume the maxParquetFileSize is 128MB and averageRecordSize is 1kb and hence we approx a total of 130K records in a file. The default (60000) is roughly half of this approximation. Warning: Setting this very low, will generate a lot of false positives and index lookup will have to scan a lot more files than it has to and setting this to a very high number will increase the size every base file linearly (roughly 4KB for every 50000 entries). This config is also used with DYNAMIC bloom filter which determines the initial size for the bloom.
Config Param: BLOOM_FILTER_NUM_ENTRIES_VALUE
hoodie.io.factory.class org.apache.hudi.io.hadoop.HoodieHadoopIOFactory The fully-qualified class name of the factory class to return readers and writers of files used by Hudi. The provided class should implement org.apache.hudi.io.storage.HoodieIOFactory.
Config Param: HOODIE_IO_FACTORY_CLASS
Since Version: 0.15.0
hoodie.logfile.data.block.max.size 268435456 LogFile Data block max size in bytes. This is the maximum size allowed for a single data block to be appended to a log file. This helps to make sure the data appended to the log file is broken up into sizable blocks to prevent from OOM errors. This size should be greater than the JVM memory.
Config Param: LOGFILE_DATA_BLOCK_MAX_SIZE
hoodie.logfile.max.size 1073741824 LogFile max size in bytes. This is the maximum size allowed for a log file before it is rolled over to the next version.
Config Param: LOGFILE_MAX_SIZE
hoodie.logfile.to.parquet.compression.ratio 0.35 Expected additional compression as records move from log files to parquet. Used for merge_on_read table to send inserts into log files & control the size of compacted parquet file.
Config Param: LOGFILE_TO_PARQUET_COMPRESSION_RATIO_FRACTION
hoodie.orc.block.size 125829120 ORC block size, recommended to be aligned with the target file size.
Config Param: ORC_BLOCK_SIZE
hoodie.orc.compression.codec ZLIB Compression codec to use for ORC base files.
Config Param: ORC_COMPRESSION_CODEC_NAME
hoodie.orc.max.file.size 125829120 Target file size in bytes for ORC base files.
Config Param: ORC_FILE_MAX_SIZE
hoodie.orc.stripe.size 67108864 Size of the memory buffer in bytes for writing
Config Param: ORC_STRIPE_SIZE
hoodie.parquet.block.size 125829120 Parquet RowGroup size in bytes. It’s recommended to make this large enough that scan costs can be amortized by packing enough column values into a single row group.
Config Param: PARQUET_BLOCK_SIZE
hoodie.parquet.bloom.filter.enabled true Control whether to write bloom filter or not. Default true. We can set to false in non bloom index cases for CPU resource saving.
Config Param: PARQUET_WITH_BLOOM_FILTER_ENABLED
Since Version: 0.15.0
hoodie.parquet.compression.ratio 0.1 Expected compression of parquet data used by Hudi, when it tries to size new parquet files. Increase this value, if bulk_insert is producing smaller than expected sized files
Config Param: PARQUET_COMPRESSION_RATIO_FRACTION
hoodie.parquet.dictionary.enabled true Whether to use dictionary encoding
Config Param: PARQUET_DICTIONARY_ENABLED
hoodie.parquet.field_id.write.enabled true Would only be effective with Spark 3.3+. Sets spark.sql.parquet.fieldId.write.enabled. If enabled, Spark will write out parquet native field ids that are stored inside StructField’s metadata as parquet.field.id to parquet files.
Config Param: PARQUET_FIELD_ID_WRITE_ENABLED
Since Version: 0.12.0
hoodie.parquet.outputtimestamptype TIMESTAMP_MICROS Sets spark.sql.parquet.outputTimestampType. Parquet timestamp type to use when Spark writes data to Parquet files.
Config Param: PARQUET_OUTPUT_TIMESTAMP_TYPE
hoodie.parquet.page.size 1048576 Parquet page size in bytes. Page is the unit of read within a parquet file. Within a block, pages are compressed separately.
Config Param: PARQUET_PAGE_SIZE
hoodie.parquet.spark.row.write.support.class org.apache.hudi.io.storage.row.HoodieRowParquetWriteSupport Provided write support class should extend HoodieRowParquetWriteSupport class and it is loaded at runtime. This is only required when trying to override the existing write context when hoodie.datasource.write.row.writer.enable=true.
Config Param: HOODIE_PARQUET_SPARK_ROW_WRITE_SUPPORT_CLASS
Since Version: 0.15.0
hoodie.parquet.writelegacyformat.enabled false Sets spark.sql.parquet.writeLegacyFormat. If true, data will be written in a way of Spark 1.4 and earlier. For example, decimal values will be written in Parquet’s fixed-length byte array format which other systems such as Apache Hive and Apache Impala use. If false, the newer format in Parquet will be used. For example, decimals will be written in int-based format.
Config Param: PARQUET_WRITE_LEGACY_FORMAT_ENABLED
hoodie.storage.class org.apache.hudi.storage.hadoop.HoodieHadoopStorage The fully-qualified class name of the HoodieStorage implementation class to instantiate. The provided class should implement org.apache.hudi.storage.HoodieStorage
Config Param: HOODIE_STORAGE_CLASS
Since Version: 0.15.0

Consistency Guard Configurations

The consistency guard related config options, to help talk to eventually consistent object storage.(Tip: S3 is NOT eventually consistent anymore!)

Advanced Configs

Config Name Default Description
_hoodie.optimistic.consistency.guard.enable false Enable consistency guard, which optimistically assumes consistency is achieved after a certain time period.
Config Param: OPTIMISTIC_CONSISTENCY_GUARD_ENABLE
Since Version: 0.6.0
hoodie.consistency.check.enabled false Enabled to handle S3 eventual consistency issue. This property is no longer required since S3 is now strongly consistent. Will be removed in the future releases.
Config Param: ENABLE
Since Version: 0.5.0
Deprecated since: 0.7.0
hoodie.consistency.check.initial_interval_ms 400 Amount of time (in ms) to wait, before checking for consistency after an operation on storage.
Config Param: INITIAL_CHECK_INTERVAL_MS
Since Version: 0.5.0
Deprecated since: 0.7.0
hoodie.consistency.check.max_checks 6 Maximum number of consistency checks to perform, with exponential backoff.
Config Param: MAX_CHECKS
Since Version: 0.5.0
Deprecated since: 0.7.0
hoodie.consistency.check.max_interval_ms 20000 Maximum amount of time (in ms), to wait for consistency checking.
Config Param: MAX_CHECK_INTERVAL_MS
Since Version: 0.5.0
Deprecated since: 0.7.0
hoodie.optimistic.consistency.guard.sleep_time_ms 500 Amount of time (in ms), to wait after which we assume storage is consistent.
Config Param: OPTIMISTIC_CONSISTENCY_GUARD_SLEEP_TIME_MS
Since Version: 0.6.0

FileSystem Guard Configurations

The filesystem retry related config options, to help deal with runtime exception like list/get/put/delete performance issues.

Advanced Configs

Config Name Default Description
hoodie.filesystem.operation.retry.enable false Enabled to handle list/get/delete etc file system performance issue.
Config Param: FILESYSTEM_RETRY_ENABLE
Since Version: 0.11.0
hoodie.filesystem.operation.retry.exceptions The class name of the Exception that needs to be retried, separated by commas. Default is empty which means retry all the IOException and RuntimeException from FileSystem
Config Param: RETRY_EXCEPTIONS
Since Version: 0.11.0
hoodie.filesystem.operation.retry.initial_interval_ms 100 Amount of time (in ms) to wait, before retry to do operations on storage.
Config Param: INITIAL_RETRY_INTERVAL_MS
Since Version: 0.11.0
hoodie.filesystem.operation.retry.max_interval_ms 2000 Maximum amount of time (in ms), to wait for next retry.
Config Param: MAX_RETRY_INTERVAL_MS
Since Version: 0.11.0
hoodie.filesystem.operation.retry.max_numbers 4 Maximum number of retry actions to perform, with exponential backoff.
Config Param: MAX_RETRY_NUMBERS
Since Version: 0.11.0

File System View Storage Configurations

Configurations that control how file metadata is stored by Hudi, for transaction processing and queries.

Advanced Configs

Config Name Default Description
hoodie.filesystem.remote.backup.view.enable true Config to control whether backup needs to be configured if clients were not able to reach timeline service.
Config Param: REMOTE_BACKUP_VIEW_ENABLE
hoodie.filesystem.view.incr.timeline.sync.enable false Controls whether or not, the file system view is incrementally updated as new actions are performed on the timeline.
Config Param: INCREMENTAL_TIMELINE_SYNC_ENABLE
hoodie.filesystem.view.remote.host localhost We expect this to be rarely hand configured.
Config Param: REMOTE_HOST_NAME
hoodie.filesystem.view.remote.port 26754 Port to serve file system view queries, when remote. We expect this to be rarely hand configured.
Config Param: REMOTE_PORT_NUM
hoodie.filesystem.view.remote.retry.enable false Whether to enable API request retry for remote file system view.
Config Param: REMOTE_RETRY_ENABLE
Since Version: 0.12.1
hoodie.filesystem.view.remote.retry.exceptions The class name of the Exception that needs to be retried, separated by commas. Default is empty which means retry all the IOException and RuntimeException from Remote Request.
Config Param: RETRY_EXCEPTIONS
Since Version: 0.12.1
hoodie.filesystem.view.remote.retry.initial_interval_ms 100 Amount of time (in ms) to wait, before retry to do operations on storage.
Config Param: REMOTE_INITIAL_RETRY_INTERVAL_MS
Since Version: 0.12.1
hoodie.filesystem.view.remote.retry.max_interval_ms 2000 Maximum amount of time (in ms), to wait for next retry.
Config Param: REMOTE_MAX_RETRY_INTERVAL_MS
Since Version: 0.12.1
hoodie.filesystem.view.remote.retry.max_numbers 3 Maximum number of retry for API requests against a remote file system view. e.g timeline server.
Config Param: REMOTE_MAX_RETRY_NUMBERS
Since Version: 0.12.1
hoodie.filesystem.view.remote.timeout.secs 300 Timeout in seconds, to wait for API requests against a remote file system view. e.g timeline server.
Config Param: REMOTE_TIMEOUT_SECS
hoodie.filesystem.view.rocksdb.base.path /tmp/hoodie_timeline_rocksdb Path on local storage to use, when storing file system view in embedded kv store/rocksdb.
Config Param: ROCKSDB_BASE_PATH
hoodie.filesystem.view.secondary.type MEMORY Specifies the secondary form of storage for file system view, if the primary (e.g timeline server) is unavailable.
Config Param: SECONDARY_VIEW_TYPE
hoodie.filesystem.view.spillable.bootstrap.base.file.mem.fraction 0.05 Fraction of the file system view memory, to be used for holding mapping to bootstrap base files.
Config Param: BOOTSTRAP_BASE_FILE_MEM_FRACTION
hoodie.filesystem.view.spillable.clustering.mem.fraction 0.01 Fraction of the file system view memory, to be used for holding clustering related metadata.
Config Param: SPILLABLE_CLUSTERING_MEM_FRACTION
hoodie.filesystem.view.spillable.compaction.mem.fraction 0.8 Fraction of the file system view memory, to be used for holding compaction related metadata.
Config Param: SPILLABLE_COMPACTION_MEM_FRACTION
hoodie.filesystem.view.spillable.dir /tmp/ Path on local storage to use, when file system view is held in a spillable map.
Config Param: SPILLABLE_DIR
hoodie.filesystem.view.spillable.log.compaction.mem.fraction 0.8 Fraction of the file system view memory, to be used for holding log compaction related metadata.
Config Param: SPILLABLE_LOG_COMPACTION_MEM_FRACTION
Since Version: 0.13.0
hoodie.filesystem.view.spillable.mem 104857600 Amount of memory to be used in bytes for holding file system view, before spilling to disk.
Config Param: SPILLABLE_MEMORY
hoodie.filesystem.view.spillable.replaced.mem.fraction 0.01 Fraction of the file system view memory, to be used for holding replace commit related metadata.
Config Param: SPILLABLE_REPLACED_MEM_FRACTION
hoodie.filesystem.view.type MEMORY File system view provides APIs for viewing the files on the underlying lake storage, as file groups and file slices. This config controls how such a view is held. Options include MEMORY,SPILLABLE_DISK,EMBEDDED_KV_STORE,REMOTE_ONLY,REMOTE_FIRST which provide different trade offs for memory usage and API request performance.
Config Param: VIEW_TYPE

Archival Configs

Configurations that control archival.

Basic Configs

Config Name Default Description
hoodie.keep.max.commits 30 Archiving service moves older entries from timeline into an archived log after each write, to keep the metadata overhead constant, even as the table size grows. This config controls the maximum number of instants to retain in the active timeline.
Config Param: MAX_COMMITS_TO_KEEP
hoodie.keep.min.commits 20 Similar to hoodie.keep.max.commits, but controls the minimum number of instants to retain in the active timeline.
Config Param: MIN_COMMITS_TO_KEEP

Advanced Configs

Config Name Default Description
hoodie.archive.async false Only applies when hoodie.archive.automatic is turned on. When turned on runs archiver async with writing, which can speed up overall write performance.
Config Param: ASYNC_ARCHIVE
Since Version: 0.11.0
hoodie.archive.automatic true When enabled, the archival table service is invoked immediately after each commit, to archive commits if we cross a maximum value of commits. It’s recommended to enable this, to ensure number of active commits is bounded.
Config Param: AUTO_ARCHIVE
hoodie.archive.beyond.savepoint false If enabled, archival will proceed beyond savepoint, skipping savepoint commits. If disabled, archival will stop at the earliest savepoint commit.
Config Param: ARCHIVE_BEYOND_SAVEPOINT
Since Version: 0.12.0
hoodie.archive.delete.parallelism 100 When performing archival operation, Hudi needs to delete the files of the archived instants in the active timeline in .hoodie folder. The file deletion also happens after merging small archived files into larger ones if enabled. This config limits the Spark parallelism for deleting files in both cases, i.e., parallelism of deleting files does not go above the configured value and the parallelism is the number of files to delete if smaller than the configured value. If you see that the file deletion in archival operation is slow because of the limited parallelism, you can increase this to tune the performance.
Config Param: DELETE_ARCHIVED_INSTANT_PARALLELISM_VALUE
hoodie.archive.merge.enable false When enable, hoodie will auto merge several small archive files into larger one. It’s useful when storage scheme doesn’t support append operation.
Config Param: ARCHIVE_MERGE_ENABLE
hoodie.archive.merge.files.batch.size 10 The number of small archive files to be merged at once.
Config Param: ARCHIVE_MERGE_FILES_BATCH_SIZE
hoodie.archive.merge.small.file.limit.bytes 20971520 This config sets the archive file size limit below which an archive file becomes a candidate to be selected as such a small file.
Config Param: ARCHIVE_MERGE_SMALL_FILE_LIMIT_BYTES
hoodie.commits.archival.batch 10 Archiving of instants is batched in best-effort manner, to pack more instants into a single archive log. This config controls such archival batch size.
Config Param: COMMITS_ARCHIVAL_BATCH_SIZE

Bootstrap Configs

Configurations that control how you want to bootstrap your existing tables for the first time into hudi. The bootstrap operation can flexibly avoid copying data over before you can use Hudi and support running the existing writers and new hudi writers in parallel, to validate the migration.

Basic Configs

Config Name Default Description
hoodie.bootstrap.base.path (N/A) Base path of the dataset that needs to be bootstrapped as a Hudi table
Config Param: BASE_PATH
Since Version: 0.6.0

Advanced Configs

Config Name Default Description
hoodie.bootstrap.data.queries.only false Improves query performance, but queries cannot use hudi metadata fields
Config Param: DATA_QUERIES_ONLY
Since Version: 0.14.0
hoodie.bootstrap.full.input.provider org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider Class to use for reading the bootstrap dataset partitions/files, for Bootstrap mode FULL_RECORD
Config Param: FULL_BOOTSTRAP_INPUT_PROVIDER_CLASS_NAME
Since Version: 0.6.0
hoodie.bootstrap.index.class org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex Implementation to use, for mapping a skeleton base file to a bootstrap base file.
Config Param: INDEX_CLASS_NAME
Since Version: 0.6.0
hoodie.bootstrap.mode.selector org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector Selects the mode in which each file/partition in the bootstrapped dataset gets bootstrapped
Config Param: MODE_SELECTOR_CLASS_NAME
Since Version: 0.6.0
hoodie.bootstrap.mode.selector.regex .* Matches each bootstrap dataset partition against this regex and applies the mode below to it.
Config Param: PARTITION_SELECTOR_REGEX_PATTERN
Since Version: 0.6.0
hoodie.bootstrap.mode.selector.regex.mode METADATA_ONLY org.apache.hudi.client.bootstrap.BootstrapMode: Bootstrap mode for importing an existing table into Hudi FULL_RECORD: In this mode, the full record data is copied into hudi and metadata columns are added. A full record bootstrap is functionally equivalent to a bulk-insert. After a full record bootstrap, Hudi will function properly even if the original table is modified or deleted. METADATA_ONLY(default): In this mode, the full record data is not copied into Hudi therefore it avoids full cost of rewriting the dataset. Instead, ‘skeleton’ files containing just the corresponding metadata columns are added to the Hudi table. Hudi relies on the data in the original table and will face data-loss or corruption if files in the original table location are deleted or modified.
Config Param: PARTITION_SELECTOR_REGEX_MODE
Since Version: 0.6.0
hoodie.bootstrap.parallelism 1500 For metadata-only bootstrap, Hudi parallelizes the operation so that each table partition is handled by one Spark task. This config limits the number of parallelism. We pick the configured parallelism if the number of table partitions is larger than this configured value. The parallelism is assigned to the number of table partitions if it is smaller than the configured value. For full-record bootstrap, i.e., BULK_INSERT operation of the records, this configured value is passed as the BULK_INSERT shuffle parallelism (hoodie.bulkinsert.shuffle.parallelism), determining the BULK_INSERT write behavior. If you see that the bootstrap is slow due to the limited parallelism, you can increase this.
Config Param: PARALLELISM_VALUE
Since Version: 0.6.0
hoodie.bootstrap.partitionpath.translator.class org.apache.hudi.client.bootstrap.translator.IdentityBootstrapPartitionPathTranslator Translates the partition paths from the bootstrapped data into how is laid out as a Hudi table.
Config Param: PARTITION_PATH_TRANSLATOR_CLASS_NAME
Since Version: 0.6.0

Clean Configs

Cleaning (reclamation of older/unused file groups/slices).

Basic Configs

Config Name Default Description
hoodie.clean.async false Only applies when hoodie.clean.automatic is turned on. When turned on runs cleaner async with writing, which can speed up overall write performance.
Config Param: ASYNC_CLEAN
hoodie.cleaner.commits.retained 10 When KEEP_LATEST_COMMITS cleaning policy is used, the number of commits to retain, without cleaning. This will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much data retention the table supports for incremental queries.
Config Param: CLEANER_COMMITS_RETAINED

Advanced Configs

Config Name Default Description
hoodie.clean.allow.multiple false Allows scheduling/executing multiple cleans by enabling this config. If users prefer to strictly ensure clean requests should be mutually exclusive, .i.e. a 2nd clean will not be scheduled if another clean is not yet completed to avoid repeat cleaning of same files, they might want to disable this config.
Config Param: ALLOW_MULTIPLE_CLEANS
Since Version: 0.11.0
Deprecated since: 0.15.0
hoodie.clean.automatic true When enabled, the cleaner table service is invoked immediately after each commit, to delete older file slices. It’s recommended to enable this, to ensure metadata and data storage growth is bounded.
Config Param: AUTO_CLEAN
hoodie.clean.max.commits 1 Number of commits after the last clean operation, before scheduling of a new clean is attempted.
Config Param: CLEAN_MAX_COMMITS
hoodie.clean.trigger.strategy NUM_COMMITS org.apache.hudi.table.action.clean.CleaningTriggerStrategy: Controls when cleaning is scheduled. NUM_COMMITS(default): Trigger the cleaning service every N commits, determined by hoodie.clean.max.commits.
Config Param: CLEAN_TRIGGER_STRATEGY
hoodie.cleaner.delete.bootstrap.base.file false When set to true, cleaner also deletes the bootstrap base file when it’s skeleton base file is cleaned. Turn this to true, if you want to ensure the bootstrap dataset storage is reclaimed over time, as the table receives updates/deletes. Another reason to turn this on, would be to ensure data residing in bootstrap base files are also physically deleted, to comply with data privacy enforcement processes.
Config Param: CLEANER_BOOTSTRAP_BASE_FILE_ENABLE
hoodie.cleaner.fileversions.retained 3 When KEEP_LATEST_FILE_VERSIONS cleaning policy is used, the minimum number of file slices to retain in each file group, during cleaning.
Config Param: CLEANER_FILE_VERSIONS_RETAINED
hoodie.cleaner.hours.retained 24 When KEEP_LATEST_BY_HOURS cleaning policy is used, the number of hours for which commits need to be retained. This config provides a more flexible option as compared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group, corresponding to commits with commit times older than the configured number of hours to be retained are cleaned.
Config Param: CLEANER_HOURS_RETAINED
hoodie.cleaner.incremental.mode true When enabled, the plans for each cleaner service run is computed incrementally off the events in the timeline, since the last cleaner run. This is much more efficient than obtaining listings for the full table for each planning (even with a metadata table).
Config Param: CLEANER_INCREMENTAL_MODE_ENABLE
hoodie.cleaner.parallelism 200 This config controls the behavior of both the cleaning plan and cleaning execution. Deriving the cleaning plan is parallelized at the table partition level, i.e., each table partition is processed by one Spark task to figure out the files to clean. The cleaner picks the configured parallelism if the number of table partitions is larger than this configured value. The parallelism is assigned to the number of table partitions if it is smaller than the configured value. The clean execution, i.e., the file deletion, is parallelized at file level, which is the unit of Spark task distribution. Similarly, the actual parallelism cannot exceed the configured value if the number of files is larger. If cleaning plan or execution is slow due to limited parallelism, you can increase this to tune the performance..
Config Param: CLEANER_PARALLELISM_VALUE
hoodie.cleaner.policy KEEP_LATEST_COMMITS org.apache.hudi.common.model.HoodieCleaningPolicy: Cleaning policy to be used. The cleaner service deletes older file slices files to re-claim space. Long running query plans may often refer to older file slices and will break if those are cleaned, before the query has had a chance to run. So, it is good to make sure that the data is retained for more than the maximum query execution time. By default, the cleaning policy is determined based on one of the following configs explicitly set by the user (at most one of them can be set; otherwise, KEEP_LATEST_COMMITS cleaning policy is used). KEEP_LATEST_FILE_VERSIONS: keeps the last N versions of the file slices written; used when “hoodie.cleaner.fileversions.retained” is explicitly set only. KEEP_LATEST_COMMITS(default): keeps the file slices written by the last N commits; used when “hoodie.cleaner.commits.retained” is explicitly set only. KEEP_LATEST_BY_HOURS: keeps the file slices written in the last N hours based on the commit time; used when “hoodie.cleaner.hours.retained” is explicitly set only.
Config Param: CLEANER_POLICY
hoodie.cleaner.policy.failed.writes EAGER org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy: Policy that controls how to clean up failed writes. Hudi will delete any files written by failed writes to re-claim space. EAGER(default): Clean failed writes inline after every write operation. LAZY: Clean failed writes lazily after heartbeat timeout when the cleaning service runs. This policy is required when multi-writers are enabled. NEVER: Never clean failed writes.
Config Param: FAILED_WRITES_CLEANER_POLICY

Clustering Configs

Configurations that control the clustering table service in hudi, which optimizes the storage layout for better query performance by sorting and sizing data files.

Basic Configs

Config Name Default Description
hoodie.clustering.async.enabled false Enable running of clustering service, asynchronously as inserts happen on the table.
Config Param: ASYNC_CLUSTERING_ENABLE
Since Version: 0.7.0
hoodie.clustering.inline false Turn on inline clustering - clustering will be run after each write operation is complete
Config Param: INLINE_CLUSTERING
Since Version: 0.7.0
hoodie.clustering.plan.strategy.small.file.limit 314572800 Files smaller than the size in bytes specified here are candidates for clustering
Config Param: PLAN_STRATEGY_SMALL_FILE_LIMIT
Since Version: 0.7.0
hoodie.clustering.plan.strategy.target.file.max.bytes 1073741824 Each group can produce ‘N’ (CLUSTERING_MAX_GROUP_SIZE/CLUSTERING_TARGET_FILE_SIZE) output file groups
Config Param: PLAN_STRATEGY_TARGET_FILE_MAX_BYTES
Since Version: 0.7.0

Advanced Configs

Config Name Default Description
hoodie.clustering.plan.strategy.cluster.begin.partition (N/A) Begin partition used to filter partition (inclusive), only effective when the filter mode ‘hoodie.clustering.plan.partition.filter.mode’ is SELECTED_PARTITIONS
Config Param: PARTITION_FILTER_BEGIN_PARTITION
Since Version: 0.11.0
hoodie.clustering.plan.strategy.cluster.end.partition (N/A) End partition used to filter partition (inclusive), only effective when the filter mode ‘hoodie.clustering.plan.partition.filter.mode’ is SELECTED_PARTITIONS
Config Param: PARTITION_FILTER_END_PARTITION
Since Version: 0.11.0
hoodie.clustering.plan.strategy.partition.regex.pattern (N/A) Filter clustering partitions that matched regex pattern
Config Param: PARTITION_REGEX_PATTERN
Since Version: 0.11.0
hoodie.clustering.plan.strategy.partition.selected (N/A) Partitions to run clustering
Config Param: PARTITION_SELECTED
Since Version: 0.11.0
hoodie.clustering.plan.strategy.sort.columns (N/A) Columns to sort the data by when clustering
Config Param: PLAN_STRATEGY_SORT_COLUMNS
Since Version: 0.7.0
hoodie.clustering.async.max.commits 4 Config to control frequency of async clustering
Config Param: ASYNC_CLUSTERING_MAX_COMMITS
Since Version: 0.9.0
hoodie.clustering.execution.strategy.class org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy Config to provide a strategy class (subclass of RunClusteringStrategy) to define how the clustering plan is executed. By default, we sort the file groups in th plan by the specified columns, while meeting the configured target file sizes.
Config Param: EXECUTION_STRATEGY_CLASS_NAME
Since Version: 0.7.0
hoodie.clustering.inline.max.commits 4 Config to control frequency of clustering planning
Config Param: INLINE_CLUSTERING_MAX_COMMITS
Since Version: 0.7.0
hoodie.clustering.max.parallelism 15 Maximum number of parallelism jobs submitted in clustering operation. If the resource is sufficient(Like Spark engine has enough idle executors), increasing this value will let the clustering job run faster, while it will give additional pressure to the execution engines to manage more concurrent running jobs.
Config Param: CLUSTERING_MAX_PARALLELISM
Since Version: 0.14.0
hoodie.clustering.plan.partition.filter.mode NONE org.apache.hudi.table.action.cluster.ClusteringPlanPartitionFilterMode: Partition filter mode used in the creation of clustering plan. NONE(default): Do not filter partitions. The clustering plan will include all partitions that have clustering candidates. RECENT_DAYS: This filter assumes that your data is partitioned by date. The clustering plan will only include partitions from K days ago to N days ago, where K >= N. K is determined by hoodie.clustering.plan.strategy.daybased.lookback.partitions and N is determined by hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions. SELECTED_PARTITIONS: The clustering plan will include only partition paths with names that sort within the inclusive range [hoodie.clustering.plan.strategy.cluster.begin.partition, hoodie.clustering.plan.strategy.cluster.end.partition]. DAY_ROLLING: To determine the partitions in the clustering plan, the eligible partitions will be sorted in ascending order. Each partition will have an index i in that list. The clustering plan will only contain partitions such that i mod 24 = H, where H is the current hour of the day (from 0 to 23).
Config Param: PLAN_PARTITION_FILTER_MODE_NAME
Since Version: 0.11.0
hoodie.clustering.plan.strategy.class org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy Config to provide a strategy class (subclass of ClusteringPlanStrategy) to create clustering plan i.e select what file groups are being clustered. Default strategy, looks at the clustering small file size limit (determined by hoodie.clustering.plan.strategy.small.file.limit) to pick the small file slices within partitions for clustering.
Config Param: PLAN_STRATEGY_CLASS_NAME
Since Version: 0.7.0
hoodie.clustering.plan.strategy.daybased.lookback.partitions 2 Number of partitions to list to create ClusteringPlan
Config Param: DAYBASED_LOOKBACK_PARTITIONS
Since Version: 0.7.0
hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions 0 Number of partitions to skip from latest when choosing partitions to create ClusteringPlan
Config Param: PLAN_STRATEGY_SKIP_PARTITIONS_FROM_LATEST
Since Version: 0.9.0
hoodie.clustering.plan.strategy.max.bytes.per.group 2147483648 Each clustering operation can create multiple output file groups. Total amount of data processed by clustering operation is defined by below two properties (CLUSTERING_MAX_BYTES_PER_GROUP * CLUSTERING_MAX_NUM_GROUPS). Max amount of data to be included in one group
Config Param: PLAN_STRATEGY_MAX_BYTES_PER_OUTPUT_FILEGROUP
Since Version: 0.7.0
hoodie.clustering.plan.strategy.max.num.groups 30 Maximum number of groups to create as part of ClusteringPlan. Increasing groups will increase parallelism
Config Param: PLAN_STRATEGY_MAX_GROUPS
Since Version: 0.7.0
hoodie.clustering.plan.strategy.single.group.clustering.enabled true Whether to generate clustering plan when there is only one file group involved, by default true
Config Param: PLAN_STRATEGY_SINGLE_GROUP_CLUSTERING_ENABLED
Since Version: 0.14.0
hoodie.clustering.rollback.pending.replacecommit.on.conflict false If updates are allowed to file groups pending clustering, then set this config to rollback failed or pending clustering instants. Pending clustering will be rolled back ONLY IF there is conflict between incoming upsert and filegroup to be clustered. Please exercise caution while setting this config, especially when clustering is done very frequently. This could lead to race condition in rare scenarios, for example, when the clustering completes after instants are fetched but before rollback completed.
Config Param: ROLLBACK_PENDING_CLUSTERING_ON_CONFLICT
Since Version: 0.10.0
hoodie.clustering.schedule.inline false When set to true, clustering service will be attempted for inline scheduling after each write. Users have to ensure they have a separate job to run async clustering(execution) for the one scheduled by this writer. Users can choose to set both hoodie.clustering.inline and hoodie.clustering.schedule.inline to false and have both scheduling and execution triggered by any async process, on which case hoodie.clustering.async.enabled is expected to be set to true. But if hoodie.clustering.inline is set to false, and hoodie.clustering.schedule.inline is set to true, regular writers will schedule clustering inline, but users are expected to trigger async job for execution. If hoodie.clustering.inline is set to true, regular writers will do both scheduling and execution inline for clustering
Config Param: SCHEDULE_INLINE_CLUSTERING
hoodie.clustering.updates.strategy org.apache.hudi.client.clustering.update.strategy.SparkRejectUpdateStrategy Determines how to handle updates, deletes to file groups that are under clustering. Default strategy just rejects the update
Config Param: UPDATES_STRATEGY
Since Version: 0.7.0
hoodie.layout.optimize.build.curve.sample.size 200000 Determines target sample size used by the Boundary-based Interleaved Index method of building space-filling curve. Larger sample size entails better layout optimization outcomes, at the expense of higher memory footprint.
Config Param: LAYOUT_OPTIMIZE_BUILD_CURVE_SAMPLE_SIZE
Since Version: 0.10.0
hoodie.layout.optimize.curve.build.method DIRECT org.apache.hudi.config.HoodieClusteringConfig$SpatialCurveCompositionStrategyType: This configuration only has effect if hoodie.layout.optimize.strategy is set to either “z-order” or “hilbert” (i.e. leveraging space-filling curves). This configuration controls the type of a strategy to use for building the space-filling curves, tackling specifically how the Strings are ordered based on the curve. Since we truncate the String to 8 bytes for ordering, there are two issues: (1) it can lead to poor aggregation effect, (2) the truncation of String longer than 8 bytes loses the precision, if the Strings are different but the 8-byte prefix is the same. The boundary-based interleaved index method (“SAMPLE”) has better generalization, solving the two problems above, but is slower than direct method (“DIRECT”). User should benchmark the write and query performance before tweaking this in production, if this is actually a problem. Please refer to RFC-28 for more details. DIRECT(default): This strategy builds the spatial curve in full, filling in all of the individual points corresponding to each individual record, which requires less compute. SAMPLE: This strategy leverages boundary-base interleaved index method (described in more details in Amazon DynamoDB blog https://aws.amazon.com/cn/blogs/database/tag/z-order/) and produces a better layout compared to DIRECT strategy. It requires more compute and is slower.
Config Param: LAYOUT_OPTIMIZE_SPATIAL_CURVE_BUILD_METHOD
Since Version: 0.10.0
hoodie.layout.optimize.data.skipping.enable true Enable data skipping by collecting statistics once layout optimization is complete.
Config Param: LAYOUT_OPTIMIZE_DATA_SKIPPING_ENABLE
Since Version: 0.10.0
Deprecated since: 0.11.0
hoodie.layout.optimize.enable false This setting has no effect. Please refer to clustering configuration, as well as LAYOUT_OPTIMIZE_STRATEGY config to enable advanced record layout optimization strategies
Config Param: LAYOUT_OPTIMIZE_ENABLE
Since Version: 0.10.0
Deprecated since: 0.11.0
hoodie.layout.optimize.strategy LINEAR org.apache.hudi.config.HoodieClusteringConfig$LayoutOptimizationStrategy: Determines ordering strategy for records layout optimization. LINEAR(default): Orders records lexicographically ZORDER: Orders records along Z-order spatial-curve. HILBERT: Orders records along Hilbert’s spatial-curve.
Config Param: LAYOUT_OPTIMIZE_STRATEGY
Since Version: 0.10.0

Compaction Configs

Configurations that control compaction (merging of log files onto a new base files).

Basic Configs

Config Name Default Description
hoodie.compact.inline false When set to true, compaction service is triggered after each write. While being simpler operationally, this adds extra latency on the write path.
Config Param: INLINE_COMPACT
hoodie.compact.inline.max.delta.commits 5 Number of delta commits after the last compaction, before scheduling of a new compaction is attempted. This config takes effect only for the compaction triggering strategy based on the number of commits, i.e., NUM_COMMITS, NUM_COMMITS_AFTER_LAST_REQUEST, NUM_AND_TIME, and NUM_OR_TIME.
Config Param: INLINE_COMPACT_NUM_DELTA_COMMITS

Advanced Configs

Config Name Default Description
hoodie.compact.inline.max.delta.seconds 3600 Number of elapsed seconds after the last compaction, before scheduling a new one. This config takes effect only for the compaction triggering strategy based on the elapsed time, i.e., TIME_ELAPSED, NUM_AND_TIME, and NUM_OR_TIME.
Config Param: INLINE_COMPACT_TIME_DELTA_SECONDS
hoodie.compact.inline.trigger.strategy NUM_COMMITS org.apache.hudi.table.action.compact.CompactionTriggerStrategy: Controls when compaction is scheduled. NUM_COMMITS(default): triggers compaction when there are at least N delta commits after last completed compaction. NUM_COMMITS_AFTER_LAST_REQUEST: triggers compaction when there are at least N delta commits after last completed or requested compaction. TIME_ELAPSED: triggers compaction after N seconds since last compaction. NUM_AND_TIME: triggers compaction when both there are at least N delta commits and N seconds elapsed (both must be satisfied) after last completed compaction. NUM_OR_TIME: triggers compaction when both there are at least N delta commits or N seconds elapsed (either condition is satisfied) after last completed compaction.
Config Param: INLINE_COMPACT_TRIGGER_STRATEGY
hoodie.compact.schedule.inline false When set to true, compaction service will be attempted for inline scheduling after each write. Users have to ensure they have a separate job to run async compaction(execution) for the one scheduled by this writer. Users can choose to set both hoodie.compact.inline and hoodie.compact.schedule.inline to false and have both scheduling and execution triggered by any async process. But if hoodie.compact.inline is set to false, and hoodie.compact.schedule.inline is set to true, regular writers will schedule compaction inline, but users are expected to trigger async job for execution. If hoodie.compact.inline is set to true, regular writers will do both scheduling and execution inline for compaction
Config Param: SCHEDULE_INLINE_COMPACT
hoodie.compaction.daybased.target.partitions 10 Used by org.apache.hudi.io.compact.strategy.DayBasedCompactionStrategy to denote the number of latest partitions to compact during a compaction run.
Config Param: TARGET_PARTITIONS_PER_DAYBASED_COMPACTION
hoodie.compaction.lazy.block.read true When merging the delta log files, this config helps to choose whether the log blocks should be read lazily or not. Choose true to use lazy block reading (low memory usage, but incurs seeks to each block header) or false for immediate block read (higher memory usage)
Config Param: COMPACTION_LAZY_BLOCK_READ_ENABLE
hoodie.compaction.logfile.num.threshold 0 Only if the log file num is greater than the threshold, the file group will be compacted.
Config Param: COMPACTION_LOG_FILE_NUM_THRESHOLD
Since Version: 0.13.0
hoodie.compaction.logfile.size.threshold 0 Only if the log file size is greater than the threshold in bytes, the file group will be compacted.
Config Param: COMPACTION_LOG_FILE_SIZE_THRESHOLD
hoodie.compaction.reverse.log.read false HoodieLogFormatReader reads a logfile in the forward direction starting from pos=0 to pos=file_length. If this config is set to true, the reader reads the logfile in reverse direction, from pos=file_length to pos=0
Config Param: COMPACTION_REVERSE_LOG_READ_ENABLE
hoodie.compaction.strategy org.apache.hudi.table.action.compact.strategy.LogFileSizeBasedCompactionStrategy Compaction strategy decides which file groups are picked up for compaction during each compaction run. By default. Hudi picks the log file with most accumulated unmerged data
Config Param: COMPACTION_STRATEGY
hoodie.compaction.target.io 512000 Amount of MBs to spend during compaction run for the LogFileSizeBasedCompactionStrategy. This value helps bound ingestion latency while compaction is run inline mode.
Config Param: TARGET_IO_PER_COMPACTION_IN_MB
hoodie.copyonwrite.insert.auto.split true Config to control whether we control insert split sizes automatically based on average record sizes. It’s recommended to keep this turned on, since hand tuning is otherwise extremely cumbersome.
Config Param: COPY_ON_WRITE_AUTO_SPLIT_INSERTS
hoodie.copyonwrite.insert.split.size 500000 Number of inserts assigned for each partition/bucket for writing. We based the default on writing out 100MB files, with at least 1kb records (100K records per file), and over provision to 500K. As long as auto-tuning of splits is turned on, this only affects the first write, where there is no history to learn record sizes from.
Config Param: COPY_ON_WRITE_INSERT_SPLIT_SIZE
hoodie.copyonwrite.record.size.estimate 1024 The average record size. If not explicitly specified, hudi will compute the record size estimate compute dynamically based on commit metadata. This is critical in computing the insert parallelism and bin-packing inserts into small files.
Config Param: COPY_ON_WRITE_RECORD_SIZE_ESTIMATE
hoodie.log.compaction.blocks.threshold 5 Log compaction can be scheduled if the no. of log blocks crosses this threshold value. This is effective only when log compaction is enabled via hoodie.log.compaction.inline
Config Param: LOG_COMPACTION_BLOCKS_THRESHOLD
Since Version: 0.13.0
hoodie.log.compaction.enable false By enabling log compaction through this config, log compaction will also get enabled for the metadata table.
Config Param: ENABLE_LOG_COMPACTION
Since Version: 0.14.0
hoodie.log.compaction.inline false When set to true, logcompaction service is triggered after each write. While being simpler operationally, this adds extra latency on the write path.
Config Param: INLINE_LOG_COMPACT
Since Version: 0.13.0
hoodie.optimized.log.blocks.scan.enable false New optimized scan for log blocks that handles all multi-writer use-cases while appending to log files. It also differentiates original blocks written by ingestion writers and compacted blocks written log compaction.
Config Param: ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN
Since Version: 0.13.0
hoodie.parquet.small.file.limit 104857600 During upsert operation, we opportunistically expand existing small files on storage, instead of writing new files, to keep number of files to an optimum. This config sets the file size limit below which a file on storage becomes a candidate to be selected as such a small file. By default, treat any file <= 100MB as a small file. Also note that if this set <= 0, will not try to get small files and directly write new files
Config Param: PARQUET_SMALL_FILE_LIMIT
hoodie.record.size.estimation.threshold 1.0 We use the previous commits’ metadata to calculate the estimated record size and use it to bin pack records into partitions. If the previous commit is too small to make an accurate estimation, Hudi will search commits in the reverse order, until we find a commit that has totalBytesWritten larger than (PARQUET_SMALL_FILE_LIMIT_BYTES * this_threshold)
Config Param: RECORD_SIZE_ESTIMATION_THRESHOLD

Error table Configs

Configurations that are required for Error table configs

Basic Configs

Config Name Default Description
hoodie.errortable.base.path (N/A) Base path for error table under which all error records would be stored.
Config Param: ERROR_TABLE_BASE_PATH
hoodie.errortable.target.table.name (N/A) Table name to be used for the error table
Config Param: ERROR_TARGET_TABLE
hoodie.errortable.write.class (N/A) Class which handles the error table writes. This config is used to configure a custom implementation for Error Table Writer. Specify the full class name of the custom error table writer as a value for this config
Config Param: ERROR_TABLE_WRITE_CLASS
hoodie.errortable.enable false Config to enable error table. If the config is enabled, all the records with processing error in DeltaStreamer are transferred to error table.
Config Param: ERROR_TABLE_ENABLED
hoodie.errortable.insert.shuffle.parallelism 200 Config to set insert shuffle parallelism. The config is similar to hoodie.insert.shuffle.parallelism config but applies to the error table.
Config Param: ERROR_TABLE_INSERT_PARALLELISM_VALUE
hoodie.errortable.upsert.shuffle.parallelism 200 Config to set upsert shuffle parallelism. The config is similar to hoodie.upsert.shuffle.parallelism config but applies to the error table.
Config Param: ERROR_TABLE_UPSERT_PARALLELISM_VALUE
hoodie.errortable.validate.recordcreation.enable true Records that fail to be created due to keygeneration failure or other issues will be sent to the Error Table
Config Param: ERROR_ENABLE_VALIDATE_RECORD_CREATION
Since Version: 0.15.0
hoodie.errortable.validate.targetschema.enable false Records with schema mismatch with Target Schema are sent to Error Table.
Config Param: ERROR_ENABLE_VALIDATE_TARGET_SCHEMA
hoodie.errortable.write.failure.strategy ROLLBACK_COMMIT The config specifies the failure strategy if error table write fails. Use one of - [ROLLBACK_COMMIT (Rollback the corresponding base table write commit for which the error events were triggered) , LOG_ERROR (Error is logged but the base table write succeeds) ]
Config Param: ERROR_TABLE_WRITE_FAILURE_STRATEGY

Layout Configs

Configurations that control storage layout and data distribution, which defines how the files are organized within a table.

Advanced Configs

Config Name Default Description
hoodie.storage.layout.partitioner.class (N/A) Partitioner class, it is used to distribute data in a specific way.
Config Param: LAYOUT_PARTITIONER_CLASS_NAME
hoodie.storage.layout.type DEFAULT org.apache.hudi.table.storage.HoodieStorageLayout$LayoutType: Determines how the files are organized within a table. DEFAULT(default): Each file group contains records of a certain set of keys, without particular grouping criteria. BUCKET: Each file group contains records of a set of keys which map to a certain range of hash values, so that using the hash function can easily identify the file group a record belongs to, based on the record key.
Config Param: LAYOUT_TYPE

Memory Configurations

Controls memory usage for compaction and merges, performed internally by Hudi.

Advanced Configs

Config Name Default Description
hoodie.memory.compaction.max.size (N/A) Maximum amount of memory used in bytes for compaction operations in bytes , before spilling to local storage.
Config Param: MAX_MEMORY_FOR_COMPACTION
hoodie.memory.spillable.map.path (N/A) Default file path for spillable map
Config Param: SPILLABLE_MAP_BASE_PATH
hoodie.memory.compaction.fraction 0.6 HoodieCompactedLogScanner reads logblocks, converts records to HoodieRecords and then merges these log blocks and records. At any point, the number of entries in a log block can be less than or equal to the number of entries in the corresponding parquet file. This can lead to OOM in the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use this config to set the max allowable inMemory footprint of the spillable map
Config Param: MAX_MEMORY_FRACTION_FOR_COMPACTION
hoodie.memory.dfs.buffer.max.size 16777216 Property to control the max memory in bytes for dfs input stream buffer size
Config Param: MAX_DFS_STREAM_BUFFER_SIZE
hoodie.memory.merge.fraction 0.6 This fraction is multiplied with the user memory fraction (1 - spark.memory.fraction) to get a final fraction of heap space to use during merge
Config Param: MAX_MEMORY_FRACTION_FOR_MERGE
hoodie.memory.merge.max.size 1073741824 Maximum amount of memory used in bytes for merge operations, before spilling to local storage.
Config Param: MAX_MEMORY_FOR_MERGE
hoodie.memory.writestatus.failure.fraction 0.1 Property to control how what fraction of the failed record, exceptions we report back to driver. Default is 10%. If set to 100%, with lot of failures, this can cause memory pressure, cause OOMs and mask actual data errors.
Config Param: WRITESTATUS_FAILURE_FRACTION

Write Configurations

Configurations that control write behavior on Hudi tables. These can be directly passed down from even higher level frameworks (e.g Spark datasources, Flink sink) and utilities (e.g Hudi Streamer).

Basic Configs

Config Name Default Description
hoodie.base.path (N/A) Base path on lake storage, under which all the table data is stored. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under this base path directory.
Config Param: BASE_PATH
hoodie.table.name (N/A) Table name that will be used for registering with metastores like HMS. Needs to be same across runs.
Config Param: TBL_NAME
hoodie.datasource.write.precombine.field ts Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)
Config Param: PRECOMBINE_FIELD_NAME
hoodie.write.concurrency.mode SINGLE_WRITER org.apache.hudi.common.model.WriteConcurrencyMode: Concurrency modes for write operations. SINGLE_WRITER(default): Only one active writer to the table. Maximizes throughput. OPTIMISTIC_CONCURRENCY_CONTROL: Multiple writers can operate on the table with lazy conflict resolution using locks. This means that only one writer succeeds if multiple writers write to the same file group.
Config Param: WRITE_CONCURRENCY_MODE

Advanced Configs

Config Name Default Description
hoodie.avro.schema (N/A) Schema string representing the current write schema of the table. Hudi passes this to implementations of HoodieRecordPayload to convert incoming records to avro. This is also used as the write schema evolving records during an update.
Config Param: AVRO_SCHEMA_STRING
hoodie.bulkinsert.user.defined.partitioner.class (N/A) If specified, this class will be used to re-partition records before they are bulk inserted. This can be used to sort, pack, cluster data optimally for common query patterns. For now we support a build-in user defined bulkinsert partitioner org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner which can does sorting based on specified column values set by hoodie.bulkinsert.user.defined.partitioner.sort.columns
Config Param: BULKINSERT_USER_DEFINED_PARTITIONER_CLASS_NAME
hoodie.bulkinsert.user.defined.partitioner.sort.columns (N/A) Columns to sort the data by when use org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner as user defined partitioner during bulk_insert. For example ‘column1,column2’
Config Param: BULKINSERT_USER_DEFINED_PARTITIONER_SORT_COLUMNS
hoodie.datasource.write.keygenerator.class (N/A) Key generator class, that implements org.apache.hudi.keygen.KeyGenerator extract a key out of incoming records.
Config Param: KEYGENERATOR_CLASS_NAME
hoodie.internal.schema (N/A) Schema string representing the latest schema of the table. Hudi passes this to implementations of evolution of schema
Config Param: INTERNAL_SCHEMA_STRING
hoodie.write.schema (N/A) Config allowing to override writer’s schema. This might be necessary in cases when writer’s schema derived from the incoming dataset might actually be different from the schema we actually want to use when writing. This, for ex, could be the case for’partial-update’ use-cases (like MERGE INTO Spark SQL statement for ex) where only a projection of the incoming dataset might be used to update the records in the existing table, prompting us to override the writer’s schema
Config Param: WRITE_SCHEMA_OVERRIDE
_.hoodie.allow.multi.write.on.same.instant false
Config Param: ALLOW_MULTI_WRITE_ON_SAME_INSTANT_ENABLE
hoodie.allow.empty.commit true Whether to allow generation of empty commits, even if no data was written in the commit. It’s useful in cases where extra metadata needs to be published regardless e.g tracking source offsets when ingesting data
Config Param: ALLOW_EMPTY_COMMIT
hoodie.allow.operation.metadata.field false Whether to include ‘_hoodie_operation’ in the metadata fields. Once enabled, all the changes of a record are persisted to the delta log directly without merge
Config Param: ALLOW_OPERATION_METADATA_FIELD
Since Version: 0.9.0
hoodie.auto.adjust.lock.configs false Auto adjust lock configurations when metadata table is enabled and for async table services.
Config Param: AUTO_ADJUST_LOCK_CONFIGS
Since Version: 0.11.0
hoodie.auto.commit true Controls whether a write operation should auto commit. This can be turned off to perform inspection of the uncommitted write before deciding to commit.
Config Param: AUTO_COMMIT_ENABLE
hoodie.avro.schema.external.transformation false When enabled, records in older schema are rewritten into newer schema during upsert,delete and background compaction,clustering operations.
Config Param: AVRO_EXTERNAL_SCHEMA_TRANSFORMATION_ENABLE
hoodie.avro.schema.validate false Validate the schema used for the write against the latest schema, for backwards compatibility.
Config Param: AVRO_SCHEMA_VALIDATE_ENABLE
hoodie.bulkinsert.shuffle.parallelism 0 For large initial imports using bulk_insert operation, controls the parallelism to use for sort modes or custom partitioning done before writing records to the table. Before 0.13.0 release, if users do not configure it, Hudi would use 200 as the default shuffle parallelism. From 0.13.0 onwards Hudi by default automatically uses the parallelism deduced by Spark based on the source data or the parallelism based on the logical plan for row writer. If the shuffle parallelism is explicitly configured by the user, the user-configured parallelism is used in defining the actual parallelism. If you observe small files from the bulk insert operation, we suggest configuring this shuffle parallelism explicitly, so that the parallelism is around total_input_data_size/120MB.
Config Param: BULKINSERT_PARALLELISM_VALUE
hoodie.bulkinsert.sort.mode NONE org.apache.hudi.execution.bulkinsert.BulkInsertSortMode: Modes for sorting records during bulk insert. NONE(default): No sorting. Fastest and matches spark.write.parquet() in number of files and overhead. GLOBAL_SORT: This ensures best file sizes, with lowest memory overhead at cost of sorting. PARTITION_SORT: Strikes a balance by only sorting within a Spark RDD partition, still keeping the memory overhead of writing low. File sizing is not as good as GLOBAL_SORT. PARTITION_PATH_REPARTITION: This ensures that the data for a single physical partition in the table is written by the same Spark executor. This should only be used when input data is evenly distributed across different partition paths. If data is skewed (most records are intended for a handful of partition paths among all) then this can cause an imbalance among Spark executors. PARTITION_PATH_REPARTITION_AND_SORT: This ensures that the data for a single physical partition in the table is written by the same Spark executor. This should only be used when input data is evenly distributed across different partition paths. Compared to PARTITION_PATH_REPARTITION, this sort mode does an additional step of sorting the records based on the partition path within a single Spark partition, given that data for multiple physical partitions can be sent to the same Spark partition and executor. If data is skewed (most records are intended for a handful of partition paths among all) then this can cause an imbalance among Spark executors.
Config Param: BULK_INSERT_SORT_MODE
hoodie.client.heartbeat.interval_in_ms 60000 Writers perform heartbeats to indicate liveness. Controls how often (in ms), such heartbeats are registered to lake storage.
Config Param: CLIENT_HEARTBEAT_INTERVAL_IN_MS
hoodie.client.heartbeat.tolerable.misses 2 Number of heartbeat misses, before a writer is deemed not alive and all pending writes are aborted.
Config Param: CLIENT_HEARTBEAT_NUM_TOLERABLE_MISSES
hoodie.client.init.callback.classes Fully-qualified class names of the Hudi client init callbacks to run at the initialization of the Hudi client. The class names are separated by ,. The class must be a subclass of org.apache.hudi.callback.HoodieClientInitCallback.By default, no Hudi client init callback is executed.
Config Param: CLIENT_INIT_CALLBACK_CLASS_NAMES
Since Version: 0.14.0
hoodie.combine.before.delete true During delete operations, controls whether we should combine deletes (and potentially also upserts) before writing to storage.
Config Param: COMBINE_BEFORE_DELETE
hoodie.combine.before.insert false When inserted records share same key, controls whether they should be first combined (i.e de-duplicated) before writing to storage.
Config Param: COMBINE_BEFORE_INSERT
hoodie.combine.before.upsert true When upserted records share same key, controls whether they should be first combined (i.e de-duplicated) before writing to storage. This should be turned off only if you are absolutely certain that there are no duplicates incoming, otherwise it can lead to duplicate keys and violate the uniqueness guarantees.
Config Param: COMBINE_BEFORE_UPSERT
hoodie.consistency.check.initial_interval_ms 2000 Initial time between successive attempts to ensure written data’s metadata is consistent on storage. Grows with exponential backoff after the initial value.
Config Param: INITIAL_CONSISTENCY_CHECK_INTERVAL_MS
hoodie.consistency.check.max_checks 7 Maximum number of checks, for consistency of written data.
Config Param: MAX_CONSISTENCY_CHECKS
hoodie.consistency.check.max_interval_ms 300000 Max time to wait between successive attempts at performing consistency checks
Config Param: MAX_CONSISTENCY_CHECK_INTERVAL_MS
hoodie.datasource.write.keygenerator.type SIMPLE Note This is being actively worked on. Please use hoodie.datasource.write.keygenerator.class instead. org.apache.hudi.keygen.constant.KeyGeneratorType: Key generator type, indicating the key generator class to use, that implements org.apache.hudi.keygen.KeyGenerator. SIMPLE(default): Simple key generator, which takes names of fields to be used for recordKey and partitionPath as configs. COMPLEX: Complex key generator, which takes names of fields to be used for recordKey and partitionPath as configs. TIMESTAMP: Timestamp-based key generator, that relies on timestamps for partitioning field. Still picks record key by name. CUSTOM: This is a generic implementation type of KeyGenerator where users can configure record key as a single field or a combination of fields. Similarly partition path can be configured to have multiple fields or only one field. This KeyGenerator expects value for prop “hoodie.datasource.write.partitionpath.field” in a specific format. For example: properties.put(“hoodie.datasource.write.partitionpath.field”, “field1:PartitionKeyType1,field2:PartitionKeyType2”). NON_PARTITION: Simple Key generator for non-partitioned tables. GLOBAL_DELETE: Key generator for deletes using global indices.
Config Param: KEYGENERATOR_TYPE
hoodie.datasource.write.payload.class org.apache.hudi.common.model.OverwriteWithLatestAvroPayload Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for PRECOMBINE_FIELD_OPT_VAL in-effective
Config Param: WRITE_PAYLOAD_CLASS_NAME
hoodie.datasource.write.record.merger.impls org.apache.hudi.common.model.HoodieAvroRecordMerger List of HoodieMerger implementations constituting Hudi’s merging strategy — based on the engine used. These merger impls will filter by hoodie.datasource.write.record.merger.strategy Hudi will pick most efficient implementation to perform merging/combining of the records (during update, reading MOR table, etc)
Config Param: RECORD_MERGER_IMPLS
Since Version: 0.13.0
hoodie.datasource.write.record.merger.strategy eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in hoodie.datasource.write.record.merger.impls which has the same merger strategy id
Config Param: RECORD_MERGER_STRATEGY
Since Version: 0.13.0
hoodie.datasource.write.schema.allow.auto.evolution.column.drop false Controls whether table’s schema is allowed to automatically evolve when incoming batch’s schema can have any of the columns dropped. By default, Hudi will not allow this kind of (auto) schema evolution. Set this config to true to allow table’s schema to be updated automatically when columns are dropped from the new incoming batch.
Config Param: SCHEMA_ALLOW_AUTO_EVOLUTION_COLUMN_DROP
Since Version: 0.13.0
hoodie.delete.shuffle.parallelism 0 Parallelism used for delete operation. Delete operations also performs shuffles, similar to upsert operation. Before 0.13.0 release, if users do not configure it, Hudi would use 200 as the default shuffle parallelism. From 0.13.0 onwards Hudi by default automatically uses the parallelism deduced by Spark based on the source data. If the shuffle parallelism is explicitly configured by the user, the user-configured parallelism is used in defining the actual parallelism.
Config Param: DELETE_PARALLELISM_VALUE
hoodie.embed.timeline.server true When true, spins up an instance of the timeline server (meta server that serves cached file listings, statistics),running on each writer’s driver process, accepting requests during the write from executors.
Config Param: EMBEDDED_TIMELINE_SERVER_ENABLE
hoodie.embed.timeline.server.async false Controls whether or not, the requests to the timeline server are processed in asynchronous fashion, potentially improving throughput.
Config Param: EMBEDDED_TIMELINE_SERVER_USE_ASYNC_ENABLE
hoodie.embed.timeline.server.gzip true Controls whether gzip compression is used, for large responses from the timeline server, to improve latency.
Config Param: EMBEDDED_TIMELINE_SERVER_COMPRESS_ENABLE
hoodie.embed.timeline.server.port 0 Port at which the timeline server listens for requests. When running embedded in each writer, it picks a free port and communicates to all the executors. This should rarely be changed.
Config Param: EMBEDDED_TIMELINE_SERVER_PORT_NUM
hoodie.embed.timeline.server.reuse.enabled false Controls whether the timeline server instance should be cached and reused across the tablesto avoid startup costs and server overhead. This should only be used if you are running multiple writers in the same JVM.
Config Param: EMBEDDED_TIMELINE_SERVER_REUSE_ENABLED
hoodie.embed.timeline.server.threads -1 Number of threads to serve requests in the timeline server. By default, auto configured based on the number of underlying cores.
Config Param: EMBEDDED_TIMELINE_NUM_SERVER_THREADS
hoodie.fail.on.timeline.archiving true Timeline archiving removes older instants from the timeline, after each write operation, to minimize metadata overhead. Controls whether or not, the write should be failed as well, if such archiving fails.
Config Param: FAIL_ON_TIMELINE_ARCHIVING_ENABLE
hoodie.fail.writes.on.inline.table.service.exception true Table services such as compaction and clustering can fail and prevent syncing to the metaclient. Set this to true to fail writes when table services fail
Config Param: FAIL_ON_INLINE_TABLE_SERVICE_EXCEPTION
Since Version: 0.13.0
hoodie.fileid.prefix.provider.class org.apache.hudi.table.RandomFileIdPrefixProvider File Id Prefix provider class, that implements org.apache.hudi.fileid.FileIdPrefixProvider
Config Param: FILEID_PREFIX_PROVIDER_CLASS
Since Version: 0.10.0
hoodie.finalize.write.parallelism 200 Parallelism for the write finalization internal operation, which involves removing any partially written files from lake storage, before committing the write. Reduce this value, if the high number of tasks incur delays for smaller tables or low latency writes.
Config Param: FINALIZE_WRITE_PARALLELISM_VALUE
hoodie.insert.shuffle.parallelism 0 Parallelism for inserting records into the table. Inserts can shuffle data before writing to tune file sizes and optimize the storage layout. Before 0.13.0 release, if users do not configure it, Hudi would use 200 as the default shuffle parallelism. From 0.13.0 onwards Hudi by default automatically uses the parallelism deduced by Spark based on the source data. If the shuffle parallelism is explicitly configured by the user, the user-configured parallelism is used in defining the actual parallelism. If you observe small files from the insert operation, we suggest configuring this shuffle parallelism explicitly, so that the parallelism is around total_input_data_size/120MB.
Config Param: INSERT_PARALLELISM_VALUE
hoodie.markers.delete.parallelism 100 Determines the parallelism for deleting marker files, which are used to track all files (valid or invalid/partial) written during a write operation. Increase this value if delays are observed, with large batch writes.
Config Param: MARKERS_DELETE_PARALLELISM_VALUE
hoodie.markers.timeline_server_based.batch.interval_ms 50 The batch interval in milliseconds for marker creation batch processing
Config Param: MARKERS_TIMELINE_SERVER_BASED_BATCH_INTERVAL_MS
Since Version: 0.9.0
hoodie.markers.timeline_server_based.batch.num_threads 20 Number of threads to use for batch processing marker creation requests at the timeline server
Config Param: MARKERS_TIMELINE_SERVER_BASED_BATCH_NUM_THREADS
Since Version: 0.9.0
hoodie.merge.allow.duplicate.on.inserts true When enabled, we allow duplicate keys even if inserts are routed to merge with an existing file (for ensuring file sizing). This is only relevant for insert operation, since upsert, delete operations will ensure unique key constraints are maintained.
Config Param: MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE
hoodie.merge.data.validation.enabled false When enabled, data validation checks are performed during merges to ensure expected number of records after merge operation.
Config Param: MERGE_DATA_VALIDATION_CHECK_ENABLE
hoodie.merge.small.file.group.candidates.limit 1 Limits number of file groups, whose base file satisfies small-file limit, to consider for appending records during upsert operation. Only applicable to MOR tables
Config Param: MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT
hoodie.release.resource.on.completion.enable true Control to enable release all persist rdds when the spark job finish.
Config Param: RELEASE_RESOURCE_ENABLE
Since Version: 0.11.0
hoodie.rollback.instant.backup.dir .rollback_backup Path where instants being rolled back are copied. If not absolute path then a directory relative to .hoodie folder is created.
Config Param: ROLLBACK_INSTANT_BACKUP_DIRECTORY
hoodie.rollback.instant.backup.enabled false Backup instants removed during rollback and restore (useful for debugging)
Config Param: ROLLBACK_INSTANT_BACKUP_ENABLED
hoodie.rollback.parallelism 100 This config controls the parallelism for rollback of commits. Rollbacks perform deletion of files or logging delete blocks to file groups on storage in parallel. The configure value limits the parallelism so that the number of Spark tasks do not exceed the value. If rollback is slow due to the limited parallelism, you can increase this to tune the performance.
Config Param: ROLLBACK_PARALLELISM_VALUE
hoodie.rollback.using.markers true Enables a more efficient mechanism for rollbacks based on the marker files generated during the writes. Turned on by default.
Config Param: ROLLBACK_USING_MARKERS_ENABLE
hoodie.schema.cache.enable false cache query internalSchemas in driver/executor side
Config Param: ENABLE_INTERNAL_SCHEMA_CACHE
hoodie.sensitive.config.keys ssl,tls,sasl,auth,credentials Comma separated list of filters for sensitive config keys. Hudi Streamer will not print any configuration which contains the configured filter. For example with a configured filter ssl, value for config ssl.trustore.location would be masked.
Config Param: SENSITIVE_CONFIG_KEYS_FILTER
Since Version: 0.14.0
hoodie.skip.default.partition.validation false When table is upgraded from pre 0.12 to 0.12, we check for “default” partition and fail if found one. Users are expected to rewrite the data in those partitions. Enabling this config will bypass this validation
Config Param: SKIP_DEFAULT_PARTITION_VALIDATION
Since Version: 0.12.0
hoodie.table.base.file.format PARQUET File format to store all the base file data. org.apache.hudi.common.model.HoodieFileFormat: Hoodie file formats. PARQUET(default): Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. HFILE: (internal config) File format for metadata table. A file of sorted key/value pairs. Both keys and values are byte arrays. ORC: The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.
Config Param: BASE_FILE_FORMAT
hoodie.table.services.enabled true Master control to disable all table services including archive, clean, compact, cluster, etc.
Config Param: TABLE_SERVICES_ENABLED
Since Version: 0.11.0
hoodie.timeline.layout.version 1 Controls the layout of the timeline. Version 0 relied on renames, Version 1 (default) models the timeline as an immutable log relying only on atomic writes for object storage.
Config Param: TIMELINE_LAYOUT_VERSION_NUM
Since Version: 0.5.1
hoodie.upsert.shuffle.parallelism 0 Parallelism to use for upsert operation on the table. Upserts can shuffle data to perform index lookups, file sizing, bin packing records optimally into file groups. Before 0.13.0 release, if users do not configure it, Hudi would use 200 as the default shuffle parallelism. From 0.13.0 onwards Hudi by default automatically uses the parallelism deduced by Spark based on the source data. If the shuffle parallelism is explicitly configured by the user, the user-configured parallelism is used in defining the actual parallelism. If you observe small files from the upsert operation, we suggest configuring this shuffle parallelism explicitly, so that the parallelism is around total_input_data_size/120MB.
Config Param: UPSERT_PARALLELISM_VALUE
hoodie.write.buffer.limit.bytes 4194304 Size of in-memory buffer used for parallelizing network reads and lake storage writes.
Config Param: WRITE_BUFFER_LIMIT_BYTES_VALUE
hoodie.write.buffer.record.cache.limit 131072 Maximum queue size of in-memory buffer for parallelizing network reads and lake storage writes.
Config Param: WRITE_BUFFER_RECORD_CACHE_LIMIT
Since Version: 0.15.0
hoodie.write.buffer.record.sampling.rate 64 Sampling rate of in-memory buffer used to estimate object size. Higher value lead to lower CPU usage.
Config Param: WRITE_BUFFER_RECORD_SAMPLING_RATE
Since Version: 0.15.0
hoodie.write.concurrency.async.conflict.detector.initial_delay_ms 0 Used for timeline-server-based markers with AsyncTimelineServerBasedDetectionStrategy. The time in milliseconds to delay the first execution of async marker-based conflict detection.
Config Param: ASYNC_CONFLICT_DETECTOR_INITIAL_DELAY_MS
Since Version: 0.13.0
hoodie.write.concurrency.async.conflict.detector.period_ms 30000 Used for timeline-server-based markers with AsyncTimelineServerBasedDetectionStrategy. The period in milliseconds between successive executions of async marker-based conflict detection.
Config Param: ASYNC_CONFLICT_DETECTOR_PERIOD_MS
Since Version: 0.13.0
hoodie.write.concurrency.early.conflict.check.commit.conflict false Whether to enable commit conflict checking or not during early conflict detection.
Config Param: EARLY_CONFLICT_DETECTION_CHECK_COMMIT_CONFLICT
Since Version: 0.13.0
hoodie.write.concurrency.early.conflict.detection.enable false Whether to enable early conflict detection based on markers. It eagerly detects writing conflict before create markers and fails fast if a conflict is detected, to release cluster compute resources as soon as possible.
Config Param: EARLY_CONFLICT_DETECTION_ENABLE
Since Version: 0.13.0
hoodie.write.concurrency.early.conflict.detection.strategy The class name of the early conflict detection strategy to use. This should be a subclass of org.apache.hudi.common.conflict.detection.EarlyConflictDetectionStrategy.
Config Param: EARLY_CONFLICT_DETECTION_STRATEGY_CLASS_NAME
Since Version: 0.13.0
hoodie.write.executor.disruptor.buffer.limit.bytes 1024 The size of the Disruptor Executor ring buffer, must be power of 2
Config Param: WRITE_EXECUTOR_DISRUPTOR_BUFFER_LIMIT_BYTES
Since Version: 0.13.0
hoodie.write.executor.disruptor.wait.strategy BLOCKING_WAIT org.apache.hudi.common.util.queue.DisruptorWaitStrategyType: Strategy employed for making Disruptor Executor wait on a cursor. BLOCKING_WAIT(default): The slowest of the available wait strategies. However, it is the most conservative with the respect to CPU usage and will give the most consistent behaviour across the widest variety of deployment options. SLEEPING_WAIT: Like the BLOCKING_WAIT strategy, it attempts to be conservative with CPU usage by using a simple busy wait loop. The difference is that the SLEEPING_WAIT strategy uses a call to LockSupport.parkNanos(1) in the middle of the loop. On a typical Linux system this will pause the thread for around 60µs. YIELDING_WAIT: The YIELDING_WAIT strategy is one of two wait strategy that can be used in low-latency systems. It is designed for cases where there is an opportunity to burn CPU cycles with the goal of improving latency. The YIELDING_WAIT strategy will busy spin, waiting for the sequence to increment to the appropriate value. Inside the body of the loop Thread#yield() will be called allowing other queued threads to run. This is the recommended wait strategy when you need very high performance, and the number of EventHandler threads is lower than the total number of logical cores, such as when hyper-threading is enabled. BUSY_SPIN_WAIT: The BUSY_SPIN_WAIT strategy is the highest performing wait strategy. Like the YIELDING_WAIT strategy, it can be used in low-latency systems, but puts the highest constraints on the deployment environment.
Config Param: WRITE_EXECUTOR_DISRUPTOR_WAIT_STRATEGY
Since Version: 0.13.0
hoodie.write.executor.type SIMPLE org.apache.hudi.common.util.queue.ExecutorType: Types of executor that implements org.apache.hudi.common.util.queue.HoodieExecutor. The executor orchestrates concurrent producers and consumers communicating through a message queue. BOUNDED_IN_MEMORY: Executor which orchestrates concurrent producers and consumers communicating through a bounded in-memory message queue using LinkedBlockingQueue. This queue will use extra lock to balance producers and consumers. DISRUPTOR: Executor which orchestrates concurrent producers and consumers communicating through disruptor as a lock free message queue to gain better writing performance. Although DisruptorExecutor is still an experimental feature. SIMPLE(default): Executor with no inner message queue and no inner lock. Consuming and writing records from iterator directly. The advantage is that there is no need for additional memory and cpu resources due to lock or multithreading. The disadvantage is that the executor is a single-write-single-read model, cannot support functions such as speed limit and can not de-couple the network read (shuffle read) and network write (writing objects/files to storage) anymore.
Config Param: WRITE_EXECUTOR_TYPE
Since Version: 0.13.0
hoodie.write.markers.type TIMELINE_SERVER_BASED org.apache.hudi.common.table.marker.MarkerType: Marker type indicating how markers are stored in the file system, used for identifying the files written and cleaning up files not committed which should be deleted. DIRECT: Individual marker file corresponding to each data file is directly created by the writer. TIMELINE_SERVER_BASED(default): Marker operations are all handled at the timeline service which serves as a proxy. New marker entries are batch processed and stored in a limited number of underlying files for efficiency. If HDFS is used or timeline server is disabled, DIRECT markers are used as fallback even if this is configured. This configuration does not take effect for Spark structured streaming; DIRECT markers are always used.
Config Param: MARKERS_TYPE
Since Version: 0.9.0
hoodie.write.num.retries.on.conflict.failures 0 Maximum number of times to retry a batch on conflict failure.
Config Param: NUM_RETRIES_ON_CONFLICT_FAILURES
Since Version: 0.14.0
hoodie.write.status.storage.level MEMORY_AND_DISK_SER Write status objects hold metadata about a write (stats, errors), that is not yet committed to storage. This controls the how that information is cached for inspection by clients. We rarely expect this to be changed.
Config Param: WRITE_STATUS_STORAGE_LEVEL_VALUE
hoodie.write.tagged.record.storage.level MEMORY_AND_DISK_SER Determine what level of persistence is used to cache write RDDs. Refer to org.apache.spark.storage.StorageLevel for different values
Config Param: TAGGED_RECORD_STORAGE_LEVEL_VALUE
hoodie.writestatus.class org.apache.hudi.client.WriteStatus Subclass of org.apache.hudi.client.WriteStatus to be used to collect information about a write. Can be overridden to collection additional metrics/statistics about the data if needed.
Config Param: WRITE_STATUS_CLASS_NAME

Commit Callback Configs

Configurations controlling callback behavior into HTTP endpoints, to push notifications on commits on hudi tables.

Write commit callback configs

Advanced Configs

Config Name Default Description
hoodie.write.commit.callback.http.custom.headers (N/A) Http callback custom headers. Format: HeaderName1:HeaderValue1;HeaderName2:HeaderValue2
Config Param: CALLBACK_HTTP_CUSTOM_HEADERS
Since Version: 0.15.0
hoodie.write.commit.callback.http.url (N/A) Callback host to be sent along with callback messages
Config Param: CALLBACK_HTTP_URL
Since Version: 0.6.0
hoodie.write.commit.callback.class org.apache.hudi.callback.impl.HoodieWriteCommitHttpCallback Full path of callback class and must be a subclass of HoodieWriteCommitCallback class, org.apache.hudi.callback.impl.HoodieWriteCommitHttpCallback by default
Config Param: CALLBACK_CLASS_NAME
Since Version: 0.6.0
hoodie.write.commit.callback.http.api.key hudi_write_commit_http_callback Http callback API key. hudi_write_commit_http_callback by default
Config Param: CALLBACK_HTTP_API_KEY_VALUE
Since Version: 0.6.0
hoodie.write.commit.callback.http.timeout.seconds 30 Callback timeout in seconds.
Config Param: CALLBACK_HTTP_TIMEOUT_IN_SECONDS
Since Version: 0.6.0
hoodie.write.commit.callback.on false Turn commit callback on/off. off by default.
Config Param: TURN_CALLBACK_ON
Since Version: 0.6.0

Write commit Kafka callback configs

Controls notifications sent to Kafka, on events happening to a hudi table.

Advanced Configs

Config Name Default Description
hoodie.write.commit.callback.kafka.bootstrap.servers (N/A) Bootstrap servers of kafka cluster, to be used for publishing commit metadata.
Config Param: BOOTSTRAP_SERVERS
Since Version: 0.7.0
hoodie.write.commit.callback.kafka.partition (N/A) It may be desirable to serialize all changes into a single Kafka partition for providing strict ordering. By default, Kafka messages are keyed by table name, which guarantees ordering at the table level, but not globally (or when new partitions are added)
Config Param: PARTITION
Since Version: 0.7.0
hoodie.write.commit.callback.kafka.topic (N/A) Kafka topic name to publish timeline activity into.
Config Param: TOPIC
Since Version: 0.7.0
hoodie.write.commit.callback.kafka.acks all kafka acks level, all by default to ensure strong durability.
Config Param: ACKS
Since Version: 0.7.0
hoodie.write.commit.callback.kafka.retries 3 Times to retry the produce. 3 by default
Config Param: RETRIES
Since Version: 0.7.0

Write commit pulsar callback configs

Controls notifications sent to pulsar, on events happening to a hudi table.

Advanced Configs

Config Name Default Description
hoodie.write.commit.callback.pulsar.broker.service.url (N/A) Server’s url of pulsar cluster, to be used for publishing commit metadata.
Config Param: BROKER_SERVICE_URL
Since Version: 0.11.0
hoodie.write.commit.callback.pulsar.topic (N/A) pulsar topic name to publish timeline activity into.
Config Param: TOPIC
Since Version: 0.11.0
hoodie.write.commit.callback.pulsar.connection-timeout 10s Duration of waiting for a connection to a broker to be established.
Config Param: CONNECTION_TIMEOUT
Since Version: 0.11.0
hoodie.write.commit.callback.pulsar.keepalive-interval 30s Duration of keeping alive interval for each client broker connection.
Config Param: KEEPALIVE_INTERVAL
Since Version: 0.11.0
hoodie.write.commit.callback.pulsar.operation-timeout 30s Duration of waiting for completing an operation.
Config Param: OPERATION_TIMEOUT
Since Version: 0.11.0
hoodie.write.commit.callback.pulsar.producer.block-if-queue-full true When the queue is full, the method is blocked instead of an exception is thrown.
Config Param: PRODUCER_BLOCK_QUEUE_FULL
Since Version: 0.11.0
hoodie.write.commit.callback.pulsar.producer.pending-queue-size 1000 The maximum size of a queue holding pending messages.
Config Param: PRODUCER_PENDING_QUEUE_SIZE
Since Version: 0.11.0
hoodie.write.commit.callback.pulsar.producer.pending-total-size 50000 The maximum number of pending messages across partitions.
Config Param: PRODUCER_PENDING_SIZE
Since Version: 0.11.0
hoodie.write.commit.callback.pulsar.producer.route-mode RoundRobinPartition Message routing logic for producers on partitioned topics.
Config Param: PRODUCER_ROUTE_MODE
Since Version: 0.11.0
hoodie.write.commit.callback.pulsar.producer.send-timeout 30s The timeout in each sending to pulsar.
Config Param: PRODUCER_SEND_TIMEOUT
Since Version: 0.11.0
hoodie.write.commit.callback.pulsar.request-timeout 60s Duration of waiting for completing a request.
Config Param: REQUEST_TIMEOUT
Since Version: 0.11.0

Lock Configs

Configurations that control locking mechanisms required for concurrency control between writers to a Hudi table. Concurrency between Hudi’s own table services are auto managed internally.

Common Lock Configurations

Basic Configs

Config Name Default Description
hoodie.write.lock.heartbeat_interval_ms 60000 Heartbeat interval in ms, to send a heartbeat to indicate that hive client holding locks.
Config Param: LOCK_HEARTBEAT_INTERVAL_MS
Since Version: 0.15.0

Advanced Configs

Config Name Default Description
hoodie.write.lock.filesystem.path (N/A) For DFS based lock providers, path to store the locks under. use Table’s meta path as default
Config Param: FILESYSTEM_LOCK_PATH
Since Version: 0.8.0
hoodie.write.lock.hivemetastore.database (N/A) For Hive based lock provider, the Hive database to acquire lock against
Config Param: HIVE_DATABASE_NAME
Since Version: 0.8.0
hoodie.write.lock.hivemetastore.table (N/A) For Hive based lock provider, the Hive table to acquire lock against
Config Param: HIVE_TABLE_NAME
Since Version: 0.8.0
hoodie.write.lock.hivemetastore.uris (N/A) For Hive based lock provider, the Hive metastore URI to acquire locks against.
Config Param: HIVE_METASTORE_URI
Since Version: 0.8.0
hoodie.write.lock.zookeeper.base_path (N/A) The base path on Zookeeper under which to create lock related ZNodes. This should be same for all concurrent writers to the same table
Config Param: ZK_BASE_PATH
Since Version: 0.8.0
hoodie.write.lock.zookeeper.port (N/A) Zookeeper port to connect to.
Config Param: ZK_PORT
Since Version: 0.8.0
hoodie.write.lock.zookeeper.url (N/A) Zookeeper URL to connect to.
Config Param: ZK_CONNECT_URL
Since Version: 0.8.0
hoodie.write.lock.client.num_retries 50 Maximum number of times to retry to acquire lock additionally from the lock manager.
Config Param: LOCK_ACQUIRE_CLIENT_NUM_RETRIES
Since Version: 0.8.0
hoodie.write.lock.client.wait_time_ms_between_retry 5000 Amount of time to wait between retries on the lock provider by the lock manager
Config Param: LOCK_ACQUIRE_CLIENT_RETRY_WAIT_TIME_IN_MILLIS
Since Version: 0.8.0
hoodie.write.lock.conflict.resolution.strategy org.apache.hudi.client.transaction.SimpleConcurrentFileWritesConflictResolutionStrategy Lock provider class name, this should be subclass of org.apache.hudi.client.transaction.ConflictResolutionStrategy
Config Param: WRITE_CONFLICT_RESOLUTION_STRATEGY_CLASS_NAME
Since Version: 0.8.0
hoodie.write.lock.filesystem.expire 0 For DFS based lock providers, expire time in minutes, must be a non-negative number, default means no expire
Config Param: FILESYSTEM_LOCK_EXPIRE
Since Version: 0.12.0
hoodie.write.lock.max_wait_time_ms_between_retry 16000 Maximum amount of time to wait between retries by lock provider client. This bounds the maximum delay from the exponential backoff. Currently used by ZK based lock provider only.
Config Param: LOCK_ACQUIRE_RETRY_MAX_WAIT_TIME_IN_MILLIS
Since Version: 0.8.0
hoodie.write.lock.num_retries 15 Maximum number of times to retry lock acquire, at each lock provider
Config Param: LOCK_ACQUIRE_NUM_RETRIES
Since Version: 0.8.0
hoodie.write.lock.provider org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider Lock provider class name, user can provide their own implementation of LockProvider which should be subclass of org.apache.hudi.common.lock.LockProvider
Config Param: LOCK_PROVIDER_CLASS_NAME
Since Version: 0.8.0
hoodie.write.lock.wait_time_ms 60000 Timeout in ms, to wait on an individual lock acquire() call, at the lock provider.
Config Param: LOCK_ACQUIRE_WAIT_TIMEOUT_MS
Since Version: 0.8.0
hoodie.write.lock.wait_time_ms_between_retry 1000 Initial amount of time to wait between retries to acquire locks, subsequent retries will exponentially backoff.
Config Param: LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS
Since Version: 0.8.0
hoodie.write.lock.zookeeper.connection_timeout_ms 15000 Timeout in ms, to wait for establishing connection with Zookeeper.
Config Param: ZK_CONNECTION_TIMEOUT_MS
Since Version: 0.8.0
hoodie.write.lock.zookeeper.lock_key Key name under base_path at which to create a ZNode and acquire lock. Final path on zk will look like base_path/lock_key. If this parameter is not set, we would set it as the table name
Config Param: ZK_LOCK_KEY
Since Version: 0.8.0
hoodie.write.lock.zookeeper.session_timeout_ms 60000 Timeout in ms, to wait after losing connection to ZooKeeper, before the session is expired
Config Param: ZK_SESSION_TIMEOUT_MS
Since Version: 0.8.0

DynamoDB based Locks Configurations

Configs that control DynamoDB based locking mechanisms required for concurrency control between writers to a Hudi table. Concurrency between Hudi’s own table services are auto managed internally.

Advanced Configs

Config Name Default Description
hoodie.write.lock.dynamodb.endpoint_url (N/A) For DynamoDB based lock provider, the url endpoint used for Amazon DynamoDB service. Useful for development with a local dynamodb instance.
Config Param: DYNAMODB_ENDPOINT_URL
Since Version: 0.10.1
hoodie.write.lock.dynamodb.billing_mode PAY_PER_REQUEST For DynamoDB based lock provider, by default it is PAY_PER_REQUEST mode. Alternative is PROVISIONED.
Config Param: DYNAMODB_LOCK_BILLING_MODE
Since Version: 0.10.0
hoodie.write.lock.dynamodb.partition_key For DynamoDB based lock provider, the partition key for the DynamoDB lock table. Each Hudi dataset should has it’s unique key so concurrent writers could refer to the same partition key. By default we use the Hudi table name specified to be the partition key
Config Param: DYNAMODB_LOCK_PARTITION_KEY
Since Version: 0.10.0
hoodie.write.lock.dynamodb.read_capacity 20 For DynamoDB based lock provider, read capacity units when using PROVISIONED billing mode
Config Param: DYNAMODB_LOCK_READ_CAPACITY
Since Version: 0.10.0
hoodie.write.lock.dynamodb.region us-east-1 For DynamoDB based lock provider, the region used in endpoint for Amazon DynamoDB service. Would try to first get it from AWS_REGION environment variable. If not find, by default use us-east-1
Config Param: DYNAMODB_LOCK_REGION
Since Version: 0.10.0
hoodie.write.lock.dynamodb.table hudi_locks For DynamoDB based lock provider, the name of the DynamoDB table acting as lock table
Config Param: DYNAMODB_LOCK_TABLE_NAME
Since Version: 0.10.0
hoodie.write.lock.dynamodb.table_creation_timeout 120000 For DynamoDB based lock provider, the maximum number of milliseconds to wait for creating DynamoDB table
Config Param: DYNAMODB_LOCK_TABLE_CREATION_TIMEOUT
Since Version: 0.10.0
hoodie.write.lock.dynamodb.write_capacity 10 For DynamoDB based lock provider, write capacity units when using PROVISIONED billing mode
Config Param: DYNAMODB_LOCK_WRITE_CAPACITY
Since Version: 0.10.0
hoodie.write.lock.wait_time_ms 60000 Lock Acquire Wait Timeout in milliseconds
Config Param: LOCK_ACQUIRE_WAIT_TIMEOUT_MS_PROP_KEY
Since Version: 0.10.0

Key Generator Configs

Hudi maintains keys (record key + partition path) for uniquely identifying a particular record. These configs allow developers to setup the Key generator class that extracts these out of incoming records.

Key Generator Options

Basic Configs

Config Name Default Description
hoodie.datasource.write.partitionpath.field (N/A) Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value obtained by invoking .toString()
Config Param: PARTITIONPATH_FIELD_NAME
hoodie.datasource.write.recordkey.field (N/A) Record key field. Value to be used as the recordKey component of HoodieKey. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: a.b.c
Config Param: RECORDKEY_FIELD_NAME
hoodie.datasource.write.hive_style_partitioning false Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
Config Param: HIVE_STYLE_PARTITIONING_ENABLE

Advanced Configs

Config Name Default Description
hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled false When set to true, consistent value will be generated for a logical timestamp type column, like timestamp-millis and timestamp-micros, irrespective of whether row-writer is enabled. Disabled by default so as not to break the pipeline that deploy either fully row-writer path or non row-writer path. For example, if it is kept disabled then record key of timestamp type with value 2016-12-29 09:54:00 will be written as timestamp 2016-12-29 09:54:00.0 in row-writer path, while it will be written as long value 1483023240000000 in non row-writer path. If enabled, then the timestamp value will be written in both the cases.
Config Param: KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED
Since Version: 0.10.1
hoodie.datasource.write.partitionpath.urlencode false Should we url encode the partition path value, before creating the folder structure.
Config Param: URL_ENCODE_PARTITIONING

Timestamp-based key generator configs

Configs used for TimestampBasedKeyGenerator which relies on timestamps for the partition field. The field values are interpreted as timestamps and not just converted to string while generating partition path value for records. Record key is same as before where it is chosen by field name.

Advanced Configs

Config Name Default Description
hoodie.keygen.timebased.timestamp.type (N/A) Timestamp type of the field, which should be one of the timestamp types supported: UNIX_TIMESTAMP, DATE_STRING, MIXED, EPOCHMILLISECONDS, SCALAR.
Config Param: TIMESTAMP_TYPE_FIELD
hoodie.keygen.datetime.parser.class org.apache.hudi.keygen.parser.HoodieDateTimeParser Date time parser class name.
Config Param: DATE_TIME_PARSER
hoodie.keygen.timebased.input.dateformat Input date format such as yyyy-MM-dd'T'HH:mm:ss.SSSZ.
Config Param: TIMESTAMP_INPUT_DATE_FORMAT
hoodie.keygen.timebased.input.dateformat.list.delimiter.regex , The delimiter for allowed input date format list, usually ,.
Config Param: TIMESTAMP_INPUT_DATE_FORMAT_LIST_DELIMITER_REGEX
hoodie.keygen.timebased.input.timezone UTC Timezone of the input timestamp, such as UTC.
Config Param: TIMESTAMP_INPUT_TIMEZONE_FORMAT
hoodie.keygen.timebased.output.dateformat Output date format such as yyyy-MM-dd'T'HH:mm:ss.SSSZ.
Config Param: TIMESTAMP_OUTPUT_DATE_FORMAT
hoodie.keygen.timebased.output.timezone UTC Timezone of the output timestamp, such as UTC.
Config Param: TIMESTAMP_OUTPUT_TIMEZONE_FORMAT
hoodie.keygen.timebased.timestamp.scalar.time.unit SECONDS When timestamp type SCALAR is used, this specifies the time unit, with allowed unit specified by TimeUnit enums (NANOSECONDS, MICROSECONDS, MILLISECONDS, SECONDS, MINUTES, HOURS, DAYS).
Config Param: INPUT_TIME_UNIT
hoodie.keygen.timebased.timezone UTC Timezone of both input and output timestamp if they are the same, such as UTC. Please use hoodie.keygen.timebased.input.timezone and hoodie.keygen.timebased.output.timezone instead if the input and output timezones are different.
Config Param: TIMESTAMP_TIMEZONE_FORMAT

Index Configs

Configurations that control indexing behavior, which tags incoming records as either inserts or updates to older records.

Common Index Configs

Basic Configs

Config Name Default Description
hoodie.index.type (N/A) org.apache.hudi.index.HoodieIndex$IndexType: Determines how input records are indexed, i.e., looked up based on the key for the location in the existing table. Default is SIMPLE on Spark engine, and INMEMORY on Flink and Java engines. HBASE: uses an external managed Apache HBase table to store record key to location mapping. HBase index is a global index, enforcing key uniqueness across all partitions in the table. INMEMORY: Uses in-memory hashmap in Spark and Java engine and Flink in-memory state in Flink for indexing. BLOOM: Employs bloom filters built out of the record keys, optionally also pruning candidate files using record key ranges. Key uniqueness is enforced inside partitions. GLOBAL_BLOOM: Employs bloom filters built out of the record keys, optionally also pruning candidate files using record key ranges. Key uniqueness is enforced across all partitions in the table. SIMPLE: Performs a lean join of the incoming update/delete records against keys extracted from the table on storage.Key uniqueness is enforced inside partitions. GLOBAL_SIMPLE: Performs a lean join of the incoming update/delete records against keys extracted from the table on storage.Key uniqueness is enforced across all partitions in the table. BUCKET: locates the file group containing the record fast by using bucket hashing, particularly beneficial in large scale. Use hoodie.index.bucket.engine to choose bucket engine type, i.e., how buckets are generated. FLINK_STATE: Internal Config for indexing based on Flink state. RECORD_INDEX: Index which saves the record key to location mappings in the HUDI Metadata Table. Record index is a global index, enforcing key uniqueness across all partitions in the table. Supports sharding to achieve very high scale.
Config Param: INDEX_TYPE

Advanced Configs

Config Name Default Description
hoodie.bucket.index.hash.field (N/A) Index key. It is used to index the record and find its file group. If not set, use record key field as default
Config Param: BUCKET_INDEX_HASH_FIELD
hoodie.bucket.index.max.num.buckets (N/A) Only applies if bucket index engine is consistent hashing. Determine the upper bound of the number of buckets in the hudi table. Bucket resizing cannot be done higher than this max limit.
Config Param: BUCKET_INDEX_MAX_NUM_BUCKETS
Since Version: 0.13.0
hoodie.bucket.index.min.num.buckets (N/A) Only applies if bucket index engine is consistent hashing. Determine the lower bound of the number of buckets in the hudi table. Bucket resizing cannot be done lower than this min limit.
Config Param: BUCKET_INDEX_MIN_NUM_BUCKETS
Since Version: 0.13.0
hoodie.bloom.index.bucketized.checking true Only applies if index type is BLOOM. When true, bucketized bloom filtering is enabled. This reduces skew seen in sort based bloom index lookup
Config Param: BLOOM_INDEX_BUCKETIZED_CHECKING
hoodie.bloom.index.input.storage.level MEMORY_AND_DISK_SER Only applies when #bloomIndexUseCaching is set. Determine what level of persistence is used to cache input RDDs. Refer to org.apache.spark.storage.StorageLevel for different values
Config Param: BLOOM_INDEX_INPUT_STORAGE_LEVEL_VALUE
hoodie.bloom.index.keys.per.bucket 10000000 Only applies if bloomIndexBucketizedChecking is enabled and index type is bloom. This configuration controls the “bucket” size which tracks the number of record-key checks made against a single file and is the unit of work allocated to each partition performing bloom filter lookup. A higher value would amortize the fixed cost of reading a bloom filter to memory.
Config Param: BLOOM_INDEX_KEYS_PER_BUCKET
hoodie.bloom.index.parallelism 0 Only applies if index type is BLOOM. This is the amount of parallelism for index lookup, which involves a shuffle. By default, this is auto computed based on input workload characteristics. If the parallelism is explicitly configured by the user, the user-configured value is used in defining the actual parallelism. If the indexing stage is slow due to the limited parallelism, you can increase this to tune the performance.
Config Param: BLOOM_INDEX_PARALLELISM
hoodie.bloom.index.prune.by.ranges true Only applies if index type is BLOOM. When true, range information from files to leveraged speed up index lookups. Particularly helpful, if the key has a monotonously increasing prefix, such as timestamp. If the record key is completely random, it is better to turn this off, since range pruning will only add extra overhead to the index lookup.
Config Param: BLOOM_INDEX_PRUNE_BY_RANGES
hoodie.bloom.index.update.partition.path true Only applies if index type is GLOBAL_BLOOM. When set to true, an update including the partition path of a record that already exists will result in inserting the incoming record into the new partition and deleting the original record in the old partition. When set to false, the original record will only be updated in the old partition
Config Param: BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE
hoodie.bloom.index.use.caching true Only applies if index type is BLOOM.When true, the input RDD will cached to speed up index lookup by reducing IO for computing parallelism or affected partitions
Config Param: BLOOM_INDEX_USE_CACHING
hoodie.bloom.index.use.metadata false Only applies if index type is BLOOM.When true, the index lookup uses bloom filters and column stats from metadata table when available to speed up the process.
Config Param: BLOOM_INDEX_USE_METADATA
Since Version: 0.11.0
hoodie.bloom.index.use.treebased.filter true Only applies if index type is BLOOM. When true, interval tree based file pruning optimization is enabled. This mode speeds-up file-pruning based on key ranges when compared with the brute-force mode
Config Param: BLOOM_INDEX_TREE_BASED_FILTER
hoodie.bucket.index.merge.threshold 0.2 Control if buckets should be merged when using consistent hashing bucket indexSpecifically, if a file slice size is smaller than hoodie.xxxx.max.file.size * threshold, then it will be consideredas a merge candidate.
Config Param: BUCKET_MERGE_THRESHOLD
Since Version: 0.13.0
hoodie.bucket.index.num.buckets 256 Only applies if index type is BUCKET. Determine the number of buckets in the hudi table, and each partition is divided to N buckets.
Config Param: BUCKET_INDEX_NUM_BUCKETS
hoodie.bucket.index.split.threshold 2.0 Control if the bucket should be split when using consistent hashing bucket index.Specifically, if a file slice size reaches hoodie.xxxx.max.file.size * threshold, then split will be carried out.
Config Param: BUCKET_SPLIT_THRESHOLD
Since Version: 0.13.0
hoodie.global.index.reconcile.parallelism 60 Only applies if index type is GLOBAL_BLOOM or GLOBAL_SIMPLE. This controls the parallelism for deduplication during indexing where more than 1 record could be tagged due to partition update.
Config Param: GLOBAL_INDEX_RECONCILE_PARALLELISM
hoodie.global.simple.index.parallelism 100 Only applies if index type is GLOBAL_SIMPLE. This limits the parallelism of fetching records from the base files of all table partitions. The index picks the configured parallelism if the number of base files is larger than this configured value; otherwise, the number of base files is used as the parallelism. If the indexing stage is slow due to the limited parallelism, you can increase this to tune the performance.
Config Param: GLOBAL_SIMPLE_INDEX_PARALLELISM
hoodie.index.bucket.engine SIMPLE org.apache.hudi.index.HoodieIndex$BucketIndexEngineType: Determines the type of bucketing or hashing to use when hoodie.index.type is set to BUCKET. SIMPLE(default): Uses a fixed number of buckets for file groups which cannot shrink or expand. This works for both COW and MOR tables. CONSISTENT_HASHING: Supports dynamic number of buckets with bucket resizing to properly size each bucket. This solves potential data skew problem where one bucket can be significantly larger than others in SIMPLE engine type. This only works with MOR tables.
Config Param: BUCKET_INDEX_ENGINE_TYPE
Since Version: 0.11.0
hoodie.index.class Full path of user-defined index class and must be a subclass of HoodieIndex class. It will take precedence over the hoodie.index.type configuration if specified
Config Param: INDEX_CLASS_NAME
hoodie.record.index.input.storage.level MEMORY_AND_DISK_SER Only applies when #recordIndexUseCaching is set. Determine what level of persistence is used to cache input RDDs. Refer to org.apache.spark.storage.StorageLevel for different values
Config Param: RECORD_INDEX_INPUT_STORAGE_LEVEL_VALUE
Since Version: 0.14.0
hoodie.record.index.update.partition.path false Similar to Key: ‘hoodie.bloom.index.update.partition.path’ , default: true , isAdvanced: true , description: Only applies if index type is GLOBAL_BLOOM. When set to true, an update including the partition path of a record that already exists will result in inserting the incoming record into the new partition and deleting the original record in the old partition. When set to false, the original record will only be updated in the old partition since version: version is not defined deprecated after: version is not defined), but for record index.
Config Param: RECORD_INDEX_UPDATE_PARTITION_PATH_ENABLE
Since Version: 0.14.0
hoodie.record.index.use.caching true Only applies if index type is RECORD_INDEX.When true, the input RDD will be cached to speed up index lookup by reducing IO for computing parallelism or affected partitions
Config Param: RECORD_INDEX_USE_CACHING
Since Version: 0.14.0
hoodie.simple.index.input.storage.level MEMORY_AND_DISK_SER Only applies when #simpleIndexUseCaching is set. Determine what level of persistence is used to cache input RDDs. Refer to org.apache.spark.storage.StorageLevel for different values
Config Param: SIMPLE_INDEX_INPUT_STORAGE_LEVEL_VALUE
hoodie.simple.index.parallelism 0 Only applies if index type is SIMPLE. This limits the parallelism of fetching records from the base files of affected partitions. By default, this is auto computed based on input workload characteristics. If the parallelism is explicitly configured by the user, the user-configured value is used in defining the actual parallelism. If the indexing stage is slow due to the limited parallelism, you can increase this to tune the performance.
Config Param: SIMPLE_INDEX_PARALLELISM
hoodie.simple.index.update.partition.path true Similar to Key: ‘hoodie.bloom.index.update.partition.path’ , default: true , isAdvanced: true , description: Only applies if index type is GLOBAL_BLOOM. When set to true, an update including the partition path of a record that already exists will result in inserting the incoming record into the new partition and deleting the original record in the old partition. When set to false, the original record will only be updated in the old partition since version: version is not defined deprecated after: version is not defined), but for simple index.
Config Param: SIMPLE_INDEX_UPDATE_PARTITION_PATH_ENABLE
hoodie.simple.index.use.caching true Only applies if index type is SIMPLE. When true, the incoming writes will cached to speed up index lookup by reducing IO for computing parallelism or affected partitions
Config Param: SIMPLE_INDEX_USE_CACHING

HBase Index Configs

Configurations that control indexing behavior (when HBase based indexing is enabled), which tags incoming records as either inserts or updates to older records.

Advanced Configs

Config Name Default Description
hoodie.index.hbase.kerberos.user.keytab (N/A) File name of the kerberos keytab file for connecting to the hbase cluster.
Config Param: KERBEROS_USER_KEYTAB
hoodie.index.hbase.kerberos.user.principal (N/A) The kerberos principal name for connecting to the hbase cluster.
Config Param: KERBEROS_USER_PRINCIPAL
hoodie.index.hbase.master.kerberos.principal (N/A) The value of hbase.master.kerberos.principal in hbase cluster.
Config Param: MASTER_PRINCIPAL
hoodie.index.hbase.max.qps.fraction (N/A) Maximum for HBASE_QPS_FRACTION_PROP to stabilize skewed write workloads
Config Param: MAX_QPS_FRACTION
hoodie.index.hbase.min.qps.fraction (N/A) Minimum for HBASE_QPS_FRACTION_PROP to stabilize skewed write workloads
Config Param: MIN_QPS_FRACTION
hoodie.index.hbase.regionserver.kerberos.principal (N/A) The value of hbase.regionserver.kerberos.principal in hbase cluster.
Config Param: REGIONSERVER_PRINCIPAL
hoodie.index.hbase.sleep.ms.for.get.batch (N/A)
Config Param: SLEEP_MS_FOR_GET_BATCH
hoodie.index.hbase.sleep.ms.for.put.batch (N/A)
Config Param: SLEEP_MS_FOR_PUT_BATCH
hoodie.index.hbase.table (N/A) Only applies if index type is HBASE. HBase Table name to use as the index. Hudi stores the row_key and [partition_path, fileID, commitTime] mapping in the table
Config Param: TABLENAME
hoodie.index.hbase.zknode.path (N/A) Only applies if index type is HBASE. This is the root znode that will contain all the znodes created/used by HBase
Config Param: ZK_NODE_PATH
hoodie.index.hbase.zkport (N/A) Only applies if index type is HBASE. HBase ZK Quorum port to connect to
Config Param: ZKPORT
hoodie.index.hbase.zkquorum (N/A) Only applies if index type is HBASE. HBase ZK Quorum url to connect to
Config Param: ZKQUORUM
hoodie.hbase.index.update.partition.path false Only applies if index type is HBASE. When an already existing record is upserted to a new partition compared to whats in storage, this config when set, will delete old record in old partition and will insert it as new record in new partition.
Config Param: UPDATE_PARTITION_PATH_ENABLE
hoodie.index.hbase.bucket.number 8 Only applicable when using RebalancedSparkHoodieHBaseIndex, same as hbase regions count can get the best performance
Config Param: BUCKET_NUMBER
hoodie.index.hbase.desired_puts_time_in_secs 600
Config Param: DESIRED_PUTS_TIME_IN_SECONDS
hoodie.index.hbase.dynamic_qps false Property to decide if HBASE_QPS_FRACTION_PROP is dynamically calculated based on write volume.
Config Param: COMPUTE_QPS_DYNAMICALLY
hoodie.index.hbase.get.batch.size 100 Controls the batch size for performing gets against HBase. Batching improves throughput, by saving round trips.
Config Param: GET_BATCH_SIZE
hoodie.index.hbase.max.qps.per.region.server 1000 Property to set maximum QPS allowed per Region Server. This should be same across various jobs. This is intended to limit the aggregate QPS generated across various jobs to an Hbase Region Server. It is recommended to set this value based on global indexing throughput needs and most importantly, how much the HBase installation in use is able to tolerate without Region Servers going down.
Config Param: MAX_QPS_PER_REGION_SERVER
hoodie.index.hbase.put.batch.size 100 Controls the batch size for performing puts against HBase. Batching improves throughput, by saving round trips.
Config Param: PUT_BATCH_SIZE
hoodie.index.hbase.put.batch.size.autocompute false Property to set to enable auto computation of put batch size
Config Param: PUT_BATCH_SIZE_AUTO_COMPUTE
hoodie.index.hbase.qps.allocator.class org.apache.hudi.index.hbase.DefaultHBaseQPSResourceAllocator Property to set which implementation of HBase QPS resource allocator to be used, whichcontrols the batching rate dynamically.
Config Param: QPS_ALLOCATOR_CLASS_NAME
hoodie.index.hbase.qps.fraction 0.5 Property to set the fraction of the global share of QPS that should be allocated to this job. Let’s say there are 3 jobs which have input size in terms of number of rows required for HbaseIndexing as x, 2x, 3x respectively. Then this fraction for the jobs would be (0.17) 1/6, 0.33 (2/6) and 0.5 (3/6) respectively. Default is 50%, which means a total of 2 jobs can run using HbaseIndex without overwhelming Region Servers.
Config Param: QPS_FRACTION
hoodie.index.hbase.rollback.sync false When set to true, the rollback method will delete the last failed task index. The default value is false. Because deleting the index will add extra load on the Hbase cluster for each rollback
Config Param: ROLLBACK_SYNC_ENABLE
hoodie.index.hbase.security.authentication simple Property to decide if the hbase cluster secure authentication is enabled or not. Possible values are ‘simple’ (no authentication), and ‘kerberos’.
Config Param: SECURITY_AUTHENTICATION
hoodie.index.hbase.zk.connection_timeout_ms 15000 Timeout to use for establishing connection with zookeeper, from HBase client.
Config Param: ZK_CONNECTION_TIMEOUT_MS
hoodie.index.hbase.zk.session_timeout_ms 60000 Session timeout value to use for Zookeeper failure detection, for the HBase client.Lower this value, if you want to fail faster.
Config Param: ZK_SESSION_TIMEOUT_MS
hoodie.index.hbase.zkpath.qps_root /QPS_ROOT chroot in zookeeper, to use for all qps allocation co-ordination.
Config Param: ZKPATH_QPS_ROOT

Metastore and Catalog Sync Configs

Configurations used by the Hudi to sync metadata to external metastores and catalogs.

Common Metadata Sync Configs

Basic Configs

Config Name Default Description
hoodie.datasource.meta.sync.enable false Enable Syncing the Hudi Table with an external meta store or data catalog.
Config Param: META_SYNC_ENABLED

Advanced Configs

Config Name Default Description
hoodie.datasource.hive_sync.assume_date_partitioning false Assume partitioning is yyyy/MM/dd
Config Param: META_SYNC_ASSUME_DATE_PARTITION
hoodie.datasource.hive_sync.base_file_format PARQUET Base file format for the sync.
Config Param: META_SYNC_BASE_FILE_FORMAT
hoodie.datasource.hive_sync.database default The name of the destination database that we should sync the hudi table to.
Config Param: META_SYNC_DATABASE_NAME
hoodie.datasource.hive_sync.partition_extractor_class org.apache.hudi.hive.MultiPartKeysValueExtractor Class which implements PartitionValueExtractor to extract the partition values, default ‘org.apache.hudi.hive.MultiPartKeysValueExtractor’.
Config Param: META_SYNC_PARTITION_EXTRACTOR_CLASS
hoodie.datasource.hive_sync.partition_fields Field in the table to use for determining hive partition columns.
Config Param: META_SYNC_PARTITION_FIELDS
hoodie.datasource.hive_sync.table unknown The name of the destination table that we should sync the hudi table to.
Config Param: META_SYNC_TABLE_NAME
hoodie.datasource.meta.sync.base.path Base path of the hoodie table to sync
Config Param: META_SYNC_BASE_PATH
hoodie.datasource.meta_sync.condition.sync false If true, only sync on conditions like schema change or partition change.
Config Param: META_SYNC_CONDITIONAL_SYNC
hoodie.meta.sync.decode_partition false If true, meta sync will url-decode the partition path, as it is deemed as url-encoded. Default to false.
Config Param: META_SYNC_DECODE_PARTITION
hoodie.meta.sync.incremental true Whether to incrementally sync the partitions to the metastore, i.e., only added, changed, and deleted partitions based on the commit metadata. If set to false, the meta sync executes a full partition sync operation when partitions are lost.
Config Param: META_SYNC_INCREMENTAL
Since Version: 0.14.0
hoodie.meta.sync.metadata_file_listing false Enable the internal metadata table for file listing for syncing with metastores
Config Param: META_SYNC_USE_FILE_LISTING_FROM_METADATA
hoodie.meta.sync.sync_snapshot_with_table_name true sync meta info to origin table if enable
Config Param: META_SYNC_SNAPSHOT_WITH_TABLE_NAME
Since Version: 0.14.0
hoodie.meta_sync.spark.version The spark version used when syncing with a metastore.
Config Param: META_SYNC_SPARK_VERSION

Glue catalog sync based client Configurations

Configs that control Glue catalog sync based client.

Basic Configs

Config Name Default Description
hoodie.datasource.meta.sync.glue.partition_index_fields Specify the partitions fields to index on aws glue. Separate the fields by semicolon. By default, when the feature is enabled, all the partition will be indexed. You can create up to three indexes, separate them by comma. Eg: col1;col2;col3,col2,col3
Config Param: META_SYNC_PARTITION_INDEX_FIELDS
Since Version: 0.15.0
hoodie.datasource.meta.sync.glue.partition_index_fields.enable false Enable aws glue partition index feature, to speedup partition based query pattern
Config Param: META_SYNC_PARTITION_INDEX_FIELDS_ENABLE
Since Version: 0.15.0

Advanced Configs

Config Name Default Description
hoodie.datasource.meta.sync.glue.all_partitions_read_parallelism 1 Parallelism for listing all partitions(first time sync). Should be in interval [1, 10].
Config Param: ALL_PARTITIONS_READ_PARALLELISM
Since Version: 0.15.0
hoodie.datasource.meta.sync.glue.changed_partitions_read_parallelism 1 Parallelism for listing changed partitions(second and subsequent syncs).
Config Param: CHANGED_PARTITIONS_READ_PARALLELISM
Since Version: 0.15.0
hoodie.datasource.meta.sync.glue.metadata_file_listing false Makes athena use the metadata table to list partitions and files. Currently it won’t benefit from other features such stats indexes
Config Param: GLUE_METADATA_FILE_LISTING
Since Version: 0.14.0
hoodie.datasource.meta.sync.glue.partition_change_parallelism 1 Parallelism for change operations - such as create/update/delete.
Config Param: PARTITION_CHANGE_PARALLELISM
Since Version: 0.15.0
hoodie.datasource.meta.sync.glue.skip_table_archive true Glue catalog sync based client will skip archiving the table version if this config is set to true
Config Param: GLUE_SKIP_TABLE_ARCHIVE
Since Version: 0.14.0

BigQuery Sync Configs

Configurations used by the Hudi to sync metadata to Google BigQuery.

Basic Configs

Config Name Default Description
hoodie.datasource.meta.sync.enable false Enable Syncing the Hudi Table with an external meta store or data catalog.
Config Param: META_SYNC_ENABLED

Advanced Configs

Config Name Default Description
hoodie.gcp.bigquery.sync.big_lake_connection_id (N/A) The Big Lake connection ID to use
Config Param: BIGQUERY_SYNC_BIG_LAKE_CONNECTION_ID
Since Version: 0.14.1
hoodie.gcp.bigquery.sync.dataset_location (N/A) Location of the target dataset in BigQuery
Config Param: BIGQUERY_SYNC_DATASET_LOCATION
hoodie.gcp.bigquery.sync.project_id (N/A) Name of the target project in BigQuery
Config Param: BIGQUERY_SYNC_PROJECT_ID
hoodie.gcp.bigquery.sync.source_uri (N/A) Name of the source uri gcs path of the table
Config Param: BIGQUERY_SYNC_SOURCE_URI
hoodie.gcp.bigquery.sync.source_uri_prefix (N/A) Name of the source uri gcs path prefix of the table
Config Param: BIGQUERY_SYNC_SOURCE_URI_PREFIX
hoodie.datasource.hive_sync.assume_date_partitioning false Assume partitioning is yyyy/MM/dd
Config Param: META_SYNC_ASSUME_DATE_PARTITION
hoodie.datasource.hive_sync.base_file_format PARQUET Base file format for the sync.
Config Param: META_SYNC_BASE_FILE_FORMAT
hoodie.datasource.hive_sync.database default The name of the destination database that we should sync the hudi table to.
Config Param: META_SYNC_DATABASE_NAME
hoodie.datasource.hive_sync.partition_extractor_class org.apache.hudi.hive.MultiPartKeysValueExtractor Class which implements PartitionValueExtractor to extract the partition values, default ‘org.apache.hudi.hive.MultiPartKeysValueExtractor’.
Config Param: META_SYNC_PARTITION_EXTRACTOR_CLASS
hoodie.datasource.hive_sync.partition_fields Field in the table to use for determining hive partition columns.
Config Param: META_SYNC_PARTITION_FIELDS
hoodie.datasource.hive_sync.table unknown The name of the destination table that we should sync the hudi table to.
Config Param: META_SYNC_TABLE_NAME
hoodie.datasource.meta.sync.base.path Base path of the hoodie table to sync
Config Param: META_SYNC_BASE_PATH
hoodie.datasource.meta_sync.condition.sync false If true, only sync on conditions like schema change or partition change.
Config Param: META_SYNC_CONDITIONAL_SYNC
hoodie.gcp.bigquery.sync.assume_date_partitioning false Assume standard yyyy/mm/dd partitioning, this exists to support backward compatibility. If you use hoodie 0.3.x, do not set this parameter
Config Param: BIGQUERY_SYNC_ASSUME_DATE_PARTITIONING
hoodie.gcp.bigquery.sync.base_path Base path of the hoodie table to sync
Config Param: BIGQUERY_SYNC_SYNC_BASE_PATH
hoodie.gcp.bigquery.sync.dataset_name Name of the target dataset in BigQuery
Config Param: BIGQUERY_SYNC_DATASET_NAME
hoodie.gcp.bigquery.sync.partition_fields Comma-delimited partition fields. Default to non-partitioned.
Config Param: BIGQUERY_SYNC_PARTITION_FIELDS
hoodie.gcp.bigquery.sync.require_partition_filter false If true, configure table to require a partition filter to be specified when querying the table
Config Param: BIGQUERY_SYNC_REQUIRE_PARTITION_FILTER
Since Version: 0.14.1
hoodie.gcp.bigquery.sync.table_name Name of the target table in BigQuery
Config Param: BIGQUERY_SYNC_TABLE_NAME
hoodie.gcp.bigquery.sync.use_bq_manifest_file false If true, generate a manifest file with data file absolute paths and use BigQuery manifest file support to directly create one external table over the Hudi table. If false (default), generate a manifest file with data file names and create two external tables and one view in BigQuery. Query the view for the same results as querying the Hudi table
Config Param: BIGQUERY_SYNC_USE_BQ_MANIFEST_FILE
Since Version: 0.14.0
hoodie.gcp.bigquery.sync.use_file_listing_from_metadata false Fetch file listing from Hudi’s metadata
Config Param: BIGQUERY_SYNC_USE_FILE_LISTING_FROM_METADATA
hoodie.meta.sync.decode_partition false If true, meta sync will url-decode the partition path, as it is deemed as url-encoded. Default to false.
Config Param: META_SYNC_DECODE_PARTITION
hoodie.meta.sync.incremental true Whether to incrementally sync the partitions to the metastore, i.e., only added, changed, and deleted partitions based on the commit metadata. If set to false, the meta sync executes a full partition sync operation when partitions are lost.
Config Param: META_SYNC_INCREMENTAL
Since Version: 0.14.0
hoodie.meta.sync.metadata_file_listing false Enable the internal metadata table for file listing for syncing with metastores
Config Param: META_SYNC_USE_FILE_LISTING_FROM_METADATA
hoodie.meta.sync.sync_snapshot_with_table_name true sync meta info to origin table if enable
Config Param: META_SYNC_SNAPSHOT_WITH_TABLE_NAME
Since Version: 0.14.0
hoodie.meta_sync.spark.version The spark version used when syncing with a metastore.
Config Param: META_SYNC_SPARK_VERSION

Hive Sync Configs

Configurations used by the Hudi to sync metadata to Hive Metastore.

Basic Configs

Config Name Default Description
hoodie.datasource.hive_sync.mode (N/A) Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.
Config Param: HIVE_SYNC_MODE
hoodie.datasource.hive_sync.enable false When set to true, register/sync the table to Apache Hive metastore.
Config Param: HIVE_SYNC_ENABLED
hoodie.datasource.hive_sync.jdbcurl jdbc:hive2://localhost:10000 Hive metastore url
Config Param: HIVE_URL
hoodie.datasource.hive_sync.metastore.uris thrift://localhost:9083 Hive metastore url
Config Param: METASTORE_URIS
hoodie.datasource.meta.sync.enable false Enable Syncing the Hudi Table with an external meta store or data catalog.
Config Param: META_SYNC_ENABLED

Advanced Configs

Config Name Default Description
hoodie.datasource.hive_sync.serde_properties (N/A) Serde properties to hive table.
Config Param: HIVE_TABLE_SERDE_PROPERTIES
hoodie.datasource.hive_sync.table_properties (N/A) Additional properties to store with table.
Config Param: HIVE_TABLE_PROPERTIES
hoodie.datasource.hive_sync.assume_date_partitioning false Assume partitioning is yyyy/MM/dd
Config Param: META_SYNC_ASSUME_DATE_PARTITION
hoodie.datasource.hive_sync.auto_create_database true Auto create hive database if does not exists
Config Param: HIVE_AUTO_CREATE_DATABASE
hoodie.datasource.hive_sync.base_file_format PARQUET Base file format for the sync.
Config Param: META_SYNC_BASE_FILE_FORMAT
hoodie.datasource.hive_sync.batch_num 1000 The number of partitions one batch when synchronous partitions to hive.
Config Param: HIVE_BATCH_SYNC_PARTITION_NUM
hoodie.datasource.hive_sync.bucket_sync false Whether sync hive metastore bucket specification when using bucket index.The specification is ‘CLUSTERED BY (trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS’
Config Param: HIVE_SYNC_BUCKET_SYNC
hoodie.datasource.hive_sync.bucket_sync_spec The hive metastore bucket specification when using bucket index.The specification is ‘CLUSTERED BY (trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS’
Config Param: HIVE_SYNC_BUCKET_SYNC_SPEC
hoodie.datasource.hive_sync.create_managed_table false Whether to sync the table as managed table.
Config Param: HIVE_CREATE_MANAGED_TABLE
hoodie.datasource.hive_sync.database default The name of the destination database that we should sync the hudi table to.
Config Param: META_SYNC_DATABASE_NAME
hoodie.datasource.hive_sync.filter_pushdown_enabled false Whether to enable push down partitions by filter
Config Param: HIVE_SYNC_FILTER_PUSHDOWN_ENABLED
hoodie.datasource.hive_sync.filter_pushdown_max_size 1000 Max size limit to push down partition filters, if the estimate push down filters exceed this size, will directly try to fetch all partitions between the min/max.In case of glue metastore, this value should be reduced because it has a filter length limit.
Config Param: HIVE_SYNC_FILTER_PUSHDOWN_MAX_SIZE
hoodie.datasource.hive_sync.ignore_exceptions false Ignore exceptions when syncing with Hive.
Config Param: HIVE_IGNORE_EXCEPTIONS
hoodie.datasource.hive_sync.omit_metadata_fields false Whether to omit the hoodie metadata fields in the target table.
Config Param: HIVE_SYNC_OMIT_METADATA_FIELDS
Since Version: 0.13.0
hoodie.datasource.hive_sync.partition_extractor_class org.apache.hudi.hive.MultiPartKeysValueExtractor Class which implements PartitionValueExtractor to extract the partition values, default ‘org.apache.hudi.hive.MultiPartKeysValueExtractor’.
Config Param: META_SYNC_PARTITION_EXTRACTOR_CLASS
hoodie.datasource.hive_sync.partition_fields Field in the table to use for determining hive partition columns.
Config Param: META_SYNC_PARTITION_FIELDS
hoodie.datasource.hive_sync.password hive hive password to use
Config Param: HIVE_PASS
hoodie.datasource.hive_sync.schema_string_length_thresh 4000
Config Param: HIVE_SYNC_SCHEMA_STRING_LENGTH_THRESHOLD
hoodie.datasource.hive_sync.skip_ro_suffix false Skip the _ro suffix for Read optimized table, when registering
Config Param: HIVE_SKIP_RO_SUFFIX_FOR_READ_OPTIMIZED_TABLE
hoodie.datasource.hive_sync.support_timestamp false ‘INT64’ with original type TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. Disabled by default for backward compatibility. NOTE: On Spark entrypoints, this is defaulted to TRUE
Config Param: HIVE_SUPPORT_TIMESTAMP_TYPE
hoodie.datasource.hive_sync.sync_as_datasource true
Config Param: HIVE_SYNC_AS_DATA_SOURCE_TABLE
hoodie.datasource.hive_sync.sync_comment false Whether to sync the table column comments while syncing the table.
Config Param: HIVE_SYNC_COMMENT
hoodie.datasource.hive_sync.table unknown The name of the destination table that we should sync the hudi table to.
Config Param: META_SYNC_TABLE_NAME
hoodie.datasource.hive_sync.table.strategy ALL Hive table synchronization strategy. Available option: RO, RT, ALL.
Config Param: HIVE_SYNC_TABLE_STRATEGY
Since Version: 0.13.0
hoodie.datasource.hive_sync.use_jdbc true Use JDBC when hive synchronization is enabled
Config Param: HIVE_USE_JDBC
hoodie.datasource.hive_sync.use_pre_apache_input_format false Flag to choose InputFormat under com.uber.hoodie package instead of org.apache.hudi package. Use this when you are in the process of migrating from com.uber.hoodie to org.apache.hudi. Stop using this after you migrated the table definition to org.apache.hudi input format
Config Param: HIVE_USE_PRE_APACHE_INPUT_FORMAT
hoodie.datasource.hive_sync.username hive hive user name to use
Config Param: HIVE_USER
hoodie.datasource.meta.sync.base.path Base path of the hoodie table to sync
Config Param: META_SYNC_BASE_PATH
hoodie.datasource.meta_sync.condition.sync false If true, only sync on conditions like schema change or partition change.
Config Param: META_SYNC_CONDITIONAL_SYNC
hoodie.meta.sync.decode_partition false If true, meta sync will url-decode the partition path, as it is deemed as url-encoded. Default to false.
Config Param: META_SYNC_DECODE_PARTITION
hoodie.meta.sync.incremental true Whether to incrementally sync the partitions to the metastore, i.e., only added, changed, and deleted partitions based on the commit metadata. If set to false, the meta sync executes a full partition sync operation when partitions are lost.
Config Param: META_SYNC_INCREMENTAL
Since Version: 0.14.0
hoodie.meta.sync.metadata_file_listing false Enable the internal metadata table for file listing for syncing with metastores
Config Param: META_SYNC_USE_FILE_LISTING_FROM_METADATA
hoodie.meta.sync.sync_snapshot_with_table_name true sync meta info to origin table if enable
Config Param: META_SYNC_SNAPSHOT_WITH_TABLE_NAME
Since Version: 0.14.0
hoodie.meta_sync.spark.version The spark version used when syncing with a metastore.
Config Param: META_SYNC_SPARK_VERSION

Global Hive Sync Configs

Global replication configurations used by the Hudi to sync metadata to Hive Metastore.

Basic Configs

Config Name Default Description
hoodie.datasource.hive_sync.mode (N/A) Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.
Config Param: HIVE_SYNC_MODE
hoodie.datasource.hive_sync.enable false When set to true, register/sync the table to Apache Hive metastore.
Config Param: HIVE_SYNC_ENABLED
hoodie.datasource.hive_sync.jdbcurl jdbc:hive2://localhost:10000 Hive metastore url
Config Param: HIVE_URL
hoodie.datasource.hive_sync.metastore.uris thrift://localhost:9083 Hive metastore url
Config Param: METASTORE_URIS
hoodie.datasource.meta.sync.enable false Enable Syncing the Hudi Table with an external meta store or data catalog.
Config Param: META_SYNC_ENABLED

Advanced Configs

Config Name Default Description
hoodie.datasource.hive_sync.serde_properties (N/A) Serde properties to hive table.
Config Param: HIVE_TABLE_SERDE_PROPERTIES
hoodie.datasource.hive_sync.table_properties (N/A) Additional properties to store with table.
Config Param: HIVE_TABLE_PROPERTIES
hoodie.meta_sync.global.replicate.timestamp (N/A)
Config Param: META_SYNC_GLOBAL_REPLICATE_TIMESTAMP
hoodie.datasource.hive_sync.assume_date_partitioning false Assume partitioning is yyyy/MM/dd
Config Param: META_SYNC_ASSUME_DATE_PARTITION
hoodie.datasource.hive_sync.auto_create_database true Auto create hive database if does not exists
Config Param: HIVE_AUTO_CREATE_DATABASE
hoodie.datasource.hive_sync.base_file_format PARQUET Base file format for the sync.
Config Param: META_SYNC_BASE_FILE_FORMAT
hoodie.datasource.hive_sync.batch_num 1000 The number of partitions one batch when synchronous partitions to hive.
Config Param: HIVE_BATCH_SYNC_PARTITION_NUM
hoodie.datasource.hive_sync.bucket_sync false Whether sync hive metastore bucket specification when using bucket index.The specification is ‘CLUSTERED BY (trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS’
Config Param: HIVE_SYNC_BUCKET_SYNC
hoodie.datasource.hive_sync.bucket_sync_spec The hive metastore bucket specification when using bucket index.The specification is ‘CLUSTERED BY (trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS’
Config Param: HIVE_SYNC_BUCKET_SYNC_SPEC
hoodie.datasource.hive_sync.create_managed_table false Whether to sync the table as managed table.
Config Param: HIVE_CREATE_MANAGED_TABLE
hoodie.datasource.hive_sync.database default The name of the destination database that we should sync the hudi table to.
Config Param: META_SYNC_DATABASE_NAME
hoodie.datasource.hive_sync.filter_pushdown_enabled false Whether to enable push down partitions by filter
Config Param: HIVE_SYNC_FILTER_PUSHDOWN_ENABLED
hoodie.datasource.hive_sync.filter_pushdown_max_size 1000 Max size limit to push down partition filters, if the estimate push down filters exceed this size, will directly try to fetch all partitions between the min/max.In case of glue metastore, this value should be reduced because it has a filter length limit.
Config Param: HIVE_SYNC_FILTER_PUSHDOWN_MAX_SIZE
hoodie.datasource.hive_sync.ignore_exceptions false Ignore exceptions when syncing with Hive.
Config Param: HIVE_IGNORE_EXCEPTIONS
hoodie.datasource.hive_sync.omit_metadata_fields false Whether to omit the hoodie metadata fields in the target table.
Config Param: HIVE_SYNC_OMIT_METADATA_FIELDS
Since Version: 0.13.0
hoodie.datasource.hive_sync.partition_extractor_class org.apache.hudi.hive.MultiPartKeysValueExtractor Class which implements PartitionValueExtractor to extract the partition values, default ‘org.apache.hudi.hive.MultiPartKeysValueExtractor’.
Config Param: META_SYNC_PARTITION_EXTRACTOR_CLASS
hoodie.datasource.hive_sync.partition_fields Field in the table to use for determining hive partition columns.
Config Param: META_SYNC_PARTITION_FIELDS
hoodie.datasource.hive_sync.password hive hive password to use
Config Param: HIVE_PASS
hoodie.datasource.hive_sync.schema_string_length_thresh 4000
Config Param: HIVE_SYNC_SCHEMA_STRING_LENGTH_THRESHOLD
hoodie.datasource.hive_sync.skip_ro_suffix false Skip the _ro suffix for Read optimized table, when registering
Config Param: HIVE_SKIP_RO_SUFFIX_FOR_READ_OPTIMIZED_TABLE
hoodie.datasource.hive_sync.support_timestamp false ‘INT64’ with original type TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. Disabled by default for backward compatibility. NOTE: On Spark entrypoints, this is defaulted to TRUE
Config Param: HIVE_SUPPORT_TIMESTAMP_TYPE
hoodie.datasource.hive_sync.sync_as_datasource true
Config Param: HIVE_SYNC_AS_DATA_SOURCE_TABLE
hoodie.datasource.hive_sync.sync_comment false Whether to sync the table column comments while syncing the table.
Config Param: HIVE_SYNC_COMMENT
hoodie.datasource.hive_sync.table unknown The name of the destination table that we should sync the hudi table to.
Config Param: META_SYNC_TABLE_NAME
hoodie.datasource.hive_sync.table.strategy ALL Hive table synchronization strategy. Available option: RO, RT, ALL.
Config Param: HIVE_SYNC_TABLE_STRATEGY
Since Version: 0.13.0
hoodie.datasource.hive_sync.use_jdbc true Use JDBC when hive synchronization is enabled
Config Param: HIVE_USE_JDBC
hoodie.datasource.hive_sync.use_pre_apache_input_format false Flag to choose InputFormat under com.uber.hoodie package instead of org.apache.hudi package. Use this when you are in the process of migrating from com.uber.hoodie to org.apache.hudi. Stop using this after you migrated the table definition to org.apache.hudi input format
Config Param: HIVE_USE_PRE_APACHE_INPUT_FORMAT
hoodie.datasource.hive_sync.username hive hive user name to use
Config Param: HIVE_USER
hoodie.datasource.meta.sync.base.path Base path of the hoodie table to sync
Config Param: META_SYNC_BASE_PATH
hoodie.datasource.meta_sync.condition.sync false If true, only sync on conditions like schema change or partition change.
Config Param: META_SYNC_CONDITIONAL_SYNC
hoodie.meta.sync.decode_partition false If true, meta sync will url-decode the partition path, as it is deemed as url-encoded. Default to false.
Config Param: META_SYNC_DECODE_PARTITION
hoodie.meta.sync.incremental true Whether to incrementally sync the partitions to the metastore, i.e., only added, changed, and deleted partitions based on the commit metadata. If set to false, the meta sync executes a full partition sync operation when partitions are lost.
Config Param: META_SYNC_INCREMENTAL
Since Version: 0.14.0
hoodie.meta.sync.metadata_file_listing false Enable the internal metadata table for file listing for syncing with metastores
Config Param: META_SYNC_USE_FILE_LISTING_FROM_METADATA
hoodie.meta.sync.sync_snapshot_with_table_name true sync meta info to origin table if enable
Config Param: META_SYNC_SNAPSHOT_WITH_TABLE_NAME
Since Version: 0.14.0
hoodie.meta_sync.spark.version The spark version used when syncing with a metastore.
Config Param: META_SYNC_SPARK_VERSION

DataHub Sync Configs

Configurations used by the Hudi to sync metadata to DataHub.

Basic Configs

Config Name Default Description
hoodie.datasource.meta.sync.enable false Enable Syncing the Hudi Table with an external meta store or data catalog.
Config Param: META_SYNC_ENABLED

Advanced Configs

Config Name Default Description
hoodie.meta.sync.datahub.emitter.server (N/A) Server URL of the DataHub instance.
Config Param: META_SYNC_DATAHUB_EMITTER_SERVER
hoodie.meta.sync.datahub.emitter.supplier.class (N/A) Pluggable class to supply a DataHub REST emitter to connect to the DataHub instance. This overwrites other emitter configs.
Config Param: META_SYNC_DATAHUB_EMITTER_SUPPLIER_CLASS
hoodie.meta.sync.datahub.emitter.token (N/A) Auth token to connect to the DataHub instance.
Config Param: META_SYNC_DATAHUB_EMITTER_TOKEN
hoodie.datasource.hive_sync.assume_date_partitioning false Assume partitioning is yyyy/MM/dd
Config Param: META_SYNC_ASSUME_DATE_PARTITION
hoodie.datasource.hive_sync.base_file_format PARQUET Base file format for the sync.
Config Param: META_SYNC_BASE_FILE_FORMAT
hoodie.datasource.hive_sync.database default The name of the destination database that we should sync the hudi table to.
Config Param: META_SYNC_DATABASE_NAME
hoodie.datasource.hive_sync.partition_extractor_class org.apache.hudi.hive.MultiPartKeysValueExtractor Class which implements PartitionValueExtractor to extract the partition values, default ‘org.apache.hudi.hive.MultiPartKeysValueExtractor’.
Config Param: META_SYNC_PARTITION_EXTRACTOR_CLASS
hoodie.datasource.hive_sync.partition_fields Field in the table to use for determining hive partition columns.
Config Param: META_SYNC_PARTITION_FIELDS
hoodie.datasource.hive_sync.table unknown The name of the destination table that we should sync the hudi table to.
Config Param: META_SYNC_TABLE_NAME
hoodie.datasource.meta.sync.base.path Base path of the hoodie table to sync
Config Param: META_SYNC_BASE_PATH
hoodie.datasource.meta_sync.condition.sync false If true, only sync on conditions like schema change or partition change.
Config Param: META_SYNC_CONDITIONAL_SYNC
hoodie.meta.sync.datahub.dataplatform.name hudi String used to represent Hudi when creating its corresponding DataPlatform entity within Datahub
Config Param: META_SYNC_DATAHUB_DATAPLATFORM_NAME
hoodie.meta.sync.datahub.dataset.env DEV Environment to use when pushing entities to Datahub
Config Param: META_SYNC_DATAHUB_DATASET_ENV
hoodie.meta.sync.datahub.dataset.identifier.class org.apache.hudi.sync.datahub.config.HoodieDataHubDatasetIdentifier Pluggable class to help provide info to identify a DataHub Dataset.
Config Param: META_SYNC_DATAHUB_DATASET_IDENTIFIER_CLASS
hoodie.meta.sync.decode_partition false If true, meta sync will url-decode the partition path, as it is deemed as url-encoded. Default to false.
Config Param: META_SYNC_DECODE_PARTITION
hoodie.meta.sync.incremental true Whether to incrementally sync the partitions to the metastore, i.e., only added, changed, and deleted partitions based on the commit metadata. If set to false, the meta sync executes a full partition sync operation when partitions are lost.
Config Param: META_SYNC_INCREMENTAL
Since Version: 0.14.0
hoodie.meta.sync.metadata_file_listing false Enable the internal metadata table for file listing for syncing with metastores
Config Param: META_SYNC_USE_FILE_LISTING_FROM_METADATA
hoodie.meta.sync.sync_snapshot_with_table_name true sync meta info to origin table if enable
Config Param: META_SYNC_SNAPSHOT_WITH_TABLE_NAME
Since Version: 0.14.0
hoodie.meta_sync.spark.version The spark version used when syncing with a metastore.
Config Param: META_SYNC_SPARK_VERSION

Metrics Configs

These set of configs are used to enable monitoring and reporting of key Hudi stats and metrics.

Metrics Configurations for Amazon CloudWatch

Enables reporting on Hudi metrics using Amazon CloudWatch. Hudi publishes metrics on every commit, clean, rollback etc.

Advanced Configs

Config Name Default Description
hoodie.metrics.cloudwatch.maxDatumsPerRequest 20 Max number of Datums per request
Config Param: MAX_DATUMS_PER_REQUEST
Since Version: 0.10.0
hoodie.metrics.cloudwatch.metric.prefix Metric prefix of reporter
Config Param: METRIC_PREFIX
Since Version: 0.10.0
hoodie.metrics.cloudwatch.namespace Hudi Namespace of reporter
Config Param: METRIC_NAMESPACE
Since Version: 0.10.0
hoodie.metrics.cloudwatch.report.period.seconds 60 Reporting interval in seconds
Config Param: REPORT_PERIOD_SECONDS
Since Version: 0.10.0

Metrics Configurations

Enables reporting on Hudi metrics. Hudi publishes metrics on every commit, clean, rollback etc. The following sections list the supported reporters.

Basic Configs

Config Name Default Description
hoodie.metrics.on false Turn on/off metrics reporting. off by default.
Config Param: TURN_METRICS_ON
Since Version: 0.5.0
hoodie.metrics.reporter.type GRAPHITE Type of metrics reporter.
Config Param: METRICS_REPORTER_TYPE_VALUE
Since Version: 0.5.0
hoodie.metricscompaction.log.blocks.on false Turn on/off metrics reporting for log blocks with compaction commit. off by default.
Config Param: TURN_METRICS_COMPACTION_LOG_BLOCKS_ON
Since Version: 0.14.0

Advanced Configs

Config Name Default Description
hoodie.metrics.executor.enable (N/A)
Config Param: EXECUTOR_METRICS_ENABLE
Since Version: 0.7.0
hoodie.metrics.configs.properties Comma separated list of config file paths for metric exporter configs
Config Param: METRICS_REPORTER_FILE_BASED_CONFIGS_PATH
Since Version: 0.14.0
hoodie.metrics.lock.enable false Enable metrics for locking infra. Useful when operating in multiwriter mode
Config Param: LOCK_METRICS_ENABLE
Since Version: 0.13.0
hoodie.metrics.reporter.class
Config Param: METRICS_REPORTER_CLASS_NAME
Since Version: 0.6.0
hoodie.metrics.reporter.metricsname.prefix The prefix given to the metrics names.
Config Param: METRICS_REPORTER_PREFIX
Since Version: 0.11.0

Metrics Configurations for Datadog reporter

Enables reporting on Hudi metrics using the Datadog reporter type. Hudi publishes metrics on every commit, clean, rollback etc.

Advanced Configs

Config Name Default Description
hoodie.metrics.datadog.api.key (N/A) Datadog API key
Config Param: API_KEY
Since Version: 0.6.0
hoodie.metrics.datadog.api.key.supplier (N/A) Datadog API key supplier to supply the API key at runtime. This will take effect if hoodie.metrics.datadog.api.key is not set.
Config Param: API_KEY_SUPPLIER
Since Version: 0.6.0
hoodie.metrics.datadog.api.site (N/A) Datadog API site: EU or US
Config Param: API_SITE_VALUE
Since Version: 0.6.0
hoodie.metrics.datadog.metric.host (N/A) Datadog metric host to be sent along with metrics data.
Config Param: METRIC_HOST_NAME
Since Version: 0.6.0
hoodie.metrics.datadog.metric.prefix (N/A) Datadog metric prefix to be prepended to each metric name with a dot as delimiter. For example, if it is set to foo, foo. will be prepended.
Config Param: METRIC_PREFIX_VALUE
Since Version: 0.6.0
hoodie.metrics.datadog.metric.tags (N/A) Datadog metric tags (comma-delimited) to be sent along with metrics data.
Config Param: METRIC_TAG_VALUES
Since Version: 0.6.0
hoodie.metrics.datadog.api.key.skip.validation false Before sending metrics via Datadog API, whether to skip validating Datadog API key or not. Default to false.
Config Param: API_KEY_SKIP_VALIDATION
Since Version: 0.6.0
hoodie.metrics.datadog.api.timeout.seconds 3 Datadog API timeout in seconds. Default to 3.
Config Param: API_TIMEOUT_IN_SECONDS
Since Version: 0.6.0
hoodie.metrics.datadog.report.period.seconds 30 Datadog reporting period in seconds. Default to 30.
Config Param: REPORT_PERIOD_IN_SECONDS
Since Version: 0.6.0

Metrics Configurations for Graphite

Enables reporting on Hudi metrics using Graphite. Hudi publishes metrics on every commit, clean, rollback etc.

Advanced Configs

Config Name Default Description
hoodie.metrics.graphite.metric.prefix (N/A) Standard prefix applied to all metrics. This helps to add datacenter, environment information for e.g
Config Param: GRAPHITE_METRIC_PREFIX_VALUE
Since Version: 0.5.1
hoodie.metrics.graphite.host localhost Graphite host to connect to.
Config Param: GRAPHITE_SERVER_HOST_NAME
Since Version: 0.5.0
hoodie.metrics.graphite.port 4756 Graphite port to connect to.
Config Param: GRAPHITE_SERVER_PORT_NUM
Since Version: 0.5.0
hoodie.metrics.graphite.report.period.seconds 30 Graphite reporting period in seconds. Default to 30.
Config Param: GRAPHITE_REPORT_PERIOD_IN_SECONDS
Since Version: 0.10.0

Metrics Configurations for Jmx

Enables reporting on Hudi metrics using Jmx. Hudi publishes metrics on every commit, clean, rollback etc.

Advanced Configs

Config Name Default Description
hoodie.metrics.jmx.host localhost Jmx host to connect to
Config Param: JMX_HOST_NAME
Since Version: 0.5.1
hoodie.metrics.jmx.port 9889 Jmx port to connect to
Config Param: JMX_PORT_NUM
Since Version: 0.5.1

Metrics Configurations for M3

Enables reporting on Hudi metrics using M3. Hudi publishes metrics on every commit, clean, rollback etc.

Basic Configs

Config Name Default Description
hoodie.metrics.m3.env production M3 tag to label the environment (defaults to ‘production’), applied to all metrics.
Config Param: M3_ENV
Since Version: 0.15.0
hoodie.metrics.m3.host localhost M3 host to connect to.
Config Param: M3_SERVER_HOST_NAME
Since Version: 0.15.0
hoodie.metrics.m3.port 9052 M3 port to connect to.
Config Param: M3_SERVER_PORT_NUM
Since Version: 0.15.0
hoodie.metrics.m3.service hoodie M3 tag to label the service name (defaults to ‘hoodie’), applied to all metrics.
Config Param: M3_SERVICE
Since Version: 0.15.0
hoodie.metrics.m3.tags Optional M3 tags applied to all metrics.
Config Param: M3_TAGS
Since Version: 0.15.0

Metrics Configurations for Prometheus

Enables reporting on Hudi metrics using Prometheus. Hudi publishes metrics on every commit, clean, rollback etc.

Advanced Configs

Config Name Default Description
hoodie.metrics.prometheus.port 9090 Port for prometheus server.
Config Param: PROMETHEUS_PORT_NUM
Since Version: 0.6.0
hoodie.metrics.pushgateway.delete.on.shutdown true Delete the pushgateway info or not when job shutdown, true by default.
Config Param: PUSHGATEWAY_DELETE_ON_SHUTDOWN_ENABLE
Since Version: 0.6.0
hoodie.metrics.pushgateway.host localhost Hostname of the prometheus push gateway.
Config Param: PUSHGATEWAY_HOST_NAME
Since Version: 0.6.0
hoodie.metrics.pushgateway.job.name Name of the push gateway job.
Config Param: PUSHGATEWAY_JOBNAME
Since Version: 0.6.0
hoodie.metrics.pushgateway.port 9091 Port for the push gateway.
Config Param: PUSHGATEWAY_PORT_NUM
Since Version: 0.6.0
hoodie.metrics.pushgateway.random.job.name.suffix true Whether the pushgateway name need a random suffix , default true.
Config Param: PUSHGATEWAY_RANDOM_JOBNAME_SUFFIX
Since Version: 0.6.0
hoodie.metrics.pushgateway.report.labels Label for the metrics emitted to the Pushgateway. Labels can be specified with key:value pairs separated by commas
Config Param: PUSHGATEWAY_LABELS
Since Version: 0.14.0
hoodie.metrics.pushgateway.report.period.seconds 30 Reporting interval in seconds.
Config Param: PUSHGATEWAY_REPORT_PERIOD_IN_SECONDS
Since Version: 0.6.0

Record Payload Config

This is the lowest level of customization offered by Hudi. Record payloads define how to produce new values to upsert based on incoming new record and stored old record. Hudi provides default implementations such as OverwriteWithLatestAvroPayload which simply update table with the latest/last-written record. This can be overridden to a custom class extending HoodieRecordPayload class, on both datasource and WriteClient levels.

Payload Configurations

Payload related configs, that can be leveraged to control merges based on specific business fields in the data.

Advanced Configs

Config Name Default Description
hoodie.compaction.payload.class org.apache.hudi.common.model.OverwriteWithLatestAvroPayload This needs to be same as class used during insert/upserts. Just like writing, compaction also uses the record payload class to merge records in the log against each other, merge again with the base file and produce the final record to be written after compaction.
Config Param: PAYLOAD_CLASS_NAME
hoodie.payload.event.time.field ts Table column/field name to derive timestamp associated with the records. This canbe useful for e.g, determining the freshness of the table.
Config Param: EVENT_TIME_FIELD
hoodie.payload.ordering.field ts Table column/field name to order records that have the same key, before merging and writing to storage.
Config Param: ORDERING_FIELD

Kafka Connect Configs

These set of configs are used for Kafka Connect Sink Connector for writing Hudi Tables

Kafka Sink Connect Configurations

Configurations for Kafka Connect Sink Connector for Hudi.

Basic Configs

Config Name Default Description
bootstrap.servers localhost:9092 The bootstrap servers for the Kafka Cluster.
Config Param: KAFKA_BOOTSTRAP_SERVERS

Advanced Configs

Config Name Default Description
hadoop.conf.dir (N/A) The Hadoop configuration directory.
Config Param: HADOOP_CONF_DIR
hadoop.home (N/A) The Hadoop home directory.
Config Param: HADOOP_HOME
hoodie.kafka.allow.commit.on.errors true Commit even when some records failed to be written
Config Param: ALLOW_COMMIT_ON_ERRORS
hoodie.kafka.commit.interval.secs 60 The interval at which Hudi will commit the records written to the files, making them consumable on the read-side.
Config Param: COMMIT_INTERVAL_SECS
hoodie.kafka.compaction.async.enable true Controls whether async compaction should be turned on for MOR table writing.
Config Param: ASYNC_COMPACT_ENABLE
hoodie.kafka.control.topic hudi-control-topic Kafka topic name used by the Hudi Sink Connector for sending and receiving control messages. Not used for data records.
Config Param: CONTROL_TOPIC_NAME
hoodie.kafka.coordinator.write.timeout.secs 300 The timeout after sending an END_COMMIT until when the coordinator will wait for the write statuses from all the partitionsto ignore the current commit and start a new commit.
Config Param: COORDINATOR_WRITE_TIMEOUT_SECS
hoodie.meta.sync.classes org.apache.hudi.hive.HiveSyncTool Meta sync client tool, using comma to separate multi tools
Config Param: META_SYNC_CLASSES
hoodie.meta.sync.enable false Enable Meta Sync such as Hive
Config Param: META_SYNC_ENABLE
hoodie.schemaprovider.class org.apache.hudi.schema.FilebasedSchemaProvider subclass of org.apache.hudi.schema.SchemaProvider to attach schemas to input & target table data, built in options: org.apache.hudi.schema.FilebasedSchemaProvider.
Config Param: SCHEMA_PROVIDER_CLASS

Amazon Web Services Configs

Configurations specific to Amazon Web Services.

Amazon Web Services Configs

Amazon Web Services configurations to access resources like Amazon DynamoDB (for locks), Amazon CloudWatch (metrics).

Advanced Configs

Config Name Default Description
hoodie.aws.access.key (N/A) AWS access key id
Config Param: AWS_ACCESS_KEY
Since Version: 0.10.0
hoodie.aws.glue.endpoint (N/A) Aws glue endpoint
Config Param: AWS_GLUE_ENDPOINT
Since Version: 0.15.0
hoodie.aws.glue.region (N/A) Aws glue endpoint
Config Param: AWS_GLUE_REGION
Since Version: 0.15.0
hoodie.aws.role.arn (N/A) AWS Role ARN to assume
Config Param: AWS_ASSUME_ROLE_ARN
Since Version: 0.15.0
hoodie.aws.role.external.id (N/A) External ID use when assuming the AWS Role
Config Param: AWS_ASSUME_ROLE_EXTERNAL_ID
Since Version: 0.15.0
hoodie.aws.secret.key (N/A) AWS secret key
Config Param: AWS_SECRET_KEY
Since Version: 0.10.0
hoodie.aws.session.token (N/A) AWS session token
Config Param: AWS_SESSION_TOKEN
Since Version: 0.10.0
hoodie.aws.role.session.name hoodie Session name to use when assuming the AWS Role
Config Param: AWS_ASSUME_ROLE_SESSION_NAME
Since Version: 0.15.0

Hudi Streamer Configs

These set of configs are used for Hudi Streamer utility which provides the way to ingest from different sources such as DFS or Kafka.

Hudi Streamer Configs

Basic Configs

Config Name Default Description
hoodie.streamer.source.kafka.topic (N/A) Kafka topic name. The config is specific to HoodieMultiTableStreamer
Config Param: KAFKA_TOPIC

Advanced Configs

Config Name Default Description
hoodie.streamer.checkpoint.provider.path (N/A) The path for providing the checkpoints.
Config Param: CHECKPOINT_PROVIDER_PATH
hoodie.streamer.ingestion.tablesToBeIngested (N/A) Comma separated names of tables to be ingested in the format <database>.<table>, for example db1.table1,db1.table2
Config Param: TABLES_TO_BE_INGESTED
hoodie.streamer.transformer.class (N/A) Names of transformer classes to apply. The config is specific to HoodieMultiTableStreamer.
Config Param: TRANSFORMER_CLASS
Since Version: 0.14.0
hoodie.streamer.checkpoint.force.skip false Config to force to skip saving checkpoint in the commit metadata.It is typically used in one-time backfill scenarios, where checkpoints are not to be persisted.
Config Param: CHECKPOINT_FORCE_SKIP
hoodie.streamer.ingestion.targetBasePath The path to which a particular table is ingested. The config is specific to HoodieMultiTableStreamer and overrides path determined using option --base-path-prefix for a table. This config is ignored for a single table streamer
Config Param: TARGET_BASE_PATH
hoodie.streamer.row.throw.explicit.exceptions false When enabled, the dataframe generated from reading source data is wrapped with an exception handler to explicitly surface exceptions.
Config Param: ROW_THROW_EXPLICIT_EXCEPTIONS
Since Version: 0.15.0
hoodie.streamer.sample.writes.enabled false Set this to true to sample from the first batch of records and write to the auxiliary path, before writing to the table.The sampled records are used to calculate the average record size. The relevant write client will have hoodie.copyonwrite.record.size.estimate being overwritten by the calculated result.
Config Param: SAMPLE_WRITES_ENABLED
Since Version: 0.14.0
hoodie.streamer.sample.writes.size 5000 Number of records to sample from the first write. To improve the estimation’s accuracy, for smaller or more compressable record size, set the sample size bigger. For bigger or less compressable record size, set smaller.
Config Param: SAMPLE_WRITES_SIZE
Since Version: 0.14.0
hoodie.streamer.source.kafka.append.offsets false When enabled, appends kafka offset info like source offset(_hoodie_kafka_source_offset), partition (_hoodie_kafka_source_partition) and timestamp (_hoodie_kafka_source_timestamp) to the records. By default its disabled and no kafka offsets are added
Config Param: KAFKA_APPEND_OFFSETS
hoodie.streamer.source.sanitize.invalid.char.mask __ Defines the character sequence that replaces invalid characters in schema field names if hoodie.streamer.source.sanitize.invalid.schema.field.names is enabled.
Config Param: SCHEMA_FIELD_NAME_INVALID_CHAR_MASK
hoodie.streamer.source.sanitize.invalid.schema.field.names false Sanitizes names of invalid schema fields both in the data read from source and also in the schema Replaces invalid characters with hoodie.streamer.source.sanitize.invalid.char.mask. Invalid characters are by goes by avro naming convention (https://avro.apache.org/docs/current/spec.html#names).
Config Param: SANITIZE_SCHEMA_FIELD_NAMES

Hudi Streamer SQL Transformer Configs

Configurations controlling the behavior of SQL transformer in Hudi Streamer.

Basic Configs

Config Name Default Description
hoodie.streamer.transformer.sql (N/A) SQL Query to be executed during write
Config Param: TRANSFORMER_SQL
hoodie.streamer.transformer.sql.file (N/A) File with a SQL script to be executed during write
Config Param: TRANSFORMER_SQL_FILE

Hudi Streamer Source Configs

Configurations controlling the behavior of reading source data.

Cloud Source Configs

Configs that are common during ingestion across different cloud stores

Advanced Configs

Config Name Default Description
hoodie.streamer.source.cloud.data.datasource.options (N/A) A JSON string passed to the Spark DataFrameReader while loading the dataset. Example: hoodie.streamer.gcp.spark.datasource.options={“header”:”true”,”encoding”:”UTF-8”}
Config Param: SPARK_DATASOURCE_OPTIONS
hoodie.streamer.source.cloud.data.ignore.relpath.prefix (N/A) Ignore objects in the bucket whose relative path starts this prefix
Config Param: IGNORE_RELATIVE_PATH_PREFIX
hoodie.streamer.source.cloud.data.ignore.relpath.substring (N/A) Ignore objects in the bucket whose relative path contains this substring
Config Param: IGNORE_RELATIVE_PATH_SUBSTR
hoodie.streamer.source.cloud.data.partition.fields.from.path (N/A) A comma delimited list of path-based partition fields in the source file structure.
Config Param: PATH_BASED_PARTITION_FIELDS
Since Version: 0.14.0
hoodie.streamer.source.cloud.data.partition.max.size (N/A) specify this value in bytes, to coalesce partitions of source dataset not greater than specified limit
Config Param: SOURCE_MAX_BYTES_PER_PARTITION
Since Version: 0.14.1
hoodie.streamer.source.cloud.data.select.file.extension (N/A) Only match files with this extension. By default, this is the same as hoodie.streamer.source.hoodieincr.file.format
Config Param: CLOUD_DATAFILE_EXTENSION
hoodie.streamer.source.cloud.data.select.relpath.prefix (N/A) Only selects objects in the bucket whose relative path starts with this prefix
Config Param: SELECT_RELATIVE_PATH_PREFIX
hoodie.streamer.source.cloud.data.check.file.exists false If true, checks whether file exists before attempting to pull it
Config Param: ENABLE_EXISTS_CHECK
hoodie.streamer.source.cloud.data.datafile.format parquet Format of the data file. By default, this will be the same as hoodie.streamer.source.hoodieincr.file.format
Config Param: DATAFILE_FORMAT
hoodie.streamer.source.cloud.data.reader.comma.separated.path.format false Boolean value for specifying path format in load args of spark.read.format(“..”).load(“a.xml,b.xml,c.xml”), set true if path format needs to be comma separated string value, if false it’s passed as array of strings like spark.read.format(“..”).load(new String[]{a.xml,b.xml,c.xml})
Config Param: SPARK_DATASOURCE_READER_COMMA_SEPARATED_PATH_FORMAT
Since Version: 0.14.1
hoodie.streamer.source.cloud.meta.ack true Whether to acknowledge Metadata messages during Cloud Ingestion or not. This is useful during dev and testing. In Prod this should always be true. In case of Cloud Pubsub, not acknowledging means Pubsub will keep redelivering the same messages.
Config Param: ACK_MESSAGES
hoodie.streamer.source.cloud.meta.batch.size 10 Number of metadata messages to pull in one API call to the cloud events queue. Multiple API calls with this batch size are sent to cloud events queue, until we consume hoodie.streamer.source.cloud.meta.max.num.messages.per.syncfrom the queue or hoodie.streamer.source.cloud.meta.max.fetch.time.per.sync.ms amount of time has passed or queue is empty.
Config Param: BATCH_SIZE_CONF
hoodie.streamer.source.cloud.meta.max.fetch.time.per.sync.secs 60 Max time in secs to consume hoodie.streamer.source.cloud.meta.max.num.messages.per.sync messages from cloud queue. Cloud event queues like SQS, PubSub can return empty responses even when messages are available the queue, this config ensures we don’t wait forever to consume MAX_MESSAGES_CONF messages, but time out and move on further.
Config Param: MAX_FETCH_TIME_PER_SYNC_SECS
Since Version: 0.14.1
hoodie.streamer.source.cloud.meta.max.num.messages.per.sync 1000 Maximum number of messages to consume per sync round. Multiple rounds of hoodie.streamer.source.cloud.meta.batch.size could be invoked to reach max messages as configured by this config
Config Param: MAX_NUM_MESSAGES_PER_SYNC
Since Version: 0.14.1

DFS Path Selector Configs

Configurations controlling the behavior of path selector for DFS source in Hudi Streamer.

Basic Configs

Config Name Default Description
hoodie.streamer.source.dfs.root (N/A) Root path of the source on DFS
Config Param: ROOT_INPUT_PATH

Advanced Configs

Config Name Default Description
hoodie.streamer.source.input.selector (N/A) Source input selector
Config Param: SOURCE_INPUT_SELECTOR

Date Partition Path Selector Configs

Configurations controlling the behavior of date partition path selector for DFS source in Hudi Streamer.

Advanced Configs

Config Name Default Description
hoodie.streamer.source.dfs.datepartitioned.selector.currentdate (N/A) Current date.
Config Param: CURRENT_DATE
hoodie.streamer.source.dfs.datepartitioned.date.format yyyy-MM-dd Date format.
Config Param: DATE_FORMAT
hoodie.streamer.source.dfs.datepartitioned.selector.depth 0 Depth of the files to scan. 0 implies no (date) partition.
Config Param: DATE_PARTITION_DEPTH
hoodie.streamer.source.dfs.datepartitioned.selector.lookback.days 2 The maximum look-back days for scanning.
Config Param: LOOKBACK_DAYS
hoodie.streamer.source.dfs.datepartitioned.selector.parallelism 20 Parallelism for listing partitions.
Config Param: PARTITIONS_LIST_PARALLELISM

GCS Events Source Configs

Configurations controlling the behavior of GCS Events Source in Hudi Streamer.

Advanced Configs

Config Name Default Description
hoodie.streamer.source.gcs.project.id (N/A) The GCP Project Id where the Pubsub Subscription to ingest from resides. Needed to connect to the Pubsub subscription
Config Param: GOOGLE_PROJECT_ID
hoodie.streamer.source.gcs.subscription.id (N/A) The GCP Pubsub subscription id for the GCS Notifications. Needed to connect to the Pubsub subscription
Config Param: PUBSUB_SUBSCRIPTION_ID

Hive Incremental Pulling Source Configs

Configurations controlling the behavior of incremental pulling from a Hive table as a source in Hudi Streamer.

Advanced Configs

Config Name Default Description
hoodie.streamer.source.incrpull.root (N/A) The root path of Hive incremental pulling source.
Config Param: ROOT_INPUT_PATH

Hudi Incremental Source Configs

Configurations controlling the behavior of incremental pulling from a Hudi table as a source in Hudi Streamer.

Basic Configs

Config Name Default Description
hoodie.streamer.source.hoodieincr.path (N/A) Base-path for the source Hudi table
Config Param: HOODIE_SRC_BASE_PATH

Advanced Configs

Config Name Default Description
hoodie.streamer.source.hoodieincr.data.datasource.options (N/A) A comma-separated list of Hudi options that can be passed to the spark dataframe reader of a hudi table, eg: hoodie.metadata.enable=true,hoodie.enable.data.skipping=true. Used only for incremental source.
Config Param: HOODIE_INCREMENTAL_SPARK_DATASOURCE_OPTIONS
Since Version: 0.15.0
hoodie.streamer.source.hoodieincr.missing.checkpoint.strategy (N/A) Allows Hudi Streamer to decide the instant to consume from when checkpoint is not set. Possible values: [READ_LATEST (Read from latest commit in hoodie source table), READ_UPTO_LATEST_COMMIT (Read everything upto latest commit)]
Config Param: MISSING_CHECKPOINT_STRATEGY
hoodie.streamer.source.hoodieincr.partition.extractor.class (N/A) PartitionValueExtractor class to extract partition fields from _hoodie_partition_path
Config Param: HOODIE_SRC_PARTITION_EXTRACTORCLASS
hoodie.streamer.source.hoodieincr.partition.fields (N/A) Specifies partition fields that needs to be added to source table after parsing _hoodie_partition_path.
Config Param: HOODIE_SRC_PARTITION_FIELDS
hoodie.streamer.source.hoodieincr.drop.all.meta.fields.from.source false Drops all meta fields from the source hudi table while ingesting into sink hudi table.
Config Param: HOODIE_DROP_ALL_META_FIELDS_FROM_SOURCE
hoodie.streamer.source.hoodieincr.file.format parquet This config is passed to the reader while loading dataset. Default value is parquet.
Config Param: SOURCE_FILE_FORMAT
hoodie.streamer.source.hoodieincr.num_instants 5 Max number of instants whose changes can be incrementally fetched
Config Param: NUM_INSTANTS_PER_FETCH
hoodie.streamer.source.hoodieincr.read_latest_on_missing_ckpt false If true, allows Hudi Streamer to incrementally fetch from latest committed instant when checkpoint is not provided. This config is deprecated. Please refer to hoodie.streamer.source.hoodieincr.missing.checkpoint.strategy
Config Param: READ_LATEST_INSTANT_ON_MISSING_CKPT

JDBC Source Configs

Configurations controlling the behavior of JDBC source in Hudi Streamer.

Advanced Configs

Config Name Default Description
hoodie.streamer.jdbc.driver.class (N/A) Driver class used for JDBC connection
Config Param: DRIVER_CLASS
hoodie.streamer.jdbc.extra.options. (N/A) Used to set any extra options the user specifies for jdbc
Config Param: EXTRA_OPTIONS
hoodie.streamer.jdbc.incr.fallback.to.full.fetch (N/A) If set true, makes incremental fetch to fallback to full fetch in case of any error
Config Param: FALLBACK_TO_FULL_FETCH
hoodie.streamer.jdbc.incr.pull (N/A) Will the JDBC source do an incremental pull?
Config Param: IS_INCREMENTAL
hoodie.streamer.jdbc.password (N/A) Password used for JDBC connection
Config Param: PASSWORD
hoodie.streamer.jdbc.password.file (N/A) Base-path for the JDBC password file.
Config Param: PASSWORD_FILE
hoodie.streamer.jdbc.storage.level (N/A) Used to control the persistence level. Default value: MEMORY_AND_DISK_SER
Config Param: STORAGE_LEVEL
hoodie.streamer.jdbc.table.incr.column.name (N/A) If run in incremental mode, this field is to pull new data incrementally
Config Param: INCREMENTAL_COLUMN
hoodie.streamer.jdbc.table.name (N/A) RDBMS table to pull
Config Param: RDBMS_TABLE_NAME
hoodie.streamer.jdbc.url (N/A) JDBC url for the Hoodie datasource.
Config Param: URL
hoodie.streamer.jdbc.user (N/A) Username used for JDBC connection
Config Param: USER

Json Kafka Post Processor Configs

Configurations controlling the post processor of Json Kafka Source in Hudi Streamer.

Advanced Configs

Config Name Default Description
hoodie.streamer.source.json.kafka.post.processor.maxwell.database.regex (N/A) Database name regex
Config Param: DATABASE_NAME_REGEX
hoodie.streamer.source.json.kafka.post.processor.maxwell.table.regex (N/A) Table name regex
Config Param: TABLE_NAME_REGEX
hoodie.streamer.source.json.kafka.processor.class (N/A) Json kafka source post processor class name, post process data after consuming fromsource and before giving it to Hudi Streamer.
Config Param: JSON_KAFKA_PROCESSOR_CLASS
hoodie.streamer.source.json.kafka.post.processor.maxwell.precombine.field.format yyyy-MM-dd HH:mm:ss When the preCombine filed is in DATE_STRING format, use should tell hoodiewhat format it is. ‘yyyy-MM-dd HH:mm:ss’ by default
Config Param: PRECOMBINE_FIELD_FORMAT
hoodie.streamer.source.json.kafka.post.processor.maxwell.precombine.field.type DATE_STRING Data type of the preCombine field. could be NON_TIMESTAMP, DATE_STRING,UNIX_TIMESTAMP or EPOCHMILLISECONDS. DATE_STRING by default
Config Param: PRECOMBINE_FIELD_TYPE

Kafka Source Configs

Configurations controlling the behavior of Kafka source in Hudi Streamer.

Basic Configs

Config Name Default Description
hoodie.streamer.source.kafka.topic (N/A) Kafka topic name.
Config Param: KAFKA_TOPIC_NAME
hoodie.streamer.source.kafka.proto.value.deserializer.class org.apache.kafka.common.serialization.ByteArrayDeserializer Kafka Proto Payload Deserializer Class
Config Param: KAFKA_PROTO_VALUE_DESERIALIZER_CLASS
Since Version: 0.15.0

Advanced Configs

Config Name Default Description
hoodie.streamer.source.kafka.value.deserializer.schema (N/A) Schema to deserialize the records.
Config Param: KAFKA_VALUE_DESERIALIZER_SCHEMA
auto.offset.reset LATEST Kafka consumer strategy for reading data.
Config Param: KAFKA_AUTO_OFFSET_RESET
hoodie.streamer.kafka.source.maxEvents 5000000 Maximum number of records obtained in each batch.
Config Param: MAX_EVENTS_FROM_KAFKA_SOURCE
hoodie.streamer.source.kafka.checkpoint.type string Kafka checkpoint type.
Config Param: KAFKA_CHECKPOINT_TYPE
hoodie.streamer.source.kafka.enable.commit.offset false Automatically submits offset to kafka.
Config Param: ENABLE_KAFKA_COMMIT_OFFSET
hoodie.streamer.source.kafka.enable.failOnDataLoss false Fail when checkpoint goes out of bounds instead of seeking to earliest offsets.
Config Param: ENABLE_FAIL_ON_DATA_LOSS
hoodie.streamer.source.kafka.fetch_partition.time.out 300000 Time out for fetching partitions. 5min by default
Config Param: KAFKA_FETCH_PARTITION_TIME_OUT
hoodie.streamer.source.kafka.minPartitions 0 Desired minimum number of partitions to read from Kafka. By default, Hudi has a 1-1 mapping of topicPartitions to Hudi partitions consuming from Kafka. If set this option to a value greater than topicPartitions, Hudi will divvy up large Kafka partitions to smaller pieces. Please note that this configuration is like a hint: the number of input tasks will be approximately minPartitions. It can be less or more depending on rounding errors or Kafka partitions that didn’t receive any new data.
Config Param: KAFKA_SOURCE_MIN_PARTITIONS
Since Version: 0.14.0
hoodie.streamer.source.kafka.value.deserializer.class io.confluent.kafka.serializers.KafkaAvroDeserializer This class is used by kafka client to deserialize the records.
Config Param: KAFKA_AVRO_VALUE_DESERIALIZER_CLASS
Since Version: 0.9.0

Parquet DFS Source Configs

Configurations controlling the behavior of Parquet DFS source in Hudi Streamer.

Advanced Configs

Config Name Default Description
hoodie.streamer.source.parquet.dfs.merge_schema.enable false Merge schema across parquet files within a single write
Config Param: PARQUET_DFS_MERGE_SCHEMA
Since Version: 0.15.0

Pulsar Source Configs

Configurations controlling the behavior of Pulsar source in Hudi Streamer.

Basic Configs

Config Name Default Description
hoodie.streamer.source.pulsar.topic (N/A) Name of the target Pulsar topic to source data from
Config Param: PULSAR_SOURCE_TOPIC_NAME
hoodie.streamer.source.pulsar.endpoint.admin.url http://localhost:8080 URL of the target Pulsar endpoint (of the form ‘pulsar://host:port’
Config Param: PULSAR_SOURCE_ADMIN_ENDPOINT_URL
hoodie.streamer.source.pulsar.endpoint.service.url pulsar://localhost:6650 URL of the target Pulsar endpoint (of the form ‘pulsar://host:port’
Config Param: PULSAR_SOURCE_SERVICE_ENDPOINT_URL

Advanced Configs

Config Name Default Description
hoodie.streamer.source.pulsar.maxRecords 5000000 Max number of records obtained in a single each batch
Config Param: PULSAR_SOURCE_MAX_RECORDS_PER_BATCH_THRESHOLD
hoodie.streamer.source.pulsar.offset.autoResetStrategy LATEST Policy determining how offsets shall be automatically reset in case there’s no checkpoint information present
Config Param: PULSAR_SOURCE_OFFSET_AUTO_RESET_STRATEGY

S3 Event-based Hudi Incremental Source Configs

Configurations controlling the behavior of incremental pulling from S3 events meta information from Hudi table as a source in Hudi Streamer.

Advanced Configs

Config Name Default Description
hoodie.streamer.source.s3incr.ignore.key.prefix (N/A) Control whether to ignore the s3 objects starting with this prefix
Config Param: S3_IGNORE_KEY_PREFIX
hoodie.streamer.source.s3incr.ignore.key.substring (N/A) Control whether to ignore the s3 objects with this substring
Config Param: S3_IGNORE_KEY_SUBSTRING
hoodie.streamer.source.s3incr.key.prefix (N/A) Control whether to filter the s3 objects starting with this prefix
Config Param: S3_KEY_PREFIX
hoodie.streamer.source.s3incr.spark.datasource.options (N/A) Json string, passed to the reader while loading dataset. Example Hudi Streamer conf —hoodie-conf hoodie.streamer.source.s3incr.spark.datasource.options={“header”:”true”,”encoding”:”UTF-8”}
Config Param: SPARK_DATASOURCE_OPTIONS
hoodie.streamer.source.s3incr.check.file.exists false Control whether we do existence check for files before consuming them
Config Param: S3_INCR_ENABLE_EXISTS_CHECK
hoodie.streamer.source.s3incr.fs.prefix s3 The file system prefix.
Config Param: S3_FS_PREFIX

S3 Source Configs

Configurations controlling the behavior of S3 source in Hudi Streamer.

Basic Configs

Config Name Default Description
hoodie.streamer.s3.source.queue.url (N/A) Queue url for cloud object events
Config Param: S3_SOURCE_QUEUE_URL

Advanced Configs

Config Name Default Description
hoodie.streamer.s3.source.queue.region (N/A) Case-sensitive region name of the cloud provider for the queue. For example, “us-east-1”.
Config Param: S3_SOURCE_QUEUE_REGION
hoodie.streamer.s3.source.queue.fs s3 File system corresponding to queue. For example, for AWS SQS it is s3/s3a.
Config Param: S3_SOURCE_QUEUE_FS
hoodie.streamer.s3.source.queue.long.poll.wait 20 Long poll wait time in seconds, If set as 0 then client will fetch on short poll basis.
Config Param: S3_QUEUE_LONG_POLL_WAIT
hoodie.streamer.s3.source.queue.max.messages.per.batch 5 Max messages for each batch of Hudi Streamer run. Source will process these maximum number of message at a time.
Config Param: S3_SOURCE_QUEUE_MAX_MESSAGES_PER_BATCH
hoodie.streamer.s3.source.queue.max.messages.per.request 10 Max messages for each request
Config Param: S3_SOURCE_QUEUE_MAX_MESSAGES_PER_REQUEST
hoodie.streamer.s3.source.queue.visibility.timeout 30 Visibility timeout for messages in queue. After we consume the message, queue will move the consumed messages to in-flight state, these messages can’t be consumed again by source for this timeout period.
Config Param: S3_SOURCE_QUEUE_VISIBILITY_TIMEOUT

File-based SQL Source Configs

Configurations controlling the behavior of File-based SQL Source in Hudi Streamer.

Basic Configs

Config Name Default Description
hoodie.streamer.source.sql.file (N/A) SQL file path containing the SQL query to read source data.
Config Param: SOURCE_SQL_FILE
Since Version: 0.14.0

Advanced Configs

Config Name Default Description
hoodie.streamer.source.sql.checkpoint.emit false Whether to emit the current epoch as the streamer checkpoint.
Config Param: EMIT_EPOCH_CHECKPOINT
Since Version: 0.14.0

SQL Source Configs

Configurations controlling the behavior of SQL source in Hudi Streamer.

Basic Configs

Config Name Default Description
hoodie.streamer.source.sql.sql.query (N/A) SQL query for fetching source data.
Config Param: SOURCE_SQL

Hudi Streamer Schema Provider Configs

Configurations that control the schema provider for Hudi Streamer.

Hudi Streamer Schema Provider Configs

Basic Configs

Config Name Default Description
hoodie.streamer.schemaprovider.registry.targetUrl (N/A) The schema of the target you are writing to e.g. https://foo:bar@schemaregistry.org
Config Param: TARGET_SCHEMA_REGISTRY_URL
hoodie.streamer.schemaprovider.registry.url (N/A) The schema of the source you are reading from e.g. https://foo:bar@schemaregistry.org
Config Param: SRC_SCHEMA_REGISTRY_URL

Advanced Configs

Config Name Default Description
hoodie.streamer.schemaprovider.registry.baseUrl (N/A) The base URL of the schema registry.
Config Param: SCHEMA_REGISTRY_BASE_URL
hoodie.streamer.schemaprovider.registry.schemaconverter (N/A) The class name of the custom schema converter to use.
Config Param: SCHEMA_CONVERTER
hoodie.streamer.schemaprovider.registry.sourceUrlSuffix (N/A) The source URL suffix.
Config Param: SCHEMA_REGISTRY_SOURCE_URL_SUFFIX
hoodie.streamer.schemaprovider.registry.targetUrlSuffix (N/A) The target URL suffix.
Config Param: SCHEMA_REGISTRY_TARGET_URL_SUFFIX
hoodie.streamer.schemaprovider.registry.urlSuffix (N/A) The suffix of the URL for the schema registry.
Config Param: SCHEMA_REGISTRY_URL_SUFFIX
hoodie.streamer.schemaprovider.spark_avro_post_processor.enable true Whether to enable Spark Avro post processor.
Config Param: SPARK_AVRO_POST_PROCESSOR_ENABLE

File-based Schema Provider Configs

Configurations for file-based schema provider.

Basic Configs

Config Name Default Description
hoodie.streamer.schemaprovider.source.schema.file (N/A) The schema of the source you are reading from
Config Param: SOURCE_SCHEMA_FILE
hoodie.streamer.schemaprovider.target.schema.file (N/A) The schema of the target you are writing to
Config Param: TARGET_SCHEMA_FILE

Hive Schema Provider Configs

Configurations for Hive schema provider.

Advanced Configs

Config Name Default Description
hoodie.streamer.schemaprovider.source.schema.hive.table (N/A) Hive table from where source schema can be fetched
Config Param: SOURCE_SCHEMA_TABLE
hoodie.streamer.schemaprovider.target.schema.hive.table (N/A) Hive table from where target schema can be fetched
Config Param: TARGET_SCHEMA_TABLE
hoodie.streamer.schemaprovider.source.schema.hive.database default Hive database from where source schema can be fetched
Config Param: SOURCE_SCHEMA_DATABASE
hoodie.streamer.schemaprovider.target.schema.hive.database default Hive database from where target schema can be fetched
Config Param: TARGET_SCHEMA_DATABASE

JDBC-based Schema Provider Configs

Configurations for JDBC-based schema provider.

Advanced Configs

Config Name Default Description
hoodie.streamer.schemaprovider.source.schema.jdbc.connection.url (N/A) The JDBC URL to connect to. The source-specific connection properties may be specified in the URL. e.g., jdbc:postgresql://localhost/test?user=fred&password=secret
Config Param: SOURCE_SCHEMA_JDBC_CONNECTION_URL
hoodie.streamer.schemaprovider.source.schema.jdbc.dbtable (N/A) The table with the schema to reference e.g. test_database.test1_table or test1_table
Config Param: SOURCE_SCHEMA_JDBC_DBTABLE
hoodie.streamer.schemaprovider.source.schema.jdbc.driver.type (N/A) The class name of the JDBC driver to use to connect to this URL. e.g. org.h2.Driver
Config Param: SOURCE_SCHEMA_JDBC_DRIVER_TYPE
hoodie.streamer.schemaprovider.source.schema.jdbc.nullable (N/A) If true, all the columns are nullable.
Config Param: SOURCE_SCHEMA_JDBC_NULLABLE
hoodie.streamer.schemaprovider.source.schema.jdbc.password (N/A) Password for the connection e.g. secret
Config Param: SOURCE_SCHEMA_JDBC_PASSWORD
hoodie.streamer.schemaprovider.source.schema.jdbc.timeout (N/A) The number of seconds the driver will wait for a Statement object to execute. Zero means there is no limit. In the write path, this option depends on how JDBC drivers implement the API setQueryTimeout, e.g., the h2 JDBC driver checks the timeout of each query instead of an entire JDBC batch. It defaults to 0.
Config Param: SOURCE_SCHEMA_JDBC_TIMEOUT
hoodie.streamer.schemaprovider.source.schema.jdbc.username (N/A) Username for the connection e.g. fred
Config Param: SOURCE_SCHEMA_JDBC_USERNAME

JDBC-based Schema Provider Configs

Configurations for Proto schema provider.

Advanced Configs

Config Name Default Description
hoodie.streamer.schemaprovider.proto.class.name (N/A) The Protobuf Message class used as the source for the schema.
Config Param: PROTO_SCHEMA_CLASS_NAME
Since Version: 0.13.0
hoodie.streamer.schemaprovider.proto.flatten.wrappers false When set to true wrapped primitives like Int64Value are translated to a record with a single ‘value’ field. By default, the value is false and the wrapped primitives are treated as a nullable value
Config Param: PROTO_SCHEMA_WRAPPED_PRIMITIVES_AS_RECORDS
Since Version: 0.13.0
hoodie.streamer.schemaprovider.proto.max.recursion.depth 5 The max depth to unravel the Proto schema when translating into an Avro schema. Setting this depth allows the user to convert a schema that is recursive in proto into something that can be represented in their lake format like Parquet. After a given class has been seen N times within a single branch, the schema provider will create a record with a byte array to hold the remaining proto data and a string to hold the message descriptor’s name for context.
Config Param: PROTO_SCHEMA_MAX_RECURSION_DEPTH
Since Version: 0.13.0
hoodie.streamer.schemaprovider.proto.timestamps.as.records false When set to true Timestamp fields are translated to a record with a seconds and nanos field. By default, the value is false and the timestamp is converted to a long with the timestamp-micros logical type
Config Param: PROTO_SCHEMA_TIMESTAMPS_AS_RECORDS
Since Version: 0.13.0

Schema Post Processor Config Configs

Configurations for Schema Post Processor

Advanced Configs

Config Name Default Description
hoodie.streamer.schemaprovider.schema_post_processor (N/A) The class name of the schema post processor.
Config Param: SCHEMA_POST_PROCESSOR
hoodie.streamer.schemaprovider.schema_post_processor.add.column.default (N/A) New column’s default value
Config Param: SCHEMA_POST_PROCESSOR_ADD_COLUMN_DEFAULT_PROP
hoodie.streamer.schemaprovider.schema_post_processor.add.column.doc (N/A) Docs about new column
Config Param: SCHEMA_POST_PROCESSOR_ADD_COLUMN_DOC_PROP
hoodie.streamer.schemaprovider.schema_post_processor.add.column.name (N/A) New column’s name
Config Param: SCHEMA_POST_PROCESSOR_ADD_COLUMN_NAME_PROP
hoodie.streamer.schemaprovider.schema_post_processor.add.column.type (N/A) New column’s type
Config Param: SCHEMA_POST_PROCESSOR_ADD_COLUMN_TYPE_PROP
hoodie.streamer.schemaprovider.schema_post_processor.delete.columns (N/A) Columns to delete in the schema post processor.
Config Param: DELETE_COLUMN_POST_PROCESSOR_COLUMN
hoodie.streamer.schemaprovider.schema_post_processor.add.column.nullable true New column’s nullable
Config Param: SCHEMA_POST_PROCESSOR_ADD_COLUMN_NULLABLE_PROP