Every record in Hudi is uniquely identified by a primary key, which is a pair of record key and partition path where the record belongs to. Using primary keys, Hudi can impose a) partition level uniqueness integrity constraint b) enable fast updates and deletes on records. One should choose the partitioning scheme wisely as it could be a determining factor for your ingestion and query latency. Some use cases do not have a naturally present record key, for ex. log ingestion type of payloads. For these type of use cases, users do not have to specify the record key field explicitly anymore and Hudi can automatically generate record keys (from Hudi version 0.14.0, Hudi can automatically generate record keys if not specified explicitly) that are efficient for compute, storage and read to meet the uniqueness requirements of the primary key.
In general, Hudi supports both partitioned and global indexes. For a dataset with partitioned index(which is most commonly used), each record is uniquely identified by a pair of record key and partition path. But for a dataset with global index, each record is uniquely identified by just the record key. There won’t be any duplicate record keys across partitions.
Key Generators
Hudi provides several key generators out of the box that users can use based on their need, while having a pluggable implementation for users to implement and use their own KeyGenerator. This page goes over all different types of key generators that are readily available to use.
Here is the interface for KeyGenerator in Hudi for your reference.
Before diving into different types of key generators, let’s go over some of the common configs relevant to key generators.
Config Name | Default | Description |
---|---|---|
hoodie.datasource.write.recordkey.field | N/A (Optional) | Record key field. Value to be used as the recordKey component of HoodieKey .
Config Param: RECORDKEY_FIELD_NAME |
hoodie.datasource.write.partitionpath.field | N/A (Optional) | Partition path field. Value to be used at the partitionPath component of HoodieKey. This needs to be specified if a partitioned table is desired. Actual value obtained by invoking .toString()Config Param: PARTITIONPATH_FIELD_NAME |
hoodie.datasource.write.keygenerator.class | N/A (Optional) | Key generator class, that implements org.apache.hudi.keygen.KeyGenerator extract a key out of incoming records.
Config Param: KEYGENERATOR_CLASS_NAME |
hoodie.datasource.write.hive_style_partitioning | false (Optional) | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)Config Param: HIVE_STYLE_PARTITIONING_ENABLE |
hoodie.datasource.write.partitionpath.urlencode | false (Optional) | Should we url encode the partition path value, before creating the folder structure.Config Param: URL_ENCODE_PARTITIONING |
For all advanced configs refer here.
Lets go over different key generators available to be used with Hudi.
SimpleKeyGenerator
Record key refers to one field(column in dataframe) by name and partition path refers to one field (single column in dataframe) by name. This is one of the most commonly used one. Values are interpreted as is from dataframe and converted to string.
ComplexKeyGenerator
Both record key and partition paths comprise one or more than one field by name(combination of multiple fields). Fields
are expected to be comma separated in the config value. For example "Hoodie.datasource.write.recordkey.field" : “col1,col4”
NonpartitionedKeyGenerator
If your hudi dataset is not partitioned, you could use this “NonpartitionedKeyGenerator” which will return an empty partition for all records. In other words, all records go to the same partition (which is empty “”)
CustomKeyGenerator
This is a generic implementation of KeyGenerator where users are able to leverage the benefits of SimpleKeyGenerator, ComplexKeyGenerator and TimestampBasedKeyGenerator all at the same time. One can configure record key and partition paths as a single field or a combination of fields.
hoodie.datasource.write.recordkey.field
hoodie.datasource.write.partitionpath.field
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
This keyGenerator is particularly useful if you want to define
complex partition paths involving regular fields and timestamp based fields. It expects value for prop "hoodie.datasource.write.partitionpath.field"
in a specific format. The format should be “field1:PartitionKeyType1,field2:PartitionKeyType2…”
The complete partition path is created as
<value for field1 basis PartitionKeyType1>/<value for field2 basis PartitionKeyType2>
and so on. Each partition key type could either be SIMPLE or TIMESTAMP.
Example config value: “field_3:simple,field_5:timestamp”
RecordKey config value is either single field incase of SimpleKeyGenerator or a comma separate field names if referring to ComplexKeyGenerator. Example:
hoodie.datasource.write.recordkey.field=field1,field2
This will create your record key in the format field1:value1,field2:value2
and so on, otherwise you can specify only one field in case of simple record keys. CustomKeyGenerator
class defines an enum PartitionKeyType
for configuring partition paths. It can take two possible values - SIMPLE and TIMESTAMP.
The value for hoodie.datasource.write.partitionpath.field
property in case of partitioned tables needs to be provided in the format field1:PartitionKeyType1,field2:PartitionKeyType2
and so on. For example, if you want to create partition path using 2 fields country
and date
where the latter has timestamp based values and needs to be customised in a given format, you can specify the following
hoodie.datasource.write.partitionpath.field=country:SIMPLE,date:TIMESTAMP
This will create the partition path in the format <country_name>/<date>
or country=<country_name>/date=<date>
depending on whether you want hive style partitioning or not.
Bring your own implementation
You can implement your own custom key generator by extending the public API class here:
TimestampBasedKeyGenerator
This key generator relies on timestamps for the partition field. The field values are interpreted as timestamps and not just converted to string while generating partition path value for records. Record key is same as before where it is chosen by field name. Users are expected to set few more configs to use this KeyGenerator.
Configs to be set:
Config Name | Default | Description |
---|---|---|
hoodie.keygen.timebased.timestamp.type | N/A (Required) | Required only when the key generator is TimestampBasedKeyGenerator. One of the timestamp types supported(UNIX_TIMESTAMP, DATE_STRING, MIXED, EPOCHMILLISECONDS, SCALAR) |
hoodie.keygen.timebased.output.dateformat | “” (Optional) | Output date format such as yyyy-MM-dd'T'HH:mm:ss.SSSZ |
hoodie.keygen.timebased.timezone | “UTC” (Optional) | Timezone of both input and output timestamp if they are the same, such as UTC . Please use hoodie.keygen.timebased.input.timezone and hoodie.keygen.timebased.output.timezone instead if the input and output timezones are different. |
hoodie.keygen.timebased.input.dateformat | “” (Optional) | Input date format such as yyyy-MM-dd'T'HH:mm:ss.SSSZ . |
Let’s go over some example values for TimestampBasedKeyGenerator.
Timestamp is GMT
Config Name | Value |
---|---|
hoodie.streamer.keygen.timebased.timestamp.type |
“EPOCHMILLISECONDS” |
hoodie.streamer.keygen.timebased.output.dateformat |
“yyyy-MM-dd hh” |
hoodie.streamer.keygen.timebased.timezone |
“GMT+8:00” |
Input Field value: “1578283932000L”
Partition path generated from key generator: “2020-01-06 12”
If input field value is null for some rows.
Partition path generated from key generator: “1970-01-01 08”
Timestamp is DATE_STRING
Config Name | Value |
---|---|
hoodie.streamer.keygen.timebased.timestamp.type |
“DATE_STRING” |
hoodie.streamer.keygen.timebased.output.dateformat |
“yyyy-MM-dd hh” |
hoodie.streamer.keygen.timebased.timezone |
“GMT+8:00” |
hoodie.streamer.keygen.timebased.input.dateformat |
“yyyy-MM-dd hh:mm:ss” |
Input field value: “2020-01-06 12:12:12”
Partition path generated from key generator: “2020-01-06 12”
If input field value is null for some rows.
Partition path generated from key generator: “1970-01-01 12:00:00”
Scalar examples
Config Name | Value |
---|---|
hoodie.streamer.keygen.timebased.timestamp.type |
“SCALAR” |
hoodie.streamer.keygen.timebased.output.dateformat |
“yyyy-MM-dd hh” |
hoodie.streamer.keygen.timebased.timezone |
“GMT” |
hoodie.streamer.keygen.timebased.timestamp.scalar.time.unit |
“days” |
Input field value: “20000L”
Partition path generated from key generator: “2024-10-04 12”
If input field value is null.
Partition path generated from key generator: “1970-01-02 12”
ISO8601WithMsZ with Single Input format
Config Name | Value |
---|---|
hoodie.streamer.keygen.timebased.timestamp.type |
“DATE_STRING” |
hoodie.streamer.keygen.timebased.input.dateformat |
“yyyy-MM-dd’T’HH:mm:ss.SSSZ” |
hoodie.streamer.keygen.timebased.input.dateformat.list.delimiter.regex |
“” |
hoodie.streamer.keygen.timebased.input.timezone |
“” |
hoodie.streamer.keygen.timebased.output.dateformat |
“yyyyMMddHH” |
hoodie.streamer.keygen.timebased.output.timezone |
“GMT” |
Input field value: “2020-04-01T13:01:33.428Z”
Partition path generated from key generator: “2020040113”
ISO8601WithMsZ with Multiple Input formats
Config Name | Value |
---|---|
hoodie.streamer.keygen.timebased.timestamp.type |
“DATE_STRING” |
hoodie.streamer.keygen.timebased.input.dateformat |
“yyyy-MM-dd’T’HH:mm:ssZ,yyyy-MM-dd’T’HH:mm:ss.SSSZ” |
hoodie.streamer.keygen.timebased.input.dateformat.list.delimiter.regex |
“” |
hoodie.streamer.keygen.timebased.input.timezone |
“” |
hoodie.streamer.keygen.timebased.output.dateformat |
“yyyyMMddHH” |
hoodie.streamer.keygen.timebased.output.timezone |
“UTC” |
Input field value: “2020-04-01T13:01:33.428Z”
Partition path generated from key generator: “2020040113”
ISO8601NoMs with offset using multiple input formats
Config Name | Value |
---|---|
hoodie.streamer.keygen.timebased.timestamp.type |
“DATE_STRING” |
hoodie.streamer.keygen.timebased.input.dateformat |
“yyyy-MM-dd’T’HH:mm:ssZ,yyyy-MM-dd’T’HH:mm:ss.SSSZ” |
hoodie.streamer.keygen.timebased.input.dateformat.list.delimiter.regex |
“” |
hoodie.streamer.keygen.timebased.input.timezone |
“” |
hoodie.streamer.keygen.timebased.output.dateformat |
“yyyyMMddHH” |
hoodie.streamer.keygen.timebased.output.timezone |
“UTC” |
Input field value: “2020-04-01T13:01:33-05:00“
Partition path generated from key generator: “2020040118”
Input as short date string and expect date in date format
Config Name | Value |
---|---|
hoodie.streamer.keygen.timebased.timestamp.type |
“DATE_STRING” |
hoodie.streamer.keygen.timebased.input.dateformat |
“yyyy-MM-dd’T’HH:mm:ssZ,yyyy-MM-dd’T’HH:mm:ss.SSSZ,yyyyMMdd” |
hoodie.streamer.keygen.timebased.input.dateformat.list.delimiter.regex |
“” |
hoodie.streamer.keygen.timebased.input.timezone |
“UTC” |
hoodie.streamer.keygen.timebased.output.dateformat |
“MM/dd/yyyy” |
hoodie.streamer.keygen.timebased.output.timezone |
“UTC” |
Input field value: “20200401”
Partition path generated from key generator: “04/01/2020”