Multi-level configuration management

Amoro provides configurations that can be configured at the Catalog, Table, and Engine levels. The configuration priority is given first to the Engine, followed by the Table, and finally by the Catalog.

  • Catalog: Generally, we recommend users to set default values for tables through the Catalog properties configuration, such as Self-optimizing related configurations.
  • Table: We also recommend users to specify customized configurations when Create Table, which can also be modified through Alter Table operations.
  • Engine: If tuning is required in the engines, then consider configuring it at the engine level, refer to Spark and Flink.

Self-optimizing configurations

Self-optimizing configurations are applicable to both Iceberg Format and Mixed streaming Format.

Key Default Description
self-optimizing.enabled true Enables Self-optimizing
self-optimizing.group default Optimizer group for Self-optimizing
self-optimizing.quota 0.1 Quota for Self-optimizing, indicating the CPU resource the table can take up
self-optimizing.execute.num-retries 5 Number of retries after failure of Self-optimizing
self-optimizing.target-size 134217728(128MB) Target size for Self-optimizing
self-optimizing.max-file-count 10000 Maximum number of files processed by a Self-optimizing process
self-optimizing.max-task-size-bytes 134217728(128MB) Maximum file size bytes in a single task for splitting tasks
self-optimizing.fragment-ratio 8 The fragment file size threshold. We could divide self-optimizing.target-size by this ratio to get the actual fragment file size
self-optimizing.minor.trigger.file-count 12 The minimum numbers of fragment files to trigger minor optimizing
self-optimizing.minor.trigger.interval 3600000(1 hour) The time interval in milliseconds to trigger minor optimizing
self-optimizing.major.trigger.duplicate-ratio 0.1 The ratio of duplicate data of segment files to trigger major optimizing
self-optimizing.full.trigger.interval -1(closed) The time interval in milliseconds to trigger full optimizing
self-optimizing.full.rewrite-all-files true Whether full optimizing rewrites all files or skips files that do not need to be optimized

Data-cleaning configurations

Data-cleaning configurations are applicable to both Iceberg Format and Mixed streaming Format.

Key Default Description
table-expire.enabled true Enables periodically expire table
change.data.ttl.minutes 10080(7 days) Time to live in minutes for data of ChangeStore
snapshot.base.keep.minutes 720(12 hours) Table-Expiration keeps the latest snapshots of BaseStore within a specified time in minutes
clean-orphan-file.enabled false Enables periodically clean orphan files
clean-orphan-file.min-existing-time-minutes 2880(2 days) Cleaning orphan files keeps the files modified within a specified time in minutes
clean-dangling-delete-files.enabled true Whether to enable cleaning of dangling delete files
data-expire.enabled false Whether to enable data expiration
data-expire.level partition Level of data expiration. Including partition and file
data-expire.field NULL Field used to determine data expiration, supporting timestamp/timestampz/long type and string type field in date format
data-expire.datetime-string-pattern yyyy-MM-dd Pattern used for matching string datetime
data-expire.datetime-number-format TIMESTAMP_MS Timestamp unit for long field. Including TIMESTAMP_MS and TIMESTAMP_S
data-expire.retention-time NULL Retention period for data expiration. For example, 1d means retaining data for 1 day. Other supported units include h (hour), min (minute), s (second), ms (millisecond), etc.

Mixed Format configurations

If using Iceberg Format,please refer to Iceberg configurations,the following configurations are only applicable to Mixed Format.

Reading configurations

Key Default Description
read.split.open-file-cost 4194304(4MB) The estimated cost to open a file
read.split.planning-lookback 10 Number of bins to consider when combining input splits
read.split.target-size 134217728(128MB) Target size when combining data input splits
read.split.delete-ratio 0.05 When the ratio of delete files is below this threshold, the read task will be split into more tasks to improve query speed

Writing configurations

Key Default Description
base.write.format parquet File format for the table for BaseStore, applicable to KeyedTable
change.write.format parquet File format for the table for ChangeStore, applicable to KeyedTable
write.format.default parquet Default file format for the table, applicable to UnkeyedTable
base.file-index.hash-bucket 4 Initial number of buckets for BaseStore auto-bucket
change.file-index.hash-bucket 4 Initial number of buckets for ChangeStore auto-bucket
write.target-file-size-bytes 134217728(128MB) Target size when writing
write.upsert.enabled false Enable upsert mode, multiple insert data with the same primary key will be merged if enabled
write.distribution-mode hash Shuffle rules for writing. UnkeyedTable can choose between none and hash, while KeyedTable can only choose hash
write.distribution.hash-mode auto Auto-bucket mode, which supports primary-key, partition-key, primary-partition-key, and auto

LogStore configurations

Key Default Description
log-store.enabled false Enables LogStore
log-store.type kafka Type of LogStore, which supports ‘kafka’ and ‘pulsar’
log-store.address NULL Address of LogStore, required when LogStore enabled. For Kafka, this is the Kafka bootstrap servers. For Pulsar, this is the Pulsar Service URL, such as ‘pulsar://localhost:6650’
log-store.topic NULL Topic of LogStore, required when LogStore enabled
properties.pulsar.admin.adminUrl NULL HTTP URL of Pulsar admin, such as ‘http://my-broker.example.com:8080‘. Only required when log-store.type=pulsar
properties.XXX NULL Other configurations of LogStore.

For Kafka, all the configurations supported by Kafka Consumer/Producer can be set by prefixing them with properties.
such as 'properties.batch.size'='16384'
refer to Kafka Consumer Configurations, Kafka Producer Configurations for more details.

For Pulsar,all the configurations supported by Pulsar can be set by prefixing them with properties.,
such as 'properties.pulsar.client.requestTimeoutMs'='60000'
refer to Flink-Pulsar-Connector for more details

Watermark configurations

Key Default Description
table.event-time-field _ingest_time The event time field for calculating the watermark. The default _ingest_time indicates calculating with the time when the data was written
table.watermark-allowed-lateness-second 0 The allowed lateness time in seconds when calculating watermark
table.event-time-field.datetime-string-format yyyy-MM-dd HH:mm:ss The format of event time when it is in string format
table.event-time-field.datetime-number-format TIMESTAMP_MS The format of event time when it is in numeric format, which supports TIMESTAMP_MS (timestamp in milliseconds) and TIMESTAMP_S (timestamp in seconds)

Mixed-Hive format configurations

Key Default Description
base.hive.auto-sync-schema-change true Whether synchronize schema changes of Hive Table from HMS
base.hive.auto-sync-data-write false Whether synchronize data changes of Hive Table from HMS, this should be true when writing to Hive
base.hive.consistent-write.enabled true To avoid writing dirty data, the files written to the Hive directory will be hidden files and renamed to visible files upon commit.