Multi-level configuration management
Amoro provides configurations that can be configured at the Catalog
, Table
, and Engine
levels. The configuration
priority is given first to the Engine
, followed by the Table
, and finally by the Catalog
.
- Catalog: Generally, we recommend users to set default values for tables through the Catalog properties configuration, such as Self-optimizing related configurations.
- Table: We also recommend users to specify customized configurations when Create Table, which can also be modified through Alter Table operations.
- Engine: If tuning is required in the engines, then consider configuring it at the engine level, refer to Spark and Flink.
Self-optimizing configurations
Self-optimizing configurations are applicable to both Iceberg Format and Mixed streaming Format.
Key | Default | Description | |
---|---|---|---|
self-optimizing.enabled | true | Enables Self-optimizing | |
self-optimizing.group | default | Optimizer group for Self-optimizing | |
self-optimizing.quota | 0.1 | Quota for Self-optimizing, indicating the CPU resource the table can take up | |
self-optimizing.execute.num-retries | 5 | Number of retries after failure of Self-optimizing | |
self-optimizing.target-size | 134217728(128MB) | Target size for Self-optimizing | |
self-optimizing.max-file-count | 10000 | Maximum number of files processed by a Self-optimizing process | |
self-optimizing.max-task-size-bytes | 134217728(128MB) | Maximum file size bytes in a single task for splitting tasks | |
self-optimizing.fragment-ratio | 8 | The fragment file size threshold. We could divide self-optimizing.target-size by this ratio to get the actual fragment file size | |
self-optimizing.minor.trigger.file-count | 12 | The minimum numbers of fragment files to trigger minor optimizing | |
self-optimizing.minor.trigger.interval | 3600000(1 hour) | The time interval in milliseconds to trigger minor optimizing | |
self-optimizing.major.trigger.duplicate-ratio | 0.1 | The ratio of duplicate data of segment files to trigger major optimizing | |
self-optimizing.full.trigger.interval | -1(closed) | The time interval in milliseconds to trigger full optimizing | |
self-optimizing.full.rewrite-all-files | true | Whether full optimizing rewrites all files or skips files that do not need to be optimized |
Data-cleaning configurations
Data-cleaning configurations are applicable to both Iceberg Format and Mixed streaming Format.
Key | Default | Description |
---|---|---|
table-expire.enabled | true | Enables periodically expire table |
change.data.ttl.minutes | 10080(7 days) | Time to live in minutes for data of ChangeStore |
snapshot.base.keep.minutes | 720(12 hours) | Table-Expiration keeps the latest snapshots of BaseStore within a specified time in minutes |
clean-orphan-file.enabled | false | Enables periodically clean orphan files |
clean-orphan-file.min-existing-time-minutes | 2880(2 days) | Cleaning orphan files keeps the files modified within a specified time in minutes |
clean-dangling-delete-files.enabled | true | Whether to enable cleaning of dangling delete files |
data-expire.enabled | false | Whether to enable data expiration |
data-expire.level | partition | Level of data expiration. Including partition and file |
data-expire.field | NULL | Field used to determine data expiration, supporting timestamp/timestampz/long type and string type field in date format |
data-expire.datetime-string-pattern | yyyy-MM-dd | Pattern used for matching string datetime |
data-expire.datetime-number-format | TIMESTAMP_MS | Timestamp unit for long field. Including TIMESTAMP_MS and TIMESTAMP_S |
data-expire.retention-time | NULL | Retention period for data expiration. For example, 1d means retaining data for 1 day. Other supported units include h (hour), min (minute), s (second), ms (millisecond), etc. |
Mixed Format configurations
If using Iceberg Format,please refer to Iceberg configurations,the following configurations are only applicable to Mixed Format.
Reading configurations
Key | Default | Description |
---|---|---|
read.split.open-file-cost | 4194304(4MB) | The estimated cost to open a file |
read.split.planning-lookback | 10 | Number of bins to consider when combining input splits |
read.split.target-size | 134217728(128MB) | Target size when combining data input splits |
read.split.delete-ratio | 0.05 | When the ratio of delete files is below this threshold, the read task will be split into more tasks to improve query speed |
Writing configurations
Key | Default | Description |
---|---|---|
base.write.format | parquet | File format for the table for BaseStore, applicable to KeyedTable |
change.write.format | parquet | File format for the table for ChangeStore, applicable to KeyedTable |
write.format.default | parquet | Default file format for the table, applicable to UnkeyedTable |
base.file-index.hash-bucket | 4 | Initial number of buckets for BaseStore auto-bucket |
change.file-index.hash-bucket | 4 | Initial number of buckets for ChangeStore auto-bucket |
write.target-file-size-bytes | 134217728(128MB) | Target size when writing |
write.upsert.enabled | false | Enable upsert mode, multiple insert data with the same primary key will be merged if enabled |
write.distribution-mode | hash | Shuffle rules for writing. UnkeyedTable can choose between none and hash, while KeyedTable can only choose hash |
write.distribution.hash-mode | auto | Auto-bucket mode, which supports primary-key, partition-key, primary-partition-key, and auto |
LogStore configurations
Key | Default | Description |
---|---|---|
log-store.enabled | false | Enables LogStore |
log-store.type | kafka | Type of LogStore, which supports ‘kafka’ and ‘pulsar’ |
log-store.address | NULL | Address of LogStore, required when LogStore enabled. For Kafka, this is the Kafka bootstrap servers. For Pulsar, this is the Pulsar Service URL, such as ‘pulsar://localhost:6650’ |
log-store.topic | NULL | Topic of LogStore, required when LogStore enabled |
properties.pulsar.admin.adminUrl | NULL | HTTP URL of Pulsar admin, such as ‘http://my-broker.example.com:8080‘. Only required when log-store.type=pulsar |
properties.XXX | NULL | Other configurations of LogStore. For Kafka, all the configurations supported by Kafka Consumer/Producer can be set by prefixing them with properties. ,such as 'properties.batch.size'='16384' ,refer to Kafka Consumer Configurations, Kafka Producer Configurations for more details. For Pulsar,all the configurations supported by Pulsar can be set by prefixing them with properties. , such as 'properties.pulsar.client.requestTimeoutMs'='60000' ,refer to Flink-Pulsar-Connector for more details |
Watermark configurations
Key | Default | Description |
---|---|---|
table.event-time-field | _ingest_time | The event time field for calculating the watermark. The default _ingest_time indicates calculating with the time when the data was written |
table.watermark-allowed-lateness-second | 0 | The allowed lateness time in seconds when calculating watermark |
table.event-time-field.datetime-string-format | yyyy-MM-dd HH:mm:ss |
The format of event time when it is in string format |
table.event-time-field.datetime-number-format | TIMESTAMP_MS | The format of event time when it is in numeric format, which supports TIMESTAMP_MS (timestamp in milliseconds) and TIMESTAMP_S (timestamp in seconds) |
Mixed-Hive format configurations
Key | Default | Description |
---|---|---|
base.hive.auto-sync-schema-change | true | Whether synchronize schema changes of Hive Table from HMS |
base.hive.auto-sync-data-write | false | Whether synchronize data changes of Hive Table from HMS, this should be true when writing to Hive |
base.hive.consistent-write.enabled | true | To avoid writing dirty data, the files written to the Hive directory will be hidden files and renamed to visible files upon commit. |