Flink - Flink Configuration - 《Apache Iceberg 1.5.2》

Catalog Configuration
Runtime configuration
- Read options
- Write options

Catalog Configuration

A catalog is created and named by executing the following query (replace <catalog_name> with your catalog name and <config_key>=<config_value> with catalog implementation config):

CREATE CATALOG <catalog_name> WITH (
  'type'='iceberg',
  `<config_key>`=`<config_value>`
);

The following properties can be set globally and are not limited to a specific catalog implementation:

Property	Required	Values	Description
type	✔️	iceberg	Must be `iceberg`.
catalog-type		`hive`, `hadoop`, `rest`, `glue`, `jdbc` or `nessie`	The underlying Iceberg catalog implementation, `HiveCatalog`, `HadoopCatalog`, `RESTCatalog`, `GlueCatalog`, `JdbcCatalog`, `NessieCatalog` or left unset if using a custom catalog implementation via catalog-impl
catalog-impl			The fully-qualified class name of a custom catalog implementation. Must be set if `catalog-type` is unset.
property-version			Version number to describe the property version. This property can be used for backwards compatibility in case the property format changes. The current property version is `1`.
cache-enabled		`true` or `false`	Whether to enable catalog cache, default value is `true`.
cache.expiration-interval-ms			How long catalog entries are locally cached, in milliseconds; negative values like `-1` will disable expiration, value 0 is not allowed to set. default value is `-1`.

The following properties can be set if using the Hive catalog:

Property	Required	Description
uri	✔️	The Hive metastore’s thrift URI.
clients		The Hive metastore client pool size, default value is 2.
warehouse		The Hive warehouse location, users should specify this path if neither set the `hive-conf-dir` to specify a location containing a `hive-site.xml` configuration file nor add a correct `hive-site.xml` to classpath.
hive-conf-dir		Path to a directory containing a `hive-site.xml` configuration file which will be used to provide custom Hive configuration values. The value of `hive.metastore.warehouse.dir` from `<hive-conf-dir>/hive-site.xml` (or hive configure file from classpath) will be overwritten with the `warehouse` value if setting both `hive-conf-dir` and `warehouse` when creating iceberg catalog.
hadoop-conf-dir		Path to a directory containing `core-site.xml` and `hdfs-site.xml` configuration files which will be used to provide custom Hadoop configuration values.

The following properties can be set if using the Hadoop catalog:

Property	Required	Values	Description
warehouse	✔️		The HDFS directory to store metadata files and data files.

The following properties can be set if using the REST catalog:

Property	Required	Description
uri	✔️	The URL to the REST Catalog.
credential		A credential to exchange for a token in the OAuth2 client credentials flow.
token		A token which will be used to interact with the server.

Runtime configuration

Read options

Flink read options are passed when configuring the Flink IcebergSource:

IcebergSource.forRowData()
    .tableLoader(TableLoader.fromCatalog(...))
    .assignerFactory(new SimpleSplitAssignerFactory())
    .streaming(true)
    .streamingStartingStrategy(StreamingStartingStrategy.INCREMENTAL_FROM_SNAPSHOT_ID)
    .startSnapshotId(3821550127947089987L)
    .monitorInterval(Duration.ofMillis(10L)) // or .set("monitor-interval", "10s") \ set(FlinkReadOptions.MONITOR_INTERVAL, "10s")
    .build()

For Flink SQL, read options can be passed in via SQL hints like this:

SELECT * FROM tableName /*+ OPTIONS('monitor-interval'='10s') */
...

Options can be passed in via Flink configuration, which will be applied to current session. Note that not all options support this mode.

env.getConfig()
    .getConfiguration()
    .set(FlinkReadOptions.SPLIT_FILE_OPEN_COST_OPTION, 1000L);
...

Read option has the highest priority, followed by Flink configuration and then Table property.

Read option	Flink configuration	Table property	Default	Description
snapshot-id	N/A	N/A	null	For time travel in batch mode. Read data from the specified snapshot-id.
case-sensitive	connector.iceberg.case-sensitive	N/A	false	If true, match column name in a case sensitive way.
as-of-timestamp	N/A	N/A	null	For time travel in batch mode. Read data from the most recent snapshot as of the given time in milliseconds.
starting-strategy	connector.iceberg.starting-strategy	N/A	INCREMENTAL_FROM_LATEST_SNAPSHOT	Starting strategy for streaming execution. TABLE_SCAN_THEN_INCREMENTAL: Do a regular table scan then switch to the incremental mode. The incremental mode starts from the current snapshot exclusive. INCREMENTAL_FROM_LATEST_SNAPSHOT: Start incremental mode from the latest snapshot inclusive. If it is an empty map, all future append snapshots should be discovered. INCREMENTAL_FROM_EARLIEST_SNAPSHOT: Start incremental mode from the earliest snapshot inclusive. If it is an empty map, all future append snapshots should be discovered. INCREMENTAL_FROM_SNAPSHOT_ID: Start incremental mode from a snapshot with a specific id inclusive. INCREMENTAL_FROM_SNAPSHOT_TIMESTAMP: Start incremental mode from a snapshot with a specific timestamp inclusive. If the timestamp is between two snapshots, it should start from the snapshot after the timestamp. Just for FIP27 Source.
start-snapshot-timestamp	N/A	N/A	null	Start to read data from the most recent snapshot as of the given time in milliseconds.
start-snapshot-id	N/A	N/A	null	Start to read data from the specified snapshot-id.
end-snapshot-id	N/A	N/A	The latest snapshot id	Specifies the end snapshot.
branch	N/A	N/A	main	Specifies the branch to read from in batch mode
tag	N/A	N/A	null	Specifies the tag to read from in batch mode
start-tag	N/A	N/A	null	Specifies the starting tag to read from for incremental reads
end-tag	N/A	N/A	null	Specifies the ending tag to to read from for incremental reads
split-size	connector.iceberg.split-size	read.split.target-size	128 MB	Target size when combining input splits.
split-lookback	connector.iceberg.split-file-open-cost	read.split.planning-lookback	10	Number of bins to consider when combining input splits.
split-file-open-cost	connector.iceberg.split-file-open-cost	read.split.open-file-cost	4MB	The estimated cost to open a file, used as a minimum weight when combining splits.
streaming	connector.iceberg.streaming	N/A	false	Sets whether the current task runs in streaming or batch mode.
monitor-interval	connector.iceberg.monitor-interval	N/A	60s	Monitor interval to discover splits from new snapshots. Applicable only for streaming read.
include-column-stats	connector.iceberg.include-column-stats	N/A	false	Create a new scan from this that loads the column stats with each data file. Column stats include: value count, null value count, lower bounds, and upper bounds.
max-planning-snapshot-count	connector.iceberg.max-planning-snapshot-count	N/A	Integer.MAX_VALUE	Max number of snapshots limited per split enumeration. Applicable only to streaming read.
limit	connector.iceberg.limit	N/A	-1	Limited output number of rows.
max-allowed-planning-failures	connector.iceberg.max-allowed-planning-failures	N/A	3	Max allowed consecutive failures for scan planning before failing the job. Set to -1 for never failing the job for scan planing failure.
watermark-column	connector.iceberg.watermark-column	N/A	null	Specifies the watermark column to use for watermark generation. If this option is present, the `splitAssignerFactory` will be overridden with `OrderedSplitAssignerFactory`.
watermark-column-time-unit	connector.iceberg.watermark-column-time-unit	N/A	TimeUnit.MICROSECONDS	Specifies the watermark time unit to use for watermark generation. The possible values are DAYS, HOURS, MINUTES, SECONDS, MILLISECONDS, MICROSECONDS, NANOSECONDS.

Write options

Flink write options are passed when configuring the FlinkSink, like this:

FlinkSink.Builder builder = FlinkSink.forRow(dataStream, SimpleDataUtil.FLINK_SCHEMA)
    .table(table)
    .tableLoader(tableLoader)
    .set("write-format", "orc")
    .set(FlinkWriteOptions.OVERWRITE_MODE, "true");

For Flink SQL, write options can be passed in via SQL hints like this:

INSERT INTO tableName /*+ OPTIONS('upsert-enabled'='true') */
...

Flink option	Default	Description
write-format	Table write.format.default	File format to use for this write operation; parquet, avro, or orc
target-file-size-bytes	As per table property	Overrides this table’s write.target-file-size-bytes
upsert-enabled	Table write.upsert.enabled	Overrides this table’s write.upsert.enabled
overwrite-enabled	false	Overwrite the table’s data, overwrite mode shouldn’t be enable when configuring to use UPSERT data stream.
distribution-mode	Table write.distribution-mode	Overrides this table’s write.distribution-mode
compression-codec	Table write.(fileformat).compression-codec	Overrides this table’s compression codec for this write
compression-level	Table write.(fileformat).compression-level	Overrides this table’s compression level for Parquet and Avro tables for this write
compression-strategy	Table write.orc.compression-strategy	Overrides this table’s compression strategy for ORC tables for this write
write-parallelism	Upstream operator parallelism	Overrides the writer parallelism