Java API Quickstart

Create a table

Tables are created using either a Catalog or an implementation of the Tables interface.

Using a Hive catalog

The Hive catalog connects to a Hive metastore to keep track of Iceberg tables. You can initialize a Hive catalog with a name and some properties. (see: Catalog properties)

  1. import java.util.HashMap
  2. import java.util.Map
  3. import org.apache.iceberg.hive.HiveCatalog;
  4. HiveCatalog catalog = new HiveCatalog();
  5. catalog.setConf(spark.sparkContext().hadoopConfiguration()); // Optionally use Spark's Hadoop configuration
  6. Map <String, String> properties = new HashMap<String, String>();
  7. properties.put("warehouse", "...");
  8. properties.put("uri", "...");
  9. catalog.initialize("hive", properties);

HiveCatalog implements the Catalog interface, which defines methods for working with tables, like createTable, loadTable, renameTable, and dropTable. To create a table, pass an Identifier and a Schema along with other initial metadata:

  1. import org.apache.iceberg.Table;
  2. import org.apache.iceberg.catalog.TableIdentifier;
  3. TableIdentifier name = TableIdentifier.of("logging", "logs");
  4. Table table = catalog.createTable(name, schema, spec);
  5. // or to load an existing table, use the following line
  6. Table table = catalog.loadTable(name);

The table’s schema and partition spec are created below.

Using a Hadoop catalog

A Hadoop catalog doesn’t need to connect to a Hive MetaStore, but can only be used with HDFS or similar file systems that support atomic rename. Concurrent writes with a Hadoop catalog are not safe with a local FS or S3. To create a Hadoop catalog:

  1. import org.apache.hadoop.conf.Configuration;
  2. import org.apache.iceberg.hadoop.HadoopCatalog;
  3. Configuration conf = new Configuration();
  4. String warehousePath = "hdfs://host:8020/warehouse_path";
  5. HadoopCatalog catalog = new HadoopCatalog(conf, warehousePath);

Like the Hive catalog, HadoopCatalog implements Catalog, so it also has methods for working with tables, like createTable, loadTable, and dropTable.

This example creates a table with Hadoop catalog:

  1. import org.apache.iceberg.Table;
  2. import org.apache.iceberg.catalog.TableIdentifier;
  3. TableIdentifier name = TableIdentifier.of("logging", "logs");
  4. Table table = catalog.createTable(name, schema, spec);
  5. // or to load an existing table, use the following line
  6. Table table = catalog.loadTable(name);

The table’s schema and partition spec are created below.

Tables in Spark

Spark can work with table by name using HiveCatalog.

  1. // spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
  2. // spark.sql.catalog.hive_prod.type = hive
  3. spark.table("logging.logs");

Spark can also load table created by HadoopCatalog by path.

  1. spark.read.format("iceberg").load("hdfs://host:8020/warehouse_path/logging/logs");

Schemas

Create a schema

This example creates a schema for a logs table:

  1. import org.apache.iceberg.Schema;
  2. import org.apache.iceberg.types.Types;
  3. Schema schema = new Schema(
  4. Types.NestedField.required(1, "level", Types.StringType.get()),
  5. Types.NestedField.required(2, "event_time", Types.TimestampType.withZone()),
  6. Types.NestedField.required(3, "message", Types.StringType.get()),
  7. Types.NestedField.optional(4, "call_stack", Types.ListType.ofRequired(5, Types.StringType.get()))
  8. );

When using the Iceberg API directly, type IDs are required. Conversions from other schema formats, like Spark, Avro, and Parquet will automatically assign new IDs.

When a table is created, all IDs in the schema are re-assigned to ensure uniqueness.

Convert a schema from Avro

To create an Iceberg schema from an existing Avro schema, use converters in AvroSchemaUtil:

  1. import org.apache.avro.Schema;
  2. import org.apache.avro.Schema.Parser;
  3. import org.apache.iceberg.avro.AvroSchemaUtil;
  4. Schema avroSchema = new Parser().parse("{\"type\": \"record\" , ... }");
  5. Schema icebergSchema = AvroSchemaUtil.toIceberg(avroSchema);

Convert a schema from Spark

To create an Iceberg schema from an existing table, use converters in SparkSchemaUtil:

  1. import org.apache.iceberg.spark.SparkSchemaUtil;
  2. Schema schema = SparkSchemaUtil.schemaForTable(sparkSession, tableName);

Partitioning

Create a partition spec

Partition specs describe how Iceberg should group records into data files. Partition specs are created for a table’s schema using a builder.

This example creates a partition spec for the logs table that partitions records by the hour of the log event’s timestamp and by log level:

  1. import org.apache.iceberg.PartitionSpec;
  2. PartitionSpec spec = PartitionSpec.builderFor(schema)
  3. .hour("event_time")
  4. .identity("level")
  5. .build();

For more information on the different partition transforms that Iceberg offers, visit this page.

Branching and Tagging

Creating branches and tags

New branches and tags can be created via the Java library’s ManageSnapshots API.

  1. /* Create a branch test-branch which is retained for 1 week, and the latest 2 snapshots on test-branch will always be retained.
  2. Snapshots on test-branch which are created within the last hour will also be retained. */
  3. String branch = "test-branch";
  4. table.manageSnapshots()
  5. .createBranch(branch, 3)
  6. .setMinSnapshotsToKeep(branch, 2)
  7. .setMaxSnapshotAgeMs(branch, 3600000)
  8. .setMaxRefAgeMs(branch, 604800000)
  9. .commit();
  10. // Create a tag historical-tag at snapshot 10 which is retained for a day
  11. String tag = "historical-tag"
  12. table.manageSnapshots()
  13. .createTag(tag, 10)
  14. .setMaxRefAgeMs(tag, 86400000)
  15. .commit();

Committing to branches

Writing to a branch can be performed by specifying toBranch in the operation. For the full list refer to UpdateOperations.

  1. // Append FILE_A to branch test-branch
  2. String branch = "test-branch";
  3. table.newAppend()
  4. .appendFile(FILE_A)
  5. .toBranch(branch)
  6. .commit();
  7. // Perform row level updates on "test-branch"
  8. table.newRowDelta()
  9. .addRows(DATA_FILE)
  10. .addDeletes(DELETES)
  11. .toBranch(branch)
  12. .commit();
  13. // Perform a rewrite operation replacing SMALL_FILE_1 and SMALL_FILE_2 on "test-branch" with compactedFile.
  14. table.newRewrite()
  15. .rewriteFiles(ImmutableSet.of(SMALL_FILE_1, SMALL_FILE_2), ImmutableSet.of(compactedFile))
  16. .toBranch(branch)
  17. .commit();

Reading from branches and tags

Reading from a branch or tag can be done as usual via the Table Scan API, by passing in a branch or tag in the useRef API. When a branch is passed in, the snapshot that’s used is the head of the branch. Note that currently reading from a branch and specifying an asOfSnapshotId in the scan is not supported.

  1. // Read from the head snapshot of test-branch
  2. TableScan branchRead = table.newScan().useRef("test-branch");
  3. // Read from the snapshot referenced by audit-tag
  4. TableScan tagRead = table.newScan().useRef("audit-tag");

Replacing and fast forwarding branches and tags

The snapshots which existing branches and tags point to can be updated via the replace APIs. The fast forward operation is similar to git fast-forwarding. Fast forward can be used to advance a target branch to the head of a source branch or a tag when the target branch is an ancestor of the source. For both fast forward and replace, retention properties of the target branch are maintained by default.

  1. // Update "test-branch" to point to snapshot 4
  2. table.manageSnapshots()
  3. .replaceBranch(branch, 4)
  4. .commit()
  5. String tag = "audit-tag";
  6. // Replace "audit-tag" to point to snapshot 3 and update its retention
  7. table.manageSnapshots()
  8. .replaceBranch(tag, 4)
  9. .setMaxRefAgeMs(1000)
  10. .commit()

Updating retention properties

Retention properties for branches and tags can be updated as well. Use the setMaxRefAgeMs for updating the retention property of the branch or tag itself. Branch snapshot retention properties can be updated via the setMinSnapshotsToKeep and setMaxSnapshotAgeMs APIs.

  1. String branch = "test-branch";
  2. // Update retention properties for test-branch
  3. table.manageSnapshots()
  4. .setMinSnapshotsToKeep(branch, 10)
  5. .setMaxSnapshotAgeMs(branch, 7200000)
  6. .setMaxRefAgeMs(branch, 604800000)
  7. .commit();
  8. // Update retention properties for test-tag
  9. table.manageSnapshots()
  10. .setMaxRefAgeMs("test-tag", 604800000)
  11. .commit();

Removing branches and tags

Branches and tags can be removed via the removeBranch and removeTag APIs respectively

  1. // Remove test-branch
  2. table.manageSnapshots()
  3. .removeBranch("test-branch")
  4. .commit()
  5. // Remove test-tag
  6. table.manageSnapshots()
  7. .removeTag("test-tag")
  8. .commit()