Iceberg format

Iceberg format can be accessed using the Iceberg Connector provided by Trino. please refer to the documentation at Iceberg Trino user manual for more information.

Paimon format

Paimon format can be accessed using the Paimon Connector provided by Trino. please refer to the documentation at Paimon Trino user manual for more information.

Mixed format

Install

  • Create the {trino_home}/plugin/amoro directory in the Trino installation package, and extract the contents of the amoro-trino package trino-amoro-xx-SNAPSHOT.tar.gz to the {trino_home}/plugin/amoro directory.
  • Configure the Catalog configuration file for Amoro in the {trino_home}/etc/catalog directory, for example:
    1. connector.name=arctic
    2. arctic.url=thrift://{ip}:{port}/{catalogName}
  • Configure the JVM configuration file for Trino in the {trino_home}/etc directory named jvm.config :
    1. --add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED

Support SQL statement

Query Table

By adopting the Merge-On-Read approach to read Mixed Format, the latest data of the table can be read, for example:

  1. SELECT * FROM "{TABLE_NAME}"

Query BaseStore of Table

Directly querying the BaseStore in a table with a primary key is supported. The BaseStore stores the stock data of the table, which is usually generated by batch job or optimization. The queried data is static, and the query efficiency is very high, but the timeliness is not good. The syntax is as follows:

  1. SELECT * FROM "{TABLE_NAME}#BASE"

Query ChangeStore of Table

Directly querying the ChangeStore in a table with a primary key is supported. The ChangeStore stores the stream and change data of the table, which is usually written in real time by stream job. The change records of the table can be queried through the ChangeStore, and the expiry time of the data in the ChangeStore determines how long ago the change records can be queried.

  1. SELECT * FROM "{TABLE_NAME}#CHANGE"

Three additional columns will be included in the query result, which are:

  • _transaction_id: The transaction ID allocated by AMS when the data is written. In batch mode, it is allocated for each SQL execution, and in stream mode, it is allocated for each checkpoint.
  • _file_offset:Indicates the order in which the data was written in the same batch of _transaction_id.
  • _change_action:Indicates the type of data, which can be either INSERT or DELETE.

Trino and Amoro Type Mapping:

Amoro type Trino type
BOOLEAN BOOLEAN
INT INTEGER
LONG BIGINT
FLOAT REAL
DOUBLE DOUBLE
DECIMAL(p,s) DECIMAL(p,s)
DATE DATE
TIME TIME(6)
TIMESTAMP TIMESTAMP(6)
TIMESTAMPTZ TIMESTAMP(6) WITH TIME ZONE
STRING VARCHAR
UUID UUID
BINARY VARBINARY
STRUCT(...) ROW(...)
LIST(e) ARRAY(e)
MAP(k,v) MAP(k,v)

Trino uses proxy user to access Hadoop cluster.

By default, when Trino queries Amoro, it uses the Hadoop user configured in the catalog creation to access the Hadoop cluster. To use Trino’s user to access the Hadoop cluster, you need enable Hadoop impersonation by adding the arctic.hdfs.impersonation.enabled=true parameter in the Amoro catalog configuration file located in the {trino_home}/etc/catalog directory, as follows.

  1. connector.name=arctic
  2. arctic.url=thrift://{ip}:{port}/{catalogName}
  3. arctic.hdfs.impersonation.enabled=true

arctic.hdfs.impersonation.enabled default false

To use Hadoop impersonation, you need to enable the proxy feature for the Hadoop user configured in the catalog creation in the Hadoop cluster beforehand, and make sure that it can proxy the Trino querying user. Please refer to Hadoop Proxy User for more information.