Deployment SeaTunnel Engine

Deployment SeaTunnel Engine

1. Download

SeaTunnel Engine is the default engine of SeaTunnel. The installation package of SeaTunnel already contains all the contents of SeaTunnel Engine.

2 Config SEATUNNEL_HOME

You can config SEATUNNEL_HOME by add /etc/profile.d/seatunnel.sh file. The content of /etc/profile.d/seatunnel.sh are

export SEATUNNEL_HOME=${seatunnel install path}
export PATH=$PATH:$SEATUNNEL_HOME/bin

3. Config SeaTunnel Engine JVM options

SeaTunnel Engine supported two ways to set jvm options.

Add JVM Options to $SEATUNNEL_HOME/bin/seatunnel-cluster.sh.

Modify the $SEATUNNEL_HOME/bin/seatunnel-cluster.sh file and add JAVA_OPTS="-Xms2G -Xmx2G" in the first line.
Add JVM Options when start SeaTunnel Engine. For example seatunnel-cluster.sh -DJvmOption="-Xms2G -Xmx2G"

4. Config SeaTunnel Engine

SeaTunnel Engine provides many functions, which need to be configured in seatunnel.yaml.

4.1 Backup count

SeaTunnel Engine implement cluster management based on Hazelcast IMDG. The state data of cluster(Job Running State, Resource State) are storage is Hazelcast IMap. The data saved in Hazelcast IMap will be distributed and stored in all nodes of the cluster. Hazelcast will partition the data stored in Imap. Each partition can specify the number of backups. Therefore, SeaTunnel Engine can achieve cluster HA without using other services(for example zookeeper).

The backup count is to define the number of synchronous backups. For example, if it is set to 1, backup of a partition will be placed on one other member. If it is 2, it will be placed on two other members.

We suggest the value of backup-count is the min(1, max(5, N/2)). N is the number of the cluster node.

seatunnel:
    engine:
        backup-count: 1
        # other config

4.2 Slot service

The number of Slots determines the number of TaskGroups the cluster node can run in parallel. SeaTunnel Engine is a data synchronization engine and most jobs are IO intensive.

Dynamic Slot is suggest.

seatunnel:
    engine:
        slot-service:
            dynamic-slot: true
        # other config

4.3 Checkpoint Manager

Like Flink, SeaTunnel Engine support Chandy–Lamport algorithm. Therefore, SeaTunnel Engine can realize data synchronization without data loss and duplication.

interval

The interval between two checkpoints, unit is milliseconds. If the checkpoint.interval parameter is configured in the env of the job config file, the value set here will be overwritten.

timeout

The timeout of a checkpoint. If a checkpoint cannot be completed within the timeout period, a checkpoint failure will be triggered. Therefore, Job will be restored.

Example

seatunnel:
    engine:
        backup-count: 1
        print-execution-info-interval: 10
        slot-service:
            dynamic-slot: true
        checkpoint:
            interval: 300000
            timeout: 10000

checkpoint storage

About the checkpoint storage, you can see checkpoint storage

4.4 Historical Job expiration Config

The information about each completed Job, such as status, counters, and error logs, is stored in the IMap object. As the number of running jobs increases, the memory increases and eventually the memory will overflow. Therefore, you can adjust the history-job-expire-minutes parameter to solve this problem. The time unit of this parameter is minute. The default value is 1440 minutes, that is one day.

Example

seatunnel:
  engine:
    history-job-expire-minutes: 1440

4.5 ClassLoader Cache Mode

This configuration mainly solves the resource leakage caused by constantly creating and trying to destroy classloaders. If you encounter exceptions related to metaspace overflow, you can try to enable this configuration. In order to reduce the frequency of creating classloaders, after enabling on this configuration, SeaTunnel will not try to release the corresponding classloader when the job is completed, so that it can be used by subsequent jobs, that is to say, it is more effective when the Source/Sink connectors used in the running job are not too many types. Default value is false.

Example

seatunnel:
  engine:
    classloader-cache-mode: true

5. Config SeaTunnel Engine Server

All SeaTunnel Engine Server config in hazelcast.yaml file.

5.1 cluster-name

The SeaTunnel Engine nodes use the cluster name to determine whether the other is a cluster with themselves. If the cluster names between the two nodes are different, the SeaTunnel Engine will reject the service request.

5.2 Network

Base on Hazelcast, A SeaTunnel Engine cluster is a network of cluster members that run SeaTunnel Engine Server. Cluster members automatically join together to form a cluster. This automatic joining takes place with various discovery mechanisms that the cluster members use to find each other.

Please note that, after a cluster is formed, communication between cluster members is always via TCP/IP, regardless of the discovery mechanism used.

SeaTunnel Engine uses the following discovery mechanisms.

TCP

You can configure SeaTunnel Engine to be a full TCP/IP cluster. See the Discovering Members by TCP section for configuration details.

An example is like this hazelcast.yaml

hazelcast:
  cluster-name: seatunnel
  network:
    join:
      tcp-ip:
        enabled: true
        member-list:
          - hostname1
    port:
      auto-increment: false
      port: 5801
  properties:
    hazelcast.logging.type: log4j2

TCP is our suggest way in a standalone SeaTunnel Engine cluster.

On the other hand, Hazelcast provides some other service discovery methods. For details, please refer to hazelcast network

5.3 Map

MapStores connect to an external data store only when they are configured on a map. This topic explains how to configure a map with a MapStore. For details, please refer to hazelcast map

type

The type of imap persistence, currently only supports hdfs.

namespace

It is used to distinguish data storage locations of different business, like OSS bucket name.

clusterName

This parameter is primarily used for cluster isolation, we can use this to distinguish different cluster, like cluster1, cluster2 and this is also used to distinguish different business

fs.defaultFS

We used hdfs api read/write file, so used this storage need provide hdfs configuration

if you used HDFS, you can config like this:

map:
    engine*:
       map-store:
         enabled: true
         initial-mode: EAGER
         factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory
         properties:
           type: hdfs
           namespace: /tmp/seatunnel/imap
           clusterName: seatunnel-cluster
           storage.type: hdfs
           fs.defaultFS: hdfs://localhost:9000
           // if you need hdfs-site config, you can config like this:
           hdfs_site_path: /path/to/your/hdfs_site_path

If there is no HDFS and your cluster only have one node, you can config to use local file like this:

map:
    engine*:
       map-store:
         enabled: true
         initial-mode: EAGER
         factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory
         properties:
           type: hdfs
           namespace: /tmp/seatunnel/imap
           clusterName: seatunnel-cluster
           storage.type: hdfs
           fs.defaultFS: file:///

if you used OSS, you can config like this:

map:
    engine*:
       map-store:
         enabled: true
         initial-mode: EAGER
         factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory
         properties:
           type: hdfs
           namespace: /tmp/seatunnel/imap
           clusterName: seatunnel-cluster
           storage.type: oss
           block.size: block size(bytes)
           oss.bucket: oss://bucket name/
           fs.oss.accessKeyId: OSS access key id
           fs.oss.accessKeySecret: OSS access key secret
           fs.oss.endpoint: OSS endpoint
           fs.oss.credentials.provider: org.apache.hadoop.fs.aliyun.oss.AliyunCredentialsProvider

6. Config SeaTunnel Engine Client

All SeaTunnel Engine Client config in hazelcast-client.yaml.

6.1 cluster-name

The Client must have the same cluster-name with the SeaTunnel Engine. Otherwise, SeaTunnel Engine will reject the client request.

6.2 Network

cluster-members

All SeaTunnel Engine Server Node address need add to here.

hazelcast-client:
  cluster-name: seatunnel
  properties:
      hazelcast.logging.type: log4j2
  network:
    cluster-members:
      - hostname1:5801

7. Start SeaTunnel Engine Server Node

Can be started by a daemon with -d.

mkdir -p $SEATUNNEL_HOME/logs
./bin/seatunnel-cluster.sh -d

The logs will write in $SEATUNNEL_HOME/logs/seatunnel-engine-server.log

8. Install SeaTunnel Engine Client

You only need to copy the $SEATUNNEL_HOME directory on the SeaTunnel Engine node to the Client node and config the SEATUNNEL_HOME like SeaTunnel Engine Server Node.

deployment

Deployment SeaTunnel Engine

1. Download

2 Config SEATUNNEL_HOME

3. Config SeaTunnel Engine JVM options

4. Config SeaTunnel Engine

4.1 Backup count

4.2 Slot service

4.3 Checkpoint Manager

4.4 Historical Job expiration Config

4.5 ClassLoader Cache Mode

5. Config SeaTunnel Engine Server

5.1 cluster-name

5.2 Network

TCP

5.3 Map

6. Config SeaTunnel Engine Client

6.1 cluster-name

6.2 Network

7. Start SeaTunnel Engine Server Node

8. Install SeaTunnel Engine Client