Checkpoint Storage

Introduction

Checkpoint is a fault-tolerant recovery mechanism. This mechanism ensures that when the program is running, it can recover itself even if it suddenly encounters an exception.

Checkpoint Storage

Checkpoint Storage is a storage mechanism for storing checkpoint data.

SeaTunnel Engine supports the following checkpoint storage types:

  • HDFS (OSS,S3,HDFS,LocalFile)
  • LocalFile (native), (it’s deprecated: use Hdfs(LocalFile) instead.

We used the microkernel design pattern to separate the checkpoint storage module from the engine. This allows users to implement their own checkpoint storage modules.

checkpoint-storage-api is the checkpoint storage module API, which defines the interface of the checkpoint storage module.

if you want to implement your own checkpoint storage module, you need to implement the CheckpointStorage and provide the corresponding CheckpointStorageFactory implementation.

Checkpoint Storage Configuration

The configuration of the seatunnel-server module is in the seatunnel.yaml file.

  1. seatunnel:
  2. engine:
  3. checkpoint:
  4. storage:
  5. type: hdfs #plugin name of checkpoint storage, we support hdfs(S3, local, hdfs), localfile (native local file) is the default, but this plugin is de
  6. # plugin configuration
  7. plugin-config:
  8. namespace: #checkpoint storage parent path, the default value is /seatunnel/checkpoint/
  9. K1: V1 # plugin other configuration
  10. K2: V2 # plugin other configuration

Notice: namespace must end with “/“.

OSS

Aliyun oss base on hdfs-file, so you can refer hadoop oss docs to config oss.

Except when interacting with oss buckets, the oss client needs the credentials needed to interact with buckets. The client supports multiple authentication mechanisms and can be configured as to which mechanisms to use, and their order of use. Custom implementations of org.apache.hadoop.fs.aliyun.oss.AliyunCredentialsProvider may also be used. if you used AliyunCredentialsProvider (can be obtained from the Aliyun Access Key Management), these consist of an access key, a secret key. you can config like this:

  1. seatunnel:
  2. engine:
  3. checkpoint:
  4. interval: 6000
  5. timeout: 7000
  6. storage:
  7. type: hdfs
  8. max-retained: 3
  9. plugin-config:
  10. storage.type: oss
  11. oss.bucket: your-bucket
  12. fs.oss.accessKeyId: your-access-key
  13. fs.oss.accessKeySecret: your-secret-key
  14. fs.oss.endpoint: endpoint address
  15. fs.oss.credentials.provider: org.apache.hadoop.fs.aliyun.oss.AliyunCredentialsProvider

For additional reading on the Hadoop Credential Provider API see: Credential Provider API.

Aliyun oss Credential Provider implements see: Auth Credential Providers

S3

S3 base on hdfs-file, so you can refer hadoop s3 docs to config s3.

Except when interacting with public S3 buckets, the S3A client needs the credentials needed to interact with buckets. The client supports multiple authentication mechanisms and can be configured as to which mechanisms to use, and their order of use. Custom implementations of com.amazonaws.auth.AWSCredentialsProvider may also be used. if you used SimpleAWSCredentialsProvider (can be obtained from the Amazon Security Token Service), these consist of an access key, a secret key. you can config like this:

  1. seatunnel:
  2. engine:
  3. checkpoint:
  4. interval: 6000
  5. timeout: 7000
  6. storage:
  7. type: hdfs
  8. max-retained: 3
  9. plugin-config:
  10. storage.type: s3
  11. s3.bucket: your-bucket
  12. fs.s3a.access.key: your-access-key
  13. fs.s3a.secret.key: your-secret-key
  14. fs.s3a.aws.credentials.provider: org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider

if you used InstanceProfileCredentialsProvider, this supports use of instance profile credentials if running in an EC2 VM, you could check iam-roles-for-amazon-ec2. you can config like this:

  1. seatunnel:
  2. engine:
  3. checkpoint:
  4. interval: 6000
  5. timeout: 7000
  6. storage:
  7. type: hdfs
  8. max-retained: 3
  9. plugin-config:
  10. storage.type: s3
  11. s3.bucket: your-bucket
  12. fs.s3a.endpoint: your-endpoint
  13. fs.s3a.aws.credentials.provider: org.apache.hadoop.fs.s3a.InstanceProfileCredentialsProvider

If you want to use Minio that supports the S3 protocol as checkpoint storage, you should configure it this way:

  1. seatunnel:
  2. engine:
  3. checkpoint:
  4. interval: 10000
  5. timeout: 60000
  6. storage:
  7. type: hdfs
  8. max-retained: 3
  9. plugin-config:
  10. storage.type: s3
  11. fs.s3a.access.key: xxxxxxxxx # Access Key of MinIO
  12. fs.s3a.secret.key: xxxxxxxxxxxxxxxxxxxxx # Secret Key of MinIO
  13. fs.s3a.endpoint: http://127.0.0.1:9000 # Minio HTTP service access address
  14. s3.bucket: s3a://test # test is the bucket name which storage the checkpoint file
  15. fs.s3a.aws.credentials.provider: org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
  16. # important: The user of this key needs to have write permission for the bucket, otherwise an exception of 403 will be returned

For additional reading on the Hadoop Credential Provider API see: Credential Provider API.

HDFS

if you used HDFS, you can config like this:

  1. seatunnel:
  2. engine:
  3. checkpoint:
  4. storage:
  5. type: hdfs
  6. max-retained: 3
  7. plugin-config:
  8. storage.type: hdfs
  9. fs.defaultFS: hdfs://localhost:9000
  10. // if you used kerberos, you can config like this:
  11. kerberosPrincipal: your-kerberos-principal
  12. kerberosKeytabFilePath: your-kerberos-keytab
  13. // if you need hdfs-site config, you can config like this:
  14. hdfs_site_path: /path/to/your/hdfs_site_path

if HDFS is in HA mode , you can config like this:

  1. seatunnel:
  2. engine:
  3. checkpoint:
  4. storage:
  5. type: hdfs
  6. max-retained: 3
  7. plugin-config:
  8. storage.type: hdfs
  9. fs.defaultFS: hdfs://usdp-bing
  10. seatunnel.hadoop.dfs.nameservices: usdp-bing
  11. seatunnel.hadoop.dfs.ha.namenodes.usdp-bing: nn1,nn2
  12. seatunnel.hadoop.dfs.namenode.rpc-address.usdp-bing.nn1: usdp-bing-nn1:8020
  13. seatunnel.hadoop.dfs.namenode.rpc-address.usdp-bing.nn2: usdp-bing-nn2:8020
  14. seatunnel.hadoop.dfs.client.failover.proxy.provider.usdp-bing: org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

if HDFS has some other configs in hdfs-site.xml or core-site.xml , just set HDFS config by using seatunnel.hadoop. prefix.

LocalFile

  1. seatunnel:
  2. engine:
  3. checkpoint:
  4. interval: 6000
  5. timeout: 7000
  6. storage:
  7. type: hdfs
  8. max-retained: 3
  9. plugin-config:
  10. storage.type: hdfs
  11. fs.defaultFS: file:/// # Ensure that the directory has written permission

Enable cache

When storage:type is hdfs, cache is disabled by default. If you want to enable it, set disable.cache: false

  1. seatunnel:
  2. engine:
  3. checkpoint:
  4. interval: 6000
  5. timeout: 7000
  6. storage:
  7. type: hdfs
  8. max-retained: 3
  9. plugin-config:
  10. storage.type: hdfs
  11. disable.cache: false
  12. fs.defaultFS: hdfs:///

or

  1. seatunnel:
  2. engine:
  3. checkpoint:
  4. interval: 6000
  5. timeout: 7000
  6. storage:
  7. type: hdfs
  8. max-retained: 3
  9. plugin-config:
  10. storage.type: hdfs
  11. disable.cache: false
  12. fs.defaultFS: file:///