Intro to config file

Intro to config file

In SeaTunnel, the most important thing is the Config file, through which users can customize their own data synchronization requirements to maximize the potential of SeaTunnel. So next, I will introduce you how to configure the Config file.

The main format of the Config file is hocon, for more details of this format type you can refer to HOCON-GUIDE, BTW, we also support the json format, but you should know that the name of the config file should end with .json

Example

Before you read on, you can find config file examples here and in distribute package’s config directory.

Config file structure

The Config file will be similar to the one below.

hocon

env {
  job.mode = "BATCH"
}
source {
  FakeSource {
    result_table_name = "fake"
    row.num = 100
    schema = {
      fields {
        name = "string"
        age = "int"
        card = "int"
      }
    }
  }
}
transform {
  Filter {
    source_table_name = "fake"
    result_table_name = "fake1"
    fields = [name, card]
  }
}
sink {
  Clickhouse {
    host = "clickhouse:8123"
    database = "default"
    table = "seatunnel_console"
    fields = ["name", "card"]
    username = "default"
    password = ""
    source_table_name = "fake1"
  }
}

multi-line support

In hocon, multiline strings are supported, which allows you to include extended passages of text without worrying about newline characters or special formatting. This is achieved by enclosing the text within triple quotes """ . For example:

var = """
Apache SeaTunnel is a
next-generation high-performance,
distributed, massive data integration tool.
"""
sql = """ select * from "table" """

json


{
  "env": {
    "job.mode": "batch"
  },
  "source": [
    {
      "plugin_name": "FakeSource",
      "result_table_name": "fake",
      "row.num": 100,
      "schema": {
        "fields": {
          "name": "string",
          "age": "int",
          "card": "int"
        }
      }
    }
  ],
  "transform": [
    {
      "plugin_name": "Filter",
      "source_table_name": "fake",
      "result_table_name": "fake1",
      "fields": ["name", "card"]
    }
  ],
  "sink": [
    {
      "plugin_name": "Clickhouse",
      "host": "clickhouse:8123",
      "database": "default",
      "table": "seatunnel_console",
      "fields": ["name", "card"],
      "username": "default",
      "password": "",
      "source_table_name": "fake1"
    }
  ]
}

As you can see, the Config file contains several sections: env, source, transform, sink. Different modules have different functions. After you understand these modules, you will understand how SeaTunnel works.

env

Used to add some engine optional parameters, no matter which engine (Spark or Flink), the corresponding optional parameters should be filled in here.

Note that we have separated the parameters by engine, and for the common parameters, we can configure them as before. For flink and spark engine, the specific configuration rules of their parameters can be referred to JobEnvConfig.

source

source is used to define where SeaTunnel needs to fetch data, and use the fetched data for the next step. Multiple sources can be defined at the same time. The supported source at now check Source of SeaTunnel. Each source has its own specific parameters to define how to fetch data, and SeaTunnel also extracts the parameters that each source will use, such as the result_table_name parameter, which is used to specify the name of the data generated by the current source, which is convenient for follow-up used by other modules.

transform

When we have the data source, we may need to further process the data, so we have the transform module. Of course, this uses the word ‘may’, which means that we can also directly treat the transform as non-existent, directly from source to sink. Like below.

env {
  job.mode = "BATCH"
}
source {
  FakeSource {
    result_table_name = "fake"
    row.num = 100
    schema = {
      fields {
        name = "string"
        age = "int"
        card = "int"
      }
    }
  }
}
sink {
  Clickhouse {
    host = "clickhouse:8123"
    database = "default"
    table = "seatunnel_console"
    fields = ["name", "age", "card"]
    username = "default"
    password = ""
    source_table_name = "fake1"
  }
}

Like source, transform has specific parameters that belong to each module. The supported source at now check. The supported transform at now check Transform V2 of SeaTunnel

sink

Our purpose with SeaTunnel is to synchronize data from one place to another, so it is critical to define how and where data is written. With the sink module provided by SeaTunnel, you can complete this operation quickly and efficiently. Sink and source are very similar, but the difference is reading and writing. So go check out our supported sinks.

Other

You will find that when multiple sources and multiple sinks are defined, which data is read by each sink, and which is the data read by each transform? We use result_table_name and source_table_name two key configurations. Each source module will be configured with a result_table_name to indicate the name of the data source generated by the data source, and other transform and sink modules can use source_table_name to refer to the corresponding data source name, indicating that I want to read the data for processing. Then transform, as an intermediate processing module, can use both result_table_name and source_table_name configurations at the same time. But you will find that in the above example Config, not every module is configured with these two parameters, because in SeaTunnel, there is a default convention, if these two parameters are not configured, then the generated data from the last module of the previous node will be used. This is much more convenient when there is only one source.

Config variable substitution

In config file we can define some variables and replace it in run time. This is only support hocon format file.

env {
  job.mode = "BATCH"
  job.name = ${jobName}
  parallelism = 2
}
source {
  FakeSource {
    result_table_name = ${resName}
    row.num = ${rowNum}
    string.template = ${strTemplate}
    int.template = [20, 21]
    schema = {
      fields {
        name = ${nameType}
        age = "int"
      }
    }
  }
}
transform {
    sql {
      source_table_name = "fake"
      result_table_name = "sql"
      query = "select * from "${resName}" where name = '"${nameVal}"' "
    }
}
sink {
  Console {
     source_table_name = "sql"
     username = ${username}
     password = ${password}
     blankSpace = ${blankSpace}
  }
}

In the above config, we define some variables, like ${rowNum}, ${resName}. We can replace those parameters with this shell command:

./bin/seatunnel.sh -c <this_config_file> 
-i jobName='st var job' 
-i resName=fake 
-i rowNum=10 
-i strTemplate=['abc','d~f','h i'] 
-i nameType=string 
-i nameVal=abc 
-i username=seatunnel=2.3.1 
-i password='$a^b%c.d~e0*9(' 
-i blankSpace='2023-12-26 11:30:00' 
-e local

Then the final submitted config is:

env {
  job.mode = "BATCH"
  job.name = "st var job"
  parallelism = 2
}
source {
  FakeSource {
    result_table_name = "fake"
    row.num = 10
    string.template = ["abc","d~f","h i"]
    int.template = [20, 21]
    schema = {
      fields {
        name = string
        age = "int"
      }
    }
  }
}
transform {
    sql {
      source_table_name = "fake"
      result_table_name = "sql"
      query = "select * from fake where name = 'abc' "
    }
}
sink {
  Console {
     source_table_name = "sql"
     username = "seatunnel=2.3.1"
     password = "$a^b%c.d~e0*9("
     blankSpace = "2023-12-26 11:30:00"
  }
}

Some Notes:

quota with ' if the value has space ` or special character (like(`)
if the replacement variables is in " or ', like resName and nameVal, you need add "

What’s More

If you want to know the details of this format configuration, Please see HOCON.