DataStream Connector - HTTP Connector - 《Apache StreamPark 2.1.4-incubating》

http asynchronous write
Write with Apache StreamPark™
Other configuration

Some background services receive data through HTTP requests. In this scenario, Apache Flink can write result data through HTTP requests. Currently, Apache Flink officially does not provide a connector for writing data through HTTP requests. Apache StreamPark encapsulates HttpSink to write data asynchronously in real-time based on asynchttpclient.

HttpSink writes do not support transactions, writing data to the target service provides AT_LEAST_ONCE semantics. Data that fails to be retried multiple times will be written to external components (Apache Kafka, MySQL, HDFS, Apache HBase), and the data will be restored manually to achieve final data consistency.

http asynchronous write

Asynchronous writing uses asynchttpclient as the client, you need to import the jar of asynchttpclient first.

<dependency>
    <groupId>org.asynchttpclient</groupId>
    <artifactId>async-http-client</artifactId>
    <optional>true</optional>
</dependency>

Write with Apache StreamPark™

http asynchronous write support type

HttpSink supports get, post, patch, put, delete, options, trace of http protocol. Corresponding to the method of the same name of HttpSink, the specific information is as follows:

Scala

class HttpSink(@(transient@param) ctx: StreamingContext,
               header: Map[String, String] = Map.empty[String, String],
               parallelism: Int = 0,
               name: String = null,
               uid: String = null) extends Sink {
  def get(stream: DataStream[String]): DataStreamSink[String] = sink(stream, HttpGet.METHOD_NAME)
  def post(stream: DataStream[String]): DataStreamSink[String] = sink(stream, HttpPost.METHOD_NAME)
  def patch(stream: DataStream[String]): DataStreamSink[String] = sink(stream, HttpPatch.METHOD_NAME)
  def put(stream: DataStream[String]): DataStreamSink[String] = sink(stream, HttpPut.METHOD_NAME)
  def delete(stream: DataStream[String]): DataStreamSink[String] = sink(stream, HttpDelete.METHOD_NAME)
  def options(stream: DataStream[String]): DataStreamSink[String] = sink(stream, HttpOptions.METHOD_NAME)
  def trace(stream: DataStream[String]): DataStreamSink[String] = sink(stream, HttpTrace.METHOD_NAME)
  private[this] def sink(stream: DataStream[String], method: String): DataStreamSink[String] = {
    val params = ctx.parameter.toMap.filter(_._1.startsWith(HTTP_SINK_PREFIX)).map(x => x._1.drop(HTTP_SINK_PREFIX.length + 1) -> x._2)
    val sinkFun = new HttpSinkFunction(params, header, method)
    val sink = stream.addSink(sinkFun)
    afterSink(sink, parallelism, name, uid)
  }
}

Configuration list of HTTP asynchronous write

http.sink:
  threshold:
    numWriters: 3
    queueCapacity: 10000 #The maximum capacity of the queue, according to the size of a single record, and the size of the queue is estimated by itself. If the value is too large, the upstream data source is coming too fast, and the downstream write data may not keep up with OOM.
    timeout: 100 #Timeout for sending http requests
    retries: 3 #Maximum number of retries when sending fails
    successCode: 200 #Send success status code
  failover:
    table: record
    storage: mysql #kafka,hbase,hdfs
    jdbc:
      jdbcUrl: jdbc:mysql://localhost:3306/test
      username: root
      password: 123456
    kafka:
      topic: bigdata
      bootstrap.servers: localhost:9091,localhost:9092,localhost:9093
    hbase:
      zookeeper.quorum: localhost
      zookeeper.property.clientPort: 2181
    hdfs:
      namenode: hdfs://localhost:8020 # namenode rpc address and port, e.g: hdfs://hadoop:8020 , hdfs://hadoop:9000
      user: benjobs # user
      path: /http/failover # save path
      format: yyyy-MM-dd

HTTP writes data asynchronously

The program sample is scala

Scala


import org.apache.streampark.flink.core.scala.FlinkStreaming
import org.apache.streampark.flink.core.scala.sink.HttpSink
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.scala.DataStream
object HttpSinkApp extends FlinkStreaming {
  override def handle(): Unit = {
    val source = context.addSource(new TestSource)
    val value: DataStream[String] = source.map(x => s"http://127.0.0.1:8080?userId=(${x.userId}&siteId=${x.siteId})")
    HttpSink().post(value).setParallelism(1)
  }
}

Since http can only write one piece of data at a time, the latency is relatively high, and it is not suitable for writing large amounts of data. It is necessary to set a reasonable threshold to improve performance. Since httpSink asynchronous writing fails, data will be added to the cache queue again, which may cause data in the same window to be written in two batches. It is recommended to fully test in scenarios with high real-time requirements. After the asynchronous write data reaches the maximum retry value, the data will be backed up to the external component, and the component connection will be initialized at this time. It is recommended to ensure the availability of the failover component.

Other configuration

All other configurations must comply with the StreamPark configuration. For specific configurable items and the role of each parameter, please refer Project configuration