The share level of Kyuubi engines describes the relationship between sessions and engines. It determines whether a new session can share an existing backend engine with other sessions or not. The sessions are also known as JDBC/ODBC/Thrift connections from clients that end-users create, and the engines are standalone applications with the full capabilities of Spark SQL, Flink SQL(under dev), running on single-node machines or clusters.

The share level of Kyuubi engines works the same whether in HA or single node mode. In other words, an engine is cluster widely shared by all Kyuubi server peers if could.

Why do we need this feature?

Apache Spark is a unified engine for large-scale data analytics. Using Spark to process data is like driving an all-wheel-drive hefty horsepower supercar. However,

  • Cars have their limit of 0-60 times. In a similar way, all Spark applications also have to warm up before go full speed.
  • Cars have a constant number of seats and are not allowed to be overloaded. Due to the master-slave architecture of Spark and the resource configured ahead, the overall workload of a single application is predictable.
  • Cars have various shapes to meet our needs.

With this feature, Kyuubi give you a more flexible way to handle different big data workloads.

The current supported share levels

The current supported share levels are,

Share Level Syntax Scenario Isolation Degree Shareability
CONNECTION One engine per session Large-scale ETL
Ad hoc
High Low
USER One engine per user Ad hoc
Small-scale ETL
Medium Medium
GROUP One engine per primary group Ad hoc
Small-scale ETL
Low High
SERVER One engine per cluster Admin Highest If Secured
Lowest If Unsecured
Admin ONLY If Secured
  • Better isolation degree of engines gives us better stability of an engine and the query executions running on it.
  • Better shareability of engines means we are more likely to reuse an engine which is already in full speed.

CONNECTION

The Share Level Of Kyuubi Engines - 图1

Each session with CONNECTION share level has a standalone engine for itself which is unreachable for anyone else. Within the session, a user or client can send multiple operation request, including metadata calls or queries, to the corresponding engine.

Although it is still an interactive form, this model does allow for more practical batch processing jobs as well.

When closing session, the corresponding engine will be shutdown at the same time.

USER(Default)

The Share Level Of Kyuubi Engines - 图2

All sessions with USER share level use the same engine if and only if the session user is the same.

Those sessions share the same engine with objects belong to the one and only SparkContext instance, including Classes/Classloaders, SparkConf, Driver/Executors, Hive Metastore Client, etc. But each session can still have its own SparkSession instance, which contains separate session state, including temporary views, SQL config, UDFs etc. Setting kyuubi.engine.single.spark.session to true will make SparkSession instance a singleton and share across sessions.

When closing session, the corresponding engine will not be shutdown. When all sessions are closed, the corresponding engine still has a time-to-live lifespan. This TTL allows new sessions to be established quickly without waiting for the engine to start.

GROUP

The Share Level Of Kyuubi Engines - 图3

An engine will be shared by all sessions created by all users belong to the same primary group name. The engine will be launched by the group name as the effective username, so here the group name is kind of special user who is able to visit the compute resources/data of a team. It follows the Hadoop GroupsMapping to map user to a primary group. If the primary group is not found, it falls back to the USER level.

The mechanisms of SparkContext, SparkSession and TTL works similarly to USER share level.

Tips for authorization in GROUP share level:

The session user and the primary group name(as sparkUser/execute user) will be both accessible at engine-side. By default, the sparkUser will be used to check the YARN/HDFS ACLs. If you want fine-grained access control for session user, you need to get it from SparkContext.getLocalProperty("kyuubi.session.user") and send it to security service, like Apache Ranger.

SERVER

The Share Level Of Kyuubi Engines - 图4

Literally, this model is similar to Spark Thrift Server with High availability.

Subdomain

For USER, GROUP, or SERVER share levels, you can further use kyuubi.engine.share.level.subdomain to isolate the engine. That is, you can also create multiple engines for a single user, group or server(cluster). For example, in USER share level, you can use kyuubi.engine.share.level.subdomain=sd1 and kyuubi.engine.share.level.subdomain=sd2 to create two standalone engines for user Tom.

The kyuubi.engine.share.level.subdomain shall be configured in the JDBC connection URL to tell the Kyuubi server which engine you want to use.

Hybrid

All supported share levels can be used together in a single Kyuubi server or cluster.

  • kyuubi.engine.share.level(kyuubi.session.engine.share.level)
    • Default: USER
    • Candidates: USER, CONNECTION, GROUP, SERVER
    • Meaning: The base level for how an engine is created, cached and shared to sessions.
    • Usage: It can be set both in the server configuration file and also connection URL. The latter has higher priority.
  • kyuubi.session.engine.idle.timeout
    • Default: PT30M (30 min)
    • Candidates: a proper timeout
    • Meaning: Time to live since engine becomes idle
    • Usage: It can be set both in the server configuration file and also connection URL. The latter has higher priority.
  • kyuubi.engine.share.level.subdomain(kyuubi.engine.share.level.sub.domain)
    • Default:
    • Candidates: a valid zookeeper a child node
    • Meaning: Add a subdomain under the base level to make further isolation for engines
    • Usage: It can be set both in the server configuration file and also connection URL. The latter has higher priority.

Conclusion

With this feature, end-users are able to leverage engines in different ways to handle their different workloads, such as large-scale ETL jobs and interactive ad hoc queries.