PySpark is an interface for Apache Spark in Python. Kyuubi can be used as JDBC source in PySpark.

Requirements

PySpark works with Python 3.7 and above.

Install PySpark with Spark SQL and optional pandas support on Spark using PyPI as follows:

  1. pip install pyspark 'pyspark[sql]' 'pyspark[pandas_on_spark]'

For installation using Conda or manually downloading, please refer to PySpark installation.

Preparation

Prepare JDBC driver

Prepare JDBC driver jar file. Supported Hive compatible JDBC Driver as below:

Driver Driver Class Name Remarks
Kyuubi Hive Driver (doc) org.apache.kyuubi.jdbc.KyuubiHiveDriver Compile for the driver on master branch, as KYUUBI #3484 required by Spark JDBC source not yet included in released version.
Hive Driver (doc) org.apache.hive.jdbc.HiveDriver

Refer to docs of the driver and prepare the JDBC driver jar file.

Prepare JDBC Hive Dialect extension

Hive Dialect support is required by Spark for wrapping SQL correctly and sending it to the JDBC driver. Kyuubi provides a JDBC dialect extension with auto-registered Hive Dialect support for Spark. Follow the instructions in Hive Dialect Support to prepare the plugin jar file kyuubi-extension-spark-jdbc-dialect_-*.jar.

Including jars of JDBC driver and Hive Dialect extension

Choose one of the following ways to include jar files in Spark.

  • Put the jar file of JDBC driver and Hive Dialect to $SPARK_HOME/jars directory to make it visible for the classpath of PySpark. And adding spark.sql.extensions = org.apache.spark.sql.dialect.KyuubiSparkJdbcDialectExtension to $SPARK_HOME/conf/spark_defaults.conf.

  • With spark’s start shell, include the JDBC driver when submitting the application with --packages, and the Hive Dialect plugins with --jars

  1. $SPARK_HOME/bin/pyspark --py-files PY_FILES \
  2. --packages org.apache.hive:hive-jdbc:x.y.z \
  3. --jars /path/kyuubi-extension-spark-jdbc-dialect_-*.jar
  • Setting jars and config with SparkSession builder
  1. from pyspark.sql import SparkSession
  2. spark = SparkSession.builder \
  3. .config("spark.jars", "/path/hive-jdbc-x.y.z.jar,/path/kyuubi-extension-spark-jdbc-dialect_-*.jar") \
  4. .config("spark.sql.extensions", "org.apache.spark.sql.dialect.KyuubiSparkJdbcDialectExtension") \
  5. .getOrCreate()

Usage

For further information about PySpark JDBC usage and options, please refer to Spark’s JDBC To Other Databases.

Using as JDBC Datasource programmingly

  1. # Loading data from Kyuubi via HiveDriver as JDBC datasource
  2. jdbcDF = spark.read \
  3. .format("jdbc") \
  4. .options(driver="org.apache.hive.jdbc.HiveDriver",
  5. url="jdbc:hive2://kyuubi_server_ip:port",
  6. user="user",
  7. password="password",
  8. query="select * from testdb.src_table"
  9. ) \
  10. .load()

Using as JDBC Datasource table with SQL

From Spark 3.2.0, CREATE DATASOURCE TABLE is supported to create jdbc source with SQL.

  1. # create JDBC Datasource table with DDL
  2. spark.sql("""CREATE TABLE kyuubi_table USING JDBC
  3. OPTIONS (
  4. driver='org.apache.hive.jdbc.HiveDriver',
  5. url='jdbc:hive2://kyuubi_server_ip:port',
  6. user='user',
  7. password='password',
  8. dbtable='testdb.some_table'
  9. )""")
  10. # read data to dataframe
  11. jdbcDF = spark.sql("SELECT * FROM kyuubi_table")
  12. # write data from dataframe in overwrite mode
  13. df.writeTo("kyuubi_table").overwrite
  14. # write data from query
  15. spark.sql("INSERT INTO kyuubi_table SELECT * FROM some_table")

Use PySpark with Pandas

From PySpark 3.2.0, PySpark supports pandas API on Spark which allows you to scale your pandas workload out.

Pandas-on-Spark DataFrame and Spark DataFrame are virtually interchangeable. More instructions in From/to pandas and PySpark DataFrames.

  1. import pyspark.pandas as ps
  2. psdf = ps.range(10)
  3. sdf = psdf.to_spark().filter("id > 5")
  4. sdf.show()