PyHive is a collection of Python DB-API and SQLAlchemy interfaces for Hive. PyHive can connect with the Kyuubi server serving in thrift protocol as HiveServer2.

Requirements

PyHive works with Python 2.7 / Python 3. Install PyHive via pip for the Hive interface.

  1. pip install 'pyhive[hive]'

Usage

Use the Kyuubi server’s host and thrift protocol port to connect.

For further information about usages and features, e.g. DB-API async fetching, using in SQLAlchemy, please refer to project homepage.

DB-API

  1. from pyhive import hive
  2. cursor = hive.connect(host=kyuubi_host,port=10009).cursor()
  3. cursor.execute('SELECT * FROM my_awesome_data LIMIT 10')
  4. print(cursor.fetchone())
  5. print(cursor.fetchall())

Use PyHive with Pandas

PyHive provides a handy way to establish a SQLAlchemy compatible connection and works with Pandas dataframe for executing SQL and reading data via pandas.read_sql.

  1. from pyhive import hive
  2. import pandas as pd
  3. # open connection
  4. conn = hive.Connection(host=kyuubi_host,port=10009)
  5. # query the table to a new dataframe
  6. dataframe = pd.read_sql("SELECT id, name FROM test.example_table", conn)

Authentication

If password is provided for connection, make sure the auth param set to either CUSTOM or LDAP.

  1. # open connection
  2. conn = hive.Connection(host=kyuubi_host, port=10009,
  3. username='user', password='password', auth='CUSTOM')

Use Kerberos to connect to Kyuubi.

kerberos_service_name must be the name of the service that started the Kyuubi server, usually the prefix of the first slash of kyuubi.kinit.principal.

Note that PyHive does not support passing in principal, it splices in part of principal with kerberos_service_name and kyuubi_host.

  1. # open connection
  2. conn = hive.Connection(host=kyuubi_host, port=10009, auth="KERBEROS", kerberos_service_name="kyuubi")

If you encounter the following errors, you need to install related packages.

  1. thrift.transport.TTransport.TTransportException: Could not start SASL: b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found'
  1. yum install -y cyrus-sasl-plain cyrus-sasl-devel cyrus-sasl-gssapi cyrus-sasl-md5

Note that PyHive does not support the connection method based on zookeeper HA, you can connect to zookeeper to get the service address via Kazoo.

Code reference https://stackoverflow.com/a/73326589

  1. from pyhive import hive
  2. import random
  3. from kazoo.client import KazooClient
  4. zk = KazooClient(hosts='kyuubi1.xx.com:2181,kyuubi2.xx.com:2181,kyuubi3.xx.com:2181', read_only=True)
  5. zk.start()
  6. servers = [kyuubi_server.split(';')[0].split('=')[1].split(':')
  7. for kyuubi_server
  8. in zk.get_children(path='kyuubi')]
  9. kyuubi_host, kyuubi_port = random.choice(servers)
  10. zk.stop()
  11. print(kyuubi_host, kyuubi_port)
  12. conn = hive.Connection(host=kyuubi_host, port=kyuubi_port, auth="KERBEROS", kerberos_service_name="kyuubi")