PyHive is a collection of Python DB-API and SQLAlchemy interfaces for Hive. PyHive can connect with the Kyuubi server serving in thrift protocol as HiveServer2.
Requirements
PyHive works with Python 2.7 / Python 3. Install PyHive via pip for the Hive interface.
pip install 'pyhive[hive]'
Usage
Use the Kyuubi server’s host and thrift protocol port to connect.
For further information about usages and features, e.g. DB-API async fetching, using in SQLAlchemy, please refer to project homepage.
DB-API
from pyhive import hive
cursor = hive.connect(host=kyuubi_host,port=10009).cursor()
cursor.execute('SELECT * FROM my_awesome_data LIMIT 10')
print(cursor.fetchone())
print(cursor.fetchall())
Use PyHive with Pandas
PyHive provides a handy way to establish a SQLAlchemy compatible connection and works with Pandas dataframe for executing SQL and reading data via pandas.read_sql
.
from pyhive import hive
import pandas as pd
# open connection
conn = hive.Connection(host=kyuubi_host,port=10009)
# query the table to a new dataframe
dataframe = pd.read_sql("SELECT id, name FROM test.example_table", conn)
Authentication
If password is provided for connection, make sure the auth
param set to either CUSTOM
or LDAP
.
# open connection
conn = hive.Connection(host=kyuubi_host, port=10009,
username='user', password='password', auth='CUSTOM')
Use Kerberos to connect to Kyuubi.
kerberos_service_name
must be the name of the service that started the Kyuubi server, usually the prefix of the first slash of kyuubi.kinit.principal
.
Note that PyHive does not support passing in principal
, it splices in part of principal
with kerberos_service_name
and kyuubi_host
.
# open connection
conn = hive.Connection(host=kyuubi_host, port=10009, auth="KERBEROS", kerberos_service_name="kyuubi")
If you encounter the following errors, you need to install related packages.
thrift.transport.TTransport.TTransportException: Could not start SASL: b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found'
yum install -y cyrus-sasl-plain cyrus-sasl-devel cyrus-sasl-gssapi cyrus-sasl-md5
Note that PyHive does not support the connection method based on zookeeper HA, you can connect to zookeeper to get the service address via Kazoo.
Code reference https://stackoverflow.com/a/73326589
from pyhive import hive
import random
from kazoo.client import KazooClient
zk = KazooClient(hosts='kyuubi1.xx.com:2181,kyuubi2.xx.com:2181,kyuubi3.xx.com:2181', read_only=True)
zk.start()
servers = [kyuubi_server.split(';')[0].split('=')[1].split(':')
for kyuubi_server
in zk.get_children(path='kyuubi')]
kyuubi_host, kyuubi_port = random.choice(servers)
zk.stop()
print(kyuubi_host, kyuubi_port)
conn = hive.Connection(host=kyuubi_host, port=kyuubi_port, auth="KERBEROS", kerberos_service_name="kyuubi")