PyHive is a collection of Python DB-API and SQLAlchemy interfaces for Hive. PyHive can connect with the Kyuubi server serving in thrift protocol as HiveServer2.
Requirements
PyHive works with Python 2.7 / Python 3. Install PyHive via pip for the Hive interface.
pip install 'pyhive[hive]'
Usage
Use the Kyuubi server’s host and thrift protocol port to connect.
For further information about usages and features, e.g. DB-API async fetching, using in SQLAlchemy, please refer to project homepage.
DB-API
from pyhive import hive
cursor = hive.connect(host=kyuubi_host,port=10009).cursor()
cursor.execute('SELECT * FROM my_awesome_data LIMIT 10')
print(cursor.fetchone())
print(cursor.fetchall())
Use PyHive with Pandas
PyHive provides a handy way to establish a SQLAlchemy compatible connection and works with Pandas dataframe for executing SQL and reading data via pandas.read_sql.
from pyhive import hive
import pandas as pd
# open connection
conn = hive.Connection(host=kyuubi_host,port=10009)
# query the table to a new dataframe
dataframe = pd.read_sql("SELECT id, name FROM test.example_table", conn)
Authentication
If password is provided for connection, make sure the auth param set to either CUSTOM or LDAP.
# open connection
conn = hive.Connection(host=kyuubi_host, port=10009,
username='user', password='password', auth='CUSTOM')
Use Kerberos to connect to Kyuubi.
kerberos_service_name must be the name of the service that started the Kyuubi server, usually the prefix of the first slash of kyuubi.kinit.principal.
Note that PyHive does not support passing in principal, it splices in part of principal with kerberos_service_name and kyuubi_host.
# open connection
conn = hive.Connection(host=kyuubi_host, port=10009, auth="KERBEROS", kerberos_service_name="kyuubi")
If you encounter the following errors, you need to install related packages.
thrift.transport.TTransport.TTransportException: Could not start SASL: b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found'
yum install -y cyrus-sasl-plain cyrus-sasl-devel cyrus-sasl-gssapi cyrus-sasl-md5
Note that PyHive does not support the connection method based on zookeeper HA, you can connect to zookeeper to get the service address via Kazoo.
Code reference https://stackoverflow.com/a/73326589
from pyhive import hive
import random
from kazoo.client import KazooClient
zk = KazooClient(hosts='kyuubi1.xx.com:2181,kyuubi2.xx.com:2181,kyuubi3.xx.com:2181', read_only=True)
zk.start()
servers = [kyuubi_server.split(';')[0].split('=')[1].split(':')
for kyuubi_server
in zk.get_children(path='kyuubi')]
kyuubi_host, kyuubi_port = random.choice(servers)
zk.stop()
print(kyuubi_host, kyuubi_port)
conn = hive.Connection(host=kyuubi_host, port=kyuubi_port, auth="KERBEROS", kerberos_service_name="kyuubi")
 我的书签
 我的书签
                                 添加书签
 添加书签 移除书签
 移除书签