3. Integration with Hive Metastore - 图1

Integration with Hive Metastore

In this section, you will learn how to configure Kyuubi to interact with Hive Metastore.

  • A common Hive metastore server could be set at Kyuubi server side
  • Individual Hive metastore servers could be used for end users to set

Requirements

So the whole thing here is to let Spark applications use this copy of Hive configuration to start a Hive metastore client for their own to talk to the Hive metastore server.

Default Behavior

By default, Kyuubi launches Spark SQL engines pointing to a dummy embedded Apache Derby-based metastore for each application, and this metadata can only be seen by one user at a time, e.g.

  1. bin/beeline -u 'jdbc:hive2://localhost:10009/' -n kentyao
  2. Connecting to jdbc:hive2://localhost:10009/
  3. Connected to: Spark SQL (version 1.0.0-SNAPSHOT)
  4. Driver: Hive JDBC (version 2.3.7)
  5. Transaction isolation: TRANSACTION_REPEATABLE_READ
  6. Beeline version 2.3.7 by Apache Hive
  7. 0: jdbc:hive2://localhost:10009/> show databases;
  8. 2020-11-16 23:50:50.388 INFO operation.ExecuteStatement:
  9. Spark application name: kyuubi_kentyao_spark_2020-11-16T15:50:08.968Z
  10. application ID: local-1605541809797
  11. application web UI: http://192.168.1.14:60165
  12. master: local[*]
  13. deploy mode: client
  14. version: 3.0.1
  15. Start time: 2020-11-16T15:50:09.123Z
  16. User: kentyao
  17. 2020-11-16 23:50:50.404 INFO metastore.HiveMetaStore: 2: get_databases: *
  18. 2020-11-16 23:50:50.404 INFO HiveMetaStore.audit: ugi=kentyao ip=unknown-ip-addr cmd=get_databases: *
  19. 2020-11-16 23:50:50.423 INFO operation.ExecuteStatement: Processing kentyao's query[8453e657-c1c4-4391-8406-ab4747a66c45]: RUNNING_STATE -> FINISHED_STATE, statement: show databases, time taken: 0.035 seconds
  20. +------------+
  21. | namespace |
  22. +------------+
  23. | default |
  24. +------------+
  25. 1 row selected (0.122 seconds)
  26. 0: jdbc:hive2://localhost:10009/> show tables;
  27. 2020-11-16 23:50:52.957 INFO operation.ExecuteStatement:
  28. Spark application name: kyuubi_kentyao_spark_2020-11-16T15:50:08.968Z
  29. application ID: local-1605541809797
  30. application web UI: http://192.168.1.14:60165
  31. master: local[*]
  32. deploy mode: client
  33. version: 3.0.1
  34. Start time: 2020-11-16T15:50:09.123Z
  35. User: kentyao
  36. 2020-11-16 23:50:52.968 INFO metastore.HiveMetaStore: 2: get_database: default
  37. 2020-11-16 23:50:52.968 INFO HiveMetaStore.audit: ugi=kentyao ip=unknown-ip-addr cmd=get_database: default
  38. 2020-11-16 23:50:52.970 INFO metastore.HiveMetaStore: 2: get_database: default
  39. 2020-11-16 23:50:52.970 INFO HiveMetaStore.audit: ugi=kentyao ip=unknown-ip-addr cmd=get_database: default
  40. 2020-11-16 23:50:52.972 INFO metastore.HiveMetaStore: 2: get_tables: db=default pat=*
  41. 2020-11-16 23:50:52.972 INFO HiveMetaStore.audit: ugi=kentyao ip=unknown-ip-addr cmd=get_tables: db=default pat=*
  42. 2020-11-16 23:50:52.986 INFO operation.ExecuteStatement: Processing kentyao's query[ff902582-ba29-433b-b70a-c25ead1353a8]: RUNNING_STATE -> FINISHED_STATE, statement: show tables, time taken: 0.03 seconds
  43. +-----------+------------+--------------+
  44. | database | tableName | isTemporary |
  45. +-----------+------------+--------------+
  46. +-----------+------------+--------------+
  47. No rows selected (0.04 seconds)

Using this mode for experimental purposes only.

In a real production environment, we always have a communal standalone metadata store, to manage the metadata of persistent relational entities, e.g. databases, tables, columns, partitions, for fast access. usually, Hive metastore as the defacto.

These are the basic needs for a Hive metastore client to communicate with the remote Hive Metastore server.

Use remote metastore database or server mode depends on the server-side configuration.

Remote Metastore Database

Name Value Meaning
javax.jdo.option.ConnectionURL jdbc:mysql://<hostname>/<databaseName>?
createDatabaseIfNotExist=true
metadata is stored in a MySQL server
javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver MySQL JDBC driver class
javax.jdo.option.ConnectionUserName <username> user name for connecting to MySQL server
javax.jdo.option.ConnectionPassword <password> password for connecting to MySQL server

Remote Metastore Server

Name Value Meaning
hive.metastore.uris thrift://<host>:<port>,thrift://<host1>:<port1>
host and port for the Thrift metastore server.

Activate Configurations

Via kyuubi-defaults.conf

In $KYUUBI_HOME/conf/kyuubi-defaults.conf, all Hive primitive configurations, e.g. hive.metastore.uris, and the Spark derivatives, which are prefixed with spark.hive. or spark.hadoop., e.g spark.hive.metastore.uris or spark.hadoop.hive.metastore.uris, will be loaded as Hive primitives by the Hive client inside the Spark application.

Kyuubi will take these configurations as system wide defaults for all applications it launches.

Via hive-site.xml

Place your copy of hive-site.xml into $$SPARK_HOME/conf, every single Spark application will automatically load this config file to its classpath.

This version of configuration has lower priority than those in $KYUUBI_HOME/conf/kyuubi-defaults.conf.

Via JDBC Connection URL

We can pass Hive primitives or Spark derivatives directly in the JDBC connection URL, e.g.

  1. jdbc:hive2://localhost:10009/;#hive.metastore.uris=thrift://localhost:9083

This will override the defaults in $$SPARK_HOME/conf/hive-site.xml and $KYUUBI_HOME/conf/kyuubi-defaults.conf for each user account

With this feature, end users are possible to visit different Hive metastore server instance. Similarly, this works for other services like HDFS, YARN too.

Limitation: As most Hive configurations are final and unmodifiable in Spark at runtime, this only takes effect during instantiating the Spark applications and will be ignored when reusing an existing application. So, keep this in our mind.

!!!THIS WORKS ONLY ONCE!!!

!!!THIS WORKS ONLY ONCE!!!

!!!THIS WORKS ONLY ONCE!!!

Via SET syntax

Most Hive configurations are final and unmodifiable in Spark at runtime, so keep this in our mind.

!!!THIS WON’T WORK!!!

!!!THIS WON’T WORK!!!

!!!THIS WON’T WORK!!!

Version Compatibility

If backward compatibility is guaranteed by Hive versioning, we can always use a lower version Hive metastore client to communicate with the higher version Hive metastore server.

For example, Spark 3.0 was released with a builtin Hive client (2.3.7), so, ideally, the version of server should >= 2.3.x.

If you do have a legacy Hive metastore server that cannot be easily upgraded, and you may face the issue by default like this,

  1. Caused by: org.apache.thrift.TApplicationException: Invalid method name: 'get_table_req'
  2. at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
  3. at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table_req(ThriftHiveMetastore.java:1567)
  4. at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table_req(ThriftHiveMetastore.java:1554)
  5. at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1350)
  6. at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.getTable(SessionHiveMetaStoreClient.java:127)
  7. at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  8. at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  9. at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  10. at java.lang.reflect.Method.invoke(Method.java:498)
  11. at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173)
  12. at com.sun.proxy.$Proxy37.getTable(Unknown Source)
  13. at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  14. at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  15. at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  16. at java.lang.reflect.Method.invoke(Method.java:498)
  17. at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2336)
  18. at com.sun.proxy.$Proxy37.getTable(Unknown Source)
  19. at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1274)
  20. ... 93 more

To prevent this problem, we can use Spark’s Interacting with Different Versions of Hive Metastore

Further Readings