Introduction
Gobblin is capable of writing data to ORC files by leveraging Hive’s SerDe library. Gobblin has native integration with Hive SerDe’s library via the HiveSerDeWrapper class.
This document will briefly explain how Gobblin integrates with Hive’s SerDe library, and show an example of writing ORC files.
Hive SerDe Integration
Hive’s SerDe library defines the interface Hive uses for serialization and deserialization of data. The Hive SerDe library has out of the box SerDe support for Avro, ORC, Parquet, CSV, and JSON SerDes. However, users are free to define custom SerDes.
Gobblin integrates with the Hive SerDe’s in a few different places. Here is a list of integration points that are relevant for this document:
HiveSerDeWrapperwrapper around Hive’s SerDe library that provides some nice utilities and structure that the rest of Gobblin can interfact withHiveSerDeConvertertakes aWritableobject in a specific format, and converts it to the Writable of another format (e.g. fromAvroGenericRecordWritabletoOrcSerdeRow)HiveWritableHdfsDataWriterwrites aWritableobject to a specific file, typically this writes the output of aHiveSerDeConverter
Writing to an ORC File
An end-to-end example of writing to an ORC file is provided in the configuration found here. This .pull file is almost identical to the Wikipedia example discussed in the Getting Started Guide. The only difference is that the output is written in ORC instead of Avro. The configuration file mentioned above can be directly used as a template for writing data to ORC files, below is a detailed explanation of the configuration options that need to be changed, and why they need to be changed.
converter.classesrequires two additional converters:gobblin.converter.avro.AvroRecordToAvroWritableConverterandgobblin.converter.serde.HiveSerDeConverter- The output of the first converter (the
WikipediaConverter) returns AvroGenericRecords - These records must be converted to
Writableobject in order for the Hive SerDe to process them, which is where theAvroRecordToAvroWritableConvertercomes in - The
HiveSerDeConverterdoes the actual heavy lifting of converting the Avro Records to ORC Records
- The output of the first converter (the
- In order to configure the
HiveSerDeConverterthe following properites need to be added:serde.deserializer.type=AVROsays that the records being fed into the converter are Avro recordsavro.schema.literaloravro.schema.urlmust be set when using this deserializer so that the Hive SerDe knows what Avro Schema to use when converting the record
serde.serializer.type=ORCsays that the records that should be returned by the converter are ORC records
writer.builder.classshould be set togobblin.writer.HiveWritableHdfsDataWriterBuilder- This writer class will take the output of the
HiveSerDeConverterand write the actual ORC records to an ORC file
- This writer class will take the output of the
writer.output.formatshould be set toORC; this ensures the files produced end with the.orcfile extensionfork.record.queue.capacityshould be set to1- This ensures no caching of records is done before they get passed to the writer; this is necessary because the
OrcSerdecaches the object it uses to serialize records, and it does not allow copying of Orc Records
- This ensures no caching of records is done before they get passed to the writer; this is necessary because the
The example job can be run the same way the regular Wikipedia job is run, except the output will be in the ORC format.
Data Flow
For the Wikipedia to ORC example, data flows in the following manner:
- It is extracted from Wikipedia via the
WikipediaExtractor, which also converts each Wikipedia entry into aJsonElement - The
WikipediaConverterthen converts the Wikipedia JSON entry into an AvroGenericRecord - The
AvroRecordToAvroWritableConverterconverts the AvroGenericRecordto aAvroGenericRecordWritable - The
HiveSerDeConverterconverts theAvroGenericRecordWritableto aOrcSerdeRow - The
HiveWritableHdfsDataWriteruses theOrcOutputFormatto write theOrcSerdeRowto anOrcFile
Extending Gobblin’s SerDe Integration
While this tutorial only discusses Avro to ORC conversion, it should be relatively straightfoward to use the approach mentioned in this document to convert CSV, JSON, etc. data into ORC.
