Introduction
Gobblin is capable of writing data to ORC files by leveraging Hive’s SerDe library. Gobblin has native integration with Hive SerDe’s library via the HiveSerDeWrapper class.
This document will briefly explain how Gobblin integrates with Hive’s SerDe library, and show an example of writing ORC files.
Hive SerDe Integration
Hive’s SerDe library defines the interface Hive uses for serialization and deserialization of data. The Hive SerDe library has out of the box SerDe support for Avro, ORC, Parquet, CSV, and JSON SerDes. However, users are free to define custom SerDes.
Gobblin integrates with the Hive SerDe’s in a few different places. Here is a list of integration points that are relevant for this document:
- HiveSerDeWrapperwrapper around Hive’s SerDe library that provides some nice utilities and structure that the rest of Gobblin can interfact with
- HiveSerDeConvertertakes a- Writableobject in a specific format, and converts it to the Writable of another format (e.g. from- AvroGenericRecordWritableto- OrcSerdeRow)
- HiveWritableHdfsDataWriterwrites a- Writableobject to a specific file, typically this writes the output of a- HiveSerDeConverter
Writing to an ORC File
An end-to-end example of writing to an ORC file is provided in the configuration found here. This .pull file is almost identical to the Wikipedia example discussed in the Getting Started Guide. The only difference is that the output is written in ORC instead of Avro. The configuration file mentioned above can be directly used as a template for writing data to ORC files, below is a detailed explanation of the configuration options that need to be changed, and why they need to be changed.
- converter.classesrequires two additional converters:- gobblin.converter.avro.AvroRecordToAvroWritableConverterand- gobblin.converter.serde.HiveSerDeConverter- The output of the first converter (the WikipediaConverter) returns AvroGenericRecords
- These records must be converted to Writableobject in order for the Hive SerDe to process them, which is where theAvroRecordToAvroWritableConvertercomes in
- The HiveSerDeConverterdoes the actual heavy lifting of converting the Avro Records to ORC Records
 
- The output of the first converter (the 
- In order to configure the HiveSerDeConverterthe following properites need to be added:- serde.deserializer.type=AVROsays that the records being fed into the converter are Avro records- avro.schema.literalor- avro.schema.urlmust be set when using this deserializer so that the Hive SerDe knows what Avro Schema to use when converting the record
 
- serde.serializer.type=ORCsays that the records that should be returned by the converter are ORC records
 
- writer.builder.classshould be set to- gobblin.writer.HiveWritableHdfsDataWriterBuilder- This writer class will take the output of the HiveSerDeConverterand write the actual ORC records to an ORC file
 
- This writer class will take the output of the 
- writer.output.formatshould be set to- ORC; this ensures the files produced end with the- .orcfile extension
- fork.record.queue.capacityshould be set to- 1- This ensures no caching of records is done before they get passed to the writer; this is necessary because the OrcSerdecaches the object it uses to serialize records, and it does not allow copying of Orc Records
 
- This ensures no caching of records is done before they get passed to the writer; this is necessary because the 
The example job can be run the same way the regular Wikipedia job is run, except the output will be in the ORC format.
Data Flow
For the Wikipedia to ORC example, data flows in the following manner:
- It is extracted from Wikipedia via the WikipediaExtractor, which also converts each Wikipedia entry into aJsonElement
- The WikipediaConverterthen converts the Wikipedia JSON entry into an AvroGenericRecord
- The AvroRecordToAvroWritableConverterconverts the AvroGenericRecordto aAvroGenericRecordWritable
- The HiveSerDeConverterconverts theAvroGenericRecordWritableto aOrcSerdeRow
- The HiveWritableHdfsDataWriteruses theOrcOutputFormatto write theOrcSerdeRowto anOrcFile
Extending Gobblin’s SerDe Integration
While this tutorial only discusses Avro to ORC conversion, it should be relatively straightfoward to use the approach mentioned in this document to convert CSV, JSON, etc. data into ORC.
 我的书签
 我的书签
                                 添加书签
 添加书签 移除书签
 移除书签