Description
An extension to FsDataWriter that writes in Parquet format in the form of either Avro, Protobuf or ParquetGroup. This implementation allows users to specify the CodecFactory to use through the configuration property writer.codec.type. By default, the snappy codec is used. See Developer Notes to make sure you are using the right Gobblin jar.
Usage
writer.builder.class=org.apache.gobblin.writer.ParquetDataWriterBuilderwriter.destination.type=HDFSwriter.output.format=PARQUET
Example Pipeline Configuration
example-parquet.pullcontains an example of generating test data and writing to Parquet files.
Configuration
| Key | Description | Default Value | Required |
|---|---|---|---|
| writer.parquet.page.size | The page size threshold. | 1048576 | No |
| writer.parquet.dictionary.page.size | The block size threshold for the dictionary pages. | 134217728 | No |
| writer.parquet.dictionary | To turn dictionary encoding on. Parquet has a dictionary encoding for data with a small number of unique values ( < 10^5 ) that aids in significant compression and boosts processing speed. | true | No |
| writer.parquet.validate | To turn on validation using the schema. This validation is done by ParquetWriter not by Gobblin. |
false | No |
| writer.parquet.version | Version of parquet writer to use. Available versions are v1 and v2. | v1 | No |
| writer.parquet.format | In-memory format of the record being written to Parquet. Options are AVRO, PROTOBUF and GROUP |
GROUP | No |
Developer Notes
Gobblin provides integration with two different versions of Parquet through its modules. Use the appropriate jar based on the Parquet library you use in your code.
| Jar | Dependency | Gobblin Release |
|---|---|---|
gobblin-parquet |
com.twitter:parquet-hadoop-bundle |
>= 0.12.0 |
gobblin-parquet-apache |
org.apache.parquet:parquet-hadoop |
>= 0.15.0 |
If you want to look at the code, check out:
| Module | File |
|---|---|
| gobblin-parquet | ParquetHdfsDataWriter |
| gobblin-parquet | ParquetDataWriterBuilder |
| gobblin-parquet-apache | ParquetHdfsDataWriter |
| gobblin-parquet-apache | ParquetDataWriterBuilder |
