开源项目推荐
Gobblin Commands & Execution Modes
The Gobblin distribution comes with a script ./bin/gobblin
for all commands and services.
Here is the usage:
Usage:
gobblin.sh cli <cli-command> <params>
gobblin.sh service <execution-mode> <start|stop|status>
Use "gobblin <cli|service> --help" for more information. (Gobblin Version: 0.15.0)
For Gobblin CLI commands, run following:
Usage:
gobblin.sh cli <cli-command> <params>
options:
cli-commands:
passwordManager Encrypt or decrypt strings for the password manager.
decrypt Decryption utilities
run Run a Gobblin application.
config Query the config library
jobs Command line job info and operations
stateMigration Command line tools for migrating state store
job-state-to-json To convert Job state to JSON
cleaner Data retention utility
keystore Examine JCE Keystore files
watermarks Inspect streaming watermarks
job-store-schema-manager Database job history store schema manager
--conf-dir <gobblin-conf-dir-path> Gobblin config path. default is '$GOBBLIN_HOME/conf/<exe-mode-name>'.
--log4j-conf <path-of-log4j-file> default is '<gobblin-conf-dir-path>/<execution-mode>/log4j.properties'.
--jvmopts <jvm or gc options> String containing JVM flags to include, in addition to "-Xmx1g -Xms512m".
--jars <csv list of extra jars> Column-separated list of extra jars to put on the CLASSPATH.
--enable-gc-logs enables gc logs & dumps.
--show-classpath prints gobblin runtime classpath.
--help Display this help.
--verbose Display full command used to start the process.
Gobblin Version: 0.15.0
Argument details:
--conf-dir
: specifies the path to directory containing gobblin system configuration files, likeapplication.conf
orreference.conf
,log4j.properties
andquartz.properties
.--log4j-conf
: specify the path of log4j config file to override the one in config directory (default is<conf>/<gobblin-mode>/log4j.properties
. Gobblin uses SLF4J and the slf4j-log4j12 binding for logging.--jvmopts
: to specify any JVM parameters, default is-Xmx1g -Xms512m
.--enable-gc-logs
: adds GC options to JVM parameters:-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseCompressedOops -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$GOBBLIN_LOGS/ -Xloggc:$GOBBLIN_LOGS/gobblin-$GOBBLIN_MODE-gc.log
--show-classpath
: It prints the full value of the classpath that gobblin uses.- all other arguments are self-explanatory.
Gobblin Commands
Gobblin provides following CLI commands:
Available commands:
job-state-to-json To convert Job state to JSON
jobs Command line job info and operations
passwordManager Encrypt or decrypt strings for the password manager.
run Run a Gobblin application.
decrypt Decryption utilities
job-store-schema-manager Database job history store schema manager
stateMigration Command line tools for migrating state store
keystore Examine JCE Keystore files
config Query the config library
watermarks Inspect streaming watermarks
cleaner Data retention utility
Details on how to use run
command:
Gobblin ingestion applications can be accessed through the following command:
gobblin cli run [listQuickApps] [<quick-app>] -jobName <jobName> [OPTIONS]
For usage run ./bin/gobblin cli run
.
gobblin cli run
uses Embedded Gobblin and subclasses to run Gobblin ingestion jobs, giving CLI access to most functionality that could be achieved using EmbeddedGobblin
. For example, the following command will run a Hello World job (it will print “Hello World 1 !” somewhere in the logs).
gobblin cli run -jobName helloWorld -setTemplate resource:///templates/hello-world.template
Obviously, it is daunting to have to know the path to templates and exactly which configurations to set. The alternative is to use a quick app. Running:
gobblin cli run listQuickApps
will provide with a list of available quick apps. To run a quick app:
gobblin cli run <quick-app-name>
Quick apps may require additional arguments. For the usage of a particular app, run bin/gobblin cli run <quick-app-name> -h
.
The Distcp Quick App
For example, consider the quick app distcp:
$ gobblin cli run distcp -h
usage: gobblin cli run distcp [OPTIONS] <source> <target>
-delete Delete files in target that don't exist
on source.
-deleteEmptyParentDirectories If deleting files on target, also delete
newly empty parent directories.
-distributeJar <arg>
-h,--help
-l Uses log to print out erros in the base CLI code.
-mrMode
-setConfiguration <arg>
-setJobTimeout <arg>
-setLaunchTimeout <arg>
-setShutdownTimeout <arg>
-simulate
-update Specifies files should be updated if they're different in the source.
-useStateStore <arg>
This provides usage for the app distcp, as well as listing all available options. Distcp could then be run:
gobblin cli run distcp file:///source/path file:///target/path
The OneShot Quick App
The Gobblin cli also ships with a generic job runner, the oneShot quick app. You can use it to run a single job using a standard config file. This is very useful during development, testing and also makes it easy to integrate with schedulers that just need to fire off a command line job. The oneShot app allows you to run a job in standalone mode or in map-reduce mode.
$ gobblin cli run oneShot -baseConf <base-config-file> -appConf <path-to-job-conf-file>
# The Base Config file is an optional parameter and contains defaults for your mode of
# execution (e.g. standalone modes would typically use
# gobblin-dist/conf/standalone/application.conf and
# mapreduce mode would typically use gobblin-dist/conf/mapreduce/application.conf)
#
# The Job Config file is your regular .pull or .conf file and is a required parameter.
# You should use a fully qualified URI to your pull file. Otherwise Gobblin will pick the
# default FS configured in the environment, which may not be what you want.
# e.g file:///gobblin-conf/my-job/wikipedia.pull or hdfs:///gobblin-conf/my-job/kafka-hdfs.pull
The oneShot app comes with certain hardcoded defaults (that it inherits from EmbeddedGobblin here), that you may not be expecting. Make sure you understand what they do and override them in your baseConf or appConf files if needed.
Notable differences at the time of this writing include:
- state.store.enabled = false (set this to true in your appConfig or baseConfig if you want state storage for repeated oneshot runs)
- data.publisher.appendExtractToFinalDir = false (set this to true in your appConfig or baseConfig if you want to see the extract name appended to the job output directory)
The oneShot app allows for specifying the log4j file of your job execution which can be very helpful while debugging pesky failures. You can launch the job in MR-Mode by using the -mrMode switch.
- oneShot execution of standalone with a log4j file.
$ gobblin cli run oneShot -baseConf /app/gobblin-dist/conf/standalone/application.conf -appConf file:///app/kafkaConfDir/kafka-simple-hdfs.pull --log4j-conf /app/gobblin-dist/conf/standalone/log4j.properties
- oneShot execution of map-reduce job with a log4j file
$ gobblin cli run oneShot -mrMode -baseConf /app/gobblin-dist/conf/standalone/application.conf -appConf file:///app/kafkaConfDir/kafka-simple-hdfs.pull --log4j-conf /app/gobblin-dist/conf/standalone/log4j.properties
Developing quick apps for the CLI
It is very easy to convert a subclass of EmbeddedGobblin
into a quick application for Gobblin CLI. All that is needed is to implement a EmbeddedGobblinCliFactory
which knows how instantiate the EmbeddedGobblin
from a CommandLine
object and annotate it with the Alias
annotation. There are two utility classes that make this very easy:
PublicMethodsGobblinCliFactory
: this class will automatically infer CLI options from the public methods of a subclass ofEmbeddedGobblin
. All the developer has to do is implement the methodconstructEmbeddedGobblin(CommandLine)
that calls the appropriate constructor of the desiredEmbeddedGobblin
subclass with parameters extracted from the CLI. Additionally, it is a good idea to overridegetUsageString()
with the appropriate usage string. For an example, seegobblin.runtime.embedded.EmbeddedGobblinDistcp.CliFactory
.ConstructorAndPublicMethodsGobblinCliFactory
: this class does everythingPublicMethodsGobblinCliFactory
does, but it additionally automatically infers how to construct theEmbeddedGobblin
object from a constructor annotated withEmbeddedGobblinCliSupport
. For an example, seegobblin.runtime.embedded.EmbeddedGobblin.CliFactory
.
Implementing new Gobblin commands
To implement a new Gobblin command to list and execute using ./bin/gobblin
, implement the class gobblin.runtime.cli.CliApplication
, and annotate it with the Alias
annotation. The Gobblin CLI will automatically find the command, and users can invoke it by the Alias value.
Gobblin Service Execution Modes ( as Daemon )
For more info on Gobblin service execution modes, run bin/gobblin service --help
:
Usage:
gobblin.sh service <execution-mode> <start|stop|status>
Argument Options:
<execution-mode> standalone, cluster-master, cluster-worker, aws,
yarn, mapreduce, service-manager.
--conf-dir <gobblin-conf-dir-path> Gobblin config path. default is '$GOBBLIN_HOME/conf/<exe-mode-name>'.
--log4j-conf <path-of-log4j-file> default is '<gobblin-conf-dir-path>/<execution-mode>/log4j.properties'. --jvmopts <jvm or gc options> String containing JVM flags to include, in addition to "-Xmx1g -Xms512m".
--jars <csv list of extra jars> Column-separated list of extra jars to put on the CLASSPATH.
--enable-gc-logs enables gc logs & dumps.
--show-classpath prints gobblin runtime classpath.
--cluster-name Name of the cluster to be used by helix & other services. ( default: gobblin_cluster).
--jt <resource manager URL> Only for mapreduce mode: Job submission URL, if not set, taken from ${HADOOP_HOME}/conf.
--fs <file system URL> Only for mapreduce mode: Target file system, if not set, taken from ${HADOOP_HOME}/conf.
--help Display this help.
--verbose Display full command used to start the process.
Gobblin Version: 0.15.0
Standalone: This mode starts all Gobblin services in single JVM on a single node. This mode is useful for development and light weight usage:
gobblin service standalone start
For more details and architecture on each execution mode, refer Standalone-Deployment
Mapreduce:
This mode is dependent on Hadoop (both MapReduce and HDFS) running locally or remote cluster. Before launching any Gobblin jobs on Hadoop MapReduce, check the Gobblin system configuration file located at
conf/mapreduce/application.properties
for propertyfs.uri
, which defines the file system URI used. The default value ishdfs://localhost:8020
, which points to the local HDFS on the default port 8020. Change it to the right value depending on your Hadoop/HDFS setup. For example, if you have HDFS setup somwhere on port 9000, then set the property as follows:fs.uri=hdfs://<namenode host name>:9000/
--jt
: resource manager URL--fs
: file system type value forfs.uri
This mode will have the minimum set of Gobblin jars, selected using
libs/gobblin-<module_name>-$GOBBLIN_VERSION.jar
, which is passed as-libjar
to hadoop command while running the job. These same set of jars also gets added to the HadoopDistributedCache
for use in the mappers. If a job has additional jars needed for task executions (in the mappers), those jars can also be included by using the--jars
option or the following job configuration property in the job configuration file:job.jars=<comma-separated list of jars the job depends on>
if
HADOOP_HOME
is set in the environment, Gobblin will add result ofhadoop classpath
prior to defaultGOBBLIN_CLASSPATH
to give them precedence while runningbin/gobblin
.All job data and persisted job/task states will be written to the specified file system. Before launching any jobs, make sure the environment variable
HADOOP_HOME
is set so that it can access hadoop binaries under{HADOOP_HOME}/bin
and also working directory should be set with configuration{gobblin.cluster.work.dir}
. Note that the Gobblin working directory will be created on the file system specified above.An important side effect of this is that (depending on the application) non-fully-qualified paths (like
/my/path
) will default to local file system ifHADOOP_HOME
is not set, while they will default to HDFS if the variable is set. When referring to local paths, it is always a good idea to use the fully qualified path (e.g.file:///my/path
).
Cluster Mode (master & worker) This is a cluster mode consist of master and worker process.
gobblin service cluster-master start
gobblin service cluster-worker start
AWS This mode starts Gobblin on AWS cloud cluster.
gobblin service aws start
YARN This mode starts Gobblin on YARN cluster.
gobblin service yarn start
Gobblin System Configurations
Following values can be overridden by setting it in gobblin-env.sh
GOBBLIN_LOGS
: by default the logs are written to $GOBBLIN_HOME/logs
, it can be overridden by setting GOBBLIN_LOGS
\
GOBBLIN_VERSION
: by default gobblin version is set by the build process, it can be overridden by setting GOBBLIN_VERSION
\
All Gobblin system configurations details can be found here: Configuration Properties Glossary.