Gobblin Commands & Execution Modes
The Gobblin distribution comes with a script ./bin/gobblin for all commands and services.
Here is the usage:
Usage:gobblin.sh cli <cli-command> <params>gobblin.sh service <execution-mode> <start|stop|status>Use "gobblin <cli|service> --help" for more information. (Gobblin Version: 0.15.0)
For Gobblin CLI commands, run following:
Usage:gobblin.sh cli <cli-command> <params>options:cli-commands:passwordManager Encrypt or decrypt strings for the password manager.decrypt Decryption utilitiesrun Run a Gobblin application.config Query the config libraryjobs Command line job info and operationsstateMigration Command line tools for migrating state storejob-state-to-json To convert Job state to JSONcleaner Data retention utilitykeystore Examine JCE Keystore fileswatermarks Inspect streaming watermarksjob-store-schema-manager Database job history store schema manager--conf-dir <gobblin-conf-dir-path> Gobblin config path. default is '$GOBBLIN_HOME/conf/<exe-mode-name>'.--log4j-conf <path-of-log4j-file> default is '<gobblin-conf-dir-path>/<execution-mode>/log4j.properties'.--jvmopts <jvm or gc options> String containing JVM flags to include, in addition to "-Xmx1g -Xms512m".--jars <csv list of extra jars> Column-separated list of extra jars to put on the CLASSPATH.--enable-gc-logs enables gc logs & dumps.--show-classpath prints gobblin runtime classpath.--help Display this help.--verbose Display full command used to start the process.Gobblin Version: 0.15.0
Argument details:
--conf-dir: specifies the path to directory containing gobblin system configuration files, likeapplication.conforreference.conf,log4j.propertiesandquartz.properties.--log4j-conf: specify the path of log4j config file to override the one in config directory (default is<conf>/<gobblin-mode>/log4j.properties. Gobblin uses SLF4J and the slf4j-log4j12 binding for logging.--jvmopts: to specify any JVM parameters, default is-Xmx1g -Xms512m.--enable-gc-logs: adds GC options to JVM parameters:-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseCompressedOops -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$GOBBLIN_LOGS/ -Xloggc:$GOBBLIN_LOGS/gobblin-$GOBBLIN_MODE-gc.log--show-classpath: It prints the full value of the classpath that gobblin uses.- all other arguments are self-explanatory.
Gobblin Commands
Gobblin provides following CLI commands:
Available commands:job-state-to-json To convert Job state to JSONjobs Command line job info and operationspasswordManager Encrypt or decrypt strings for the password manager.run Run a Gobblin application.decrypt Decryption utilitiesjob-store-schema-manager Database job history store schema managerstateMigration Command line tools for migrating state storekeystore Examine JCE Keystore filesconfig Query the config librarywatermarks Inspect streaming watermarkscleaner Data retention utility
Details on how to use run command:
Gobblin ingestion applications can be accessed through the following command:
gobblin cli run [listQuickApps] [<quick-app>] -jobName <jobName> [OPTIONS]
For usage run ./bin/gobblin cli run.
gobblin cli run uses Embedded Gobblin and subclasses to run Gobblin ingestion jobs, giving CLI access to most functionality that could be achieved using EmbeddedGobblin. For example, the following command will run a Hello World job (it will print “Hello World 1 !” somewhere in the logs).
gobblin cli run -jobName helloWorld -setTemplate resource:///templates/hello-world.template
Obviously, it is daunting to have to know the path to templates and exactly which configurations to set. The alternative is to use a quick app. Running:
gobblin cli run listQuickApps
will provide with a list of available quick apps. To run a quick app:
gobblin cli run <quick-app-name>
Quick apps may require additional arguments. For the usage of a particular app, run bin/gobblin cli run <quick-app-name> -h.
The Distcp Quick App
For example, consider the quick app distcp:
$ gobblin cli run distcp -husage: gobblin cli run distcp [OPTIONS] <source> <target>-delete Delete files in target that don't existon source.-deleteEmptyParentDirectories If deleting files on target, also deletenewly empty parent directories.-distributeJar <arg>-h,--help-l Uses log to print out erros in the base CLI code.-mrMode-setConfiguration <arg>-setJobTimeout <arg>-setLaunchTimeout <arg>-setShutdownTimeout <arg>-simulate-update Specifies files should be updated if they're different in the source.-useStateStore <arg>
This provides usage for the app distcp, as well as listing all available options. Distcp could then be run:
gobblin cli run distcp file:///source/path file:///target/path
The OneShot Quick App
The Gobblin cli also ships with a generic job runner, the oneShot quick app. You can use it to run a single job using a standard config file. This is very useful during development, testing and also makes it easy to integrate with schedulers that just need to fire off a command line job. The oneShot app allows you to run a job in standalone mode or in map-reduce mode.
$ gobblin cli run oneShot -baseConf <base-config-file> -appConf <path-to-job-conf-file># The Base Config file is an optional parameter and contains defaults for your mode of# execution (e.g. standalone modes would typically use# gobblin-dist/conf/standalone/application.conf and# mapreduce mode would typically use gobblin-dist/conf/mapreduce/application.conf)## The Job Config file is your regular .pull or .conf file and is a required parameter.# You should use a fully qualified URI to your pull file. Otherwise Gobblin will pick the# default FS configured in the environment, which may not be what you want.# e.g file:///gobblin-conf/my-job/wikipedia.pull or hdfs:///gobblin-conf/my-job/kafka-hdfs.pull
The oneShot app comes with certain hardcoded defaults (that it inherits from EmbeddedGobblin here), that you may not be expecting. Make sure you understand what they do and override them in your baseConf or appConf files if needed.
Notable differences at the time of this writing include:
- state.store.enabled = false (set this to true in your appConfig or baseConfig if you want state storage for repeated oneshot runs)
- data.publisher.appendExtractToFinalDir = false (set this to true in your appConfig or baseConfig if you want to see the extract name appended to the job output directory)
The oneShot app allows for specifying the log4j file of your job execution which can be very helpful while debugging pesky failures. You can launch the job in MR-Mode by using the -mrMode switch.
- oneShot execution of standalone with a log4j file.
$ gobblin cli run oneShot -baseConf /app/gobblin-dist/conf/standalone/application.conf -appConf file:///app/kafkaConfDir/kafka-simple-hdfs.pull --log4j-conf /app/gobblin-dist/conf/standalone/log4j.properties
- oneShot execution of map-reduce job with a log4j file
$ gobblin cli run oneShot -mrMode -baseConf /app/gobblin-dist/conf/standalone/application.conf -appConf file:///app/kafkaConfDir/kafka-simple-hdfs.pull --log4j-conf /app/gobblin-dist/conf/standalone/log4j.properties
Developing quick apps for the CLI
It is very easy to convert a subclass of EmbeddedGobblin into a quick application for Gobblin CLI. All that is needed is to implement a EmbeddedGobblinCliFactory which knows how instantiate the EmbeddedGobblin from a CommandLine object and annotate it with the Alias annotation. There are two utility classes that make this very easy:
PublicMethodsGobblinCliFactory: this class will automatically infer CLI options from the public methods of a subclass ofEmbeddedGobblin. All the developer has to do is implement the methodconstructEmbeddedGobblin(CommandLine)that calls the appropriate constructor of the desiredEmbeddedGobblinsubclass with parameters extracted from the CLI. Additionally, it is a good idea to overridegetUsageString()with the appropriate usage string. For an example, seegobblin.runtime.embedded.EmbeddedGobblinDistcp.CliFactory.ConstructorAndPublicMethodsGobblinCliFactory: this class does everythingPublicMethodsGobblinCliFactorydoes, but it additionally automatically infers how to construct theEmbeddedGobblinobject from a constructor annotated withEmbeddedGobblinCliSupport. For an example, seegobblin.runtime.embedded.EmbeddedGobblin.CliFactory.
Implementing new Gobblin commands
To implement a new Gobblin command to list and execute using ./bin/gobblin, implement the class gobblin.runtime.cli.CliApplication, and annotate it with the Alias annotation. The Gobblin CLI will automatically find the command, and users can invoke it by the Alias value.
Gobblin Service Execution Modes ( as Daemon )
For more info on Gobblin service execution modes, run bin/gobblin service --help:
Usage:gobblin.sh service <execution-mode> <start|stop|status>Argument Options:<execution-mode> standalone, cluster-master, cluster-worker, aws,yarn, mapreduce, service-manager.--conf-dir <gobblin-conf-dir-path> Gobblin config path. default is '$GOBBLIN_HOME/conf/<exe-mode-name>'.--log4j-conf <path-of-log4j-file> default is '<gobblin-conf-dir-path>/<execution-mode>/log4j.properties'. --jvmopts <jvm or gc options> String containing JVM flags to include, in addition to "-Xmx1g -Xms512m".--jars <csv list of extra jars> Column-separated list of extra jars to put on the CLASSPATH.--enable-gc-logs enables gc logs & dumps.--show-classpath prints gobblin runtime classpath.--cluster-name Name of the cluster to be used by helix & other services. ( default: gobblin_cluster).--jt <resource manager URL> Only for mapreduce mode: Job submission URL, if not set, taken from ${HADOOP_HOME}/conf.--fs <file system URL> Only for mapreduce mode: Target file system, if not set, taken from ${HADOOP_HOME}/conf.--help Display this help.--verbose Display full command used to start the process.Gobblin Version: 0.15.0
Standalone: This mode starts all Gobblin services in single JVM on a single node. This mode is useful for development and light weight usage:
gobblin service standalone start
For more details and architecture on each execution mode, refer Standalone-Deployment
Mapreduce:
This mode is dependent on Hadoop (both MapReduce and HDFS) running locally or remote cluster. Before launching any Gobblin jobs on Hadoop MapReduce, check the Gobblin system configuration file located at
conf/mapreduce/application.propertiesfor propertyfs.uri, which defines the file system URI used. The default value ishdfs://localhost:8020, which points to the local HDFS on the default port 8020. Change it to the right value depending on your Hadoop/HDFS setup. For example, if you have HDFS setup somwhere on port 9000, then set the property as follows:fs.uri=hdfs://<namenode host name>:9000/--jt: resource manager URL--fs: file system type value forfs.uriThis mode will have the minimum set of Gobblin jars, selected using
libs/gobblin-<module_name>-$GOBBLIN_VERSION.jar, which is passed as-libjarto hadoop command while running the job. These same set of jars also gets added to the HadoopDistributedCachefor use in the mappers. If a job has additional jars needed for task executions (in the mappers), those jars can also be included by using the--jarsoption or the following job configuration property in the job configuration file:job.jars=<comma-separated list of jars the job depends on>
if
HADOOP_HOMEis set in the environment, Gobblin will add result ofhadoop classpathprior to defaultGOBBLIN_CLASSPATHto give them precedence while runningbin/gobblin.All job data and persisted job/task states will be written to the specified file system. Before launching any jobs, make sure the environment variable
HADOOP_HOMEis set so that it can access hadoop binaries under{HADOOP_HOME}/binand also working directory should be set with configuration{gobblin.cluster.work.dir}. Note that the Gobblin working directory will be created on the file system specified above.An important side effect of this is that (depending on the application) non-fully-qualified paths (like
/my/path) will default to local file system ifHADOOP_HOMEis not set, while they will default to HDFS if the variable is set. When referring to local paths, it is always a good idea to use the fully qualified path (e.g.file:///my/path).
Cluster Mode (master & worker) This is a cluster mode consist of master and worker process.
gobblin service cluster-master startgobblin service cluster-worker start
AWS This mode starts Gobblin on AWS cloud cluster.
gobblin service aws start
YARN This mode starts Gobblin on YARN cluster.
gobblin service yarn start
Gobblin System Configurations
Following values can be overridden by setting it in gobblin-env.sh
GOBBLIN_LOGS : by default the logs are written to $GOBBLIN_HOME/logs, it can be overridden by setting GOBBLIN_LOGS\
GOBBLIN_VERSION : by default gobblin version is set by the build process, it can be overridden by setting GOBBLIN_VERSION\
All Gobblin system configurations details can be found here: Configuration Properties Glossary.
