Gobblin Commands & Execution Modes

The Gobblin distribution comes with a script ./bin/gobblin for all commands and services. Here is the usage:

  1. Usage:
  2. gobblin.sh cli <cli-command> <params>
  3. gobblin.sh service <execution-mode> <start|stop|status>
  4. Use "gobblin <cli|service> --help" for more information. (Gobblin Version: 0.15.0)

For Gobblin CLI commands, run following:

  1. Usage:
  2. gobblin.sh cli <cli-command> <params>
  3. options:
  4. cli-commands:
  5. passwordManager Encrypt or decrypt strings for the password manager.
  6. decrypt Decryption utilities
  7. run Run a Gobblin application.
  8. config Query the config library
  9. jobs Command line job info and operations
  10. stateMigration Command line tools for migrating state store
  11. job-state-to-json To convert Job state to JSON
  12. cleaner Data retention utility
  13. keystore Examine JCE Keystore files
  14. watermarks Inspect streaming watermarks
  15. job-store-schema-manager Database job history store schema manager
  16. --conf-dir <gobblin-conf-dir-path> Gobblin config path. default is '$GOBBLIN_HOME/conf/<exe-mode-name>'.
  17. --log4j-conf <path-of-log4j-file> default is '<gobblin-conf-dir-path>/<execution-mode>/log4j.properties'.
  18. --jvmopts <jvm or gc options> String containing JVM flags to include, in addition to "-Xmx1g -Xms512m".
  19. --jars <csv list of extra jars> Column-separated list of extra jars to put on the CLASSPATH.
  20. --enable-gc-logs enables gc logs & dumps.
  21. --show-classpath prints gobblin runtime classpath.
  22. --help Display this help.
  23. --verbose Display full command used to start the process.
  24. Gobblin Version: 0.15.0

Argument details:

  • --conf-dir: specifies the path to directory containing gobblin system configuration files, like application.conf or reference.conf, log4j.properties and quartz.properties.
  • --log4j-conf: specify the path of log4j config file to override the one in config directory (default is <conf>/<gobblin-mode>/log4j.properties. Gobblin uses SLF4J and the slf4j-log4j12 binding for logging.
  • --jvmopts: to specify any JVM parameters, default is -Xmx1g -Xms512m.
  • --enable-gc-logs: adds GC options to JVM parameters: -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseCompressedOops -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$GOBBLIN_LOGS/ -Xloggc:$GOBBLIN_LOGS/gobblin-$GOBBLIN_MODE-gc.log
  • --show-classpath: It prints the full value of the classpath that gobblin uses.
  • all other arguments are self-explanatory.

Gobblin Commands

Gobblin provides following CLI commands:

  1. Available commands:
  2. job-state-to-json To convert Job state to JSON
  3. jobs Command line job info and operations
  4. passwordManager Encrypt or decrypt strings for the password manager.
  5. run Run a Gobblin application.
  6. decrypt Decryption utilities
  7. job-store-schema-manager Database job history store schema manager
  8. stateMigration Command line tools for migrating state store
  9. keystore Examine JCE Keystore files
  10. config Query the config library
  11. watermarks Inspect streaming watermarks
  12. cleaner Data retention utility

Details on how to use run command:

Gobblin ingestion applications can be accessed through the following command:

  1. gobblin cli run [listQuickApps] [<quick-app>] -jobName <jobName> [OPTIONS]

For usage run ./bin/gobblin cli run.

gobblin cli run uses Embedded Gobblin and subclasses to run Gobblin ingestion jobs, giving CLI access to most functionality that could be achieved using EmbeddedGobblin. For example, the following command will run a Hello World job (it will print “Hello World 1 !” somewhere in the logs).

  1. gobblin cli run -jobName helloWorld -setTemplate resource:///templates/hello-world.template

Obviously, it is daunting to have to know the path to templates and exactly which configurations to set. The alternative is to use a quick app. Running:

  1. gobblin cli run listQuickApps

will provide with a list of available quick apps. To run a quick app:

  1. gobblin cli run <quick-app-name>

Quick apps may require additional arguments. For the usage of a particular app, run bin/gobblin cli run <quick-app-name> -h.

The Distcp Quick App

For example, consider the quick app distcp:

  1. $ gobblin cli run distcp -h
  2. usage: gobblin cli run distcp [OPTIONS] <source> <target>
  3. -delete Delete files in target that don't exist
  4. on source.
  5. -deleteEmptyParentDirectories If deleting files on target, also delete
  6. newly empty parent directories.
  7. -distributeJar <arg>
  8. -h,--help
  9. -l Uses log to print out erros in the base CLI code.
  10. -mrMode
  11. -setConfiguration <arg>
  12. -setJobTimeout <arg>
  13. -setLaunchTimeout <arg>
  14. -setShutdownTimeout <arg>
  15. -simulate
  16. -update Specifies files should be updated if they're different in the source.
  17. -useStateStore <arg>

This provides usage for the app distcp, as well as listing all available options. Distcp could then be run:

  1. gobblin cli run distcp file:///source/path file:///target/path

The OneShot Quick App

The Gobblin cli also ships with a generic job runner, the oneShot quick app. You can use it to run a single job using a standard config file. This is very useful during development, testing and also makes it easy to integrate with schedulers that just need to fire off a command line job. The oneShot app allows you to run a job in standalone mode or in map-reduce mode.

  1. $ gobblin cli run oneShot -baseConf <base-config-file> -appConf <path-to-job-conf-file>
  2. # The Base Config file is an optional parameter and contains defaults for your mode of
  3. # execution (e.g. standalone modes would typically use
  4. # gobblin-dist/conf/standalone/application.conf and
  5. # mapreduce mode would typically use gobblin-dist/conf/mapreduce/application.conf)
  6. #
  7. # The Job Config file is your regular .pull or .conf file and is a required parameter.
  8. # You should use a fully qualified URI to your pull file. Otherwise Gobblin will pick the
  9. # default FS configured in the environment, which may not be what you want.
  10. # e.g file:///gobblin-conf/my-job/wikipedia.pull or hdfs:///gobblin-conf/my-job/kafka-hdfs.pull

The oneShot app comes with certain hardcoded defaults (that it inherits from EmbeddedGobblin here), that you may not be expecting. Make sure you understand what they do and override them in your baseConf or appConf files if needed.

Notable differences at the time of this writing include:

  • state.store.enabled = false (set this to true in your appConfig or baseConfig if you want state storage for repeated oneshot runs)
  • data.publisher.appendExtractToFinalDir = false (set this to true in your appConfig or baseConfig if you want to see the extract name appended to the job output directory)

The oneShot app allows for specifying the log4j file of your job execution which can be very helpful while debugging pesky failures. You can launch the job in MR-Mode by using the -mrMode switch.

  • oneShot execution of standalone with a log4j file.
    1. $ gobblin cli run oneShot -baseConf /app/gobblin-dist/conf/standalone/application.conf -appConf file:///app/kafkaConfDir/kafka-simple-hdfs.pull --log4j-conf /app/gobblin-dist/conf/standalone/log4j.properties
  • oneShot execution of map-reduce job with a log4j file
    1. $ gobblin cli run oneShot -mrMode -baseConf /app/gobblin-dist/conf/standalone/application.conf -appConf file:///app/kafkaConfDir/kafka-simple-hdfs.pull --log4j-conf /app/gobblin-dist/conf/standalone/log4j.properties

Developing quick apps for the CLI

It is very easy to convert a subclass of EmbeddedGobblin into a quick application for Gobblin CLI. All that is needed is to implement a EmbeddedGobblinCliFactory which knows how instantiate the EmbeddedGobblin from a CommandLine object and annotate it with the Alias annotation. There are two utility classes that make this very easy:

  • PublicMethodsGobblinCliFactory: this class will automatically infer CLI options from the public methods of a subclass of EmbeddedGobblin. All the developer has to do is implement the method constructEmbeddedGobblin(CommandLine) that calls the appropriate constructor of the desired EmbeddedGobblin subclass with parameters extracted from the CLI. Additionally, it is a good idea to override getUsageString() with the appropriate usage string. For an example, see gobblin.runtime.embedded.EmbeddedGobblinDistcp.CliFactory.
  • ConstructorAndPublicMethodsGobblinCliFactory: this class does everything PublicMethodsGobblinCliFactory does, but it additionally automatically infers how to construct the EmbeddedGobblin object from a constructor annotated with EmbeddedGobblinCliSupport. For an example, see gobblin.runtime.embedded.EmbeddedGobblin.CliFactory.

Implementing new Gobblin commands

To implement a new Gobblin command to list and execute using ./bin/gobblin, implement the class gobblin.runtime.cli.CliApplication, and annotate it with the Alias annotation. The Gobblin CLI will automatically find the command, and users can invoke it by the Alias value.

Gobblin Service Execution Modes ( as Daemon )

For more info on Gobblin service execution modes, run bin/gobblin service --help:

  1. Usage:
  2. gobblin.sh service <execution-mode> <start|stop|status>
  3. Argument Options:
  4. <execution-mode> standalone, cluster-master, cluster-worker, aws,
  5. yarn, mapreduce, service-manager.
  6. --conf-dir <gobblin-conf-dir-path> Gobblin config path. default is '$GOBBLIN_HOME/conf/<exe-mode-name>'.
  7. --log4j-conf <path-of-log4j-file> default is '<gobblin-conf-dir-path>/<execution-mode>/log4j.properties'. --jvmopts <jvm or gc options> String containing JVM flags to include, in addition to "-Xmx1g -Xms512m".
  8. --jars <csv list of extra jars> Column-separated list of extra jars to put on the CLASSPATH.
  9. --enable-gc-logs enables gc logs & dumps.
  10. --show-classpath prints gobblin runtime classpath.
  11. --cluster-name Name of the cluster to be used by helix & other services. ( default: gobblin_cluster).
  12. --jt <resource manager URL> Only for mapreduce mode: Job submission URL, if not set, taken from ${HADOOP_HOME}/conf.
  13. --fs <file system URL> Only for mapreduce mode: Target file system, if not set, taken from ${HADOOP_HOME}/conf.
  14. --help Display this help.
  15. --verbose Display full command used to start the process.
  16. Gobblin Version: 0.15.0
  1. Standalone: This mode starts all Gobblin services in single JVM on a single node. This mode is useful for development and light weight usage:

    1. gobblin service standalone start

    For more details and architecture on each execution mode, refer Standalone-Deployment

  2. Mapreduce:

    This mode is dependent on Hadoop (both MapReduce and HDFS) running locally or remote cluster. Before launching any Gobblin jobs on Hadoop MapReduce, check the Gobblin system configuration file located at conf/mapreduce/application.properties for property fs.uri, which defines the file system URI used. The default value is hdfs://localhost:8020, which points to the local HDFS on the default port 8020. Change it to the right value depending on your Hadoop/HDFS setup. For example, if you have HDFS setup somwhere on port 9000, then set the property as follows: fs.uri=hdfs://<namenode host name>:9000/

    • --jt: resource manager URL
    • --fs: file system type value for fs.uri

      This mode will have the minimum set of Gobblin jars, selected using libs/gobblin-<module_name>-$GOBBLIN_VERSION.jar, which is passed as -libjar to hadoop command while running the job. These same set of jars also gets added to the Hadoop DistributedCache for use in the mappers. If a job has additional jars needed for task executions (in the mappers), those jars can also be included by using the --jars option or the following job configuration property in the job configuration file:

      1. job.jars=<comma-separated list of jars the job depends on>

      if HADOOP_HOME is set in the environment, Gobblin will add result of hadoop classpath prior to default GOBBLIN_CLASSPATH to give them precedence while running bin/gobblin.

      All job data and persisted job/task states will be written to the specified file system. Before launching any jobs, make sure the environment variable HADOOP_HOME is set so that it can access hadoop binaries under {HADOOP_HOME}/bin and also working directory should be set with configuration {gobblin.cluster.work.dir}. Note that the Gobblin working directory will be created on the file system specified above.

      An important side effect of this is that (depending on the application) non-fully-qualified paths (like /my/path) will default to local file system if HADOOP_HOME is not set, while they will default to HDFS if the variable is set. When referring to local paths, it is always a good idea to use the fully qualified path (e.g. file:///my/path).

  1. Cluster Mode (master & worker) This is a cluster mode consist of master and worker process.

    1. gobblin service cluster-master start
    2. gobblin service cluster-worker start
  2. AWS This mode starts Gobblin on AWS cloud cluster.

    1. gobblin service aws start
  3. YARN This mode starts Gobblin on YARN cluster.

    1. gobblin service yarn start

Gobblin System Configurations

Following values can be overridden by setting it in gobblin-env.sh

GOBBLIN_LOGS : by default the logs are written to $GOBBLIN_HOME/logs, it can be overridden by setting GOBBLIN_LOGS\ GOBBLIN_VERSION : by default gobblin version is set by the build process, it can be overridden by setting GOBBLIN_VERSION\

All Gobblin system configurations details can be found here: Configuration Properties Glossary.