资源中心配置详情

  • 资源中心通常用于上传文件、UDF 函数,以及任务组管理等操作。
  • 资源中心可以对接分布式的文件存储系统,如Hadoop(2.6+)或者MinIO集群,也可以对接远端的对象存储,如AWS S3或者阿里云 OSS华为云 OBS 等。
  • 资源中心也可以直接对接本地文件系统。在单机模式下,您无需依赖HadoopS3一类的外部存储系统,可以方便地对接本地文件系统进行体验。
  • 除此之外,对于集群模式下的部署,您可以通过使用S3FS-FUSES3挂载到本地,或者使用JINDO-FUSEOSS挂载到本地等,再用资源中心对接本地文件系统方式来操作远端对象存储中的文件。

对接本地文件系统

配置 common.properties 文件

Dolphinscheduler 资源中心使用本地系统默认是开启的,不需要用户做任何额外的配置,但是当用户需要对默认配置做修改时,请确保同时完成下面的修改。

  • 如果您以 集群 模式或者 伪集群 模式部署DolphinScheduler,您需要对以下路径的文件进行配置:api-server/conf/common.propertiesworker-server/conf/common.properties
  • 若您以 单机 模式部署DolphinScheduler,您只需要配置 standalone-server/conf/common.properties,具体配置如下:

您可能需要涉及如下的修改:

  • resource.storage.upload.base.path 改为本地存储路径,请确保部署 DolphinScheduler 的用户拥有读写权限,例如:resource.storage.upload.base.path=/tmp/dolphinscheduler。当路径不存在时会自动创建文件夹

注意

  1. LOCAL模式不支持分布式模式读写,意味着上传的资源只能在一台机器上使用,除非使用共享文件挂载点
  2. 如果您不想用默认值作为资源中心的基础路径,请修改resource.storage.upload.base.path的值。
  3. 当配置 resource.storage.type=LOCAL,其实您配置了两个配置项,分别是 resource.storage.type=HDFSresource.hdfs.fs.defaultFS=file:/// ,我们单独配置 resource.storage.type=LOCAL 这个值是为了 方便用户,并且能使得本地资源中心默认开启

对接AWS S3

如果需要使用到资源中心的 S3 上传资源,我们需要对以下路径的进行配置:api-server/conf/common.propertiesworker-server/conf/common.properties。可参考如下:

配置以下字段

  1. ......
  2. resource.storage.type=S3
  3. ......
  4. resource.aws.access.key.id=aws_access_key_id
  5. # The AWS secret access key. if resource.storage.type=S3 or use EMR-Task, This configuration is required
  6. resource.aws.secret.access.key=aws_secret_access_key
  7. # The AWS Region to use. if resource.storage.type=S3 or use EMR-Task, This configuration is required
  8. resource.aws.region=us-west-2
  9. # The name of the bucket. You need to create them by yourself. Otherwise, the system cannot start. All buckets in Amazon S3 share a single namespace; ensure the bucket is given a unique name.
  10. resource.aws.s3.bucket.name=dolphinscheduler
  11. # You need to set this parameter when private cloud s4. If S3 uses public cloud, you only need to set resource.aws.region or set to the endpoint of a public cloud such as S3.cn-north-1.amazonaws.com.cn
  12. resource.aws.s3.endpoint=
  13. ......

对接分布式或远端对象存储

当需要使用资源中心进行相关文件的创建或者上传操作时,所有的文件和资源都会被存储在分布式文件系统HDFS或者远端的对象存储,如S3上。所以需要进行以下配置:

配置 common.properties 文件

在 3.0.0-alpha 版本之后,如果需要使用到资源中心的 HDFS 或 S3 上传资源,我们需要对以下路径的进行配置:api-server/conf/common.propertiesworker-server/conf/common.properties。可参考如下:

  1. #
  2. # Licensed to the Apache Software Foundation (ASF) under one or more
  3. # contributor license agreements. See the NOTICE file distributed with
  4. # this work for additional information regarding copyright ownership.
  5. # The ASF licenses this file to You under the Apache License, Version 2.0
  6. # (the "License"); you may not use this file except in compliance with
  7. # the License. You may obtain a copy of the License at
  8. #
  9. # http://www.apache.org/licenses/LICENSE-2.0
  10. #
  11. # Unless required by applicable law or agreed to in writing, software
  12. # distributed under the License is distributed on an "AS IS" BASIS,
  13. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  14. # See the License for the specific language governing permissions and
  15. # limitations under the License.
  16. #
  17. # user data local directory path, please make sure the directory exists and have read write permissions
  18. data.basedir.path=/tmp/dolphinscheduler
  19. # resource storage type: LOCAL, HDFS, S3, OSS, GCS, ABS, OBS
  20. resource.storage.type=LOCAL
  21. # resource store on HDFS/S3/OSS path, resource file will store to this hadoop hdfs path, self configuration,
  22. # please make sure the directory exists on hdfs and have read write permissions. "/dolphinscheduler" is recommended
  23. resource.storage.upload.base.path=/tmp/dolphinscheduler
  24. # The AWS access key. if resource.storage.type=S3 or use EMR-Task, This configuration is required
  25. resource.aws.access.key.id=minioadmin
  26. # The AWS secret access key. if resource.storage.type=S3 or use EMR-Task, This configuration is required
  27. resource.aws.secret.access.key=minioadmin
  28. # The AWS Region to use. if resource.storage.type=S3 or use EMR-Task, This configuration is required
  29. resource.aws.region=cn-north-1
  30. # The name of the bucket. You need to create them by yourself. Otherwise, the system cannot start. All buckets in Amazon S3 share a single namespace; ensure the bucket is given a unique name.
  31. resource.aws.s3.bucket.name=dolphinscheduler
  32. # You need to set this parameter when private cloud s3. If S3 uses public cloud, you only need to set resource.aws.region or set to the endpoint of a public cloud such as S3.cn-north-1.amazonaws.com.cn
  33. resource.aws.s3.endpoint=http://localhost:9000
  34. # alibaba cloud access key id, required if you set resource.storage.type=OSS
  35. resource.alibaba.cloud.access.key.id=<your-access-key-id>
  36. # alibaba cloud access key secret, required if you set resource.storage.type=OSS
  37. resource.alibaba.cloud.access.key.secret=<your-access-key-secret>
  38. # alibaba cloud region, required if you set resource.storage.type=OSS
  39. resource.alibaba.cloud.region=cn-hangzhou
  40. # oss bucket name, required if you set resource.storage.type=OSS
  41. resource.alibaba.cloud.oss.bucket.name=dolphinscheduler
  42. # oss bucket endpoint, required if you set resource.storage.type=OSS
  43. resource.alibaba.cloud.oss.endpoint=https://oss-cn-hangzhou.aliyuncs.com
  44. # alibaba cloud access key id, required if you set resource.storage.type=OBS
  45. resource.huawei.cloud.access.key.id=<your-access-key-id>
  46. # alibaba cloud access key secret, required if you set resource.storage.type=OBS
  47. resource.huawei.cloud.access.key.secret=<your-access-key-secret>
  48. # oss bucket name, required if you set resource.storage.type=OBS
  49. resource.huawei.cloud.obs.bucket.name=dolphinscheduler
  50. # oss bucket endpoint, required if you set resource.storage.type=OBS
  51. resource.huawei.cloud.obs.endpoint=obs.cn-southwest-2.huaweicloud.com
  52. # if resource.storage.type=HDFS, the user must have the permission to create directories under the HDFS root path
  53. resource.hdfs.root.user=root
  54. # if resource.storage.type=S3, the value like: s3a://dolphinscheduler;
  55. # if resource.storage.type=HDFS and namenode HA is enabled, you need to copy core-site.xml and hdfs-site.xml to conf dir
  56. resource.hdfs.fs.defaultFS=hdfs://localhost:8020
  57. # whether to startup kerberos
  58. hadoop.security.authentication.startup.state=false
  59. # java.security.krb5.conf path
  60. java.security.krb5.conf.path=/opt/krb5.conf
  61. # login user from keytab username
  62. login.user.keytab.username=hdfs-mycluster@ESZ.COM
  63. # login user from keytab path
  64. login.user.keytab.path=/opt/hdfs.headless.keytab
  65. # kerberos expire time, the unit is hour
  66. kerberos.expire.time=2
  67. # resource view suffixs
  68. #resource.view.suffixs=txt,log,sh,bat,conf,cfg,py,java,sql,xml,hql,properties,json,yml,yaml,ini,js
  69. # resourcemanager port, the default value is 8088 if not specified
  70. resource.manager.httpaddress.port=8088
  71. # if resourcemanager HA is enabled, please set the HA IPs; if resourcemanager is single, keep this value empty
  72. yarn.resourcemanager.ha.rm.ids=192.168.xx.xx,192.168.xx.xx
  73. # if resourcemanager HA is enabled or not use resourcemanager, please keep the default value;
  74. # If resourcemanager is single, you only need to replace ds1 to actual resourcemanager hostname
  75. yarn.application.status.address=http://localhost:%s/ds/v1/cluster/apps/%s
  76. # job history status url when application number threshold is reached(default 10000, maybe it was set to 1000)
  77. yarn.job.history.status.address=http://localhost:19888/ds/v1/history/mapreduce/jobs/%s
  78. # datasource encryption enable
  79. datasource.encryption.enable=false
  80. # datasource encryption salt
  81. datasource.encryption.salt=!@#$%^&*
  82. # data quality jar directory path, it would auto discovery data quality jar from this given dir. You should keep it empty if you do not change anything in
  83. # data-quality, it will auto discovery by dolphinscheduler itself. Change it only if you want to use your own data-quality jar and it is not in worker-server
  84. # libs directory(but may sure your jar name start with `dolphinscheduler-data-quality`).
  85. data-quality.jar.dir=
  86. #data-quality.error.output.path=/tmp/data-quality-error-data
  87. # Network IP gets priority, default inner outer
  88. # Whether hive SQL is executed in the same session
  89. support.hive.oneSession=false
  90. # use sudo or not, if set true, executing user is tenant user and deploy user needs sudo permissions;
  91. # if set false, executing user is the deploy user and doesn't need sudo permissions
  92. sudo.enable=true
  93. # network interface preferred like eth0, default: empty
  94. #dolphin.scheduler.network.interface.preferred=
  95. # network IP gets priority, default: inner outer
  96. #dolphin.scheduler.network.priority.strategy=default
  97. # system env path
  98. #dolphinscheduler.env.path=env/dolphinscheduler_env.sh
  99. # development state
  100. development.state=false
  101. # rpc port
  102. alert.rpc.port=50052
  103. # way to collect applicationId: log(original regex match), aop
  104. appId.collect: log

注意

  • 如果只配置了 api-server/conf/common.properties 的文件,则只是开启了资源上传的操作,并不能满足正常使用。如果想要在工作流中执行相关文件则需要额外配置 worker-server/conf/common.properties
  • 如果用到资源上传的功能,那么安装部署中,部署用户需要有这部分的操作权限。
  • 如果 Hadoop 集群的 NameNode 配置了 HA 的话,需要开启 HDFS 类型的资源上传,同时需要将 Hadoop 集群下的 core-site.xmlhdfs-site.xml 复制到 worker-server/conf 以及 api-server/conf,非 NameNode HA 跳过此步骤。