Kubernetes Tools Spark Block Cleaner
Requirements
You’d better have cognition upon the following things when you want to use spark-block-cleaner.
Scenes
When you’re using Spark On Kubernetes with Client mode and don’t use emptyDir
for Spark local-dir
type, you may face the same scenario that executor pods deleted without clean all the Block files. It may cause disk overflow.
Therefore, we chose to use Spark Block Cleaner to clear the block files accumulated by Spark.
Principle
When deploying Spark Block Cleaner, we will configure volumes for the destination folder. Spark Block Cleaner will perceive the folder by the parameter CACHE_DIRS
.
Spark Block Cleaner will clear the perceived folder in a fixed loop(which can be configured by SCHEDULE_INTERVAL
). And Spark Block Cleaner will select folder start with blockmgr
and spark
for deletion using the logic Spark uses to create those folders.
Before deleting those files, Spark Block Cleaner will determine whether it is a recently modified file(depending on whether the file has not been acted on within the specified time which configured by FILE_EXPIRED_TIME
). Only delete files those beyond that time interval.
And Spark Block Cleaner will check the disk utilization after clean, if the remaining space is less than the specified value(control by FREE_SPACE_THRESHOLD
), will trigger deep clean(which file expired time control by DEEP_CLEAN_FILE_EXPIRED_TIME
).
Usage
Before you start using Spark Block Cleaner, you should build its docker images.
Build Block Cleaner Docker Image
In the KYUUBI_HOME
directory, you can use the following cmd to build docker image.
docker build ./tools/spark-block-cleaner/kubernetes/docker
Modify spark-block-cleaner.yml
You need to modify the ${KYUUBI_HOME}/tools/spark-block-cleaner/kubernetes/spark-block-cleaner.yml
to fit your current environment.
In Kyuubi tools, we recommend using DaemonSet
to start , and we offer default yaml file in daemonSet way.
Base file structure :
apiVersion
kind
metadata
name
namespace
spec
select
template
metadata
spce
containers
- image
- volumeMounts
- env
volumes
You can use affect the performance of Spark Block Cleaner through configure parameters in containers env part of spark-block-cleaner.yml
.
env:
- name: CACHE_DIRS
value: /data/data1,/data/data2
- name: FILE_EXPIRED_TIME
value: 604800
- name: DEEP_CLEAN_FILE_EXPIRED_TIME
value: 432000
- name: FREE_SPACE_THRESHOLD
value: 60
- name: SCHEDULE_INTERVAL
value: 3600
The most important thing, configure volumeMounts and volumes corresponding to Spark local-dirs.
For example, Spark use /spark/shuffle1 as local-dir, you can configure like:
volumes:
- name: block-files-dir-1
hostPath:
path: /spark/shuffle1
volumeMounts:
- name: block-files-dir-1
mountPath: /data/data1
env:
- name: CACHE_DIRS
value: /data/data1
Start daemonSet
After you finishing modifying the above, you can use the following command kubectl apply -f ${KYUUBI_HOME}/tools/spark-block-cleaner/kubernetes/spark-block-cleaner.yml
to start daemonSet.
Related parameters
Name | Default | unit | Meaning |
---|---|---|---|
CACHE_DIRS | /data/data1,/data/data2 | The target dirs in container path which will clean block files. | |
FILE_EXPIRED_TIME | 604800 | seconds | Cleaner will clean the block files which current time - last modified time more than the fileExpiredTime. |
DEEP_CLEAN_FILE_EXPIRED_TIME | 432000 | seconds | Deep clean will clean the block files which current time - last modified time more than the deepCleanFileExpiredTime. |
FREE_SPACE_THRESHOLD | 60 | % | After first clean, if free Space low than threshold trigger deep clean. |
SCHEDULE_INTERVAL | 3600 | seconds | Cleaner sleep between cleaning. |