For better performance of spark application it is important to understand the resource allocation and the spark tuning process.
This article help you to understand how to calculate the number of executors, executor memory and number of cores required for better performance of your application.
Below is the sample spark submit command
./bin/spark-submit –class <class_name> –master <yarn/local/spark:> –deploy-mode <cluster/client> –executor-memory ??? –num-executors ??? –executor-cores ??? <application_jar_path>
For static allocation we are now going to calculate
- Number of executors (–num-executors)
- Executor memory (–executor-memory)
- Number of cores per executor (–executor-cores)
Before going to detail calculation let’s we must have knowledge of :
- What is Executors?
- What is cluster manager?
- Hadoop basic eco-system
- What is yarn?
- What is core?
- What is Task?
- What is Partition?
1. What is Executor?
An executor is a distributed agent that is responsible for executing tasks. Executors are managed by executor backend. It is JVM process which runs on a worker node. Executor runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. The number of executors can be specified inside the SparkConf or via the flag –num-executors command-line.
For more details please go through http://www.mycloudplace.com/what-is-spark-executor/
2. What is cluster manager?
Cluster manager is used to acquire the resources on the cluster which is required by spark application. The different cluster managers are
- Spark Standalone
- Apache Mesos
- Hadoop YARN
To decide Which Cluster Managers is also a difficult job.
3. Hadoop eco-system
It is important to understand basics of Hadoop Ecosystem.
4. What is yarn?
Apache Hadoop YARN is the resource management and job scheduling components of Hadoop distributed processing framework. YARN is responsible for allocating system resources to the various applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes. YARN stands for Yet Another Resource Negotiator.
5. What is core?
A CPU core is a CPU’s processor. In past, every processor had just one core that could focus on one task at a time. Today, CPUs have been 2 to 18 cores, each of which can work on a different task.
A core can work on one task, while another core works a different task, so the more cores a CPU has, the more different task.
6. What is Partition?
A partition is a small chunk of a large distributed data set. A partition in spark is an atomic chunk of data (logical division of data) stored on a node in the cluster. Partitions are basic units of parallelism in Apache Spark. RDDs in Apache Spark are collection of partitions.
7. What is Task?
A task is a unit of work that can be run on a partition of a distributed dataset and gets executed on a single executor.
Always remember the numbers of executors and their memory configuration play an important role in a spark job tuning. There are two ways to configure the executor and their memories
- Dynamic Allocation of resources
- Static Allocation of resources
Dynamic Allocation of resources
For dynamic allocation of executors just we need to configure the following properties:
- dynamicAllocation.enabled (true/false)
- dynamicAllocation.initialExecutors
- dynamicAllocation.minExecutors
- dynamicAllocation.maxExecutors
- dynamicAllocation.schedulerBacklogTimeout
- dynamicAllocation.executorIdleTimeout
1. spark.dynamicAllocation.enabled
The values are true/false. when this is set to true we don’t need to mention executors.
2. spark.dynamicAllocation.initialExecutors
This the initial number of executor to start with.
3. spark.dynamicAllocation.minExecutors
The minimum number of executors required for the spark application.
4. spark.dynamicAllocation.maxExecutors
The maximum number of executors required for the spark application.
5. spark.dynamicAllocation.schedulerBacklogTimeout
If dynamic allocation is enabled and there have been pending tasks backlogged for more than this duration, new executors will be requested.
6. spark.dynamicAllocation.executorIdleTimeout
If dynamic allocation is enabled and an executor has been idle for more than this duration, the executor will be removed.
Static Allocation of resources
To understand the static allocation of executors we must take some scenario, let’s consider a cluster with the following configuration
Number of nodes : 10
Number of cores per node : 16
Memory (RAM) per node : 64gb
We have to calculate the values of 3 parameters
–executor-memory ??? –num-executors ??? –executor-cores ???
There are 3 approaches to calculate these parameters value
- Tiny executors approach
- Fat executors approach
- Balanced approach (Mix of Fat, Tiny) (recommended)
1. Tiny executors approach
In this approach, we’ll assign one executor per core
–num-executors = “Total number of cores in the cluster”
= 10 * 16
= 160
–executor-cores = 1
—executor-memory = memory per node / num of executors per node
= 64gb / 16
= 4gb
RESULT: This approach is not good as we are not taking advantage of running multiple tasks in the same JVM. We need extra memory, cores for other demons as well.
2. Fat executors approach
In this approach we assign one executor per node, for memory will be all RAM memory
–num-executors = ” one executor per node ”
= 10
–executor-cores = 16 (assign all core to 1 executor)
—executor-memory = memory per node / num of executors per node
= 64gb / 1
= 64gb
RESULT: This approach is also not good as we not allocating memory for other demons.
3. Balance approach (Mix of Fat, Tiny) (recommended)
After research and a lot rnd work it is found that the optimal value is 5 core per executors.
Now we need to give other demons space(memory) for better execution.
Leave 1 core per node for Hadoop/Yarn daemons = cores available per node = 16-1 = 15
Total number of core in the cluster = 15 * 10 => 150
–num-executors = (total cores/number of cores per executor)
= 150 / 5 => 30
= 30 -1 => 29 (Leaving 1 executor for ApplicationManager)
–executor-cores = 5 (as we have already fixed)
–executor-memory = memory per node / num of executors per node
= 64gb / 5
= 13gb (approx.)
= 13gb – (7% of 13gb) (give space for heap overhead)
= 13 – 0.91 => 12gb (approx)
RESULT: This approach is good and recommended approach as we allocating memory for other demons per node, also we are considering space for heap overhead and also we are assigning 1 core for application. In this approach we are taking advantage of parallel computing with all resource utilization.
It’s a good read, helped me understanding the basics of executor logic.
How did you calculate -executor-cores to 3? You mentioned “5 cores per executor is optimal”.
Could you elaborate on “executor-cores = 30 / 10 => 3”?
Thanks
Thanks Ryan, For mixed approach we will not calculate the –executor-cores as we have already fixed it. If we take executor-core = 3 then we are not utilizing all available cores.
Good explanation and now I understand dynamic and static allocation .
Thank you Pravallika
I think in balanced approach the value taken for num of executors per node is 5 .But i think 5 is number of cores per executor and num of executors per node should be 3.
Correct me if I am wrong.
Thanks
Nice blog on calculation of executor memory.
Pingback: Lazy Evaluation in Apache Spark and its Advantage - Mycloudplace
Pingback: pyspark dataframe | python spark dataframe with examples - Mycloudplace
Do you mind if I quote a few of your articles as long as I provide credit and sources back to
your site? My website is in the very same area
of interest as yours and my users would genuinely benefit
from a lot of the information you present here. Please let me
know if this ok with you. Regards!
excellent submit, very informative. I ponder why the other
specialists of this sector don’t notice this. You should
proceed your writing. I am sure, you’ve a huge readers’ base already!
Hi! Do you know if they make any plugins to help with SEO?
I’m trying to get my blog to rank for some targeted keywords
but I’m not seeing very good results. If you know of any please share.
Kudos!
Very nice post. I just stumbled upon your weblog and wanted to say that I’ve truly
enjoyed browsing your blog posts. After all I will be subscribing to your feed and I hope
you write again soon!