Profiling Spark Applications with Java Flight Recorder

Java Flight Recorder for Spark on Yarn

In a previous post we have discussed how to enable Java Flight Recorder for Java applications. Here we show how to enable Java Flight Recorder for Spark applications running on Yarn. In this example, we use “RDDRelation”, a Spark SQL application that is provided in Spark distribution. We run “RDDRelation” on Spark 2.4.4 and Yarn 2.6.5.

Two configurations are necessary, one for the driver program and one for the executors, as shown in the code snippet below. The recordings will stop after 600 seconds and will be saved in the working directory of the driver / executors. By default, Yarn stores application logs at the following location: “/tmp/hadoop-${user}/nm-local-dir/usercache/${user}/appcache/“.

#!/bin/bash                                                                                                                                                                 

# please edit the following parameters                                                                                                                                      
PATH_TO_SPARK_EXAMPLE_JAR=${SPARK_HOME}/examples/jars/spark-examples_2.11-2.4.4.jar                                  

# remove result folder (if already created)                                                                                                                                 
hdfs dfs -rmr pair.parquet

spark-submit \
    --class org.apache.spark.examples.sql.RDDRelation \
    --master yarn-client \
    --conf "spark.driver.extraJavaOptions=-XX:+UnlockCommercialFeatures -XX:+FlightRecorder -XX:StartFlightRecording=duration=600s,filename=events.jfr" \
    --conf "spark.executor.extraJavaOptions=-XX:+UnlockCommercialFeatures -XX:+FlightRecorder -XX:StartFlightRecording=duration=600s,filename=executor.jfr" \
    --conf "spark.yarn.preserve.staging.files=true"\
    $PATH_TO_SPARK_EXAMPLE_JAR

The node manager aggregates all the container logs into one single file when log aggregation was enabled (“${yarn.log-aggregation-enable}” was set to true). Then, it uploads the file to the distributed file system at the following location “${yarn.nodemanager.remote-app-log-dir}/${user.name}/logs/“. Soon after, the node manager deletes the container logs from the local user logs directory. For this reason, you should configure “${yarn.nodemanager.delete.debug-delay-sec} property to a reasonable value (e.g., 600 seconds). This setting ensures that the logs are kept for the specified duration, allowing you to inspect the recording file (i.e., “executor.jfr” file) for further performance analysis.

Follow me!