Spark Resources Metrics plugin
is an Apache Spark plugin that registers metrics onto the Apache Spark metrics system, that will sink values collected from operational system's resources, aiming to cover metrics that the Spark metrics system does not provide, like the Ganglia monitoring system metrics.
Latest Version | Build | Coverage |
---|---|---|
Releases | Scala 2.12 | Scala 2.13 |
---|---|---|
Maven Central | ||
Sonatype Nexus | ||
Snapshot |
The Spark Resources Metrics plugin
is intended to be used together with the native Spark metrics system (click for details). In order to properly show the metric values collected by this plugin, the Spark metrics system has to be set to report metrics on the plugin's supported Spark components, which currently are:
driver
executor
This is done in the sink configuration properties of the Spark metrics system. Usually to cover the supported components, you may set the component
configuration detail to "all components" (*
) in the property names. e.g.: spark.metrics.conf.*.sink.graphite.class
Choose your preferred method to include Spark Resources Metrics plugin
package in your Spark environment:
β οΈ Attention: switch to the desired Scala version (2.12 or 2.13) when using the links or JARs names.
1. Adding a new property to the Spark configuration file (click to expand)
You may choose to edit the spark-default.conf
file, adding the following property and respective value, which will download the package through Maven Central:
spark.jars.packages io.github.dutrevis:spark-resources-metrics-plugin_2.12:0.1
2. Using a local package manager to download and copy (click to expand)
You may opt to use a local package manager like Maven to download and copy the package and its dependencies into your local Spark JARs folder:
mvn dependency:get -Dartifact="io.github.dutrevis:spark-resources-metrics-plugin_2.12:0.1"
mvn dependency:copy -Dartifact="io.github.dutrevis:spark-resources-metrics-plugin_2.12:0.1" -DoutputDirectory="$SPARK_HOME/jars"
3. Adding a flag to your CLI Spark call (click to expand)
You may choose to add one of these flags with property-value pairs to your CLI Spark call:
-
External download - choose one (Internet access required):
--packages io.github.dutrevis:spark-resources-metrics-plugin_2.12:0.1
--jars https://s01.oss.sonatype.org/content/repositories/releases/io/github/dutrevis/spark-resources-metrics-plugin_2.12/0.1/spark-resources-metrics-plugin_2.12-0.1.jar
-
Local deployments - choose one (prior JAR download required):
--conf spark.driver.extraClassPath=/path/to/spark-resources-metrics-plugin_2.12-0.1.jar
--jars /path/to/spark-resources-metrics-plugin_2.12-0.1.jar
π‘ Tip: the
--jars
CLI option can be used in YARN backend to make the plugin JAR available to both executors and cluster-mode drivers.
The plugin is composed of classes that, once activated in Apache Spark, register a group of metrics related to their distinct resources into the native Spark metrics system.
After the package is installed, these classes may be activated by being declared in the spark.plugins
property.
1. You may add the property and its value to the "spark-default.conf" file (click to expand)
spark.plugins io.github.dutrevis.CPUMetrics,io.github.dutrevis.MemoryMetrics
2. You may add a flag (with its property-value pair) to your CLI Spark call (click to expand)
--conf "spark.plugins"="io.github.dutrevis.CPUMetrics,io.github.dutrevis.MemoryMetrics"
3. You may add a config to the Spark Context created with PySpark or Scala (click to expand)
.config("spark.plugins", "io.github.dutrevis.CPUMetrics,io.github.dutrevis.MemoryMetrics")
π Note: as seen on Spark docs, properties set programmatically on the Spark Context take highest precedence, then flags passed through CLI calls like
spark-submit
orspark-shell
, then options in thespark-defaults.conf
file.
Once configured and registered onto the Spark metrics system, sinked metrics will be found in the following naming format:
{SparkComponent}.plugin.io.github.dutrevis.{ClassName}.{Metric}
Each class registers a group of metrics collected. For details of each metric, consult its specific collect
method inside its class.
The current available plugin classes are meant to be used when Spark is running in clusters with standalone, Mesos or YARN resource managers.
MemoryMetrics: collects memory resource metrics from a unix-based operating system. Memory metrics are obtained from the numbers of each line of the /proc/meminfo
file, available at the proc pseudo-filesystem of unix-based operating systems. The file has statistics about memory usage on the system, arranged in lines consisted of a parameter name, followed by a colon, the value of the parameter, and an option unit of measurement.
π Notes:
- While the
/proc/meminfo
file shows kilobytes (kB; 1 kB equals 1000 B), its unit is actually kibibytes (KiB; 1 KiB equals 1024 B). This imprecision is known, but is not corrected due to legacy concerns.- Many fields have been present since at least Linux 2.6.0, but most of the other fields are available at specific Linux versions (as noted in each method docstring) or are displayed only if the kernel was configured with specific options. If these fields are not found, the metrics won't be registered onto Dropwizard's metric system.
Metrics registered:
TotalMemory
FreeMemory
UsedMemory
SharedMemory
BufferMemory
TotalSwapMemory
FreeSwapMemory
CachedSwapMemory
UsedSwapMemory
CPUMetrics: collects CPU resource metrics from a unix-based operating system. CPU metrics are obtained from the numbers of the first line of the /proc/stat
file, available at the proc pseudo-filesystem of unix-based operating systems. These numbers identify the amount of time the CPU has spent performing different kinds of work, arranged in columns at the following order: "cpu_user", "cpu_nice", "cpu_system", "cpu_idle", "cpu_iowait", "cpu_irq" and "cpu_softirq".
π Notes:
- All of the numbers retrieved are aggregates since the system first booted.
- Time units are in USER_HZ or Jiffies (typically hundredths of a second).
- Values for "cpu_steal", "cpu_guest" and "cpu_guest_nice", available at spectific Linux versions, are not parsed from the file.
Metrics registered:
UserCPU
NiceCPU
SystemCPU
IdleCPU
WaitCPU
Say you will use the Graphite sink to report metrics and you wish to monitor the memory usage of your Spark cluster, then you want to instrument your Spark deployment using Spark Resources Metrics plugin
's class MemoryMetrics
.
First, you have to decide how to install Spark Resources Metrics plugin
.
For environments with customizable configurations, like container images, you may consider adding an installation step with Maven or adding a custom Spark configuration file, such as the example below:
Spark configuration file example
spark.jars.packages io.github.dutrevis:spark-resources-metrics-plugin_2.12:0.1
spark.plugins.defaultList io.github.dutrevis.MemoryMetrics
spark.metrics.conf.*.sink.graphite.class org.apache.spark.metrics.sink.GraphiteSink
spark.metrics.conf.*.sink.graphite.host your-graphite-host.com
spark.metrics.conf.*.sink.graphite.port 2003
spark.metrics.conf.*.sink.graphite.period 5
spark.metrics.conf.*.sink.graphite.unit seconds
For environments with static configurations, like ever-standing or shared clusters, consider adding the properties into the Spark CLI call or into the Spark job itself, such as shown in the examples below:
CLI example
bin/spark-shell --master yarn \
--packages io.github.dutrevis:spark-resources-metrics-plugin_2.12:0.1 \
--conf "spark.plugins"="io.github.dutrevis.MemoryMetrics" \
--conf "spark.metrics.conf.*.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink" \
--conf "spark.metrics.conf.*.sink.graphite.host"="your-graphite-host.com" \
--conf "spark.metrics.conf.*.sink.graphite.port"=2003 \
--conf "spark.metrics.conf.*.sink.graphite.period"=5 \
--conf "spark.metrics.conf.*.sink.graphite.unit"=seconds
PySpark example
from pyspark.sql import SparkSession
spark = (SparkSession.builder.appName("Instrumented app").master("yarn")
.config("spark.jars.packages", "io.github.dutrevis:spark-resources-metrics-plugin_2.12:0.1")
.config("spark.plugins", "io.github.dutrevis.MemoryMetrics")
.config("spark.metrics.conf.*.sink.graphite.class", "org.apache.spark.metrics.sink.GraphiteSink")
.config("spark.metrics.conf.*.sink.graphite.host", "your-graphite-host.com")
.config("spark.metrics.conf.*.sink.graphite.port", 2003)
.config("spark.metrics.conf.*.sink.graphite.period", 5)
.config("spark.metrics.conf.*.sink.graphite.unit", "seconds")
.getOrCreate()
)
Then run your Spark job and check the output of a Graphite plaintext protocol reader or exporter to find the metrics.
The plugin was developed using the technologies below, but feel free to try other solutions when contributing!
Developed to: | |
---|---|
Language: | |
CI: | |
Public distribution: |
The source code compiles and runs in these Scala versions:
Compiles | Runs | |
---|---|---|
Scala 2.12 | β | β |
Scala 2.13 | β | β |
Scala 3 | β | β |
classDiagram
direction RL
class ProcFileMetricCollector{
#String procFilePath
+getMetricValue(String procFileContent,String originalMetricName) Any
+defaultPathGetter(String s1, String s2) Path
+getProcFileContent(String path_getter, String file_reader) String
}
class MeminfoMetricCollector{
#String procFilePath
+getMetricValue(String procFileContent,String originalMetricName) Long
}
MeminfoMetricCollector --|> ProcFileMetricCollector
class StatMetricCollector{
#String procFilePath
+getMetricValue(String procFileContent,String originalMetricName) Long
}
StatMetricCollector --|> ProcFileMetricCollector
class MemoryMetrics{
+Map metricMapping
+registerMetric(MetricRegistry metricRegistry, String metricName, Metric metricInstance ) Unit
}
MemoryMetrics --|> MeminfoMetricCollector
class CPUMetrics{
+Map metricMapping
+registerMetric(MetricRegistry metricRegistry, String metricName, Metric metricInstance ) Unit
}
CPUMetrics --|> StatMetricCollector
Spark Resources Metrics plugin
is inspired by cerndb/SparkPlugins and LucaCanali/sparkMeasure.
Conceptually, Spark Resources Metrics plugin
is very similar to SparkPlugins, but:
- It aims to collect metrics that other less supported metric systems used to cover, like Ganglia;
- It is modular for each resource, being more complete and flexible to use;
- It is more easily extensible, with its development considering design patterns and unit tests.
- π Use within Databricks clusters to cover Ganglia's outage!
- π Use with CERN's Spark-Dashboard
- βοΈ Star
Spark Resources Metrics Plugin
on GitHub βοΈ Follow me (@dutrevis) on Medium- π Follow me (@dutrevis) on GitHub