Graph diameter approximation on Spark
graphx-diameter is a Spark package that allows to approximate the
diameter of (weighted) graphs, that is the longest shortest path.
The algorithm implemented here is derived from the ones described in the following papers
If you use this software in your research, please cite the aforementioned papers.
NOTE: the implementation contained in this package is not the
one used to perform the experiments described in the aforementioned
papers. That implementation is available under the GPL license
here, and was developed on
graphx-diameter, instead, provides an equivalent
implementation compatible with
The diameter of a graph can be obtained by computing all pairs shortest paths. However, computing APSP is impractical on large graphs, due to the excessive space and time requirements.
To compute an approximation to the diameter using only linear space, one can resort to a simple Single Source Shortest Path (SSSP) computation, that approximates the diameter within a factor of two. The drawback of this approach is that it requires a number of rounds linear in the diameter itself: on a platform such as Spark, where for efficiency we seek to minimize the number of rounds, this is undesirable.
A popular approach to approximate the diameter (and some centrality measures) are algorithms based on probabilistic counters, like HyperANF [Boldi, Rosa, Vigna - WWW11]. These algorithms are able to attain a tight estimate of the diameter. However, in a distributed computing setting like Spark, the running time linear in the diameter and the superlinear space requirements limit the applicability of this approach.
Therefore, we developed the algorithm implemented in this library with two goals:
- performing a number of rounds sublinear in the diameter
- using space linear in the size of the graph
The algorithm returns an approximation of the diameter in the form of an upper bound, with a provable polylogarithmic bound. In practice, the approximation factor is usually a small constant.
Further details on the algorithm, its efficiency, and the approximation factor are given in the aforementioned papers:
For the unweighted case: "Space and Time Efficient Parallel Graph Decomposition, Clustering, and Diameter Approximation";
For the weighted case: "A Practical Parallel Algorithm for Diameter Approximation of Massive Weighted Graphs".
graphx-diameter is cross compiled for both Scala 2.10 and 2.11. You
can include it in your project in several ways.
spark-shell, pyspark, or spark-submit
_2.xx appended to the package name must match the Scala
version that is run by the
$ $SPARK_HOME/bin/spark-shell --packages it.unipd.dei:graphx-diameter_2.10:0.1.0
If you use the
plugin then add the following line to your
spDependencies += "Cecca/graphx-diameter:0.1.0"
Otherwise, you can add
graphx-diameter as a normal dependency
libraryDependencies += "it.unipd.dei" %% "graphx-diameter" % "0.1.0"
Again, be sure that the
_2.xx suffix matches the Scala version you
already use in your project.
<dependencies> <!-- list of dependencies --> <dependency> <groupId>it.unipd.dei</groupId> <artifactId>graphx-diameter_2.11</artifactId> <version>0.1.0</version> </dependency> </dependencies>
The library works on graphs with
Double edge weights assigned to
edges. The package
it.unipd.dei.graphx.diameter defines a type
Distance like the following
type Distance = Double
The algorithm takes two parameters, namely
target: this is the size of the quotient graph that will be built by the underlying clustering algorithm. It depends on the size of the local memory of the machines. The last step of the algorithm computes the diameter of a graph of size
target. Higher values of
targetcan result is shorter running times, whereas smaller ones require less memory. Usually
target == 4000provides a good compromise, and this is the default.
delta: this parameter, representing a distance, controls the number of nodes and edges that can be active in each step of the algorithm. Intuitively, higher values will result in fewer but slower rounds; smaller values will perform more shorter rounds. In any case, this parameter is taken as a hint by the algorithm, that then auto-tunes itself. A good initial guess is (empirically) the average edge weight, which is the default.
For more details on these two parameters, we refer to the companion papers.
org.apache.spark.graphx.Graph[V, Distance] object, you can
get an approximation to its diameter as follows, using implicit
// import implicit conversions import it.unipd.dei.graphx.diameter.DiameterApproximation._ val g = // ... build the graph object ... // Compute the approximation using the default parameters g.diameterApprox() // Specify the target size for the underlying clustering algorithm g.diameterApprox(target=5000) // Control the number of active nodes/edges in each step g.diameterApprox(delta=0.5) // Both parameters can be specified simultaneously g.diameterApprox(target=5000, delta=0.5)
If you prefer to avoid implicit conversions, you can explicitly invoke
DiameterApproximation.run, as follows
import it.unipd.dei.graphx.diameter.DiameterApproximation val g = // ... build the graph object ... // Compute the approximation using the default parameters DiameterApproximation.run(g) // Specify the target size for the underlying clustering algorithm DiameterApproximation.run(g, target=5000) // Control the number of active nodes/edges in each step DiameterApproximation.run(g, delta=0.5) // Both parameters can be specified simultaneously DiameterApproximation.run(g, target=5000, delta=0.5)