Graph diameter approximation on Spark
graphxdiameter
is a Spark package that allows to approximate the diameter of (weighted) graphs, that is the longest shortest path.
The algorithm implemented here is derived from the ones described in the following papers

Space and Time Efficient Parallel Graph Decomposition, Clustering, and Diameter Approximation
Matteo Ceccarello, Andrea Pietracaprina, Geppino Pucci, and Eli Upfal
SPAA 2015 · bibtex file 
A Practical Parallel Algorithm for Diameter Approximation of Massive Weighted Graphs
Matteo Ceccarello, Andrea Pietracaprina, Geppino Pucci, and Eli Upfal
Arxiv preprint · bibtex file
If you use this software in your research, please cite the aforementioned papers.
NOTE: the implementation contained in this package is not the one used to perform the experiments described in the aforementioned papers. That implementation is available under the GPL license here, and was developed on plain Spark. graphxdiameter
, instead, provides an equivalent implementation compatible with graphx
.
Motivation
The diameter of a graph can be obtained by computing all pairs shortest paths. However, computing APSP is impractical on large graphs, due to the excessive space and time requirements.
To compute an approximation to the diameter using only linear space, one can resort to a simple Single Source Shortest Path (SSSP) computation, that approximates the diameter within a factor of two. The drawback of this approach is that it requires a number of rounds linear in the diameter itself: on a platform such as Spark, where for efficiency we seek to minimize the number of rounds, this is undesirable.
A popular approach to approximate the diameter (and some centrality measures) are algorithms based on probabilistic counters, like HyperANF [Boldi, Rosa, Vigna  WWW11]. These algorithms are able to attain a tight estimate of the diameter. However, in a distributed computing setting like Spark, the running time linear in the diameter and the superlinear space requirements limit the applicability of this approach.
Therefore, we developed the algorithm implemented in this library with two goals:
 performing a number of rounds sublinear in the diameter
 using space linear in the size of the graph
The algorithm returns an approximation of the diameter in the form of an upper bound, with a provable polylogarithmic bound. In practice, the approximation factor is usually a small constant.
Further details on the algorithm, its efficiency, and the approximation factor are given in the aforementioned papers:

For the unweighted case: "Space and Time Efficient Parallel Graph Decomposition, Clustering, and Diameter Approximation";

For the weighted case: "A Practical Parallel Algorithm for Diameter Approximation of Massive Weighted Graphs".
Linking
graphxdiameter
is cross compiled for both Scala 2.10 and 2.11. You can include it in your project in several ways.
sparkshell, pyspark, or sparksubmit
The suffix _2.xx
appended to the package name must match the Scala version that is run by the sparkshell
command.
$ $SPARK_HOME/bin/sparkshell packages it.unipd.dei:graphxdiameter_2.10:0.1.0
sbt
If you use the sbtsparkpackage plugin then add the following line to your build.sbt
spDependencies += "Cecca/graphxdiameter:0.1.0"
Otherwise, you can add graphxdiameter
as a normal dependency
libraryDependencies += "it.unipd.dei" %% "graphxdiameter" % "0.1.0"
Maven
Again, be sure that the _2.xx
suffix matches the Scala version you already use in your project.
<dependencies>
<! list of dependencies >
<dependency>
<groupId>it.unipd.dei</groupId>
<artifactId>graphxdiameter_2.11</artifactId>
<version>0.1.0</version>
</dependency>
</dependencies>
Usage
The library works on graphs with Double
edge weights assigned to edges. The package it.unipd.dei.graphx.diameter
defines a type Distance
like the following
type Distance = Double
The algorithm takes two parameters, namely

target
: this is the size of the quotient graph that will be built by the underlying clustering algorithm. It depends on the size of the local memory of the machines. The last step of the algorithm computes the diameter of a graph of sizetarget
. Higher values oftarget
can result is shorter running times, whereas smaller ones require less memory. Usuallytarget == 4000
provides a good compromise, and this is the default. 
delta
: this parameter, representing a distance, controls the number of nodes and edges that can be active in each step of the algorithm. Intuitively, higher values will result in fewer but slower rounds; smaller values will perform more shorter rounds. In any case, this parameter is taken as a hint by the algorithm, that then autotunes itself. A good initial guess is (empirically) the average edge weight, which is the default.
For more details on these two parameters, we refer to the companion papers.
Given a org.apache.spark.graphx.Graph[V, Distance]
object, you can get an approximation to its diameter as follows, using implicit conversions
// import implicit conversions
import it.unipd.dei.graphx.diameter.DiameterApproximation._
val g = // ... build the graph object ...
// Compute the approximation using the default parameters
g.diameterApprox()
// Specify the target size for the underlying clustering algorithm
g.diameterApprox(target=5000)
// Control the number of active nodes/edges in each step
g.diameterApprox(delta=0.5)
// Both parameters can be specified simultaneously
g.diameterApprox(target=5000, delta=0.5)
If you prefer to avoid implicit conversions, you can explicitly invoke DiameterApproximation.run
, as follows
import it.unipd.dei.graphx.diameter.DiameterApproximation
val g = // ... build the graph object ...
// Compute the approximation using the default parameters
DiameterApproximation.run(g)
// Specify the target size for the underlying clustering algorithm
DiameterApproximation.run(g, target=5000)
// Control the number of active nodes/edges in each step
DiameterApproximation.run(g, delta=0.5)
// Both parameters can be specified simultaneously
DiameterApproximation.run(g, target=5000, delta=0.5)