π° from the hot shores of the black forest π°
speed is a compile-time only macro library that provides functionality similar to Scala collection's views but with an efficient implementation that inlines the complete processing into a (mostly) single while-loop at the use site.
As an example see this:
array.speedy.map(x => x * x).foldLeft(0)(_ + _)will be converted into code roughly equivalent to this:
var acc = 0
var i = 0
while (i < array.length) {
val x = array(i)
acc = acc + x * x
i += 1
}
accresolvers += Opts.resolver.sonatypeReleases
// only needed at compile-time
libraryDependencies += "net.virtual-void" %% "speed" % "15" % "provided"
There are two principle ways of using speed. Either, you explicitly request optimization by
"converting" a collection into a speedy one using .speedy. Or, you can also import automatic
conversions in which case supported collections will be auto-optimized (automatic conversion
currently only supported for Range and Array).
Similar to Java 7 Streams, a speed expression must start with a "Generator" like a Range or
an Array, then apply some non-terminal processing steps, and finally end with a terminal result.
Intermediate results are usually never materialized into a collection and therefore cannot be put
into a variable. Instead, use to[Coll] as a terminal step to create a result collection if needed.
Add an import to speed in the source file:
import speed._and then, convert a collection to its speedy variant by adding .speedy (similar as you would add .view):
array.speedy.map(_ + 2).sum // this will run fast...Ranges and arrays can also be automatically optimized by import speed.auto._ at the top of a
source file:
import speed.auto._
array.map(_ + 2).sum // this will also run fast...Semantics are similar to what to expect from Scala collection's views. However, if in question we'll take the liberty to divert from the known behavior to allow more aggressive optimizations.
The basic optimization approach is based on aggressive inlining the whole processing pipeline into a single while-loop.
Apart from that, speed aggressively tries to make use of static information in several ways:
- special cased range loops for the common cases (step = +/- 1)
- static optimizations over the operations structure to avoid work where possible
- a final run of constant-folding tries to gather static information
Note: function arguments to map, filter, etc. are assumed to be pure and to execute no side-effects
to be able to reorder operations for optimizations. You cannot rely on side-effects in these functions
as they may not be executed for speed optimized code if the result of the function call isn't needed
for the final result.
RangeArrayIndexedSeqList
speed currently supports a subset of interesting and important collection operations from the Scala collections.
See the NonTerminalOps type:
- flatMap
- map
- filter
- withFilter
- reverse
- take
- (more to come)
See the TerminalOps type:
- foldLeft
- foreach
- reduce
- sum
- min
- max
- count
- size
- mkString
- to[Coll[_]]
- forall
- (more to come)
speed already employs some optimizations on the operations structure like these:
- try to push selection of elements down to the generator (like
takeorreverse) - implement array operations in terms of range
- generate three different kind of basic loops for ranges depending on static information
about
start,end,stepandisInclusiveif available
- for-comprehension pattern extraction support (avoid tuple generation)
- replace well-known
NumericandOrderinginstances by primitive operations
In the words of the master of enrichment:
- Wrap an expression with
speed.impl.Debug.showto show the generated code
It is known that speed still optimizes side-effects in user code too aggressively away in some cases. If you encounter such an issue please report it on the issue tracker.
Runtime (smaller is better, reference is an equivalent while loop, "Scala" uses the same expression with views)
| Description | While | Speedy | Scala |
|---|---|---|---|
| foreach counting | 100 % | 102.89 % | 430.06 % |
| nested counting | 100 % | 98.63 % | 389.60 % |
| filtered summing | 100 % | 99.40 % | 861.21 % |
| mapped summing | 100 % | 100.15 % | 3985.92 % |
| array foreach counting | 100 % | 99.24 % | 450.01 % |
| array summing | 100 % | 100.23 % | 625.30 % |
| array filtered summing | 100 % | 103.45 % | 1137.95 % |
| array mapped summing | 100 % | 101.55 % | 2101.77 % |
| size of filtered ref array | 100 % | 98.30 % | 929.55 % |
| array map - filter - to | 100 % | 95.01 % | 291.28 % |
| list foreach counting | 100 % | 100.07 % | 100.87 % |
| list summing | 100 % | 104.40 % | 202.86 % |
| list filtered summing | 100 % | 104.42 % | 627.96 % |
| list mapped summing | 100 % | 100.08 % | 769.58 % |
| nested list summing | 100 % | 99.52 % | 1133.47 % |
(Don't be fooled, values < 100 % are most likely not significant but I'm including them here anyways just for the giggles. π)
(Don't be fooled2, I obviously selected benchmarks in the favor of speed, if you'd like to see other comparisons please PR.)
Benchmark results can be regenerated by running the PerformanceSpecs in the code base.
var counter = 0
for (i β 1 to 1000) counter += i * i
countervar counter = 0
for {
i β 1 to 100
j β 1 to 100
} counter += i * j
counter(1 to 1000).filter(_ % 3 == 0).sum(1 to 1000).map(i β i * i).sumvar counter = 0
for (x β array) counter += x * x
counterarray.sumarray.filter(_ % 3 == 0).sumarray.map(x β x * x).sumclass Ref(var num: Int = 0)
val N = 1000
val refs = (0 until N).map(i β new Ref(i)).toArray
refs
.filter(_.num % 5 == 0)
.filter(_.num % 7 == 0)
.sizearray.map(x β x * x).filter(_ % 3 == 0).to[List]var counter = 0
for (x β list) counter += x * x
counterlist.sumlist.filter(_ % 3 == 0).sumlist.map(x β x * x).sum(for (x β list1; y β list2) yield x * y).sum