Flip is Fast, Lightweight pure-functional library for Information theory and Probability distribution. Flip aims to extract and process statistical features of the input data stream in a short time using only small memory. It has the following features:
- Estimate and summarize probability distributions with high speed and high accuracy using only limited memory for both stationary and non-stationary data streams
- Combine several probability distributions by using probability monad
- Generate random variables from many predefined and estimated probability disributions
- Measure similarity between two probability distribution using various (generalized) statistical distances (e.g., Kullback–Leibler divergence)
Flip is published to Maven Central and built for Scala 2.12, so you can add the following to your build.sbt
:
libraryDependencies += "com.xxxnell" %% "flip" % "0.0.4"
Sketch
is the probablistic data structure that quickly measures the probalility density for the real number random variable data stream with limited memory without prior knowledge. Simply put, Sketch
is a special histogram in which the width of each bin is adaptively adjusted to the input data stream, unlike conventional histograms, which require the user to specify the width and start/end point of the bin. It follows the change of probability distribution, and adapts to the sudden/incremental concept drift. Also, more than two Sketch
can be combined in monadic way. This is what we call the probability monad in functional programming. Sketch
is a better alternative to kernel density estimation and histogram in most cases.
Here is an example of how Sketch
estimates the density using the dataset sampled from the standard normal distribution.
import flip.implicits._
// get 100 random variables from standard normal distribution
val underlying = NumericDist.normal(0.0, 1.0)
val (_, samples) = underlying.samples(100)
// update samples to sketch
val sketch0 = Sketch.empty[Double]
val sketch1 = samples.foldLeft(sketch0) {
case (sketch, sample) => sketch.update(sample)
}
// analyze sketch
println(
s"Estimated Pr(0.0 ≤ x ≤ 1.0): ${sketch1.probability(0.0, 1.0)}, " +
s"Expected Pr(0.0 ≤ x ≤ 1.0): ${underlying.probability(0.0, 1.0)}\n" +
s"Estimated median: ${sketch1.median}, expected median: 0.0 \n" +
s"Sample from sketch: ${sketch1.sample._2} \n" +
s"KL-divergence: ${KLD(underlying, sketch1)}"
)
Here is an experiment result for a bimodal probabability density function consisting of two standard normal distributions centered at -2 and 2.
In this figure, the dashed orange line is the expected underlying probability distribution, and the blue bar is the probability distribution that Sketch
estimates. Sketch
assumes an initial bin with a uniform width, and estimates the first optimal bin at the update count of 50. Then Sketch
estimates new bins every 100 data updates, for example, 50, 150, 250, and so on.
Sketch
also adapts to any types of concept drift successfully. Here is an experiment result under the situation where the distribution that Sketch
is supposed to estimate is incrementally changing over time. The underlying distribution starts to change when the update count come to 300 and moves by +0.01 per one update count. Sketch
is good at predicting this moving distribution, although there is some lag. Also this lag can be reduced by adjusting the sensitivity to new data.
In all of these experiments, I did not provide any prior knowledge to predict the underlying distirbution accurately. It works precisely with the default configuration. For more example, see the experiment documentation. If you want to learn how to use Sketch
in a real world, see the code for these experiments.
Contributions are always welcome. Any kind of contribution, such as writing a unit test, documentation, bug fix, or implementing the density estimation algorithm of Sketch
in another language, is helpful. If you need some help, please contact me via email or twitter.
The master
branch of this repository contains the latest stable release of Flip. In general, pull requests should be submitted from a separate feature
branch starting from the develop
branch.
Fo more detail, see the contributing documentation.
All code of Flip is available to you under the MIT license.
Copyright the maintainers.