sirthias / spliff   0.8.0

Mozilla Public License 2.0 GitHub

Efficient diffing in Scala

Scala versions: 3.x 2.13
Scala.js versions: 1.x

spliff

spliff is a small, zero-dependency Scala library providing efficient implementations of the diff algorithm and supporting logic presented by Eugene W. Myers in his 1986 paper "An O(ND) Difference Algorithm and Its Variations".

Myers' algorithm is the default diffing logic for many popular tools (like git diff) because it performs well (time- and memory-wise) and tends to produce diffing results that humans would consider "good".

Many improvements and refinements have been proposed over Myers' original algorithm since 1986.
The implementations provided by spliff are based on the work by Robert Elder.

spliff sports these features:

  • no dependencies except for the Scala standard library
  • fast (minimal support structures, light on allocations, cheap integer-logic)
  • optional detection of moves and replaces
  • stacksafe (i.e. heap-based rather than stack-based recursion)
  • operates on any data type (not only String)
  • customizable, type class based equality logic

spliff is available for Scala 2.13, Scala 3 and Scala.js.

Installation

The spliff artifacts live on Maven Central and can be tied into your SBT project like this:

libraryDependencies ++= Seq(
  "io.bullet" %% "spliff" % "0.8.0"
)

Usage

Since information about the differences between two sequences of arbitrary objects are useful in a very broad range of application contexts spliff provides several distinct "views" onto the diff data.
Simply pick the one(s) that are most suited to the task you are trying to solve:

1. Basic Operations

On the most basic level spliff uses Myers' algorithm to describe the difference between two sequences base and target as a number of Diff.Op.Delete and Diff.Op.Insert operations. Instances of these types can be regarded as mere "decorators" on the underlying base and target sequences as they don't hold any data elements themselves. They merely describe in terms of index ranges, which data elements must be deleted or inserted in order to transform base into target.
The least common super type of Diff.Op.Delete and Diff.Op.Insert is the Diff.Op.DelIns trait.

Example:

import io.bullet.spliff.Diff

// create a 'diff' between two `IndexedSeq[T]`
val diff = Diff(
  "the base sequence",
  "the target sequence"
)

// all delete operations required to get from `base` to `target`
val deletes: Seq[Diff.Op.Delete] = diff.deletes
deletes ==> ArraySeq(
  Delete(4, 1),
  Delete(6, 1)
)

// all insert operations required to get from `base` to `target`
val inserts: Seq[Diff.Op.Insert] = diff.inserts
inserts ==> ArraySeq(
  Insert(5, 4, 1),
  Insert(7, 6, 2),
  Insert(8, 9, 1)
)

// all deletes and inserts combined
val delIns: Seq[Diff.Op.DelIns] = diff.delInsOps
val delInsSorted: Seq[Diff.Op.DelIns] = diff.delInsOpsSorted // same but sorted by index

2. Higher-Level Operations

On the next higher-level spliff can refine the basic Diff.Op.DelIns operations by identifying Diff.Op.Move and, optionally, Diff.Op.Replace operations. While this raises the semantic level it doesn't change the fundamental "decorator-only" character of the diff result. Without access to both underlying sequences (base and target) this representation of the diff is likely of limited value only.

Example:

import io.bullet.spliff.Diff

// create a 'diff' between two `IndexedSeq[T]`
val diff = Diff(
  "the base sequence",
  "the sequence base !"
)

// the diff result as a list of 'delete', 'insert' and 'move' operations
val delInsMov: Seq[Diff.Op.DelInsMov] = diff.delInsMovOps
val delInsMovSorted: Seq[Diff.Op.DelInsMov] = diff.delInsMovOpsSorted // same but sorted by index

delInsMov ==> ArraySeq(
  Move(2, 16, 5),
  Insert(17, 17, 2)
)

// the diff result as a list of 'delete', 'insert', 'move' and 'replace' operations
val allOps: Seq[Diff.Op] = diff.allOps // already sorted by index

3. Patch

Building upon the Diff.Op.DelInsMov operations spliff can also represent the diff as a Diff.Patch[T]. A patch holds a compact representation of all data required to reconstruct the target sequence, given only the base. In addition to the information about which data are to be deleted from the base and/or moved to other positions a patch must therefore contain the actual data elements that are to be inserted.

Example:

import io.bullet.spliff.Diff

// create a 'diff' between two `IndexedSeq[T]`
val diff = Diff(
  "the base sequence",
  "the target sequence"
)

// create a batch
val patch: Diff.Patch[Char] = diff.patch

patch ==> Patch(
  baseSize = 17,
  targetSize = 19,
  steps = ArraySeq(
    Delete(4,1),
    Delete(6,1),
    Insert(5, ArraySeq('t')),
    Insert(7, ArraySeq('r', 'g')),
    Insert(8, ArraySeq('t'))
  )
)

// apply the patch to the base sequence
val newTarget = patch.apply("the base sequence")

// yields the original target sequence
newTarget ==> Right("the target sequence")

4. Chunks

In addition to the above spliff can represent the diff as a sequence of Diff.Chunk[T] instances, which partitions the base and target into a list of segments, each of which holds the respective data elements as well as meta data about where the chunk comes from (only the base, only the target, both sequences or distinct to each sequence). For some use cases this diff representation is more suitable and directly usable than the more basic operations or patches.

Example:

import io.bullet.spliff.Diff

// create a 'diff' between two `IndexedSeq[T]`
val diff = Diff(
  "the base sequence",
  "the target sequence"
)

// the diff represented as a sequence of "chunks"
val chunks: Seq[Diff.Chunk[Char]] = diff.chunks

chunks ==> Seq(
  Diff.Chunk.InBoth("the "),
  Diff.Chunk.Distinct("base", "target"),
  Diff.Chunk.InBoth(" sequence")
)

5. Bimap

Sometimes you need a (bidirectional) mapping between the indices of base and the indices of target. spliff makes this readily available.

Example:

import io.bullet.spliff.Diff

// create a 'diff' between two `IndexedSeq[T]`
val diff = Diff(
  "the base sequence",
  "the target sequence"
)

// a bidirectional mapping between each individual base and target index, where possible
val bimap: Diff.Bimap = diff.bimap

bimap.baseToTargetIndex(10) ==> Some(12)
bimap.targetToBaseIndex(12) ==> Some(10)

6. Longest Common Subsequence and Min Edit Distance

Finally spliff offers these two supporting functions:

import io.bullet.spliff.Diff

val base = "the base sequence"
val target = "the target sequence"

// determines the longest subsequence of elements that is present in both sequences
val lcs: Seq[Char] = Diff.longestCommonSubsequence(base, target)
lcs.mkString ==> "the ae sequence"

// determines the minimum number of edits required to transform `base` into `target`,
// whereby one "edit" corresponds to deleting or inserting one single element
val distance: Int = Diff.minEditDistance(base, target)
distance ==> 6

For further information and more detailed API documentation just look at the source code, which is hopefully not that hard to read. (Even though the algorithmic core parts themselves are certainly quite dense.)

Why "spliff"?

The name spliff is a portmanteau of the words "split" and "difference" alluding to the core principle of Myers' algorithm, which divides the problem of finding a suitable diff into two parts, that are then solved separately and recursively.

There is, of course, no relationship with other, potentially overloaded meanings of the word "spliff".

License

spliff is released under the MPL 2.0, which is a simple and modern weak copyleft license.

Here is the gist of the terms that are likely most important to you (disclaimer: the following points are not legally binding, only the license text itself is):

If you'd like to use spliff as a library in your own applications:

  • spliff is safe for use in closed-source applications. The MPL share-alike terms do not apply to applications built on top of or with the help of spliff.

  • You do not need a commercial license. The MPL applies to spliff's own source code, not your applications.

If you'd like to contribute to spliff:

  • You do not have to transfer any copyright.

  • You do not have to sign a CLA.

  • You can be sure that your contribution will always remain available in open-source form and will not become a closed-source commercial product (even though it might be used by such products!)

For more background info on the license please also see the official MPL 2.0 FAQ.