A Scala DSL for Stan.

ScalaStan depends on Scala, SBT, and CmdStan.

The `CMDSTAN_HOME`

environment variable should be set to the location of the CmdStan installation.
Alternatively, ScalaStan will work if `PATH`

contains `stanc`

within a valid CmdStan installation.

To use ScalaStan, add the following to your build:

`libraryDependencies += "io.gihub.cibotech" %% "scalastan" % "<version>"`

`com.cibo.scalastan`

contains the ScalaStan DSL (most importantly, the`StanModel`

trait).`com.cibo.scalastan.analysis`

contains analyses for ScalaStan models.`com.cibo.scalastan.ast`

contains the ScalaStan abstract syntax tree.`com.cibo.scalastan.data`

contains parsers for various data sources (R, for example).`com.cibo.scalastan.models`

contains reusable ScalaStan models.`com.cibo.scalastan.run`

contains the classes to handle compiling and running a model.`com.cibo.scalastan.transform`

contains ScalaStan transformations (optimizations, etc.).- Examples can be found in the
`com.cibo.scalastan.examples`

package in the integration test (`it`

) source directory. Run an example using the command`sbt it:run`

and choosing from the available examples.

The ScalaStan DSL is accessed by extending the `com.cibo.scalastan.StanModel`

trait.
Here's a simple example of linear regression:

```
import com.cibo.scalastan.StanModel
object MyApp extends App {
object MyModel extends StanModel {
val n = data(int(lower = 0))
val x = data(vector(n))
val y = data(vector(n))
val b = parameter(real())
val m = parameter(real())
val sigma = parameter(real(lower = 0.0))
sigma ~ stan.cauchy(0, 1)
y ~ stan.normal(m * x + b, sigma)
}
val xs: Seq[Double] = ???
val ys: Seq[Double] = ???
val results = MyModel
.withData(MyModel.x, xs)
.withData(MyModel.y, ys)
.run(chains = 5)
results.summary(System.out)
}
```

For comparison, the equivalent Stan program (without the data setup/output) is:

```
data {
int<lower=0> n;
vector[n] x;
vector[n] y;
}
parameters {
real b;
real m;
real<lower = 0> sigma;
}
model {
sigma ~ cauchy(0, 1);
y ~ normal(m * x + b, sigma);
}
```

Data declarations define the inputs to the model.
These go in the `data`

section in Stan:

```
data {
real<lower=0> x;
}
```

Using ScalaStan, this is encoded as:

`val x = data(real(lower = 0))`

The `lower`

and `upper`

bounds on values are optional in ScalaStan, just as in regular Stan. ScalaStan supports
the following data types:

`int([lower], [upper])`

// An integer`real([lower], [upper])`

// A real/double`vector(length, [lower], [upper])`

// A (column) vector of reals`rowVector(length, [lower], [upper])`

// A row vector of reals`matrix(rows, cols, [lower], [upper])`

// A matrix of reals`categorical()`

// A categorical type (this is a ScalaStan extension to handle categorical data that turns into an`int`

).

Arrays can be created for any data type with multiple dimensions by calling `apply`

on the type.
For example, to create a 2-dimensional array of `int`

:

`int()(j, k)`

Parameter declarations define the parameters to be inferred. In Stan, these go in the `parameters`

section:

```
parameters {
real y;
}
```

Using ScalaStan, this is encoded as:

`val y = parameter(real())`

The types for parameters are the same as those for data, however, Stan does not allow integral parameters.

Local values can be declared using the `local`

function, which behaves like the `data`

and `parameter`

functions,
but is only available within the `StanModel`

DSL, for example:

`val z = local(real())`

The `StanModel`

trait supports a DSL to describe the model, which can be expressed directly in the body
of the subclass.

Most arithmetic operators behave as one would expect (and identically to Stan), for example `+`

, `-`

, `*`

, `/`

,
and `%`

.

The logical operators `<`

, `<=`

, `>`

, `>=`

operate as expected, however, note that test for equality
is `===`

(three `=`

s instead of `==`

) and test for inequality is `=/=`

(instead of `!=`

).

Stan supports element-wise multiplication and division, which are `*:*`

and `/:/`

in ScalaStan.

The power operator is `^`

.

The transpose operator is `.t`

(for example, `x.t`

to transpose `x`

).

The assignment operator is `:=`

(instead of `=`

).

For loops use the standard Scala syntax, but using the `range`

function as a generator. For example:

```
for (i <- range(1, n)) {
// ...
}
```

This will cause `i`

to take on the values `1`

through `n`

inclusive.
Note that arrays are indexed starting from one in Stan.

The `range`

function can also be used for array slicing, for example:

`x(range(1, 5)) := y(range(2, 6))`

While loops are declared using the `loop`

function to avoid conflicting with the Scala `while`

statement.
For example:

```
loop(x < 10) {
// ...
}
```

Stan supports breaking out of loops using the `break`

keyword and starting the next iteration using
the `continue`

keyword. Both of these are available in ScalaStan (since Scala does not support
break or continue, these functions are named the same as in Stan).

Conditions take the form `when`

and `otherwise`

to avoid conflicting with the Scala `if`

and `else`

statements.
For example:

```
when(x > 1) {
// ...
} otherwise {
// ...
}
```

The `otherwise`

section is optional. Additional `when`

statements can be added to provide an "else if"
structure, but note the required `.`

:

```
when(x > 1) {
// ...
}.when(x < 0) {
// ...
} otherwise {
// ...
}
```

Conditionals in ScalaStan/Stan are not expressions, so there is a ternary operator to
support conditionals as part of an expression. The ternary operator is supported as an overload
of `when`

:

`y := when(x > 0, 1, 2)`

This will evaluate the condition `x > 0`

and return `1`

if true and `2`

if false.

The built-in Stan distributions are available in ScalaStan from the `stan`

object.

To indicate that a value is sampled from a distribution, the `~`

operator is used, for example:

`y ~ stan.normal(0.0, 1.0)`

This is roughly equivalent to the following:

`target += stan.normal(0.0, 1.0).lpdf(y)`

The built-in Stan functions are available in ScalaStan from the `stan`

object.

Transformed versions of data and parameters can be declared via the `transformed data`

and `transformed parameters`

sections in Stan. In ScalaStan, this is accomplished by extending the `TransformedData`

or `TransformedParameter`

classes from within a `StanModel`

. These classes take a parameter that is the type of the transformed value.
The body of the class provides a DSL to encode the transformation using the same constructs as the model.
Inside this DSL, the value is accessed using `result`

.

Here is a simple data transform to add `1`

to all elements of an array:

```
val xPlusOne = new TransformedParameter(vector(n)) {
result := x + 1
}
```

The transformed parameter can now be referenced instead of the actual parameter in the model. Note that a transformed parameter can be referenced in the output, however, if the transformed parameter is not used within the model, ScalaStan will not emit it. In this case, one should use a generated quantity instead.

User-defined functions are created by extending the `Function`

class from within a `StanModel`

.
This class takes an optional parameter to determine the return type (it is assumed to return `void`

if not specified). The body of the class provides a DSL for implementing the function. Inputs to
the function are specified using the `input`

function, which works much like the `local`

function,
but creates an input parameter instead of a local variable.
The `output`

function sets the return value and returns from the function.

Here is an example to add `1`

to all elements of an array:

```
val myFunc = new Function(vector(n)) {
val x = input(n)
output(x + 1)
}
```

Generated quantities provide a means of deriving quantities from parameters, data, and random number generation.
In ScalaStan, such quantities are created by extending the `GeneratedQuantity`

class within a `StanModel`

. This class
works like the data and parameter transform blocks, but is contained in the model. In addition to parameters and
data, a generated quantity can use random numbers drawn from a distribution.

Here is an example to draw a random number:

```
val rand = new GeneratedQuantity(real()) {
result := stan.normal(0.0, 1.0).rng
}
```

The values for data declarations are specified using the `withData`

method on the model (technically, this method
is on the `CompiledModel`

class, but there is an implicit conversion to compile the model). So, for example,
given a data declaration for an integer called `x`

and a model called `model`

, one obtains an updated model
with the data filled in for `x`

via:

`model.withData(x, data)`

Note that `data`

in the above must be the appropriate Scala type for the model. For example, an `int()`

in
ScalaStan must be assigned an `Int`

and a `real()`

must be assigned a `Double`

. Vectors and higher
dimensional objects are represented `Seq`

in Scala, which each dimension adding another `Seq`

.
Note that Scala is indexed from zero whereas Stan (and ScalaStan) is indexed from one.

It is also possible to assign data from a `DataSource`

. The following data sources are available:

`CsvDataSource`

: A data source for parsing CSV.`RDataSource`

: A data source for parsing R-formatted data.`TextDataSource`

: A data source for parsing a text file with an array of values.

For example, given a file (`data.R`

) with R-formatted
data, we can assign `x`

from the value named `X`

in the R data file using the `apply`

method:

```
val source = com.cibo.scalastan.data.RDataSource.fromFile("data.R")
model.withData(source(x, "X"))
```

Using data sources, it is also possible to load the data in Scala format for manipulation using
the `read`

method:

```
val source = com.cibo.scalastan.data.CsvDataSource.fromFile("data.csv")
val data: Seq[Double] = source.read(x, "X")
```

Before the model can be run, all inputs must be assigned. To assign different inputs, the `reset`

method
must be called. Using the `reset`

method will clear all assigned inputs and initial values, allowing the
model to be run multiple times with different inputs. The `reset`

method, like the `withData`

method, returns an updated copy of the model.

Initial values can be set for parameter declarations using the `withInitialValue`

method on the model (like
`withData`

, this is actually on the `CompiledModel`

class). Here is an example to set the initial value
for `b`

:

`model.withInitialValue(b, value)`

It is also possible to set the initial value to a range [-x, x] for all parameters using:

`model.withInitialValue(x)`

Once all inputs are assigned, the model can be run using the `run`

method. This method takes the following
optional parameters:

- chains: An integer specifying the number of chains to run in parallel. This defaults to 4.
- seed: An integer specifying the first random number seed to use or
`-1`

to use system time to select one. With multiple chains, the chain index is added to the seed to ensure each chain gets a different, but deterministic seed. - cache: A boolean specifying whether results should be cached. This defaults to
`true`

. When the results are cached, re-running the model with the same data set and seed as before will cause the results from the previous run to be returned rather than re-executing the model. - method: The method to use (
`RunMethod.Sample()`

,`RunMethod.Optimize()`

,`RunMethod.Variational()`

, or`RunMethod.Diagnose()`

). This defaults to`RunMethod.Sample()`

. Note that each`RunMethod`

is a case class that can accept additional parameters.

The return value of the `run`

method is a results object that can be used to extract values from the run.

The first run of a model will take a long time since the Stan code will need to be built. However,
the generated Stan executable is cached and reused, so additional runs will be fast. The SHA1 hash of the
generated code from ScalaStan is used to determine if the code has changed and needs to be rebuilt. The
cached models and results are stored in `$HOME/.scalastan`

. Old models and results are removed in a
least-recently used fashion.

The `run`

method on `CompiledModel`

returns a `StanResults`

object, which has several methods to extract
samples and statistics on the results of the run. These methods take a parameter (or transformed
parameter) as an argument.

For example, to get the mean of the samples for parameter `y`

:

`val m = results.mean(model.y)`

This will return the mean with the same shape as the input parameter. Note that the Scala data structure is indexed from zero (instead of one as it is in Stan).

For convenience, there is also a `get`

method on `StanResults`

to access the input data.

The results object also has a `summary`

method to output a summary of results like the `stansummary`

program:

`results.summary(System.out)`

It is possible to use ScalaStan to generate regular Stan code using the `emit`

function on the model.
For example:

```
object MyModel extends StanModel {
// ...
}
val writer = new java.io.PrintWriter("myModel.stan")
MyModel.emit(writer)
writer.close()
```

It is also possible to use ScalaStan to run existing Stan models. To use an existing model, the data
and parameters must be set up in ScalaStan, then the `loadFromFile`

(or `loadFromString`

) method on
`Model`

can be used to load the model. Once loaded, the model can be used just as a model implemented
directly in ScalaStan. For example:

```
object MyModel extends StanModel {
val n = data(int(lower = 0))
val x = data(vector(n))
val mu = parameter(real())
val sigma = parameter(real(lower = 0))
loadFromFile("normal.stan")
}
// ...
```