## 15.3 Arithmetic Precision

The choice of significand provides \(\log_{10} 2^{53} \approx 15.95\)
decimal (base 10) digits of *arithmetic precision*. This is just the
precision of the floating-point representation. After several
operations are chained together, the realized arithmetic precision is
often much lower.

### 15.3.1 Rounding and probabilities

In practice, the finite amount of arithmetic precision leads to
*rounding*, whereby a number is represented by the closest
floating-point number. For example, with only 16 decimal digits of
accuracy,

`1 + 1e-20 == 1`

The closest floating point number to \(1 + 10^{-20}\) turns out to be \(1\) itself. By contrast,

`0 + 1e-20 == 1e-20`

This highlights the fact that precision depends on scale. Even though
`1 + 1e-20 == 1`

, we have `1e-20 + 1e-20 == 2e-20`

, as expected.

Rounding also manifests itself in a lack of *transitivity*. In
particular, it does *not* usually hold for three floating point numbers
\(a, b, c\) that \((a + b) + c = a + (b = c)\).

In statistical applications, problems often manifest in situations where users expect the usual rules of real-valued arithmetic to hold. Suppose we have a lower triangular matrix \(L\) with strictly positive diagonal, so that it is the Cholesky factor of a positive-definite matrix \(L \, L^{\top}\). In practice, rounding and loss of precision may render the result \(L \, L^{\top}\) neither symmetric nor positive definite.

In practice, care must be taken to defend against rounding. For example, symmetry may be produced by adding \(L \, L^{\top}\) with its transpose and dividing by two, or by copying the lower triangular portion into the upper portion. Positive definiteness may be maintained by adding a small quantity to the diagonal.

### 15.3.2 Machine precision and the asymmetry of 0 and 1

The smallest number greater than zero is roughly \(0 + 10^{-323}\). The largest number less than zero is roughly \(1 - 10^{-15.95}\). The asymmetry is apparent when considering the representation of that largest number smaller than one—the exponent is of no help, and the number is represented as the binary equivalent of \(0.9999999999999999\).

For this reason, the *machine precision* is said to be roughly
\(10^{15.95}\). This constant is available as `machine_precision()`

in
Stan.

### 15.3.3 Complementary and epsilon functions

Special operations are available to mitigate this problem with numbers
rounding when they get close to one. For example, consider the
operation `log(1 + x)`

for positive `x`

. When `x`

is small (less than
\(10^{-16}\) for double-precision floating point), the sum in the
argument will round to 1 and the result will round to zero. To allow
more granularity, programming languages provide a library function
directly implementing \(f(x) = \log (1 + x)\). In Stan (as in C++),
this operation is written as `log1p(x)`

. Because `x`

itself may be
close to zero, the function `log1p(x)`

can take the logarithm of
values very close to one, the results of which are close to zero.

Similarly, the complementary cumulative distribution functions (CCDF), defined by \(F^{\complement}_Y(y) = 1 - F_Y(y)\), where \(F_Y\) is the cumulative distribution function (CDF) for the random variable \(Y\). This allows values very close to one to be represented in complementary form.

### 15.3.4 Catastrophic cancellation

Another downside to floating point representations is that subtraction of two numbers close to each other results in a loss of precision that depends on how close they are. This is easy to see in practice. Consider \[\begin{align*} 1&.23456789012345 \\ - 1&.23456789012344 \\ = 0&.00000000000001 \end{align*}\] We start with fifteen decimal places of accuracy in the arguments and are left with a single decimal place of accuracy in the result.

Catastrophic cancellation arises in statistical computations whenever
we calculate variance for a distribution with small standard
deviations relative to its location. When calculating summary
statistics, Stan uses *Welford’s algorithm* for computing variances.
This avoids catastrophic cancellation and may also be carried out in a
single pass.

### 15.3.5 Overflow

Even though `1e200`

may be represented as a double precision floating
point value, there is no finite value large enough to represent `1e200 * 1e200`

. The result of `1e200 * 1e200`

is said to *overflow*. The
IEEE 754 standard requires the result to be positive infinity.

Overflow is rarely a problem in statistical computations. If it is, it’s possible to work on the log scale, just as for underflow as described below.

### 15.3.6 Underflow and the log scale

When there is no number small enough to represent a result, it is said
to *underflow*. For instance, `1e-200`

may be represented, but
`1e-200 * 1e-200`

underflows so that the result is zero.

Underflow is a ubiquitous problem in likelihood calculations, For example, if \(p(y_n \mid \theta) < 0.1\), then \[ p(y \mid \theta) = \prod_{n=1}^N p(y_n \mid \theta) \] will underflow as soon as \(N > 350\) or so.

To deal with underflow, work on the log scale. Even though \(p(y \mid \theta)\) can’t be represented, there is no problem representing \[ \begin{array}{rcl} \log p(y \mid \theta) & = & \log \prod_{n=1}^N p(y_n \mid \theta) \\[4pt] & = & \sum_{n = 1}^N \log p(y_n \mid \theta) \end{array} \]

This is why all of Stan’s probability functions operate on the log scale.

### 15.3.7 Adding on the log scale

If we work on the log scale, multiplication is converted to addition, \[ \log (a \times b) = \log a + \log b. \] Thus we can just start on the log scale and stay there through multiplication. But what about addition? If we have \(\log a\) and \(\log b\), how do we get \(\log (a + b)\)? Working out the algebra, \[ \log (a + b) = \log (\exp(\log a) + \exp(\log b)) \]

The nested log of sum of exponentials is so common, it has its own name, \[ \texttt{log}\mathtt{\_}\texttt{sum}\mathtt{\_}\texttt{exp}(u, v) = \log (\exp(u) + \exp(v)). \] so that \[ \log (a + b) = \texttt{log}\mathtt{\_}\texttt{sum}\mathtt{\_}\texttt{exp}(\log a, \log b). \]

Although it appears this might overflow as soon as exponentiation is introduced, evaluation does not proceed by evaluating the terms as written. Instead, with a little algebra, the terms are rearranged into a stable form, \[ \texttt{log}\mathtt{\_}\texttt{sum}\mathtt{\_}\texttt{exp}(u, v) = \max(u, v) + \log\big( \exp(u - \max(u, v)) + \exp(v - \max(u, v)) \big). \]

Because the terms inside the exponentiations are \(u - \max(u, v)\) and \(v - \max(u, v)\), one will be zero, and the other will be negative. Because the operation is symmetric, it may be assumed without loss of generality that \(u \geq v\), so that \[ \texttt{log}\mathtt{\_}\texttt{sum}\mathtt{\_}\texttt{exp}(u, v) = u + \log\big(1 + \exp(v - u)\big). \]

The inner term may itself be evaluated using `log1p`

, there is only
limited gain because \(\exp(v - u)\) is only near zero when \(u\) is much
larger than \(v\), meaning the result is likely to round to \(u\) anyway.

To conclude, when evaluating \(\log (a + b)\) given \(\log a\) and \(\log b\), and assuming \(\log a > \log b\), return

\[ \log (a + b) = \log a + \texttt{log1p}\big(\exp(\log b - \log a)\big). \]