Compression

Lossy compression is a widely used data reduction technique in scientific computing. By accepting a controlled loss of precision, compressors such as SZ and ZFP can achieve significant reductions in data volume while preserving the features that matter for downstream analysis. DTLMod models the performance impact of compression on in situ workflows without actually compressing any data: it simulates the computational cost of compression and decompression and adjusts the volume of data transported through the DTL according to a compression ratio.

How compression works in DTLMod

Unlike decimation, compression does not change the shape of a variable. A \(1000 \times 1000\) array remains a \(1000 \times 1000\) array after compression. What changes is the byte-size of the variable: the number of bytes transported through the DTL is divided by the compression ratio. This reflects the fact that real-world lossy compressors produce a bitstream that is smaller than the original data but still represents all the elements of the array.

Compression is a publisher-side only operation. Applying compression on the subscriber side is not meaningful because compression aims to reduce the volume of data that needs to be transported—which requires intervention before the data leaves the publisher.

Compressor profiles

DTLMod provides three ways to determine the compression ratio for a variable:

Fixed ratio. The simplest option: you directly specify the desired compression ratio. This is useful when you already know, from experiments or from the literature, the compression ratio achieved by a particular compressor on data similar to yours. The ratio must be at least 1.0 (a ratio of 1 means no size reduction).

SZ profile. This profile is inspired by the SZ lossy compressor, a prediction-based algorithm. SZ achieves high compression ratios on smooth scientific data because it can accurately predict neighboring values and only store the (small) prediction errors. The compression ratio is derived from two user-specified parameters:

accuracy (or error bound): the maximum acceptable pointwise error. Tighter accuracy requirements reduce the compression ratio because more bits are needed to represent the prediction residuals.
data smoothness: a value between 0 and 1 that characterizes how regular the data is. Smooth data (e.g., temperature fields) yields higher compression ratios because predictions are more accurate. Noisy or turbulent data yields lower ratios.

The model computes the ratio as:

\[r = \max\!\Big(1,\;\alpha \cdot \left(-\log_{10} \varepsilon\right)^{\beta} \cdot (0.5 + \sigma)\Big)\]

where \(\varepsilon\) is the accuracy, \(\sigma\) is the data smoothness, and \(\alpha = 3.0\), \(\beta = 0.8\) are empirical parameters fitted from published benchmarks on scientific datasets.

ZFP profile. This profile is inspired by the ZFP compressor, a transform-based algorithm. ZFP organizes data into small blocks, applies a near-orthogonal transform, and encodes the resulting coefficients with a fixed number of bits per value. The compression ratio depends primarily on the requested accuracy:

\[\text{rate} = \max(1,\;-\log_2 \varepsilon + 1) \quad;\quad r = \frac{64}{\text{rate}}\]

where the rate represents the number of bits per double-precision value after compression. Higher accuracy requirements increase the rate and therefore decrease the compression ratio.

Compression and decompression costs

Two independent cost parameters control the simulated computational overhead of compression:

compression cost per element: the number of floating-point operations incurred per array element when compressing the data on the publisher side.
decompression cost per element: the number of floating-point operations incurred per array element when decompressing the data on the subscriber side, after it has been received.

Both parameters default to 1.0. The total compression cost for a variable is computed as:

\[C_{\text{compress}} = c_{\text{comp}} \times \frac{N_{\text{local}}}{\text{element\_size}}\]

\[C_{\text{decompress}} = c_{\text{decomp}} \times \frac{N_{\text{local}}}{\text{element\_size}}\]

where \(N_{\text{local}}\) is the local size of the variable in bytes and \(\text{element\_size}\) is the size of one array element. The compression cost is incurred by the publisher right before putting the variable into the DTL, and the decompression cost is incurred by the subscriber right after receiving it.

Per-transaction variability

In practice, the compression ratio achieved on a given variable varies from one time step to the next as the data evolves. DTLMod models this variability through an optional ratio variability parameter that introduces a bounded, deterministic perturbation around the nominal compression ratio at each transaction.

For a given transaction \(t\), the effective ratio is computed as:

\[r_{\text{eff}}(t) = \max\!\Big(1,\; r \cdot \big(1 + \delta \cdot (2 h(t) - 1)\big)\Big)\]

where \(r\) is the nominal compression ratio, \(\delta\) is the ratio variability, and \(h(t)\) is a deterministic value in \([0, 1]\) derived from a hash of the variable name and the transaction id. This ensures that simulations are fully reproducible: the same transaction always produces the same effective ratio.

The effective ratio is used by Engine::put() to determine the actual transfer size for each transaction. The ReductionMethod::get_reduced_variable_global_size() and ReductionMethod::get_reduced_variable_local_size() query methods also accept an optional transaction_id argument so that users can inspect the effective size for any transaction before running the simulation.

Re-parameterization

As with decimation, compression parameters can be updated between transactions. You can change the compression ratio, switch compressor profiles, adjust accuracy or smoothness, or modify the cost parameters for a variable that is already being compressed. Only the parameters that are explicitly provided in the update are modified; the others retain their previous values. This supports the simulation of adaptive compression strategies that adjust their settings in response to changes in the data.