Structured Linear CDEs

SLiCE variants compared: dense, diagonal, DPLR, sparse, Walsh–Hadamard, block-diagonal
SLiCE structures and their origins

Introduction

This post is an introduction to Structured Linear Controlled Differential Equations (SLiCEs): a framework for sequence models that balance expressivity with efficient parallel-in-time computation by using structured, input-dependent state-transition matrices. This framework was introduced in our NeurIPS 2025 spotlight paper. We start by outlining how SLiCEs connect to linear recurrent neural networks and linear neural controlled differential equations, before motivating why expressivity matters. We then introduce SLiCEs and compare their properties, finishing with some empirical results.

Linear Recurrent Neural Networks

Given observations from a time series $\{x_i\}_{i=0}^n$, an RNN has the generic form

$$ h_{i+1} = g_\theta(h_i, x_i). $$

Linear RNNs further restrict to

$$ h_{i+1} = W_\theta h_i + B_\theta x_i = h_i + A_\theta h_i + B_\theta x_i, $$

where $W_\theta = I + A_\theta \in \mathbb{R}^{d_h \times d_h}$ and $B_\theta \in \mathbb{R}^{d_h \times d_x}$. Just as residual networks can be seen as an Euler discretisation of a Neural ODE, the linear RNN is an Euler discretisation of

$$ \frac{\mathrm{d}h_s}{\mathrm{d}s} = A_\theta h_s + B_\theta X_s, $$

where $X_s$ is an interpolation of the input sequence $\{x_i\}_{i=0}^n$. Due to the similarity to classical state-space models, a popular class of linear RNNs are known as structured state-space models (SSMs). The simple recurrence allows for parallelisation via associative scans or convolution. However, linear RNNs are not very expressive, which motivates making the state-transition matrix depend on the input:

$$ \frac{\mathrm{d}h_s}{\mathrm{d}s} = A_\theta(X_s) h_s + B_\theta X_s. $$

For example, Mamba takes this approach. Although no longer time invariant, these models can still be computed in parallel using associative scans. A natural question is: How expressive are input-dependent linear RNNs?

Quick Tangent on Names

These models are often called (input-dependent) linear RNNs, with many variants detailed in Table 2 of this paper. Since the update has multiplicative interactions between $X_s$ and $h_s$, they are not linear in $(h_s, X_s)$. A more descriptive term would be linear multiplicative RNNs or linear 2-RNNs, both of which are nonlinear RNNs where the matrix multiplying the hidden state depends linearly on the input. See here and here for examples. These names would emphasise that “linear” refers to the absence of pointwise nonlinearities on the hidden state update.

Linear Neural Controlled Differential Equations

Linear neural controlled differential equations (LNCDEs) are continuous-time models where the state-transition depends linearly on the increments of the input:

$$ \mathrm{d}h_s = \sum_{j=1}^{d_\omega} A^j_\theta\, h_s\, \mathrm{d}\omega^{X, j}_s + B_\theta\, \mathrm{d}\xi^{X}_s. $$

Taking $\mathrm{d}\omega^{X,j}_s = X^j_s\, \mathrm{d}s$ and $\mathrm{d}\xi^X_s = X_s\, \mathrm{d}s$, we recover the input-dependent linear RNN while restricting $A_\theta$ to be linear in $X_s$. Thus, LNCDEs are the continuous-time analogue of linear 2-RNNs. As shown in our NeurIPS 2024 paper, linear dependence in the state transition is sufficient for maximal expressivity (universality), which answers how expressive generic input-dependent linear RNNs are. However, S4D and Mamba are special cases of LNCDEs with diagonal $A^j_\theta$; this restriction is computationally efficient but limits expressivity.

A Simple Parity Example

Consider a stream of bits $x_0,x_1,x_2,\dots\in\{0,1\}$ and the parity label

$$ p_n = S_n \bmod 2 \in \{0,1\}, \quad S_n = \sum_{k=0}^{n-1} x_k. $$

Whenever $x_i=1$ the label flips; if $x_i=0$ the label stays the same. Take

$$ \frac{\mathrm{d}h_s}{\mathrm{d}s} = A_\theta X_s h_s, $$

with $h_s\in\mathbb{R}^2$, piecewise constant $X_s=x_i$ for $s\in[i,i+1)$, and diagonal $A_\theta$. Then

$$ h_{n+1} = \exp\left(\begin{bmatrix} a_1 & 0 \\ 0 & a_2\end{bmatrix} x_n\right) h_n, \qquad h_n^{i}=h_0^{i} e^{a_i S_n},\ i=1,2. $$

With a linear readout $r=(r_1,r_2)^\top$ and monotone $\phi$:

$$ \hat p_n = \phi\Big(r^\top h_n\Big) = \phi\Big(c_1 e^{a_1 S_n} + c_2 e^{a_2 S_n}\Big) = \phi\big(f(S_n)\big), \quad c_i=r_i h_0^{i}. $$

Since $f(S)$ has at most one turning point and $\phi$ is monotone, $\hat p_n$ crosses any fixed threshold at most twice, while $p_n$ flips every time $S_n$ increases by one. Therefore, a diagonal $A$ cannot realise parity on arbitrarily long input.

Now take $A = \begin{pmatrix} 0 & \pi \\ -\pi & 0 \end{pmatrix}$. Then

$$ \exp(A x_{n}) = \begin{pmatrix}1 & 0 \\ 0 & 1\end{pmatrix} \text{ if } x_{n}=0, \quad \exp(A x_{n}) = \begin{pmatrix}-1 & 0 \\ 0 & -1\end{pmatrix} \text{ if } x_{n}=1, $$

so $h_{n+1} = (-1)^{x_{n}} h_n$ and hence $h_n = (-1)^{S_n} h_0$. With $r=(1,0)^\top$, $h_0^{(1)}=1$, and $\phi(s) = \big(1-\operatorname{sign}(s)\big)/2$,

$$ \hat p_n = \frac{1-\operatorname{sign}\big((-1)^{S_n}\big)}{2} = S_n \bmod 2 = p_n. $$

Therefore, a dense matrix can solve parity exactly with $d_h=2$.

However, moving from diagonal to dense matrices is impractical for large $d_h$: parameter count and compute grow from $\mathcal{O}(d_\omega d_h)$ to $\mathcal{O}(d_\omega d_h^2)$. This motivates structured alternatives that balance expressivity and efficiency, leading us to SLiCEs.

Structured Linear Neural Controlled Differential Equations

SLiCEs replace the dense matrices of an LNCDE with structured variants that preserve expressivity while being computationally cheaper. Our paper proposes:

All four of these choices are maximally expressive (universal), as shown for DPLR-SLiCE in this paper, and shown for BD, sparse, and WH-SLiCE in our paper.

Comparison of SLiCEs

The opening figure visualises six possible structures: dense LNCDE (DE-LNCDE), diagonal SLiCE (D-SLiCE), diagonal-plus-low-rank (DPLR-SLiCE), sparse (S-SLiCE), Walsh–Hadamard (WH-SLiCE), and block-diagonal (BD-SLiCE). The table below compares parameter count, computational cost, and whether each is maximally expressive (Max. Exp.). Here $d_h$ is hidden size, $n$ sequence length, $b_j$ BD block sizes, $r$ DPLR rank, $\epsilon$ sparse exponent, and we take $d_\omega=d_h$ for simplicity. Parallel cost is measured as $\mathcal{O}(\text{scan depth, cost per composition})$ when using a parallel associative scan.

Complexity and expressivity of SLiCE variants
Model Parameters Recurrent Cost Parallel Cost Max. Exp.
DE-LNCDE$\mathcal{O}(d_h^3)$$\mathcal{O}(n d_h^3)$$\mathcal{O}(\log n, d_h^3)$Yes
D-SLiCE$\mathcal{O}(d_h^2)$$\mathcal{O}(n d_h^2)$$\mathcal{O}(\log n, d_h^2)$No
DPLR-SLiCE$\mathcal{O}(r d_h^2)$$\mathcal{O}(n r d_h^2)$$\mathcal{O}(\log n, d_h^3)$Yes
S-SLiCE$\mathcal{O}(d_h^{2+\epsilon})$$\mathcal{O}(n d_h^{2+\epsilon})$$\mathcal{O}(\log n, d_h^3)$Yes
WH-SLiCE$\mathcal{O}(d_h^2)$$\mathcal{O}(n d_h^2)$$\mathcal{O}(\log n, d_h^3)$Yes
BD-SLiCE$\mathcal{O}\!\left(d_h \sum_j b_j^2\right)$$\mathcal{O}\!\left(n d_h \sum_j b_j^2\right)$$\mathcal{O}\!\left(\log n, d_h \sum_j b_j^2\right)$Yes

Parallel Computation

Block-diagonal matrices are closed under multiplication, so the parallel associative scan preserves their structure. For DPLR, sparse, and WH, compositions destroy structure, yielding an effective $\mathcal{O}(d_h^3)$ cost per composition. Another important consideration is that for large hidden sizes, scans can incur high I/O costs, which reduces their practical benefit. DeltaNet mitigates this using a chunk-wise algorithm tailored for diagonal-plus-low-rank matrices; see this blog post. Because chunk-wise methods also apply to diagonal matrices, BD-SLiCE with a mostly diagonal structure ($b_i=1$ for $i=1,\ldots,k-1$) plus a small dense block ($b_k=b$) is an attractive option when $d_h$ is large. This variant can leverage chunk-wise updates for the diagonal part and scans for the dense part. We refer to this structure as diagonal–dense SLiCE (D-DE-SLiCE).

Empirical Results

Two highlights. First, DPLR, BD, and D-DE achieve strong length generalisation on regular language tasks from the Formal Language Benchmark, outperforming all parallel-in-time baselines we considered.

Formal Language Benchmark: DPLR, BD, and D-DE show strong length generalisation
Length generalisation on regular-language tasks

Second, replacing the nonlinear vector field of a Log-NCDE with a block-diagonal linear field (BD-SLiCE) gives the same average test accuracy with a $20\times$ reduction in average time per training step over six datasets from the UEA Multivariate Time Series Classification Archive.

Training time vs accuracy: BD-SLiCE approximately 20× faster than Log-NCDE with similar accuracy
Test accuracy against time per training step over six datasets from the UEA-MTSCA. Circle area is proportional to GPU memory usage.

Conclusion

SLiCEs provide a framework for understanding structured, input-dependent state-transition matrices that are both maximally expressive and parallel-in-time. If you want to keep learning more about them, further details, proofs of theoretical claims, and thorough experiments can be found in our paper. If you want to start using them, we have released open-source implementations in both PyTorch and JAX.

As for what is next, we are currently working on efficient implementations so that we can scale SLiCEs to the billion-parameter regime. We are also interested in exploring other structured matrices that fit within the SLiCE framework, so please reach out if you have any ideas!

Thanks for reading!

Back to top