20x Slowdown From ReLU

2025-06-05

Last week, I was working on my master's thesis and decided to try simplifying my approach. I was inspired by this paper on replacing all these new fancy activation functions with regular old rectified linear units. So, I went ahead and replaced the GLU and gated Tanh activations I was using with ReLUs and ReGLUs, respectively.

I then ran some benchmarks before making the change:

benchmark              runs     total time     time/run (avg ± σ)     (min ... max)                p75        p99        p995
-----------------------------------------------------------------------------------------------------------------------------
sigmoid                333      2.392s         7.185ms ± 1.107ms      (5.731ms ... 11.132ms)       8.095ms    9.384ms    10.731ms
relu                   1268     1.645s         1.297ms ± 138.268us    (1.218ms ... 2.547ms)        1.298ms    2.103ms    2.284ms
svfnn film 16x4        8        1.864s         233.111ms ± 1.107ms    (230.472ms ... 233.83ms)     233.823ms  233.83ms   233.83ms

and after:

benchmark              runs     total time     time/run (avg ± σ)     (min ... max)                p75        p99        p995
-----------------------------------------------------------------------------------------------------------------------------
sigmoid                322      1.906s         5.921ms ± 218.743us    (5.659ms ... 7.23ms)         5.992ms    7.159ms    7.221ms
relu                   1587     1.926s         1.213ms ± 85.031us     (1.14ms ... 2.165ms)         1.212ms    1.622ms    1.88ms
svfnn film 16x4        1        4.084s         4.084s ± 0ns           (4.084s ... 4.084s)          4.084s     4.084s     4.084s

So, a 20x slowdown from switching to ReLU? This made me incredibly puzzled, as the behavior didn't seem to follow any coherent reasoning. Based on the benchmarks, the ReLU should be five times faster than the sigmoid function. How then could changing the activation function to ReLU cause such a massive slowdown? The key to solving this puzzle, however, lay in the architecture of my neural network itself.

The models that I've been building for my thesis utilize digital state variable filters (DSVF) and linear layers with activation to model nonlinear audio effects. The DSVF has nice properties that make it ideal for this kind of task. It has an infinite impulse response, meaning it can model arbitrarily long temporal dependencies. This capability is also, however, the reason why I eventually ran into this problem of losing an order of magnitude of performance when I decided to gate the features using ReLU.

The problem disappeared if I either switched the activation back to sigmoid or removed the filter bank entirely. Since I was playing around with Zig's generic type system and doing some profound OOP experimentation (TBH), my first thought was that this was probably some weird behavior where the code just unrolls so hard that it cripples any cache locality in the process.

Here's a snippet of the part where the funky stuff was happening:

inline for (0..num_blocks) |i| {
    x = self.filter_banks[i].forward(x);
    x += self.condition_layers[i].forward(self.condition);
    x = self.mlps[i].forward(x);
}

Another thing I noticed was that if I only used the first block to process all num_blocks layers, I didn't get any slowdown. So, I figured it was time to don the Hawaiian shirt, buckle up, start doing some real man programming, and forget my beautiful abstractions.

After working really hard for a couple of hours to reorganize the parameters for the main input/output layers and filter banks to possibly get some cache locality, I fired up the benchmark again:

benchmark              runs     total time     time/run (avg ± σ)     (min ... max)                p75        p99        p995
-----------------------------------------------------------------------------------------------------------------------------
sigmoid                350      1.957s         5.594ms ± 168.653us    (5.294ms ... 6.565ms)        5.63ms     6.17ms     6.321ms
relu                   1712     1.963s         1.146ms ± 46.451us     (1.074ms ... 1.494ms)        1.151ms    1.391ms    1.417ms
svfnn film 16x4        1        2.01s          2.01s ± 0ns            (2.01s ... 2.01s)            2.01s      2.01s      2.01s

"Wow, it really must be the cache locality!" was my initial thought. I then fired up the plugin in Reaper, but to (none of) my surprise, I got no sound. "Ah, must be just a minor bug in the code"—and it turned out I was accidentally reading the filter gain coefficients in the wrong order.

So, I fixed the order and reran the benchmarks:

benchmark              runs     total time     time/run (avg ± σ)     (min ... max)                p75        p99        p995
-----------------------------------------------------------------------------------------------------------------------------
...
svfnn film 16x4        11       1.627s         147.993ms ± 18.374ms   (135.26ms ... 190.418ms)     154.519ms  190.418ms  190.418ms
...

"What? How could the order of the parameters have such a huge effect?"

So, I reran the benchmarks:

benchmark              runs     total time     time/run (avg ± σ)     (min ... max)                p75        p99        p995
-----------------------------------------------------------------------------------------------------------------------------
svfnn film 16x4        2        1.829s         914.768ms ± 3.936ms    (911.984ms ... 917.552ms)    917.552ms  917.552ms  917.552ms

Then it hit me.

State.

The filters have state. The gates can change the input to a filter to zero, which causes the state to decay toward zero. If you know anything about DSP, 32-bit floating point numbers, and decaying to zero, this should instantly trigger some primal DSP guru senses. Really small 32-bit floating point numbers lead to some pretty slow computations!

So...

// svf_bank.zig
pub fn forward(self: *Self, input: [num_filters]f32) [num_filters]f32 {
+   const denormal_protection: T = @splat(1e-15);
    const two: T = @splat(2.0);
    const x: T = input;

    const y_bp = (g * (x - self.h2) + self.h1) * coef;
-   self.h1 = @mulAdd(T, two, y_bp, -self.h1);
+   self.h1 = @mulAdd(T, two, y_bp, -self.h1) + denormal_protection;

    const y_lp = @mulAdd(T, g, y_bp, self.h2);
-   self.h2 = @mulAdd(T, two, y_lp, -self.h2);
+   self.h2 = @mulAdd(T, two, y_lp, -self.h2) + denormal_protection;

    const y_hp = x - y_lp - two_r * y_bp;

    return m_bp * y_bp + m_lp * y_lp + m_hp * y_hp;
}
};

and now we finally get a slight speedup compared to the original performance:

benchmark              runs     total time     time/run (avg ± σ)     (min ... max)                p75        p99        p995
-----------------------------------------------------------------------------------------------------------------------------
sigmoid                325      2.387s         7.347ms ± 693us        (6.167ms ... 10.675ms)       7.608ms    10.249ms   10.661ms
relu                   1540     1.939s         1.259ms ± 47.298us     (1.181ms ... 1.733ms)        1.297ms    1.352ms    1.405ms
svfnn film 16x4        12       1.882s         156.901ms ± 1.743ms    (155.209ms ... 162.055ms)    157.262ms  162.055ms  162.055ms