20x Slowdown From ReLU
2025-06-05
Last week, I was working on my master's thesis and decided to try simplifying my approach. I was inspired by this paper on replacing all these new fancy activation functions with regular old rectified linear units. So, I went ahead and replaced the GLU and gated Tanh activations I was using with ReLUs and ReGLUs, respectively.
I then ran some benchmarks before making the change:
benchmark runs total time time/run (avg ± σ) (min ... max) p75 p99 p995
-----------------------------------------------------------------------------------------------------------------------------
sigmoid 333 2.392s 7.185ms ± 1.107ms (5.731ms ... 11.132ms) 8.095ms 9.384ms 10.731ms
relu 1268 1.645s 1.297ms ± 138.268us (1.218ms ... 2.547ms) 1.298ms 2.103ms 2.284ms
svfnn film 16x4 8 1.864s 233.111ms ± 1.107ms (230.472ms ... 233.83ms) 233.823ms 233.83ms 233.83ms
and after:
benchmark runs total time time/run (avg ± σ) (min ... max) p75 p99 p995
-----------------------------------------------------------------------------------------------------------------------------
sigmoid 322 1.906s 5.921ms ± 218.743us (5.659ms ... 7.23ms) 5.992ms 7.159ms 7.221ms
relu 1587 1.926s 1.213ms ± 85.031us (1.14ms ... 2.165ms) 1.212ms 1.622ms 1.88ms
svfnn film 16x4 1 4.084s 4.084s ± 0ns (4.084s ... 4.084s) 4.084s 4.084s 4.084s
So, a 20x slowdown from switching to ReLU? This made me incredibly puzzled, as the behavior didn't seem to follow any coherent reasoning. Based on the benchmarks, the ReLU should be five times faster than the sigmoid function. How then could changing the activation function to ReLU cause such a massive slowdown? The key to solving this puzzle, however, lay in the architecture of my neural network itself.
The models that I've been building for my thesis utilize digital state variable filters (DSVF) and linear layers with activation to model nonlinear audio effects. The DSVF has nice properties that make it ideal for this kind of task. It has an infinite impulse response, meaning it can model arbitrarily long temporal dependencies. This capability is also, however, the reason why I eventually ran into this problem of losing an order of magnitude of performance when I decided to gate the features using ReLU.
The problem disappeared if I either switched the activation back to sigmoid or removed the filter bank entirely. Since I was playing around with Zig's generic type system and doing some profound OOP experimentation (TBH), my first thought was that this was probably some weird behavior where the code just unrolls so hard that it cripples any cache locality in the process.
Here's a snippet of the part where the funky stuff was happening:
inline for (0..num_blocks) |i| {
x = self.filter_banks[i].forward(x);
x += self.condition_layers[i].forward(self.condition);
x = self.mlps[i].forward(x);
}
Another thing I noticed was that if I only used the first block to process all num_blocks
layers, I didn't get any slowdown. So, I figured it was time to don the Hawaiian shirt, buckle up, start doing some real man programming, and forget my beautiful abstractions.
After working really hard for a couple of hours to reorganize the parameters for the main input/output layers and filter banks to possibly get some cache locality, I fired up the benchmark again:
benchmark runs total time time/run (avg ± σ) (min ... max) p75 p99 p995
-----------------------------------------------------------------------------------------------------------------------------
sigmoid 350 1.957s 5.594ms ± 168.653us (5.294ms ... 6.565ms) 5.63ms 6.17ms 6.321ms
relu 1712 1.963s 1.146ms ± 46.451us (1.074ms ... 1.494ms) 1.151ms 1.391ms 1.417ms
svfnn film 16x4 1 2.01s 2.01s ± 0ns (2.01s ... 2.01s) 2.01s 2.01s 2.01s
"Wow, it really must be the cache locality!" was my initial thought. I then fired up the plugin in Reaper, but to (none of) my surprise, I got no sound. "Ah, must be just a minor bug in the code"—and it turned out I was accidentally reading the filter gain coefficients in the wrong order.
So, I fixed the order and reran the benchmarks:
benchmark runs total time time/run (avg ± σ) (min ... max) p75 p99 p995
-----------------------------------------------------------------------------------------------------------------------------
...
svfnn film 16x4 11 1.627s 147.993ms ± 18.374ms (135.26ms ... 190.418ms) 154.519ms 190.418ms 190.418ms
...
"What? How could the order of the parameters have such a huge effect?"
So, I reran the benchmarks:
benchmark runs total time time/run (avg ± σ) (min ... max) p75 p99 p995
-----------------------------------------------------------------------------------------------------------------------------
svfnn film 16x4 2 1.829s 914.768ms ± 3.936ms (911.984ms ... 917.552ms) 917.552ms 917.552ms 917.552ms
Then it hit me.
State.
The filters have state. The gates can change the input to a filter to zero, which causes the state to decay toward zero. If you know anything about DSP, 32-bit floating point numbers, and decaying to zero, this should instantly trigger some primal DSP guru senses. Really small 32-bit floating point numbers lead to some pretty slow computations!
// svf_bank.zig
pub fn forward(self: *Self, input: [num_filters]f32) [num_filters]f32 {
+ const denormal_protection: T = @splat(1e-15);
const two: T = @splat(2.0);
const x: T = input;
const y_bp = (g * (x - self.h2) + self.h1) * coef;
- self.h1 = @mulAdd(T, two, y_bp, -self.h1);
+ self.h1 = @mulAdd(T, two, y_bp, -self.h1) + denormal_protection;
const y_lp = @mulAdd(T, g, y_bp, self.h2);
- self.h2 = @mulAdd(T, two, y_lp, -self.h2);
+ self.h2 = @mulAdd(T, two, y_lp, -self.h2) + denormal_protection;
const y_hp = x - y_lp - two_r * y_bp;
return m_bp * y_bp + m_lp * y_lp + m_hp * y_hp;
}
};
and now we finally get a slight speedup compared to the original performance:
benchmark runs total time time/run (avg ± σ) (min ... max) p75 p99 p995
-----------------------------------------------------------------------------------------------------------------------------
sigmoid 325 2.387s 7.347ms ± 693us (6.167ms ... 10.675ms) 7.608ms 10.249ms 10.661ms
relu 1540 1.939s 1.259ms ± 47.298us (1.181ms ... 1.733ms) 1.297ms 1.352ms 1.405ms
svfnn film 16x4 12 1.882s 156.901ms ± 1.743ms (155.209ms ... 162.055ms) 157.262ms 162.055ms 162.055ms