Invariance is the easy part. You can always take an average or a max over a group and call it a day.
The hard part is being invariant without throwing away the signal.
This article is about a specific fix: the Selective -Bispectrum, introduced in the NeurIPS 2024 paper The Selective -Bispectrum and its Inversion: Applications to -Invariant Networks (in this repo under papers/bispectrum/).
The problem: “invariant” can mean “information-destroying”
Group-equivariant CNNs (-CNNs) are great at carrying symmetries through layers. But at some point (classification, retrieval, etc.) you usually want -invariance: rotate the input, same answer.
The default invariant layer is -pooling:
- Avg -pooling:
- Max -pooling:
Both are invariant, and both are cheap. But they can erase structure so aggressively that two inputs that are not related by a group action can map to the same pooled value.
Max pooling is a “bag of features” approach. It knows a feature exists somewhere, but forgets where it is relative to other features. It can’t distinguish a face from a Picasso painting where the eye is on the chin.
This also explains why max pooling sometimes works: if you have enough filters, a giant bag-of-features can approximate structure indirectly. But it’s a crude contract: you buy invariance by destroying geometry, and then you buy geometry back with width.
The money shot: parameter efficiency (why engineers should care)
The paper’s strongest practical argument isn’t just “invariance.” It’s parameter efficiency.
Because max pooling throws away relative structure, you often need to compensate with a wider -convolution (more filters). The Selective -Bispectrum layer is structure-preserving, so it stays strong under tight filter budgets.
In the paper’s MNIST experiments, the effect is dramatic: with only filters, the Selective -Bispectrum stays around ~95% accuracy while max pooling can drop to ~60%, and max pooling only catches up once you throw enough filters at it.
The alternative: complete -invariants (TC / bispectrum)
There’s a stronger notion than “invariant” that matters in practice:
A complete -invariant is one where (generically) equal outputs imply equal inputs up to group action.
In the paper’s framing, the -Triple Correlation (-TC) and the full -Bispectrum have this property under mild “genericity” assumptions on the Fourier coefficients.
The “genericity” assumption is basically “no accidental singularities / zeros,” i.e. corner cases of measure zero; small perturbations usually fix it.
This is exactly what you want from a pooling-like primitive if you care about robustness: it removes only the nuisance symmetry, not the content.
The blocker: cost explodes with
Completeness has historically been expensive:
- -TC: outputs and compute (per feature map, roughly).
- Full -bispectrum: outputs and compute.
- Pooling: outputs and compute.
So the practical choice was often: “cheap and lossy” vs “complete and prohibitive.”
The core idea: bispectrum redundancy → selective bispectrum
The paper’s main observation is that the full -bispectrum contains redundancy. The non-magic intuition is generating sets: just as a group can be generated by a small set of elements, the spectral information can be recovered from a small chain of bispectral “links.” You don’t need every pair ; you need enough pairs to traverse the graph of representations induced by tensor products (Kronecker decompositions) and reconstruct everything up to the inherent group-action ambiguity.
That subset is the Selective -Bispectrum:
- Space (linear): coefficients (instead of ).
- Time: near-linear if a fast Fourier transform exists for the group; otherwise it’s quadratic (still far better than the cubic -TC).
- Completeness: proved for the big practical families (finite commutative groups, dihedral groups, octahedral / full octahedral groups).
The hidden cost of completeness (you can’t just import this)
If you are used to PyTorch’s nn.MaxPool, you are spoiled. Max pooling is stateless and agnostic: it doesn’t care about the mathematical properties of the data it crushes. It just looks for the highest number and moves on.
The Selective -Bispectrum is not a drop-in, stateless replacement. It is structure-aware, and that structure requires a strict setup cost.
The mechanism: routing tables for symmetry
Unlike standard pooling (spatial pixels), the -bispectrum operates in the spectral domain.
At a high level the layer has a preamble:
- -Fourier transform: the feature map is mapped to Fourier blocks indexed by irreps , i.e. .
- Tensor contraction: those blocks are combined using the bispectrum formula, which includes Clebsch–Gordan / Kronecker structure telling you how frequency modes mix under tensor products.
The “selective” part is where the paper gets clever: it consults a Kronecker table (think: a routing map over irreps) and computes only a sparse chain of interactions—enough to traverse the representation graph and preserve completeness—rather than filling the entire grid.
The implementation gap (the precomputed constants)
Because this layer depends on group structure, you aren’t just passing a tensor; you’re passing it through a gauntlet of precomputed representation-theoretic constants:
- The hard way: manually compute irreps, Clebsch–Gordan matrices, Kronecker decompositions for your group, cache them, and write the sparse contractions efficiently.
- The smart way: lean on geometric deep learning libraries (the paper’s code builds on
escnn) to generate irreps / transformation structure and to manage the bookkeeping.
The FFT caveat (software maturity, not theory)
The theoretical “fast” scaling () relies on the existence of a fast Fourier transform for your group.
For groups where we rely on classical DFTs (including typical dihedral implementations in practice), compute scales quadratically: . That’s still significantly faster than the cubic -TC, but it’s not “instant” until dihedral FFT plumbing is real.
A small table of the tradeoffs
| Invariant layer | Time (per feature map, rough) | Output size | Complete -invariant |
|---|---|---|---|
| Avg/Max -pooling | No | ||
| -TC | Yes | ||
| Full -bispectrum | Yes | ||
| Selective -Bispectrum | if FFT else | Yes |
What it looks like as a layer
Architecturally, it’s “swap the invariant module” after a -convolution: same -conv backbone, different bottleneck.
Implementation note (the catch)
This is not nn.MaxPool. Computing -bispectral coefficients involves the group Fourier transform and Clebsch–Gordan / Kronecker structure (e.g. precomputed Kronecker tables, Clebsch–Gordan matrices). In practice you lean on libraries like escnn, or you precompute the representation-theoretic pieces once for your group and reuse them.
Empirics: the regime where it matters
The experiments (MNIST / EMNIST under discrete rotation and rotation+reflection groups) are pretty consistent with the theory:
- Speed: with FFTs (cyclic groups), selective bispectrum scales close to linearly in , while -TC scales cubically.
- Accuracy: selective bispectrum is typically much better than max/avg pooling when you don’t have a huge filter budget.
- Tradeoff: -TC can still win slightly on accuracy (even though both are complete in theory), plausibly because redundancy makes the downstream MLP’s job easier at finite width.
Takeaway (the “why this matters” sentence)
Max pooling buys invariance at the cost of signal destruction, forcing you to compensate with wider networks. The Selective -Bispectrum offers a new contract: complete signal preservation at linear space cost. It lets models be invariant and parameter-efficient, making it the first credible “complete” default for realistic -CNN pipelines.