On the Power of Attention

The following is a synthesis written by ChatGPT.

Attention as Selective Focus

The central problem is that a processing system (whether a brain’s conscious workspace or a machine learning model) has finite capacity. It cannot load all available information at once without collapsing distinctions. The bottleneck forces selection: some items must be emphasized, others ignored. Attention is the mechanism that makes this possible.

Aggregation Without Attention

If there are many candidate values ${v_1, \dots, v_n}$, we can always combine them by averaging or other blind aggregation. But this is structurally limiting:

A global average collapses everything into a single point, erasing the individuality of inputs.
A biased or pointed average (slight weighting toward one perspective) still fuses everything, just less evenly.
More elaborate tricks, like arranging values into grids and averaging along rows, preserve some structure (anchoring each average on one item) but remain diffuse: every row is still diluted by contributions from all others.

All these schemes are unconditional: they treat all items as equally present, regardless of context. The result is a kind of mush that grows worse as the number of candidates increases.

The Dynamic Nature of Attention

Attention differs fundamentally because it is context-sensitive and dynamic:

A query represents what the system needs at that moment — a kind of internal “pointer” of intent.
Each candidate carries a key that characterizes it.
The query–key similarity produces scores, which determine how strongly each candidate’s value contributes.

This makes attention a form of dynamic routing or content-addressable indexing: the pathway of information is determined on the fly, conditioned by the current state. The same item may be ignored in one context and central in another.

Hard vs. Soft Attention

Hard attention: pick exactly one candidate (argmax). Sharp but brittle — the wrong choice discards everything else, and it’s non-differentiable.
Soft attention: select many candidates, but weight them proportionally to relevance. This produces a weighted superposition rather than a blind average.

The trick is that representations must be compatible with superposition:

Tight encodings (e.g. bit-strings combined by XOR) collapse into noise.
Distributed, high-dimensional encodings (embeddings, images, holographic vectors) can be combined while still preserving structure.
This allows “mashing together a few” without immediate loss, enabling the workspace to hold a small coalition of items at once.

Why Attention is Essential

Without attention, aggregation produces undifferentiated summaries. With attention, the system achieves selectivity:

Not everything at once (overload).
Not just one thing (no integration).
But a few relevant things, weighted by their current significance.

This selective integration is what allows a central processor to combine information from different sources in a meaningful, context-dependent way.

In short: attention is not just averaging with bells on. It is a conditional, query-driven operator that dynamically routes information, enabling finite-memory systems to integrate the right subset of inputs without collapsing into mush.

On the Power of Attention

Attention as Selective Focus

Aggregation Without Attention

The Dynamic Nature of Attention

Hard vs. Soft Attention

Why Attention is Essential

Comments

More from this blog

Complex Log-Mean-Exp Networks

Less Overfitting via Stochastic Exposure

Why Transformers Are Powerful

NN Architectures as Generalized Algorithms

Command Palette

Attention as Selective Focus

Aggregation Without Attention

The Dynamic Nature of Attention

Hard vs. Soft Attention

Why Attention is Essential

Comments

More from this blog