Using group theory to explore the space of positional encodings for attention
Positional encodings are essential for attention mechanisms to understand sequence order, but their design space is more constrained than it appears. By applying group theory and requiring properties like linearity, translation invariance, and continuity, the set of valid encodings reduces to one-parameter matrix groups. This analysis shows that all practical encodings in use today—like RoPE—fall into a few mathematically permissible families. A previously unrecognized but exotic class of encodings emerges as theoretically valid, though likely impractical.
- ▪Positional encodings modify attention mechanisms to account for the order of elements in sequences by incorporating relative or absolute position information.
- ▪Desirable properties such as linearity, translation invariance, and continuity constrain valid positional encodings to one-parameter matrix groups of the form A(t) = exp(tX).
- ▪All widely used positional encodings, including RoPE, fall within the mathematically derived families, suggesting current practices are already aligned with theoretical limits.
- ▪A novel, complex-valued class of encodings arises from non-diagonalizable generator matrices, but these are likely impractical for real-world models.
- ▪The analysis implies that discovering fundamentally new effective positional encodings is unlikely, as the space of sensible options is highly restricted by mathematical structure.
Full article excerpt tap to expand
Attention is a computational primitive at the core of modern language models, allowing internal representations to reference and influence each other. It’s how these models handle sequential data in the first place. Yet, naively implemented, attention doesn’t have any notion of position. In the core attention computation, you calculate the dot product between a given “query” and a “key,” which tells you how much to attend to the value at that key—and this dot product says nothing about where in the sequence the queries and keys are. Since in most settings that information matters a great deal, you generally want to somehow perturb your dot-product calculation so that it depends on the positions (usually just the relative positions) of the query and key. The so-called “positional encoding” that you use represents an important modeling decision, because it governs how the model views the passage of time. The most popular positional encoding, called RoPE, rotates components of key and query vectors by an angle that depends on the position in the sequence, like the hands of a clock. This works well in practice, but it’s far from the only option. On a recent Friday afternoon I wondered, well, what are all the options? I’m an ML researcher at Jane Street, and when we’re working with sequential models we’ve often debated whether we’re using the best positional encoding. This raises the question: what positional encodings are even possible? As I started investigating, I discovered to my surprise that the space is quite constrained. The reason is that there are a few key properties that any desirable positional encoding should have, and once you formalize those, you’re left with a very particular mathematical structure. (Spoiler: a one-parameter group.) In exploring that structure, I was able to show that there are only a few families of valid positional encoding—the demonstration of that fact forms the bulk of this blog post—and actually all of the sensible ones are already being used in real systems. It was a reassuring finding, because it means that we don’t need to rack our brains to come up with some perfect positional encoding, as we are probably already using it. Even so, in doing this analysis I did discover a strange, unlikely-to-work-out-in-practice, but technically legal class of positional encodings that seems to be totally unexplored. Formalizing a positional encoding Let’s suppose we have a time series of queries q(t): \mathbb{R} \to \mathbb{R}^d, and a time series of keys k(t): \mathbb{R} \to \mathbb{R}^d. We write q and k as functions of time so that we can accommodate continuous or irregularly sampled inputs, but we could just as well restrict ourselves to only integer times if we prefer. “Time” here is a stand-in for any increasing quantity that we might care to use to measure progression through a sequence; it could be a sequence index, or literal elapsed time in a time series modeling problem, or even some kind of learned notion of time, as used in the Mamba family of models. We’re only concerned with the time-dependent aspects of the positional encoding, so we’ll allow ourselves to apply certain time-independent operations, most notably changes of basis, to q and k without really considering the positional encoding to have changed. (When q and k are produced by linear layers in neural networks, the basis is arbitrary anyway.) In the absence of positional encodings, attention would require us to compute inner products like…
This excerpt is published under fair use for community discussion. Read the full article at Jane Street Blog.