WeSearch

Using group theory to explore the space of positional encodings for attention

·12 min read · 0 reactions · 0 comments · 1 view
#positional encoding#attention mechanisms#group theory#machine learning#mathematical constraints
Using group theory to explore the space of positional encodings for attention
⚡ TL;DR · AI summary

Positional encodings are essential for attention mechanisms to understand sequence order, but their design space is more constrained than it appears. By applying group theory and requiring properties like linearity, translation invariance, and continuity, the set of valid encodings reduces to one-parameter matrix groups. This analysis shows that all practical encodings in use today—like RoPE—fall into a few mathematically permissible families. A previously unrecognized but exotic class of encodings emerges as theoretically valid, though likely impractical.

Key facts
Original article
Jane Street Blog
Read full at Jane Street Blog →
Full article excerpt tap to expand

Attention is a computational primitive at the core of modern language models, allowing internal representations to reference and influence each other. It’s how these models handle sequential data in the first place. Yet, naively implemented, attention doesn’t have any notion of position. In the core attention computation, you calculate the dot product between a given “query” and a “key,” which tells you how much to attend to the value at that key—and this dot product says nothing about where in the sequence the queries and keys are. Since in most settings that information matters a great deal, you generally want to somehow perturb your dot-product calculation so that it depends on the positions (usually just the relative positions) of the query and key. The so-called “positional encoding” that you use represents an important modeling decision, because it governs how the model views the passage of time. The most popular positional encoding, called RoPE, rotates components of key and query vectors by an angle that depends on the position in the sequence, like the hands of a clock. This works well in practice, but it’s far from the only option. On a recent Friday afternoon I wondered, well, what are all the options? I’m an ML researcher at Jane Street, and when we’re working with sequential models we’ve often debated whether we’re using the best positional encoding. This raises the question: what positional encodings are even possible? As I started investigating, I discovered to my surprise that the space is quite constrained. The reason is that there are a few key properties that any desirable positional encoding should have, and once you formalize those, you’re left with a very particular mathematical structure. (Spoiler: a one-parameter group.) In exploring that structure, I was able to show that there are only a few families of valid positional encoding—the demonstration of that fact forms the bulk of this blog post—and actually all of the sensible ones are already being used in real systems. It was a reassuring finding, because it means that we don’t need to rack our brains to come up with some perfect positional encoding, as we are probably already using it. Even so, in doing this analysis I did discover a strange, unlikely-to-work-out-in-practice, but technically legal class of positional encodings that seems to be totally unexplored. Formalizing a positional encoding Let’s suppose we have a time series of queries q(t): \mathbb{R} \to \mathbb{R}^d, and a time series of keys k(t): \mathbb{R} \to \mathbb{R}^d. We write q and k as functions of time so that we can accommodate continuous or irregularly sampled inputs, but we could just as well restrict ourselves to only integer times if we prefer. “Time” here is a stand-in for any increasing quantity that we might care to use to measure progression through a sequence; it could be a sequence index, or literal elapsed time in a time series modeling problem, or even some kind of learned notion of time, as used in the Mamba family of models. We’re only concerned with the time-dependent aspects of the positional encoding, so we’ll allow ourselves to apply certain time-independent operations, most notably changes of basis, to q and k without really considering the positional encoding to have changed. (When q and k are produced by linear layers in neural networks, the basis is arbitrary anyway.) In the absence of positional encodings, attention would require us to compute inner products like…

This excerpt is published under fair use for community discussion. Read the full article at Jane Street Blog.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Email

Discussion

0 comments

More from Jane Street Blog