skip to content
Ball's Blog

Introduction to Rotary Positional Embedding

/ 4 min read

Updated:
Table of Contents

What is Positional Embedding?

Let’s start with a simple sentence: “Big apple and small apple.” In this sentence, we have two “apple”. Even though they are the same word, they have different meanings. The first “apple” refers to a big apple, while the second “apple” refers to a small apple. To distinguish between these two “apple”, we need to consider their positions in the sentence.(First “apple” is close to “big”, while second “apple” is close to “small”.)

Positional Embedding example sentence

Previous approaches to positional embedding(Additive)

In the original transformer model, positional embedding is implemented as an additive method. Given a token embedding xtRdx_t \in \mathbb{R}^d, the positional embedding PE(t)RdPE(t) \in \mathbb{R}^d is added to it:

xt+PE(t)x_t + PE(t)

However, this additive method has a limitation. It cannot capture the relative position between tokens effectively. Adding the positional embedding and applying self-attention wouldn’t model the relative position between tokens well.

Rotary Positional Embedding(Multiplicative)

In the paper ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING, the authors propose a new method for positional embedding called Rotary Positional Embedding. This is done by matmul the token embedding with a rotation matrix.

Terminology

dd: Dimension of hidden state

xt,qt,ktx_t, q_t, k_t: Token embedding, Query, Key for tt th token. These are all in Rd\mathbb{R}^d unless specified.

xt,qt,ktx_t', q_t', k_t': Rotated token embeddings, Queries, Keys for tt th token. These are all in Rd\mathbb{R}^d unless specified.

Complex Space

Representation of 2D vector in complex space

For any 2D vector qt=(qt0,qt1)R2q_t = (q_{t0}, q_{t1}) \in \mathbb{R}^2, we can represent it in the complex plane as:

qt=qeiαq_t = |q| e^{i \alpha}

Where q|q| is the magnitude of the vector and α\alpha is the angle it makes with the positive x-axis.

This is all due to Euler’s formula: eiθ=cos(θ)+isin(θ)e^{i \theta} = \cos(\theta) + i \sin(\theta)

Complex plane

Rotation in complex space

In the complex plane, we can perform a rotation by multiplying a complex number with another complex number that represents the rotation. For example, if we want to rotate a vector by an angle θ\theta, we can multiply it with eiθe^{i \theta}:

qt=eiθqt=qei(α+θ)\begin{aligned} q_t' = e^{i \theta} q_t &= |q| e^{i (\alpha + \theta)} \end{aligned} Rotation in complex plane

Details of Rotary Positional Embedding(with hidden_dim=2)

Let’s consider a simple case where the hidden dimension is 2. Given a token embedding at position tt, xt=(xt0,xt1)R2x_t = (x_{t0}, x_{t1}) \in \mathbb{R}^2, the rotary positional embedding is applied as a rotation in the complex plane.

First, let’s define the token embeddings in complex plane:

xt=(xt0,xt1)=(xcos(αt),xsin(αt))=xeiαtx_t = (x_{t0}, x_{t1}) = (|x| \cos(\alpha_t), |x| \sin(\alpha_t)) = |x| e^{i \alpha_t}

The rotation is defined as follows:

xm=eiθmxm=xei(αm+θm)xn=eiθnxn=xei(αn+θn)\begin{aligned} x_m' = e^{i \theta m} x_m = |x| e^{i (\alpha_m + \theta m)} \\ x_n' = e^{i \theta n} x_n = |x| e^{i (\alpha_n + \theta n)} \end{aligned}

xmx_m rotates by θm\theta m and xnx_n rotates by θn\theta n.

By doing so, if we apply inner product between xmx_m' and xnx_n', we can capture the relative position between mm and nn:

The inner product in complex space is defined as follows: xmxn=Real(xmxn)x_m' \cdot x_n' = Real(x_m' \overline{x_n'})

xmxn=(xeiθm)(xeiθn)=Real(xei(αm+θm)xei(αn+θn))=Real(x2ei(αmαn+θmθn))=x2cos(αmαn+θ(mn)) .. (Eq 1.)\begin{aligned} x_m' \cdot x_n' &= (|x| e^{i \theta m}) \cdot (|x| e^{i \theta n}) \\ &= Real(|x| e^{i (\alpha_m + \theta m)} |x| e^{-i (\alpha_n + \theta n)}) \\ &= Real(|x|^2 e^{i (\alpha_m - \alpha_n + \theta m - \theta n)}) \\ &= |x|^2 \cos(\alpha_m - \alpha_n + \theta (m-n)) \ .. \ (Eq \ 1.) \end{aligned}

As you can see, the inner product between xmx_m' and xnx_n' depends on the relative position between mm and nn through the term θ(mn)\theta (m-n). This allows the model to capture the relative positional information effectively.

In real implementation, we don’t use the concept of complex plane. Instead, we use 2D rotation matrix to achieve the same effect. In this post, I will not go into the details of 2D rotation matrix. If you are interested, check here

Expanding the RoPE into hidden_dim=dd

In practice, the hidden dimension is usually much larger than 2. In this case, we can apply the same rotation to each pair of dimensions. For example, if the hidden dimension is 4, we can apply the same rotation to the first two dimensions and the last two dimensions:

xm=(xm0,xm1,xm2,xm3)=((xm0,xm1),(xm2,xm3))xn=(xn0,xn1,xn2,xn3)=((xn0,xn1),(xn2,xn3))\begin{aligned} x_m = (x_{m0}, x_{m1}, x_{m2}, x_{m3}) = ((x_{m0}, x_{m1}), (x_{m2}, x_{m3})) \\ x_n = (x_{n0}, x_{n1}, x_{n2}, x_{n3}) = ((x_{n0}, x_{n1}), (x_{n2}, x_{n3})) \end{aligned}

Assume θi=100002(i1)/d\theta_i = 10000^{-2(i-1)/d}.

Applying rotation would be as follows:

  1. Rotate the first two dimensions with θ1\theta_1:
(xm0,xm1)=eiθ1m(xm0,xm1)(xn0,xn1)=eiθ1n(xn0,xn1)\begin{aligned} (x_{m0}', x_{m1}') = e^{i \theta_1 m} (x_{m0}, x_{m1}) \\ (x_{n0}', x_{n1}') = e^{i \theta_1 n} (x_{n0}, x_{n1}) \end{aligned}
  1. Rotate the last two dimensions with θ2\theta_2:
(xm2,xm3)=eiθ2m(xm2,xm3)(xn2,xn3)=eiθ2n(xn2,xn3)\begin{aligned} (x_{m2}', x_{m3}') = e^{i \theta_2 m} (x_{m2}, x_{m3}) \\ (x_{n2}', x_{n3}') = e^{i \theta_2 n} (x_{n2}, x_{n3}) \end{aligned}

You may wonder why we use different θ\theta for different dimensions. This is because we want to capture the positional information at different frequency. Recall (Eq 1.Eq \ 1.)