Number Theoretic Transform - A Gentle Introduction: Part I

by Steven Yue. Posted on Mar 22, 2023

“A Monet painting of convolution” by DALL-E.

Polynomial Multiplications as Convolutions

Polynomials are everywhere. In fact, in Computer Science, they are really just fancy lists of numbers that have a particular way of doing arithmetic.

For example, adding two polynomials is as simple as just summing up their coefficients. Multiplying a polynomial by a constant is just multiplying every coefficient by that constant.

However, when we want to multiply polynomials, things get slightly more complicated.

Multiplying Polynomials

Suppose we have two degree-$(d-1)$ polynomials, $P(x), Q(x)$:

$$P(x) = \sum_{i=0}^{d-1} p_i x^i$$ $$Q(x) = \sum_{i=0}^{d-1} q_i x^i$$

The classical way of performing polynomial multiplications is to expand out the summation, and multiply everything in $P(x)$ with everything in $Q(x)$ to form a degree $2d-2$ polynomial:

$$ \begin{align*} (P \cdot Q)(x) &= \left(\sum_{i=0}^{d-1} p_i x^i \right)\left(\sum_{i=0}^{d-1} q_i x^i \right)\\ &= \sum_{i=0}^{2d-2} c_i x^i \\ \end{align*} $$

where $c_k$ is the sum of product of coefficients in $P(x), Q(x)$ such that $c_k = \sum_{i+j=k} p_i q_j$. This summation is commonly known as the convolution of the coefficients of $P, Q$.

Now we can code up a simple implementation for the naive polynomial multiplication. Here we represent polynomials as lists of their coefficients, with the first element corresponding to the lowest degree coefficient (constant), and the last element to the highest degree.

def mul_poly_naive(a, b):
  tmp = [0] * (len(a) + len(b) - 1) # the product of two polynomials cannot exceed the sum of their degree
  
  # schoolbook multiplication
  for i in range(len(a)):
    # perform a_i * b
    for j in range(len(b)):
      tmp[i + j] += a[i] * b[j]
  
  return tmp

a = [1, 2, 3, 4]
b = [1, 3, 5, 7]
print(mul_poly_naive(a, b))

[1, 5, 14, 30, 41, 41, 28]

It works!

However, we are still far from finished. Next, we will introduce a few interesting tweaks/constraints to the polynomials to make things work a little bit more conveniently.

Additional Polynomial Structure

In the field of cryptography, we typically work with finite fields, and the most common one is simply $\mathbb{Z}_q$, i.e., integers modulo a prime number $q$. We define the polynomials $P, Q$ described above with coefficients $p_i, q_i \in \mathbb{Z}_q$ as belonging to the set $\mathbb{Z}_q [x]$.

The interesting thing about working in a finite field is that the polynomial coefficients “wrap around” when being multiplied. So regardless of how much polynomial arithmetic we perform, the coefficients of the polynomial can still be bounded in a fixed range, which is very attractive from an implementation perspective.

def mul_poly_naive_q(a, b, q):
  tmp = [0] * (len(a) + len(b) - 1) # the product of two polynomials cannot exceed the sum of their degree
  
  # schoolbook multiplication
  for i in range(len(a)):
    # perform a_i * b
    for j in range(len(b)):
      tmp[i + j] = (tmp[i + j] + a[i] * b[j]) % q
  
  return tmp

print(mul_poly_naive_q(a, b, 17))

[1, 5, 14, 13, 7, 7, 11]

Putting a $\text{Ring}^\text{TM}$ on it

However, one can quickly notice an item that is “not so positive” in the current setting - which is that the degree of the polynomial will only grow larger and larger as we perform more multiplications. That means we need to have longer arrays to store coefficients and more complex convolutions every time a multiplication is performed. It would be great if the degree of the polynomials could “wrap around” just like the coefficients.

As it turns out, there’s another algebraic structure that’s exactly for that - Polynomial Quotient Rings. Just like in the base field $\mathbb{Z}_q$ where we would take modulo $q$ after every arithmetic operation, we can also take modulo some polynomial $\phi(x)$ after every polynomial operation. The resulting polynomial’s degree would never be larger or equal to the degree of $\phi(x)$. We call such structure $\mathbb{Z}_q [x] / (\phi(x))$.

In the context of lattice-based Cryptography, $\phi(x)$ is chosen to be a polynomial looking either like $x^d - 1$ or $x^d + 1$. This kind of polynomial is usually referred to as a cyclotomic polynomial. Next we will examine how the two choices of the cyclotomic polynomial influence the “wrap around” behavior of the resulting ring.

Polynomial multiplication over a polynomial ring

Before we look into the cyclotomic polynomial $\phi(x)$, let’s see how to manually perform a polynomial multiplication over a polynomial ring.

The recipe is actually very simple: we first perform the normal polynomial multiplication, resulting in a polynomial with high degree; then, we perform a reduction operation where we take the high degree polynomial modulo the polynomial modulus, and obtain a low degree result.

Let’s look into a particular example:

$$ \begin{align*} P(x) &= x^3 + x + 1 \\ Q(x) &= x^2 - x \\ \phi(x) &= x^4 + 1 \\ P(x) \cdot Q(x) \text{ mod } \phi(x) &= (x^3 + x + 1) \cdot (x^2 - x) \text{ mod } (x^4 + 1) \\ &= x^5 - x^4 + x^3 - x \text{ mod } (x^4 + 1) \\ &= x^5 - x^4 + x^3 - x - x \cdot (x^4 + 1) + 1 \cdot (x^4+1) \\ &= x^3 - 2x + 1 \end{align*} $$

As we see from the example above, the multiplication part was straightforward cross-coefficient products. The reduction part was iteratively removing copies of $x^4+1$ from the high degree expression until the resulting degree was smaller than the modulus.

Cyclic Convolution (CC)

Now, let’s take a look at the case when $\phi(x) = x^d - 1$. If we take any polynomial modulo $x^d - 1$, it’s equivalent to removing multiples of $x^d - 1$ from the polynomial until the resulting polynomial has degree lower than $d$.

For example, for a term like $c_d x^d$, after the reduction, it becomes $c_d$, as it removes $c_d$ copies of $x^d - 1$, and the negative one term caused $c_d$ to shift from the degree-$d$ coefficient to the degree-0 coefficient. Similarly, the term $c_{d+1} x^{d+1}$ will be reduced to $c_{d+1} x$, due to the negative one term in the modulus. We can gradually see that any terms that exceed degree-$d-1$ will essentially “wrap around” and become smaller degree. With such behavior, the ring formed by $\mathbb{Z}_q [x] / (x^d - 1)$ has a very interesting property called cyclic convolution.

To illustrate this property, let’s compute the convolution of two degree-$(d-1)$ polynomials $P, Q$ again. Upon obtaining the resulting degree-$(2d-2)$ polynomial $\sum_{i=0}^{2d-2} c_i x^i$, we take modulo $x^d - 1$, which essentially means summing the degree-$i$ for coefficients $i>d-1$ with the degree-$(i-d)$ coefficients:

$$ \sum_{i=0}^{d-1} c_i’ x^i = \sum_{i=0}^{d-1} (c_i + c_{i+d-1}) x^i $$

As we can see, it is almost as if the terms higher than the modulus degree $d$ end up showing up at the other end of the table and stacking on top of the lower coefficients, hence the name cyclic.

def mul_poly_naive_q_cc(a, b, q, d):
  tmp = [0] * (d * 2 - 1) # intermediate polynomial has degree 2d-2
  
  # schoolbook multiplication
  for i in range(len(a)):
    # perform a_i * b
    for j in range(len(b)):
      tmp[i + j] = (tmp[i + j] + a[i] * b[j]) % q
  
  # take polynomial modulo x^d - 1
  for i in range(d, len(tmp)):
    tmp[i - d] = (tmp[i - d] + tmp[i]) % q
    tmp[i] = 0

  return tmp[:d]

print(mul_poly_naive_q_cc(a, b, 17, 4))

[8, 12, 8, 13]

Negative Wrapped Convolution (NWC)

Conversely, the other modulus $\phi(x) = x^d + 1$ results in what is called negative wrapped convolution, or sometimes negacyclic convolution. Obviously, because the sign of the one term flipped, now when removing copies of the modulus from a polynomial, what wraps around will be the negation of the original high degree term coefficient that got reduced. Therefore, the overall expression in the NWC case is:

$$ \sum_{i=0}^{d-1} c_i’ x^i = \sum_{i=0}^{d-1} (c_i - c_{i+d-1}) x^i $$

def mul_poly_naive_q_nwc(a, b, q, d):
  tmp = [0] * (d * 2 - 1) # intermediate polynomial has degree 2d-2
  
  # schoolbook multiplication
  for i in range(len(a)):
    # perform a_i * b
    for j in range(len(b)):
      tmp[i + j] = (tmp[i + j] + a[i] * b[j]) % q
  
  # take polynomial modulo x^d + 1
  for i in range(d, len(tmp)):
    tmp[i - d] = (tmp[i - d] - tmp[i]) % q
    tmp[i] = 0

  return tmp[:d]

print(mul_poly_naive_q_nwc(a, b, 17, 4))

[11, 15, 3, 13]

From the test results, we can see that both implementations (CC, NWC) lead to a way of multiplying polynomials while preserving the underlying field size and order.

Performance

Throughout the post, we have been working with the naive approach of multiplying polynomials, i.e., convolution of two dimension-$d$ lists.

Although it works, we can immediately notice a performance issue with convolution-based approaches - it requires iterating over all pairs of coefficients which results in runtime complexity of $O(d^2)$. In the example of Kyber where the polynomials have degree of 256, doing a naive multiplication between elements in Kyber would require as many as $256^2 = 65536$ coefficient multiplications! It is obvious that a quadratic cost like this would not scale.

Evaluation Form

Fortunately enough, polynomials are objects that can not only be expressed in their normal (coefficient) form like the summation above, but also in evaluation form.

Consider the same polynomials $P, Q$ as above. Instead of describing the coefficients $p_i, q_i$, we can evaluate the polynomials at distinct points $x_i$:

$$P(x_0), P(x_1), \dots, P(x_d)$$ $$Q(x_0), Q(x_1), \dots, Q(x_d)$$

As long as we have $d+1$ distinct evaluations of a polynomial, one can efficiently recover the polynomial’s coefficient form through a process called Lagrange Interpolation. A common choice is to pick $x_i$ to just be numbers from 0 to $d$, such that we have $P(0), P(1), \dots, P(d)$.

Now, coming back to the topic of polynomial multiplication, we see that, in the evaluation form, multiplication is strictly component-wise:

$$(P \cdot Q)(x_i) = P(x_i) \cdot Q(x_i)$$

This means, the quadratic cost when it comes to polynomial multiplication can be thus reduced to $O(d)$ linear cost (one multiplication per evaluation point)!

Next, we implement a fast polynomial multiplication in the evaluation form.

def eval_poly(p, d=-1):
  deg = len(p) - 1
  if d != -1:
    deg = d
  points = [0] * (deg + 1)

  for i in range(len(points)):
    # evaluate P(i) using Horner's method
    res = p[-1]
    for c in p[-2::-1]:
      res *= i
      res += c
    points[i] = res

  return points

import numpy as np
from scipy.interpolate import lagrange

def interp_poly(p):
  # here we use SciPy's Lagrange Interpolation implementation to recover the polynomial.
  coefs = lagrange(range(len(p)), p).coef[::-1]
  return np.rint(coefs).astype(int).tolist()

def mul_poly_eval_attempt(a, b):
  # first, convert poly A, B to eval form
  max_deg = len(a) + len(b) - 2
  ea = eval_poly(a, d=max_deg)
  eb = eval_poly(b, d=max_deg)
  res = [0] * (max_deg + 1)

  # perform component wise mul
  for i in range(max_deg + 1):
    res[i] = ea[i] * eb[i]
  
  # convert back to coef form
  return interp_poly(res)

print(eval_poly([0, 1, 4]))
print(mul_poly_eval_attempt(a, b))
print(mul_poly_naive(a, b) == mul_poly_eval_attempt(a, b))

[0, 5, 18]
[1, 5, 14, 30, 41, 41, 28]
True

🎉🎉🎉 Tada! We can now perform multiplication in the evaluation form, which indeed only requires linear $O(d)$ cost.

But what about Rings?

Once we have the evaluation form implementation for the naive convolution, the next natural question is how do we deal with cyclic/negative wrapped convolutions.

There is a super simple approach we can follow: we perform the naive multiplication in evaluation form, and receive a degree $2d-2$ polynomial. Then we reduce this polynomial by our quotient polynomial $x^d \pm 1$ and get the right result.

In fact, this approach is already good enough because the reduction at the end is $O(d)$ - just a linear scan in the resulting list and perform some “wrap around” field additions or subtractions.

def mul_poly_q_cc_eval_attempt(a, b, q, d):
  # first, convert poly A, B to eval form
  max_deg = d * 2 - 2
  ea = eval_poly(a, d=max_deg)
  eb = eval_poly(b, d=max_deg)

  # reduce by q

  res = [0] * (max_deg + 1)

  # perform component wise mul
  for i in range(max_deg + 1):
    res[i] = ea[i] * eb[i]
  
  # convert back to coef form and reduce coefficients to Z_q
  res_interp = [x % q for x in interp_poly(res)]

  # reduce by x^d - 1
  for i in range(d, len(res_interp)):
    res_interp[i - d] = (res_interp[i - d] + res_interp[i]) % q
    res_interp[i] = 0

  return res_interp[:d]

def mul_poly_q_nwc_eval_attempt(a, b, q, d):
  # first, convert poly A, B to eval form
  max_deg = d * 2 - 2
  ea = eval_poly(a, d=max_deg)
  eb = eval_poly(b, d=max_deg)

  # reduce by q

  res = [0] * (max_deg + 1)

  # perform component wise mul
  for i in range(max_deg + 1):
    res[i] = ea[i] * eb[i]
  
  # convert back to coef form and reduce coefficients to Z_q
  res_interp = [x % q for x in interp_poly(res)]

  # reduce by x^d _ 1
  for i in range(d, len(res_interp)):
    res_interp[i - d] = (res_interp[i - d] - res_interp[i]) % q
    res_interp[i] = 0

  return res_interp[:d]

print(mul_poly_naive_q_cc(a, b, 17, 4) == mul_poly_q_cc_eval_attempt(a, b, 17, 4))
print(mul_poly_naive_q_nwc(a, b, 17, 4) == mul_poly_q_nwc_eval_attempt(a, b, 17, 4))

True
True

There is no free lunch

The approaches we have described above have a few obvious catches:

2x more evaluation points: Since the final polynomial has degree $2d-2$ as we multiply two degree-$(d-1)$ polynomials, we need twice as many ($2d-1$, to be exact) evaluation points in order to be able to fully recover it back to its coefficient form. This means, we will have to first evaluate $P, Q$ at $2d-1$ points instead of $d$ to begin with.
Cost of converting to/from coefficient form: The cost we saved in multiplication must show up somehow somewhere else. Indeed, the cost goes into how we effectively evaluate polynomials (coefficient form -> evaluation form), and interpolate polynomials (evaluation form -> coefficient form).

The first catch is relatively fine - it is still linear compared to the degree of the polynomials, which is much better than the quadratic cost of the naive approach.

The second catch is where things get slightly more problematic, as evaluating a polynomial at a single point is already at least $O(d)$ even when we apply some of the commonly used optimizations. (One of the optimizations is called Horner’s method, which allows evaluating polynomials in $O(d)$ time.) Besides, in this post the entire interpolation is done via the black-box lagrange method provided by SciPy, and we have no idea what complexity it has! It turns out that this is the make-or-break moment - if we can find out a way to efficiently transition between the two forms, we can solve the problem of efficient polynomial multiplication.

One easy hack to relieve the cost is to keep polynomials in the evaluation form throughout the computation - in fact, maybe we should never go back to coefficient form unless we have to. This can ensure that we have a minimum number of conversions between the two forms. But this is not a topic we want to go into yet…

The systematic and efficient way to perform this conversion is a procedure called the Number Theoretic Transform, or NTT for short. Up until this point, we have already paved the foundation for understanding the reason why we need NTT and what NTT can help us with. In our next post, we will dive right into NTT and see how it helps us with the aforementioned problem.