Introduction
RSA relies heavily on modular exponentiation. Given a modulus (with and prime), encryption and decryption are:
The core operation is therefore computing efficiently.
Quick RSA key-generation recap
- Choose two large primes and .
- Compute .
- Compute Euler's totient: .
- Choose such that and .
- Compute .
Security note: this article uses textbook RSA for learning. Real-world RSA requires secure padding (e.g. OAEP) and constant-time implementations.
Baseline: binary modular exponentiation
The standard optimization over naive exponentiation is square-and-multiply:
def exp_mod(x: int, exp: int, mod: int) -> int:
x %= mod
if x == 0:
return 0
result = 1
while exp > 0:
if exp & 1:
result = (result * x) % mod
x = (x * x) % mod
exp >>= 1
return result
This is already much faster than computing first and reducing at the end.
Why Montgomery multiplication?
Modular multiplication repeatedly performs reductions modulo . Montgomery's idea is to move values into a different representation where reductions avoid expensive generic division and instead use operations based on a power-of-two radix.
Choose a radix:
For odd RSA moduli , choosing as a power of two automatically gives .
Montgomery domain
A value is represented in Montgomery form as:
Multiplying two Montgomery values gives:

To stay in Montgomery form we need to divide one factor of , i.e. compute:

So Montgomery multiplication is "multiply, then apply Montgomery reduction".
REDC: Montgomery reduction
Montgomery reduction (often written REDC) computes:
assuming .
Precompute:
Then:
- if , return , else return
Because :
- becomes
T & (R - 1) - division by becomes
>> k
Pseudocode:
def redc(T: int, N: int, N_prime: int, k: int) -> int:
mask = R - 1
m = ((T & mask) * N_prime) & mask
t = (T + m * N) >> k
if t >= N:
t -= N
return t
The full explanation for how works can be found here
Implementation details
To use Montgomery multiplication efficiently, we precompute constants once:
Python implementation:
def mont_reduce(T, N=N, nprime=nprime, k=k, mask=mask):
m = ((T & mask) * nprime) & mask
t = (T + m * N) >> k
if t >= N:
t -= N
return t
def mont_mul(a, b):
return mont_reduce(a * b)
def to_mont(x):
# x * R mod N, implemented as MonPro(x, R^2)
return mont_mul(x % N, R2)
def from_mont(x):
# x * R^{-1} mod N
return mont_reduce(x)
def exp_mont(x, exp):
x %= N
if x == 0:
return 0
acc = to_mont(1)
base = to_mont(x)
while exp:
if exp & 1:
acc = mont_mul(acc, base)
base = mont_mul(base, base)
exp >>= 1
return from_mont(acc)
Benchmarking notes
In pure Python, this version can be slower than expected because interpreter overhead may dominate the arithmetic gains.
For fair benchmarks we need to measure in lower-level languages (C/C++/Rust) where Montgomery arithmetic usually shines
You can find the full implementation and benchmark scripts in this repository.