## Saturday, October 6, 2012

### SHA-3 has landed

Exactly 12 years after announcing Rijndael as AES, NIST has done it again. Keccak has been hailed as the SHA-3 winner. Congratulations to the Keccak team, who have designed an amazingly elegant hash, and in particular to Joan Daemen, who has now won 2 NIST competitions!

SHA-3 originated amidst fears that SHA-2, the current NIST hashing standard, was wounded after the breakthrough Wang attacks on MD4, MD5, and SHA-1. Furthermore, SHA-2, like its predecessors, suffered from the common flaws of the Merkle-Damgård (MD) construction, among which are length-extension attacks, faster-than-theoretical preimages and multicollisions. All SHA-3 finalists fixed these flaws, as a submission requirement.

The winner, Keccak, strays quite drastically from the SHA-2 design. Keccak uses the sponge construction, which relies on a fixed-width permutation $$f$$ as the main building block, unlike the more common compression function from the MD designs. The sponge works in two phases: absorption and squeezing. Here's a picture:

 The sponge construction

During the absorption phase, $$r$$ bits from the message are injected into the state, after which the whole state ($$r+c$$ bits) is permuted using $$f$$. Once we're done with all the message bits (plus padding), we simply output the first l bits of the state as the final hash, applying the permutation repeatedly if necessary.

The strength of the sponge construction comes from the indifferentiability of f from a random permutation. Keccak is not perfect in this regard: an observation by Duan and Lai distinguishes the full Keccak permutation from random in expected $$2^{1579}$$ time, down from $$2^{1599}$$  — not a real attack.

Performance was taken seriously in this competition. Keccak is a curious selection in that it is rather unbalanced towards hardware speed, while it is somewhat lacking in the software speed department. eBASH shows that on high-end processors, Keccak is slower than SHA-2 by up to 2x. Meanwhile, Keccak wipes the floor in hardware speed and die area.

For certain applications, this particular performance profile can give the advantage to an attacker. Consider the scenario of password storage, where the defender switches from PBKDF2-HMAC-SHA-2 to PBKDF2-HMAC-SHA-3. This reduces the cost of an attacker while increasing the cost of the defender, for the same number of PBKDF2 iterations. That said, this is a rather fringe case, and the defender would be better served by switching to a memory-hard key-derivation function if possible.

SHA-3 comes with a completely different design which not only provides diversity, but also fixes some of the standing issues with the aging  MD construction. It is a very elegant design, very efficient in hardware, but that sadly suffers from low software performance. Perhaps chip makers will give us a hand there, or perhaps better ways to compute Keccak will arise in the meantime.

## Friday, March 30, 2012

### Invisible threats and RSA

There has been much fuss regarding the recent Lenstra et al's survey on X.509 certificates (see also Nadia Heninger's independent, but similar, results), where it was shown (among other things) that over 20000 out of 6 million RSA keys out there have no security, since their public keys are easily factored.

This is a visible threat: given a set of public keys, we can tell whether they are broken. Today I want to talk about invisible threats, i.e., threats that are undetectable to anyone but their perpetrator. We often know them as "backdoors."

Young and Yung formalized the concept of such backdoors as "Secretly Embedded Trapdoors with Universal Protection", or SETUPs. A SETUP is a modification of a cryptosystem that behaves very similarly, but its output is easily breakable by an attacker (and no one else!) A SETUP must also be (polynomially) indistinguishable from a valid output of said cryptosystem.

#### Small roots of polynomials

What does smooth integer finding, error correction, and code-breaking have in common? They can all be represented as finding "small" roots of some integer polynomial (modulo n) and, in turn, they all come down to a lattice reduction problem.

Coppersmith, in his 1996 paper, showed how to use lattice reduction as a tool to attack RSA. In particular, he showed how given half of the bits of a public key's ($$N = pq$$) factor (obtained via, say, a side-channel attack), you can find the full factor, and thus break the key. You do this by modelling the problem as a polynomial: given $$p'$$, an approximation of $$p$$ within a reasonable distance of $$p$$, e.g., $$<2^{256}$$, find small roots of

$f(x) = x + p' \pmod{p}.$

Since we do not actually know $$p$$, we use the next best thing, $$N$$, a multiple thereof. The algorithm itself works by "lifting" the polynomial onto the integers, and using the well-known LLL (co-authored by A. K. Lenstra, incidentally) approach to solving said integer polynomial. LLL is fast in the "polynomial-time"-sense: given a basis of dimension $$d$$ with $$n$$-dimensional vectors with norm at most $$B$$, LLL finds a short (but not the shortest; that problem, SVP, is NP-hard) basis in $$d^{3+o(1)}n(d + \log B) \log B$$ bit operations, or roughly $$O(d^5 \log B)$$. This means bases with some hundreds of vectors of (say) 1024-bit integers will not be fast to reduce in practice, and may require hours or days to reduce.

Luckily, the Coppersmith method doesn't require very large bases (usually dimension is under 100, for 1024-bit RSA; we'll get to that later), and is often pretty fast in practice. Now, let's see the actual backdoor in practice.

### An RSA backdoor

A very simple RSA SETUP was given by Crépeau and Slakmon, relying on Coppersmith's method to factor vulnerable keys. What it does is simply to embed approximately half of the bits of one of the factors of the RSA key, in the key itself. Here's some SAGE code to do this:

def rsa_bd_keygen(k):
"""Generate backdoored RSA key with k bits. Tested with k = 1024"""
slack = 32 # k//4 + slack bits stored
beta = 0.5 # p = n^beta
e = 65537
p = random_prime(2^(k//2), lbound=ceil(sqrt(2)*2^(k//2-1)), proof=False)
q_ = random_prime(2^(k//2), lbound=ceil(sqrt(2)*2^(k//2-1)), proof=False)
n_ = p*q_

tau = (n_ // 2^(7*k//8)) % 2^(k//8)
mu  = (p // 2^(k//4 - slack)) % 2^(k//4 + slack)
lam = n_ % 2^(5*k//8 - slack)

n = tau*2^(7*k//8) + mu*2^(5*k//8 - slack) + lam
q = n//p; q += (1^^q&1)

while gcd(e, q-1) > 1 or not q.is_prime():
m = ZZ.random_element(2^(k//8 - slack - 10)) & -2
q = q^^
n = p*

d = e.inverse_mod(n)
return (p,q,n,e,d)

The above code returns an RSA modulus in which the middle $$k/4 + \text{slack}$$ bits belong to the factor $$p$$, and the remaining bits are randomly generated. This is not yet a SETUP: the moduli $$n$$ are distinguishable from random, by using Coppersmith's method to obtain $$p$$ from its upper bits. To turn this into a SETUP, one has to hide the bits, by using, e.g., encryption on the partial $$p$$ information. In case our generator is a blackbox, this can be regular symmetric encryption; when the users have access to the key generation procedure, it is best to use public key methods to hide it - Young and Yung have made a living out of coming up with such methods.

To break the key generated above, we can too use SAGE:
def rsa_bd_break(n, k, eps=0.5/8):
tau = (n // 2^(7*k//8)) % 2^(k//8)
mu = (n // 2^(5*k//8 - slack)) % 2^(k//4 + slack)
print "Solving..."
P.<x> = PolynomialRing(Zmod(n))
f = x + p_*2^(k//4 - slack)
# set_verbose(2)
while True:
roots = f.small_roots(X=2^(k//4 - slack), beta=0.5, epsilon=eps)
if len(roots) > 0:
print hex(ZZ(roots[0]))
return ZZ(roots[0])
print "Failed with eps = " + str(eps)
if eps/2 < epsthresh:
print "Quitting..."
break
print "Retrying with eps = " + str(eps/2)
eps /= 2
return None

As long as we have more than half the bits of $$p$$ in $$p'$$, we can find the full $$p$$ in polynomial time. This does not mean it can't be slow. I have added the slack bits to increase the efficiency of the root finding, requiring the reduction of a smaller basis¹. The eps argument is used to control the size of the lattice to reduce --- the smaller it is, the larger the lattice basis will be (and the better the chances to finding a solution).

Here's an example execution:

Generating RSA key... Done.
Original p = 11183581128574189163983793036002083746501559137516652528752284765046234959066928998151315453810766744898339995298871471424016984721754805133501061552181043
Solving...
Failed with eps = 0.0625000000000000
Retrying with eps = 0.0312500000000000
Recovered p = 11183581128574189163983793036002083746501559137516652528752284765046234959066928998151315453810766744898339995298871471424016984721754805133501061552181043

SAGE factors RSA-1024 bit keys, with 272 bits of $$p$$ embedded, very quickly, usually under a second.

#### Summary

The grandfather of this sort of method is, of course, Ken Thompson's legendary talk "Reflections on Trusting Trust". The moral there was that you can't really trust code you did not fully create yourself²; in our case, it is similar: you cannot trust unreviewed, proprietary cryptographic code that behaves like a blackbox. It could be doing anything!

Backdoors like this one, be them in hardware or software, can be avoided by systematically auditing and reviewing code. If you're a vendor, you can increase your perceived trustworthiness by opening up critical routines such as key generation. As a client, you should demand that such code be thoroughly reviewed either by you, or appropriate third-parties. In today's uncertain times, when intrusions come from increasingly powerful adversaries, one can never be too careful.

[1] Lattice basis reduction, while polynomial-time, can be quite slow in practice, with complexity ranging from cubic to quintic.
[2] And even then, you may shoot yourself in the foot.

## Tuesday, March 20, 2012

### More fun with C++ metaprogramming

Since you're here, you have most certainly seen my last C++-related post, wherein I implemented compile-time AES encryption using C++ templates.

Today, I bring you CubeHash. Now, you might be wondering of what use is a compile-time implementation of CubeHash. Template metaprogramming is used here for a very specific task: to compute the initial state.

The CubeHash specification dictates that the initial 1024-bit (or 32-word) state is computed from the parameters h (output hash bit length), b (the number of bytes to mix with the state each round), and r (number of rounds) as follows:

• Place the 32-bit value h/8 in state[0];
• Place the 32-bit value b in state[1];
• Place the 32-bit value r in state[2];
• Fill state[3] through state[31] with 0;
• Apply the CubeHash round function 10r times.
To compute the initial CubeHash state, then, we have to be able to compute the full round function. And that's exactly what we did. The following C++ header file implements all possible variants of CubeHash easily, since all that changes between them are the h, b and r parameters. The implementation itself is based on the reference simple implementation, available from SUPERCOP. Switching to a faster SSE2 implementation is trivial.

Here it is: CubeHash. Have fun.