Tuesday, March 5, 2013

Codegate YUT Preliminary 2013: bin500

Last weekend, we had the chance to play at the Codegate YUT 2013 preliminary match. The final challenge was in the binary category, and presented us with the following window:

This is a Win32 application that asks as input 6 lotto digits (from the set 0-9a-z), which are processed once you click the "BUY" button.

The first hurdle to jump through is the anti-debugging and anti-disassembly. There is one thread, running at 0x00401C80, that continuously checks for changes in the main function of the application, such as breakpoints and patches. We disable this thread. There are also some sequences of code sprinkled around that confuse disassemblers:

.text:00402607    push    33h
.text:00402609    call    $+5 .text:0040260E add dword ptr [esp], 5 .text:00402612 retf .text:00402613 call$+5
.text:00402618    mov     dword ptr [esp+4], 23h
.text:00402620    add     dword ptr [esp], 0Dh
.text:00402624    retf

This is a known trick to jump between x86 and x86_64 mode under WOW64, and back again. Since no code is actually running in 64-bit mode, we can just safely nop out all those sequences and be done with it.

We are now ready to take a look at the main function, starting at 0x004046C0. The function first checks the lotto sequence for a hardcoded correct value, by checking the MD5 of each individual character. It is easy to figure out which character sequence results in the desired lotto: 58p1gt.

Once we get past this check, we see another check. This one matches the lower 8 characters of a modified SHA-1 of each lotto number to the hardcoded string 26c113b16b376f27f86192b1b6c0273aeae2066af4eb6dbb. We simply patch over this check, since the SHA-1 modification used here calls GetTickCount in its compression function; clearly it's meant to be random and to distract us.

We finally reach the important part. The 6 lotto letters are xored with the string (at 0x0058D6AC) {0x15, 0x59, 0x1c, 0x48, 0x49, 0x6c}, and a 24-byte array is formed:
    unsigned char key[24] =
{
'5', '8', 'p', '1', 'g', 't',
'5'^0x15, '8'^0x59, 'p'^0x1c, '1'^0x48, 'g'^0x49, 't'^0x6c,
'5', '8', 'p', '1', 'g', 't',
'f', '3', 'v', '1', '4', '\0'
};
This array is then used as the key to decrypt the hardcoded 24-byte block at 0x005B2814. The cipher used here is 192-bit block Rijndael. Not AES, Rijndael — AES is only defined for 128-bit blocks.

The winning key 58p1gt does not give out a meaningful decryption of the hardcoded block. What are we to do? Bruteforce! 6 characters from the allowed 0-9a-z 36-char dictionary means we only have to go through at most ~2^32 combinations.

We quickly find that the correct lotto key to correctly decrypt the block is 4z9na3, and the flag is "L0tto_1s_noT_bAd_9ame!"

The bruteforcer used to get at the solution is available here.

Thursday, January 17, 2013

Cloud hashing

Hash functions are everywhere. Getting this webpage to your computer has required at least a handful of hash calculations. We use them for identification, authentication, or even data structures, such as hash tables or Bloom filters. A particular strain of hash functions, cryptographic hash functions, tends to have more narrow uses: authentication (e.g., as part of HMAC), pseudorandom number generation, password storage (e.g., as part of HMAC in PBKDF2), and so on.

Until a few years ago, everyone was happy: MD5 ruled the land, while SHA-1 sometimes made an appearance. Disaster struck in 2004, when Xiaoyun Wang successfully found collisions in MD5, and seriously damaged SHA-1's reputation. Today, the landscape is different: MD5 and SHA-1 are no longer recommended, and SHA-2 and SHA-3 are the appointed replacements.

Yet, SHA-2 and SHA-3 do not fulfill every hash function requirement adequately. In particular, they are fairly slow in software, clocking at between 10 and 20 clocks per byte on modern x86 processors. In comparison, MD5 and SHA-1 clock between 4 and 6 clocks per byte. Applications that need to hash massive amounts of data will notice this gap. Therefore, such applications will often make the conscious choice to use SHA-1 or MD5, because a potentially insecure product is often better than an unresponsive product.

There are other options to consider when fast hashing (and non-strict adherence to standards) is required. Several SHA-3 finalists were very fast in software, such as BLAKE or Skein. Both are in the same order of magnitude of speed as MD5, and have passed the exhaustive level of scrutiny needed to get to SHA-3 finalist.

The recently announced BLAKE2 function takes advantage of the analysis already done on BLAKE, and strips the design of any unnecessary computational overhead. This results on a function that is expected to be as secure as SHA-3 in every way, but actually faster than MD5. Furthermore, BLAKE2 adds support for tree (parallel) hashing, which is a godsend for big data applications where sequential hashing would be simply too slow regardless of how optimized the design got.

So what are the applications of fast (and/or parallel) hashing?

File blacklisting/whitelisting is an obvious one; matching files to known hash sets is only dependent on disk and hashing speed. Fast-enough hashing can also benefit advanced filesystems. A remarkable case is ZFS, where a strong hash is supported (SHA-256), but disabled by default for performance reasons. SHA-256 is also required for other advanced features, such as deduplication.

Ultimately, pretty much every current hash use case does benefit from a faster function: faster equals less power consumed, and less overhead for the actual application. One exception is password storage, but you should not be using a plain hash function for that anyway!

Saturday, October 6, 2012

SHA-3 has landed

Exactly 12 years after announcing Rijndael as AES, NIST has done it again. Keccak has been hailed as the SHA-3 winner. Congratulations to the Keccak team, who have designed an amazingly elegant hash, and in particular to Joan Daemen, who has now won 2 NIST competitions!

SHA-3 originated amidst fears that SHA-2, the current NIST hashing standard, was wounded after the breakthrough Wang attacks on MD4, MD5, and SHA-1. Furthermore, SHA-2, like its predecessors, suffered from the common flaws of the Merkle-Damgård (MD) construction, among which are length-extension attacks, faster-than-theoretical preimages and multicollisions. All SHA-3 finalists fixed these flaws, as a submission requirement.

The winner, Keccak, strays quite drastically from the SHA-2 design. Keccak uses the sponge construction, which relies on a fixed-width permutation $$f$$ as the main building block, unlike the more common compression function from the MD designs. The sponge works in two phases: absorption and squeezing. Here's a picture:

 The sponge construction

During the absorption phase, $$r$$ bits from the message are injected into the state, after which the whole state ($$r+c$$ bits) is permuted using $$f$$. Once we're done with all the message bits (plus padding), we simply output the first l bits of the state as the final hash, applying the permutation repeatedly if necessary.

The strength of the sponge construction comes from the indifferentiability of f from a random permutation. Keccak is not perfect in this regard: an observation by Duan and Lai distinguishes the full Keccak permutation from random in expected $$2^{1579}$$ time, down from $$2^{1599}$$  — not a real attack.

Performance was taken seriously in this competition. Keccak is a curious selection in that it is rather unbalanced towards hardware speed, while it is somewhat lacking in the software speed department. eBASH shows that on high-end processors, Keccak is slower than SHA-2 by up to 2x. Meanwhile, Keccak wipes the floor in hardware speed and die area.

For certain applications, this particular performance profile can give the advantage to an attacker. Consider the scenario of password storage, where the defender switches from PBKDF2-HMAC-SHA-2 to PBKDF2-HMAC-SHA-3. This reduces the cost of an attacker while increasing the cost of the defender, for the same number of PBKDF2 iterations. That said, this is a rather fringe case, and the defender would be better served by switching to a memory-hard key-derivation function if possible.

SHA-3 comes with a completely different design which not only provides diversity, but also fixes some of the standing issues with the aging  MD construction. It is a very elegant design, very efficient in hardware, but that sadly suffers from low software performance. Perhaps chip makers will give us a hand there, or perhaps better ways to compute Keccak will arise in the meantime.

Friday, March 30, 2012

Invisible threats and RSA

There has been much fuss regarding the recent Lenstra et al's survey on X.509 certificates (see also Nadia Heninger's independent, but similar, results), where it was shown (among other things) that over 20000 out of 6 million RSA keys out there have no security, since their public keys are easily factored.

This is a visible threat: given a set of public keys, we can tell whether they are broken. Today I want to talk about invisible threats, i.e., threats that are undetectable to anyone but their perpetrator. We often know them as "backdoors."

Young and Yung formalized the concept of such backdoors as "Secretly Embedded Trapdoors with Universal Protection", or SETUPs. A SETUP is a modification of a cryptosystem that behaves very similarly, but its output is easily breakable by an attacker (and no one else!) A SETUP must also be (polynomially) indistinguishable from a valid output of said cryptosystem.

Small roots of polynomials

What does smooth integer finding, error correction, and code-breaking have in common? They can all be represented as finding "small" roots of some integer polynomial (modulo n) and, in turn, they all come down to a lattice reduction problem.

Coppersmith, in his 1996 paper, showed how to use lattice reduction as a tool to attack RSA. In particular, he showed how given half of the bits of a public key's ($$N = pq$$) factor (obtained via, say, a side-channel attack), you can find the full factor, and thus break the key. You do this by modelling the problem as a polynomial: given $$p'$$, an approximation of $$p$$ within a reasonable distance of $$p$$, e.g., $$<2^{256}$$, find small roots of

$f(x) = x + p' \pmod{p}.$

Since we do not actually know $$p$$, we use the next best thing, $$N$$, a multiple thereof. The algorithm itself works by "lifting" the polynomial onto the integers, and using the well-known LLL (co-authored by A. K. Lenstra, incidentally) approach to solving said integer polynomial. LLL is fast in the "polynomial-time"-sense: given a basis of dimension $$d$$ with $$n$$-dimensional vectors with norm at most $$B$$, LLL finds a short (but not the shortest; that problem, SVP, is NP-hard) basis in $$d^{3+o(1)}n(d + \log B) \log B$$ bit operations, or roughly $$O(d^5 \log B)$$. This means bases with some hundreds of vectors of (say) 1024-bit integers will not be fast to reduce in practice, and may require hours or days to reduce.

Luckily, the Coppersmith method doesn't require very large bases (usually dimension is under 100, for 1024-bit RSA; we'll get to that later), and is often pretty fast in practice. Now, let's see the actual backdoor in practice.

An RSA backdoor

A very simple RSA SETUP was given by Crépeau and Slakmon, relying on Coppersmith's method to factor vulnerable keys. What it does is simply to embed approximately half of the bits of one of the factors of the RSA key, in the key itself. Here's some SAGE code to do this:

def rsa_bd_keygen(k):
"""Generate backdoored RSA key with k bits. Tested with k = 1024"""
slack = 32 # k//4 + slack bits stored
beta = 0.5 # p = n^beta
e = 65537
p = random_prime(2^(k//2), lbound=ceil(sqrt(2)*2^(k//2-1)), proof=False)
q_ = random_prime(2^(k//2), lbound=ceil(sqrt(2)*2^(k//2-1)), proof=False)
n_ = p*q_

tau = (n_ // 2^(7*k//8)) % 2^(k//8)
mu  = (p // 2^(k//4 - slack)) % 2^(k//4 + slack)
lam = n_ % 2^(5*k//8 - slack)

n = tau*2^(7*k//8) + mu*2^(5*k//8 - slack) + lam
q = n//p; q += (1^^q&1)

while gcd(e, q-1) > 1 or not q.is_prime():
m = ZZ.random_element(2^(k//8 - slack - 10)) & -2
q = q^^
n = p*

d = e.inverse_mod(n)
return (p,q,n,e,d)

The above code returns an RSA modulus in which the middle $$k/4 + \text{slack}$$ bits belong to the factor $$p$$, and the remaining bits are randomly generated. This is not yet a SETUP: the moduli $$n$$ are distinguishable from random, by using Coppersmith's method to obtain $$p$$ from its upper bits. To turn this into a SETUP, one has to hide the bits, by using, e.g., encryption on the partial $$p$$ information. In case our generator is a blackbox, this can be regular symmetric encryption; when the users have access to the key generation procedure, it is best to use public key methods to hide it - Young and Yung have made a living out of coming up with such methods.

To break the key generated above, we can too use SAGE:
def rsa_bd_break(n, k, eps=0.5/8):
tau = (n // 2^(7*k//8)) % 2^(k//8)
mu = (n // 2^(5*k//8 - slack)) % 2^(k//4 + slack)
print "Solving..."
P.<x> = PolynomialRing(Zmod(n))
f = x + p_*2^(k//4 - slack)
# set_verbose(2)
while True:
roots = f.small_roots(X=2^(k//4 - slack), beta=0.5, epsilon=eps)
if len(roots) > 0:
print hex(ZZ(roots[0]))
return ZZ(roots[0])
print "Failed with eps = " + str(eps)
if eps/2 < epsthresh:
print "Quitting..."
break
print "Retrying with eps = " + str(eps/2)
eps /= 2
return None

As long as we have more than half the bits of $$p$$ in $$p'$$, we can find the full $$p$$ in polynomial time. This does not mean it can't be slow. I have added the slack bits to increase the efficiency of the root finding, requiring the reduction of a smaller basis¹. The eps argument is used to control the size of the lattice to reduce --- the smaller it is, the larger the lattice basis will be (and the better the chances to finding a solution).

Here's an example execution:

Generating RSA key... Done.
Original p = 11183581128574189163983793036002083746501559137516652528752284765046234959066928998151315453810766744898339995298871471424016984721754805133501061552181043
Solving...
Failed with eps = 0.0625000000000000
Retrying with eps = 0.0312500000000000
Recovered p = 11183581128574189163983793036002083746501559137516652528752284765046234959066928998151315453810766744898339995298871471424016984721754805133501061552181043

SAGE factors RSA-1024 bit keys, with 272 bits of $$p$$ embedded, very quickly, usually under a second.

Summary

The grandfather of this sort of method is, of course, Ken Thompson's legendary talk "Reflections on Trusting Trust". The moral there was that you can't really trust code you did not fully create yourself²; in our case, it is similar: you cannot trust unreviewed, proprietary cryptographic code that behaves like a blackbox. It could be doing anything!

Backdoors like this one, be them in hardware or software, can be avoided by systematically auditing and reviewing code. If you're a vendor, you can increase your perceived trustworthiness by opening up critical routines such as key generation. As a client, you should demand that such code be thoroughly reviewed either by you, or appropriate third-parties. In today's uncertain times, when intrusions come from increasingly powerful adversaries, one can never be too careful.

[1] Lattice basis reduction, while polynomial-time, can be quite slow in practice, with complexity ranging from cubic to quintic.
[2] And even then, you may shoot yourself in the foot.

Tuesday, March 20, 2012

More fun with C++ metaprogramming

Since you're here, you have most certainly seen my last C++-related post, wherein I implemented compile-time AES encryption using C++ templates.

Today, I bring you CubeHash. Now, you might be wondering of what use is a compile-time implementation of CubeHash. Template metaprogramming is used here for a very specific task: to compute the initial state.

The CubeHash specification dictates that the initial 1024-bit (or 32-word) state is computed from the parameters h (output hash bit length), b (the number of bytes to mix with the state each round), and r (number of rounds) as follows:

• Place the 32-bit value h/8 in state[0];
• Place the 32-bit value b in state[1];
• Place the 32-bit value r in state[2];
• Fill state[3] through state[31] with 0;
• Apply the CubeHash round function 10r times.
To compute the initial CubeHash state, then, we have to be able to compute the full round function. And that's exactly what we did. The following C++ header file implements all possible variants of CubeHash easily, since all that changes between them are the h, b and r parameters. The implementation itself is based on the reference simple implementation, available from SUPERCOP. Switching to a faster SSE2 implementation is trivial.

Here it is: CubeHash. Have fun.

Thursday, September 8, 2011

DEFCON 19 CTF FOR THE WIN!

We would like to congratulate all the members of Kryptos Logic participating in the Defcon 19 CTF with the European Nop Sled Team. This is our second year competing side by side with our great friends across the globe from Portugal to Denmark to the United States, and Kryptos Logic was honored to be a part of the fun and hard fought competition.

In t-shirt swapping tradition, Hates Irony (VedaGodz and Co), we hope to see you back next year in full force. Next year drinks on us.

Routards as usual a fantastic showing that should not go unrecognized by any team and we hope that you do return next year. Your first blood breakthroughs did not go unnoticed.

Plaid Parliament of Pwning, your first time at the Defcon CTF, and great work.

To all other teams, good luck next year

A very special note to DDTEK for organizing the "Olympics" of Capture the Flags. We think there might be the slightest bit of stress involved with running a competition where the best CTF players in the world scrutinize every line of code and gameplay. Fantastic game, good curve ball with the IPV6, and everyone worked hard until the end on both the qualifiers/finals. Affirmation, that it was a job well done.

Check our blog soon for some solutions to the qualifiers/finals.

Thanks,

Team Kryptos Logic

Thursday, August 18, 2011

The sky is not falling

CRYPTO 2011's rump session yesterday was shaken by the announcement of a break of the full AES. What, again? Yes - this time AES was broken without relying on the related-key assumption. Is this any reason to panic? No.

According to the paper, the computational complexity of the attack (on AES-128) is 2^126.18. The attack requires, however, 2^88 plaintext/ciphertext pairs. What this means is that a machine of size 2^88 (the pairs must be stored somewhere) will succeed at breaking AES in 2^126 steps.

Now consider another scenario: a machine of size 2^88, with 2^88 small AES processors (admittedly, an AES circuit is a constant factor larger than a circuit to store 256 bits; the point still stands). This machine will success in bruteforcing an AES key in 2^(128-88) = 2^40 steps, much faster than the BKR attack. Perhaps now would be a good time to bring up Daniel Bernstein's paper "Understanding bruteforce," which makes the same point, but takes it even further, with a description of a parallel architecture for cost-efficient bruteforcing.

This is not, of course, to discredit the impressive research done by the BKR paper authors. This is important research, and allows us to have better insight (and attacks) into the cipher used to protect all kinds of communication around us. The authors themselves acknowledge that their results do not affect the security of AES in practice:
As our attacks are of high computational complexity, they do not threaten the practical use of AES in any way.
So the sky is not really falling. A theoretical break has been done, and it might even lead to a practical attack in the future, but not today.