[cryptography] RDRAND and Is it possible to protect against malicious hw accelerators?

Marsh Ray marsh at extendedsubset.com
Sat Jun 18 20:16:39 EDT 2011


On 06/18/2011 03:08 PM, slinky wrote:
>
> Likewise, for accelerated component functions the hardware will know
> what is a key and what is input data - again, it needs this
> information in order to operate. Contrast this to a general purpose
> processor which can't really deduce what is a key and what isn't
> while processing code that happens to be AES.

Why not?

As Peter Gutmann just said "They really have waaaay too much die space
to spare don't they?"

Intel bought McAfee a while ago. Speaking informally with some chip 
people (not necessarily from Intel and definitely not under any 
confidentiality) there's active research around the idea of building 
instruction stream validation in support of antivirus directly into the 
processor. Recognizing an intentionally-obfuscated virus seems no easier 
than recognizing AES.

The Intel hardware RNG ("DRBG") is an example of how not do do it. It 
has weird timings and magic numbers:
> There are two approaches to structuring RdRand invocations such that
> DRBG reseeding can be guaranteed: Iteratively execute RdRand beyond
> the DRBG upper bound by executing more than 1022 64-bit RdRands, and
> Iteratively execute 32 RdRand invocations with a 10us wait period
> per iteration. The latter approach has the effect of forcing a
> reseeding event since the DRBG aggressively reseeds during idle
> periods.

But on any kind of networked or mutitasking system "idle periods" happen 
only at the consent of the attacker.

> Within the context of virtualization, the DRNG's stateless design
> and atomic instruction access means that RdRand can be used freely
> by multiple VMs without the need for hypervisor intervention or
> resource management.

Hmm, are we sure none of this carries across any security boundaries?

In another place they say:
> After invoking the RdRand instruction, the caller must examine the
> Carry Flag (CF) to determine whether a random value was available at
> the time the RdRand instruction was executed.

So it's not stateless after all because it keeps a FIFO of numbers and 
emits all zeroes when it "runs out".

To me, this only makes sense if one of the following might be true:
* the entropy pool is so small that it's in danger of brute-forcing
* the pool's contents can be read out somehow, or
* the extraction process (NIST SP 800-90 CTR_DRBG AES) may not be 
strongly one-way

But I know there are other opinions :-). More than likely though, 
they're doing this to follow "best practices".

At the very least this is going to disclose to an attacker on another 
core how many random numbers you're consuming. Random numbers can often 
be requested via the network, so he sends changing rates of SSL or IPsec 
handshake requests while watching how fast the pool depletes. This 
enables him to determine if he's running on the same processor as your 
crypto thread. It may also create a covert channel for exfiltration. Of 
course, there are other shared resources that might already be an easier 
way to do this.

Still, we can predict how this story will turn out because we have 
several examples of what happens in practice when RNGs decide that they 
have "run out" of numbers: the client code continues running with 
whatever garbage it gets because it's not a situation that the software 
developer ever encountered in his debugger, or one which a QA team ever 
noticed in the lab. At best, it will continue along an untested code path.

The thing _least_ likely to happen is for the operation actually needing 
the CSRNs to fail, because that would be a conspicuous bug which would 
have to be "fixed" somehow.

So the Intel "DRNG" has observable shared internal state and is shared 
among multiple cores. Even worse, *an attacker running on one core has 
the ability to cause the RDRAND instruction to write zeroes to the 
destination register* !

Note that the carry flag isn't accessible from C. The RDRAND instruction 
isn't either, but there will be inline assembler snippets floating 
around any day now.

Just to pick on Peter Gutmann:
> How would you encode, for example, 'RdRand eax'?
> I'd like to get the encoded form to implement it as '__asm _emit 0x0F __asm
> _emit 0xC7 __asm _emit <something>' (in the case of MSVC).

Note that he's not asking about how to check the carry flag too. I'm 
sure he of all people wouldn't forget this, but not so for your typical 
developer.

It's possible to check the carry flag from inline asm:
http://stackoverflow.com/questions/3139772/check-if-carry-flag-is-set

So if you were a C programmer who didn't know x64 assembler *maybe* you 
could find the right advice in that thread and get the carry flag out 
reliably. But how would you test it? How would QA test it?

It's certainly possible to get everything right, but
* requires significant skill and effort on the developer's part
* zero observable benefit in the normal case
* proper error handling reduces perceived reliability
* silent vulnerable failure mode
* difficult to repro on the developers box (masked by debugger)
* difficult to repro on the QA box
* may change with new processor revisions
* relatively easy for the attacker under the right conditions

This is a recipe for fail.

For comparison, look at the RDTSC instruction. Here's a nice helpful 
page (I've referred to it myself) with code to read the processor 
timestamp counter:
http://www.mcs.anl.gov/~kazutomo/rdtsc.html
Microsoft's compiler even provides a compiler intrinsic for it:
http://msdn.microsoft.com/en-us/library/twchhe95.aspx

Note that neither of these two sources provide any code to determine the 
processor support for and the validity of the values returned by this 
instruction! In fact, the numbers it returns can be really wacky on 
multi-cpu architectures. Intel and AMD behave differently, even across 
chip models. You should probably set CPU affinity on any thread that's 
going to use it and be prepared to handle a certain amount of wacky 
measurements.

> Now, put on your tinfoil beanie and suppose the hw accelerator is a
> Mallory. Suppose there is some kind of a built-in weakness/backdoor,
> for instance as a persistent memory inside the chip, which stores
> the last N keys.

Years ago, somebody ran "strings" on a Windows system DLL and found the 
string "NSAKEY". http://en.wikipedia.org/wiki/NSAKEY There was a lot of 
speculation and outright accusations that this was a backdoor to Windows.

Ah, those were innocent times. :-)

Since then we've learned more about exploiting ordinary software bugs 
and there have been *thousands* of remote holes discovered in Microsoft 
products:
> http://web.nvd.nist.gov/view/vuln/search-results?adv_search=true&cves=on&cve_id=&query=microsoft&cwe_id=&pub_date_start_month=-1&pub_date_start_year=-1&pub_date_end_month=-1&pub_date_end_year=-1&mod_date_start_month=-1&mod_date_start_year=-1&mod_date_end_month=-1&mod_date_end_year=-1&cvss_sev_base=&cvss_av=NETWORK&cvss_ac=&cvss_au=&cvss_c=&cvss_i=&cvss_a=

The idea that the NSA (or any skilled attacker) would have needed 
Microsoft's help to execute code remotely on Windows NT is now simply 
laughable. The same goes for other commonly-used OSes.

Microsoft's code quality is vastly improved and now far ahead of most of 
the rest of the industry. But we know there are still hundreds of 
"trusted" root CAs, many from governments, that will silently install 
themselves into Windows at the request of any website. Some of these 
even have code signing capabilities.

I guess the point here is that unless you're talking about hardware 
constructed specifically for a high-security environment, sneaking 
special key detection logic into the microarchitecture seems like the 
world's most overcomplicated way of pwning a commodity box.

> Having physical access to the machine would yield
> the keys (thus subverting e.g. any disk encryption). And even more
> paranoidly, a proper instruction sequence could blurt the key cache
> out for convenient remote access by malware crafted by the People
> Who Know The Secrets.

It wouldn't need to be anything that obvious. Just a circuit
configuration that leaks enough secret stuff via a side channel (timing,
power, RF, etc.) that it could be captured.

Things like data caches and branch prediction buffers have been shown to
do this as a natural consequence of how they operate. I think when an
incomprehensibly complex black-box system has things like subtle info
leaks happening by accident it's a good sign that intentional behaviors
could easily be hidden.

> My questions: 1. How can one ensure this blackbox device really
> isn't a Mallory?

Mallory lives on the communications link, which is neither Alice's nor
Bob's CPU by definition.

But it could certainly be malicious and there's probably no way to prove
that it isn't, particularly on a chip as dense as a modern x64 CPU.

Being in your chips is the ultimate in physical access to the hardware,
so you have no choice but to use only hardware that you "trust". But
there may be ways to split up the data such that the attacker would need 
to have previously compromised multiple, disparate systems.

> 2. Are there techniques, such as encrypting a lot of useless junk
> before/after the real deal to flush out the real key, as a way to
> reduce the impact of untrusted hardware, while still being able to
> use the hw-accelerated capabilities?   And if you know of any good
> papers around this subject, feel free to
>  mention them :)

Encrypting a lot of stuff with the same key would probably just make the
side channel easier to read. Switching keys on every block might help, 
but only if the attacker didn't expect and account for that when he 
designed the backdoor.

- Marsh



More information about the cryptography mailing list