[cryptography] A small public key encryption program
solar at openwall.com
Tue Aug 7 22:18:42 EDT 2012
Thanks for your response.
On Tue, Aug 07, 2012 at 09:59:17PM +0300, Elias Yarrkov wrote:
> On Tue, Aug 7, 2012 at 4:34 AM, Solar Designer <solar at openwall.com> wrote:
> > Here's an attack:
> > This puts an upper bound on the cost, regardless of the (possibly
> > configurable) cost of the expensive component. It also lets an attacker
> > avoid the memory cost altogether (but pay for this with the processing
> > cost of the algorithm above).
> Indeed, and the function is specified with the intention of reaching this
> bound. This why I claim the construction's AT cost as being "similar to 2**32
> iterations of SHA-1" on the webpage. Notice that the upper bound in fact is
> 2**32, and not 2**31. You did about (2**32 / 2) * input_space hash function
> calls. Compare to a hash that is iterated 2**32 times: you would need to do
> about 2**32 * (input_space / 2) calls. No difference.
> > Overall, I think that having a narrower-pipe expensive component in a
> > KDF might be OK (especially if there are multiple invocations of such
> > components per a full KDF computation), but 32 bits for the only
> > expensive component is likely too narrow for some otherwise reasonable
> > uses (cost settings).
> It is true that 32 bits is quite limiting. This structure works fine with
> 64-bit words as well, but I want to avoid uint64_t in this program.
Sorry for stating the obvious, but you could use two uint32_t's or make
two calls to zen32() with different inputs and process both outputs.
I understand that you're trying to keep the code very simple, though.
> On Tue, Aug 7, 2012 at 2:36 AM, Solar Designer <solar at openwall.com> wrote:
> > Also, I am curious how you derive your area*time cost estimate, and
> Here are some back-of-the-envelope numbers. SHA-1 in ASIC is 8k GEs and hashes
> a short input in 85 cycles. 2**32 iterations of this would be about 2**51.4
> GEs*cycles, plus unknown overhead.
> One bit of memory is 1.5 GEs in eDRAM. zen32 uses 2**23 * 32 bits of memory.
> Let's be pessimistic and assume that the attacker has zero memory latency. A
> straightforward implementation would take about 2**23 * 4 cycles. This comes
> down to about 2**53.6 GEs*cycles, plus unknown overhead.
> There are obvious ways to reduce the latter cost slightly, at least if we stick
> to the "zero memory latency" story, but it probably cannot be much cheaper than
> the iterated SHA-1.
This sounds about right, thanks. For a SHA-1 implementation in ASIC to
compare against, you could want to search for a fully-pipelined one.
My guess is that it'd be (a few times?) smaller than 8k*85 GEs.
> > whether you considered other metrics of cost (those more relevant for
> > non-ASIC attacks). Your zen32() does use RAM, but it only uses a tiny
> > fraction of the CPU's logic (little more than a 32-bit ALU - no SIMD),
> > which is suboptimal against attacks with CPUs. It looks like
> CPUs and other general-purpose hardware are only useful against passphrases
> that are too weak for this purpose in the first place.
> Granted, some users will use stupid passphrases despite being told how to do it
> securely, but being harder on the CPU is, to me, not a goal in itself if the
> attacker isn't constrained to general-purpose hardware. I try to go for
> cryptography that not even the best-equipped attackers can break.
That's reasonable, but this is not the only threat model for KDFs in
general. Under some reasonable threat models, the defender's goal may
be to reduce the percentage of passphrases that will get cracked (per
unit of time, during their lifespan, etc.) by a variety of attackers.
Attacks with CPUs are relevant because CPUs are readily available in
large quantities, including e.g. in botnets. Ditto for GPUs.
Besides, by not using the defender's CPU and memory bus more fully, you
make attacks with specialized hardware easier too.
> > SIMD-enhanced attacks won't even require scatter/gather addressing
> > (which, by the way, might become widely available, such as with Intel
> > MIC) since your array indices are not data-dependent. Also, such
> > attacks would use the available memory bandwidth more fully. When you
> > make a 32-bit memory access, an entire cache line is fetched anyway, but
> > your second loop only uses the 32-bit value. With SIMD or/and similar
> > grouping of memory accesses for multiple keys, the entire cache line
> > would be actually made use of.
> I considered data-dependent array lookups in the manner of ISAAC. It's a good
> idea in some circumstances, and definitely forces one to face the issue of
> latency. I decided against it because CPUs leak information via side channels
> about those lookups.
Note that I was not actually advocating data-dependent lookups. I'm
also not sure if forcing the attacker to face the issue of latency would
be desirable (the attacker would be able to avoid it by interleaving
accesses for different candidate passphrases, although this increases
the minimum memory needs accordingly).
Rather, I was/am advocating taking advantage of the available CPU
execution units and the wider memory bus and the cache line wide
prefetches in the KDF design and thus in possible defender's
implementations of the KDF. You don't have to do it in all
implementations (some may be optimized for simplicity instead), but I
think it's an advantage if the design allows for this.
For example, even starting to use two adjacent uint32_t's as mentioned
above would be a slight improvement in this aspect, and would make
attacks with attack-optimized implementations (including ASIC) a bit
more costly (per defender's time*memory). Perhaps you could expand this
to cache line wide accesses (with proper alignment).
> Consider this, for example: attacker gathers a small amount of noisy
> information about memory lookups near the beginning of this type of random
> access function. He then runs brute force on the full construction; he can
> compare the known right memory lookup pattern to the one he ends up doing
> near the beginning of the random access function, detect gross mismatches and
> skip to the next candidate passphrase without computing the entire function.
I agree with you on this.
On a related note, the same applies, to an even greater extent, to KDFs
with data-dependent branches, such as SunMD5. Data-dependent branches
are being proposed as an anti-GPU measure lately, which I acknowledge,
but oppose. I think it's a last resort in case we would have failed
with other anti-GPU approaches (assuming that we do choose to design a
KDF to be GPU-unfriendly), but I think those other approaches would be
sufficient. Besides, not every kind of data-dependent branching is that
much of a problem for GPUs.
> > In general, your algorithm appears to be memory bandwidth bound, whereas
> > scrypt avoids that (by including some non-trivial computation) - it is
> > memory-hard, but not memory bandwidth bound.
> > With your algorithm, an attacker with the same amount of RAM, but with
> > higher memory bandwidth for random 32-bit accesses (such as with many
> > independent 32-bit buses), would have an advantage. This is probably
> > not what you wanted.
> Specialized hardware is indeed advantageous. One generally wants to increase
> attack cost per user inconvenience in deriving a key from a passphrase; using
> longer blocks of memory may indeed help slightly. It would also make the
> program more complex.
We're in agreement on this.
> In short, I mostly care about the asymptotically best brute force attacks,
> which work by special-purpose hardware.
> I'm intending to write more formally about this structure when I have the time.
More information about the cryptography