[cryptography] Oddity in common bcrypt implementation

Nico Williams nico at cryptonector.com
Tue Jun 28 11:25:10 EDT 2011


On Tue, Jun 28, 2011 at 9:56 AM, Marsh Ray <marsh at extendedsubset.com> wrote:
> Humans tend to not define text very precisely and computers don't work with
> it directly anyway, they only work with encoded representations of text as
> character data. Even a simple accented character in a word or name can be
> represented in several different ways.

I'll grant this for argument's sake, though you should know that every
language has its equivalent of English grammar and spelling fascists.
Just sayin'.

Another interesting data point is how changeable language rules are
over time.  In Spanish the rule used to be to drop accents in
capitalized letters (I suspect, though I really ought to check, that
the rationale was that it was difficult to get typewriters to handle
accents in capital letters, though perhaps there were typesetting
issues as well).  You can imagine the pain this would cause us today
if we had to stick to that rule for Spanish but not, say, French!!

> Many devs (particularly Unixers :-) in the US, AU, and NZ have gotten away
> with the "7 bit ASCII" assumption for a long time, but most of the rest of
> the world has to deal with locales, code pages, and multi-byte encodings.
> This seemed to allow older IETF protocol specs to often get away without a
> rigorous treatment of the character data encoding issues. (I suspect one
> factor in the lead of the English-speaking world in the development of 20th
> century computers and protocols is because we could get by with one of the
> smallest character sets.)

ASCII is (was) a multi-byte codeset.  ´ could be written by
typing 'a', backspace, then a single quote.  A great many Latin-1
characters could be typed that way, and indeed, the compose sequences
of many internationalized input methods reflect this.

> Let's say you're writing a piece of code like:
> if (username == "root")
> {
>        // avoid doing something insecure with root privs
> }
> The logic of this example is probably broken in important ways but the point
> remains: sometimes we need to compare usernames for equality in contexts
> that have security implications. You can only claim "bytes are bytes" up
> until the point that the customer says they have a directory server which
> compares usernames "case insensitively".

What's wrong with that code?  Depends on the programming language, no?
 One could imagine a programming language were the string equality
comparison operator does normalization- and codeset-insensitive
comparisons.

But even in typical programming languages, "root" is just ASCII, not
confusable with, say, "röot".  *Some* things are easy enough,
fortunately.  But yes, string comparison is not trivial.

BTW, we have a pretty good handle on these issues in the IETF
nowadays.  Not all problems are solved, of course, but mostly that's a
result of dealing with legacy.

> For most things "verbatim binary" is the right choice. However, a password
> or pass phrase is specifically character data which is the result of a user
> input method.

Passwords are the hardest strings to deal with, because if you're
doing anything like a PBKDF, then you must normalize the password
first, which requires very strict rules for doing so.  Worse, it means
that *clients* can't be spared having to have a local, fairly
full-featured I18N library (whereas for non-passwords it's always
possible to say that the server has to apply any complex normalization
rules, ...).

>> If you want to do crypto, just do crypto on the bits/bytes. If you
>> really have to, you can tag the intended format for forensic purposes
>> and sign your intent. But don't meddle with your given bits.
>> Canonicalization/normalization is simply too hard to do right or even to
>> analyse to have much place in protocol design.

If you don't normalize somewhere (in the right places, whatever they
might be for your app) you won't interop.  Simple as that.  Hopefully
you'll just not interop in rare corner cases, but you can't guarantee
it.  I've written a fair bit about this at my old Sun blog
(blogs.sun.com/nico, now also hosted at cryptonector.com).

Fortunately Unicode normalization has now been widely implemented
several times in open source, so this is hardly that big a deal.  (I
can point to at least four different open source implementations, some
with very friendly licenses, one even optimized for normalization- and
case-insensitive comparison, and there's probably several more still.)

> Consider RAIDUS.
>
> The first RFC http://tools.ietf.org/html/rfc2058#section-5.2
> says nothing about the encoding of the character data of the password field,
> it just treats it as a series of octets. So what do you do when implementing
> RADIUS on an OS that gives user input to your application with UTF-16LE
> encoding? If you "don't meddle with your given bits" and just pass them on
> to the protocol layer, they are almost guaranteed to be non-interoperable.

But doesn't the AAA server get the password in the clear?  If so the
server can make it right.  It's protocols that use PBKDFs on clients
that get into trouble (think of DIGEST-MD5, SCRAM, Kerberos, any
ZKPPs...

> Consequently, we can hardly blame users for not using special characters in
> their passwords.

The most immediate problem for many users w.r.t. non-ASCII in
passwords is not the likelihood of interop problems but the
heterogeneity of input methods and input method selection in login
screens, password input fields in apps and browsers, and so on, as
well as the fact that they can't see the password they are typing to
confirm that the input method is working correctly.  Oddly enough
mobiles are ahead of other systems here in that they show the user the
*last/current* character of any passwords they are entering.

I would gladly use non-ASCII in my passwords but for the input method issues.

Nico
--



More information about the cryptography mailing list