[cryptography] Oddity in common bcrypt implementation

Marsh Ray marsh at extendedsubset.com
Tue Jun 28 10:56:49 EDT 2011

On 06/27/2011 06:30 PM, Sampo Syreeni wrote:
> On 2011-06-20, Marsh Ray wrot
>> I once looked up the Unicode algorithm for some basic "case
>> insensitive" string comparison... 40 pages!
> Isn't that precisely why e.g. Peter Gutmann once wrote against the
> canonicalization (in the Unicode context, "normalization") that ISO
> derived crypto protocols do, in favour of the "bytes are bytes" approach
> that PGP/GPG takes?

Yes, but in most actual systems the strings are going to get handled. 
It's more a question of whether or not your protocol specification 
defines the format it's expecting.

Humans tend to not define text very precisely and computers don't work 
with it directly anyway, they only work with encoded representations of 
text as character data. Even a simple accented character in a word or 
name can be represented in several different ways.

Many devs (particularly Unixers :-) in the US, AU, and NZ have gotten 
away with the "7 bit ASCII" assumption for a long time, but most of the 
rest of the world has to deal with locales, code pages, and multi-byte 
encodings. This seemed to allow older IETF protocol specs to often get 
away without a rigorous treatment of the character data encoding issues. 
(I suspect one factor in the lead of the English-speaking world in the 
development of 20th century computers and protocols is because we could 
get by with one of the smallest character sets.)

Let's say you're writing a piece of code like:
if (username == "root")
	// avoid doing something insecure with root privs
The logic of this example is probably broken in important ways but the 
point remains: sometimes we need to compare usernames for equality in 
contexts that have security implications. You can only claim "bytes are 
bytes" up until the point that the customer says they have a directory 
server which compares usernames "case insensitively".

For most things "verbatim binary" is the right choice. However, a 
password or pass phrase is specifically character data which is the 
result of a user input method.

> If you want to do crypto, just do crypto on the bits/bytes. If you
> really have to, you can tag the intended format for forensic purposes
> and sign your intent. But don't meddle with your given bits.
> Canonicalization/normalization is simply too hard to do right or even to
> analyse to have much place in protocol design.

Consider RAIDUS.

The first RFC http://tools.ietf.org/html/rfc2058#section-5.2
says nothing about the encoding of the character data of the password 
field, it just treats it as a series of octets. So what do you do when 
implementing RADIUS on an OS that gives user input to your application 
with UTF-16LE encoding? If you "don't meddle with your given bits" and 
just pass them on to the protocol layer, they are almost guaranteed to 
be non-interoperable.

Later RFCs http://tools.ietf.org/html/rfc2865
have added in most places "It is recommended that the message contain 
UTF-8 encoded 10646 characters." I think this is a really practical 
middle ground. Interestingly, it doesn't say this for the password 
field, likely because the authors figured it would break some existing 
underspecified behavior.

So exactly which characters are allowed in passwords and how are they to 
be represented for interoperable RADIUS implementations? I have no idea, 
and I help maintain one!

Consequently, we can hardly blame users for not using special characters 
in their passwords.

- Marsh

More information about the cryptography mailing list