[cryptography] ZFS dedup? hashes (Re: [zfs] SHA-3 winner announced)

Eugen Leitl eugen at leitl.org
Thu Oct 4 05:47:08 EDT 2012

----- Forwarded message from Jim Klimov <jimklimov at cos.ru> -----

From: Jim Klimov <jimklimov at cos.ru>
Date: Thu, 04 Oct 2012 13:44:21 +0400
To: zfs at lists.illumos.org
CC: Eugen Leitl <eugen at leitl.org>
Subject: Re: ZFS dedup? hashes (Re: [cryptography] [zfs] SHA-3 winner announced)
Reply-To: jimklimov at cos.ru
Organization: JSC COS/HT
User-Agent: Mozilla/5.0 (Windows NT 5.2; WOW64; rv:15.0) Gecko/20120907 Thunderbird/15.0.1

2012-10-03 18:52, Eugen Leitl wrote:
> I infer from your comments that you are focusing on the ZFS use of a hash
> for dedup?  (The forward did not include the full context).  A forged
> collision for dedup can translate into a DoS (deletion) so 2nd pre-image
> collision resistance would still be important.

This subject was discussed a few months ago on zfs-discuss,
I believe the thread history may be here:


Regarding the dedup-collision attacks, the problem is such: zfs
dedup uses a checksum of a low-level block of ZFS data (which
has passed compression, and encryption in case of Solaris 11).
The final on-disk blocks with whatever contents are checksummed
as part of ZFS integrity verification (lack of bitrot), and
the stronger of these checksums can be used as keys into the
deduplication table (DDT) if enabled for these datasets and
used upon write (i.e. prepare the final block contents, make
checksum, find it in DDT, increment the DDT entry counter or
make a new DDT entry with counter=1). The DDT is shared by
many datasets on the pool, and accounting of used/free space
becomes "interesting", but the users have little if any ways
to know whether their data was deduped (might infer that from
changes of used/free space, but one can never be sure if HIS
recently written file was involved).

The block is several sectors in size, now ranging from 512b
to 128Kb. In order to craft an attack on dedup you should:
1) Know what data will be written by the victim - exactly,
   including raw data, compression algo, encryption, etc.;
2) Create a block with forged data which has the same checksum
   (as used by this block's metadata on disk - currently sha256,
   maybe more as  a result of Saso's work);
3) Be the very first writer into this pool that creates a block
   with this hash and enters it into the DDT.

Reality is that any co-user of space on the deduped pool might
do this. However, impracticality is that you need such intimate
access to the victim's source data and system setup details,
that you might just as well be the storage admin who can just
corrupt and overwrite the victim's userdata block with whatever
trash. Also, as far as dedup goes, a simple setting of verify=on
requires comparison of on-disk block with the one ZFS is going
to save (given they have same checksums, maybe size, and one is
already in DDT), and if these two don't match ZFS should just
write the new block non-deduped. The attack would at most waste
space on the storage if the victim's data is indeed dedupable
and ultimately many identical copies are saved, but the forged
block only sits and occupies the DDT entry.

> Incidentally a somewhat related problem with dedup (probably more in cloud
> storage than local dedup of storage) is that the dedup function itself can
> lead to the "confirmation" or even "decryption" of documents with
> sufficiently low entropy as the attacker can induce you to "store" or
> directly query the dedup service looking for all possible documents.  eg say
> a form letter where the only blanks to fill in are the name (known
> suspected) and a figure (<1,000,000 possible values).

What sort of attack do you suggest? That a storage user (attacker)
pre-creates a million files of this form with filled-in data?

Having no access to ZFS low-level internals and metadata, the
end-user has no reliable way of knowing that a particular file
got deduped. (And it's not files, but component blocks, to be
exact). And if an admin does that, he might just as well read
the victim's file directly (on non-encrypted pool).

Or did I misunderstand your point?

> Also if there is encryption there are privacy and security leaks arising
> from doing dedup based on plaintext.
> And if you are doing dedup on ciphertext (or the data is not encrypted), you
> could follow David's suggestion of HMAC-SHA1 or the various AES-MACs.  In
> fact I would suggest for encrypted data, you really NEED to base dedup on
> MACs and NOT hashes or you leak and risk bruteforce "decryption" of
> plaintext by hash brute-forcing the non-encrypted dedup tokens.

I am not a cypher expert to even well decipher this part ;)

//Jim Klimov

----- End forwarded message -----
Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE

More information about the cryptography mailing list