[cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

Nico Williams nico at cryptonector.com
Sat May 21 17:22:27 EDT 2011

On Sat, May 21, 2011 at 1:50 PM, Zooko O'Whielacronx <zooko at zooko.com> wrote:
> What I would most want is for ZFS (and every other filesystem) to
> maintain a Merkle Tree over the file data with a good secure hash.

Me too.  ZFS does do that, but unfortunately the internal Merkel hash
maintained this way also has metadata tree nodes (namely, indirect
blocks), which wouldn't be a problem except that embedded in that
metadata (and hashed in) are block pointers, which contain some data
which is completely non-deterministic relative to the file contents.
The right thing to have done would have been to exclude the address
portion of block pointers, but that owuld have required more data
movement and/or computation (and perhaps was not thought of at the
time, but I was never in the ZFS team, so I'd not know.

> Whenever a change to a file is made, the filesystem can update the
> Merkle Tree this with mere O(log(N)) work in the size of the file plus
> O(N) work in the size of the change. For a modern filesystem like ZFS
> which is already maintaining a checksum tree the *added* cost of
> maintaining the secure hash Merkle Tree could be minimal.


> Then, the filesystem should make this Merkle Tree available to
> applications through a simple query.

Indeed.  And if non-deterministic metadata is excluded then the only
thing one would have to know in order to independently compute the
same hash would be the maximum breath of the tree and leaf block size,
and that the tree is intended to be kept with all but the rightmost
leaf node full.  This would be a wonderful feature to have.

> This would enable applications—without needing any further
> in-filesystem code—to perform a Merkle Tree sync, which would range
> from "noticeably more efficient" to "dramatically more efficient" than
> rsync or zfs send. :-)

I haven't thought about this enough, but my level-0 concern would be
that data moves not aligned with block size would require a rolling
hash in order to make synchronization efficient, thus the Merkle tree
wouldn't help unless, perhaps, it were constructed from rolling hash

The other thought I had was that now that we accept that file stores
need large amounts of computational power, not jut I/O bandwidth, it's
not a stretch for the server to also compute and store (i.e.,
pre-compute) arrays of rolling CRCs, each taken over a small data
chunk contiguous with the preceding and succeeding ones.  This would
allow a client to very quickly read the rsync signature of a remote
file and locally generate a patch to it, making the write process very
efficient for large files that get many small changes all over the

> Of course it is only more efficient because we're treating the
> maintenance of the secure-hash Merkle Tree as free. There are two
> senses in which this is legitimate and it is almost free:

And/or because we intend to use this for optimizing an otherwise slow
operation, and if we do it enough then it'd be worthwhile even if the
given FS had never before used a Merkle tree.
> [...]
> 2. a. From what I've read, nobody uses the SHA-256 configuration in
> ZFS because it is too computationally expensive, so they use an
> insecure checksum (fletcher2/4) instead.

You kinda have to use SHA-256 if you want dedup.

> 2. b. I assume the shape of the resulting checksum tree is modified by
> artifacts of the ZFS layout instead of being a simple canonical shape.

Yes: a) the shape of the indirect block tree (but this is
deterministic if you know the maximum block size for the filesystem
and the maximum block size for the file), and b) physical block
address information in indirect blocks (which can and should have been
excluded -- it's good enough to compute the hash of an indirect block
over the block checksums in each block pointer).  See above.  Oh,
also, there the first few block pointers in the dnode too.

> This is a show-stopper for this use case because if the same file data
> exists on a different system, and some software on that system
> computes a Merkle Tree over the data, it might come out with different
> hashes than the ZFS checksum tree, thus eliminating all of the
> performance benefits of this approach.

Indeed, but:

> But, if ZFS could be modified to fix these problems or if a new
> filesystem would add a feature of maintaining a canonical,
> reproducible Merkle Tree, then it might be extremely useful.

I think it could be.  I don't think you'll get much out of ZFS folks
nowadays though.  Maybe if you talk to ZFS folks outside of Oracle
though, they might confirm whether this is doable.

BTW, IIRC I filed an RFE at Sun to expose a Merkle hash tree checksum
to applications.  I believe I've seen others ask for this before too
at various times, and it may be that the RFE I filed (if I did) was in
response to a request on one of the OSOL discuss lists -- this must
have been back in 2005.  This is a really, really obvious enhancement
to want -- it should be obvious the moment you realize that Merkle
hash trees are a godsend.


More information about the cryptography mailing list