Tweet

Tagir Valeev

Follow @tagir_valeev

Jun 9, 2021 • 21 tweets • 7 min read

https://twitter.com/tagir_valeev/status/1402089474805354499

Thanks to everybody who participated! Let's dig into this puzzle.

https://twitter.com/tagir_valeev/status/1402089474805354499

We have a HashSet<StringBuilder> here (which has a HashMap inside), we fill it with some objects, then we add one more StringBuilder (sb), and it's correctly added (contains() returns 'true'). Next, we modify the sb contents and contains() returns 'false' now. How it's possible?

Usually, it's a bad idea to mutate the objects used as hash table keys if the hashCode depends on the changed state. In this case, the updated hashCode will point to the wrong bucket, and the hash table won't be able to find the corresponding entry.

However, StringBuilder never overrides equals/hashCode, so it has the identity equals and the identity hashCode, which doesn't depend on content. This way, regardless of the updates, we will always find the proper bucket, and contains() will always return true. Isn't it?

Well, since Java 8 the HashMap implementation was updated to build a red-black tree if we have 8 or more elements inside the single bucket to avoid linear scan. For this, it should be possible to compare elements. bugs.openjdk.java.net/browse/JDK-800…

To compare elements, HashMap uses first complete hashCode (not only bits used to identify the bucket), then if elements are Comparable and have the same class it uses compareTo method. Then it uses runtime class name and identityHashCode as the last resort.

This procedure was still stable for StringBuilder in Java 8..10, as StringBuilder was not comparable. However, in Java 11 it was improved and it actually implements a Comparable interface that compares two StringBuilders by content. bugs.openjdk.java.net/browse/JDK-813…

So, if we actually have many collisions in HashSet/HashMap that uses StringBuilder as keys, it may rely on compareTo() and fail if the StringBuilder content was changed since the StringBuilder was added to the collection, as we may take the wrong route inside the r-b tree.

To make this happen, we need to generate some collisions and force HashMap to treeify the bucket. The identity hash codes are just random numbers, so you need to create a number of StringBuilders until you find one with the desired bucket number.

hc ^ (hc >>> 16) is the formula used in HashMap implementation to improve the bucket distribution when low bits change not very often. A bucket number is just several lower bits, hence & 0x3F github.com/openjdk/jdk/bl…

So this code will generate StringBuilder objects that always fall into the first bucket inside the HashMap having the capacity of 64 (0x40). It's the minimal treeification capacity. github.com/openjdk/jdk/bl…

So if we generate about 40 of such StringBuilders, we will have a pretty treeified HashMap. However, compareTo() won't be called, unless we have exactly duplicate hashCode(). Luckily, we need only one exact duplicate and it can collide with any of 40 objects we already have.

Putting this all together, we can generate a list of 40 StringBuilders that go to the same bucket. Let's also sort them by hashCode to simplify the search.

Next, let's find a single collision:

That's it. We initialize the first 40 StringBuilders with "a" to make them lexicographically bigger than the empty one. However, when we append "oops", their order changes, and compareTo() will return the different result. Here's the complete solution gist.github.com/amaembo/e20a66…

@AndreiPangin

A similar solution was proposed by @AndreiPangin who just generated 10,000,000 StringBuilder objects and find the collision between them. His solution works faster!

https://twitter.com/AndreiPangin/status/1402240430385807361

@quydxnguyen

Another solution was proposed by @quydxnguyen but he looks for collision with exact StringBuilder hash, rather than any hash which makes the things slower. But he parallelized the search for speedup!

https://twitter.com/quydxnguyen/status/1402151079635308544

@heinzkabutz

One more solution is submitted by @heinzkabutz who used a bigger table than me and did a trick with an anonymous class static initializer to write some imperative code. Good job, Heinz!

https://twitter.com/heinzkabutz/status/1402260780490297349

@VladimirSitnikv

The main question that this puzzle poses: is this ok to mutate StringBuilder objects used as HashMap keys/HashSet elements in production code? As @VladimirSitnikv suggests here, you should test your code with -XX:+UnlockExperimentalVMOptions -XX:hashCode=2

https://twitter.com/VladimirSitnikv/status/1401230889468809219

This option makes the identity hashCode of all the objects the same. So with this option, your code will fail without special preparation of StringBuilders. On the other hand, is it a big chance that many StringBuilders will actually fall into a single bucket?

After all, when designing programs we often assume that something unlikely is impossible (the whole Git thing relies on the absence of SHA1 collisions). Can we consider the program to be correct if it uses mutable StringBuilders in hash tables? How long will it run until failure?

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Tagir Valeev

People who liked this thread also liked...

Try unrolling a thread yourself!

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?