Thanks to everybody who participated! Let's dig into this puzzle.
We have a HashSet<StringBuilder> here (which has a HashMap inside), we fill it with some objects, then we add one more StringBuilder (sb), and it's correctly added (contains() returns 'true'). Next, we modify the sb contents and contains() returns 'false' now. How it's possible?
Usually, it's a bad idea to mutate the objects used as hash table keys if the hashCode depends on the changed state. In this case, the updated hashCode will point to the wrong bucket, and the hash table won't be able to find the corresponding entry.
However, StringBuilder never overrides equals/hashCode, so it has the identity equals and the identity hashCode, which doesn't depend on content. This way, regardless of the updates, we will always find the proper bucket, and contains() will always return true. Isn't it?
Well, since Java 8 the HashMap implementation was updated to build a red-black tree if we have 8 or more elements inside the single bucket to avoid linear scan. For this, it should be possible to compare elements. bugs.openjdk.java.net/browse/JDK-800…
To compare elements, HashMap uses first complete hashCode (not only bits used to identify the bucket), then if elements are Comparable and have the same class it uses compareTo method. Then it uses runtime class name and identityHashCode as the last resort.
This procedure was still stable for StringBuilder in Java 8..10, as StringBuilder was not comparable. However, in Java 11 it was improved and it actually implements a Comparable interface that compares two StringBuilders by content. bugs.openjdk.java.net/browse/JDK-813…
So, if we actually have many collisions in HashSet/HashMap that uses StringBuilder as keys, it may rely on compareTo() and fail if the StringBuilder content was changed since the StringBuilder was added to the collection, as we may take the wrong route inside the r-b tree.
To make this happen, we need to generate some collisions and force HashMap to treeify the bucket. The identity hash codes are just random numbers, so you need to create a number of StringBuilders until you find one with the desired bucket number. Image
hc ^ (hc >>> 16) is the formula used in HashMap implementation to improve the bucket distribution when low bits change not very often. A bucket number is just several lower bits, hence & 0x3F github.com/openjdk/jdk/bl…
So this code will generate StringBuilder objects that always fall into the first bucket inside the HashMap having the capacity of 64 (0x40). It's the minimal treeification capacity. github.com/openjdk/jdk/bl…
So if we generate about 40 of such StringBuilders, we will have a pretty treeified HashMap. However, compareTo() won't be called, unless we have exactly duplicate hashCode(). Luckily, we need only one exact duplicate and it can collide with any of 40 objects we already have.
Putting this all together, we can generate a list of 40 StringBuilders that go to the same bucket. Let's also sort them by hashCode to simplify the search. Image
Next, let's find a single collision: Image
That's it. We initialize the first 40 StringBuilders with "a" to make them lexicographically bigger than the empty one. However, when we append "oops", their order changes, and compareTo() will return the different result. Here's the complete solution gist.github.com/amaembo/e20a66…
A similar solution was proposed by @AndreiPangin who just generated 10,000,000 StringBuilder objects and find the collision between them. His solution works faster!
Another solution was proposed by @quydxnguyen but he looks for collision with exact StringBuilder hash, rather than any hash which makes the things slower. But he parallelized the search for speedup!
One more solution is submitted by @heinzkabutz who used a bigger table than me and did a trick with an anonymous class static initializer to write some imperative code. Good job, Heinz!
The main question that this puzzle poses: is this ok to mutate StringBuilder objects used as HashMap keys/HashSet elements in production code? As @VladimirSitnikv suggests here, you should test your code with -XX:+UnlockExperimentalVMOptions -XX:hashCode=2
This option makes the identity hashCode of all the objects the same. So with this option, your code will fail without special preparation of StringBuilders. On the other hand, is it a big chance that many StringBuilders will actually fall into a single bucket?
After all, when designing programs we often assume that something unlikely is impossible (the whole Git thing relies on the absence of SHA1 collisions). Can we consider the program to be correct if it uses mutable StringBuilders in hash tables? How long will it run until failure?

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tagir Valeev

Tagir Valeev Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @tagir_valeev

10 Mar
Here's an example of how one can discover bugs in the product using static analysis. 👇
I'm improving Java static analysis in IntelliJ IDEA. The recent improvement: I added a range contract for CharSequence.charAt() method: the argument must be non-negative. github.com/JetBrains/inte…
That's because the contract of charAt() method requires to throw IndexOutOfBoundsException on negative argument, so all the implementations must fulfill this contract.
Read 12 tweets
10 Mar
Тут не упомянута (или я пропустил) вполне банальная причина того что ФП не стало мейнстримом в девяностые. Оно медленное и жрёт гораздо больше памяти. Все эти миллионы функций надо создавать и подбирать сборщиком мусора.
Не от хорошей жизни люди делали изменяемые объекты в девяностые, и это считалось нормальным. Создание объекта было дорогой операцией, а сборщики мусора приостанавливали приложение.
Но с тех пор произошло несколько революций. Во-первых, память стала сильно дешевле, и её стало больше. В 50-мегабайтной куче на ФП не развернёшься, а в гигабайтной уже можно много интересного сделать.
Read 7 tweets
13 Feb
Ever wondered how the JVM debugger 'Evaluate' feature works in IDEs? Long ago I thought that the Java debug interface (JDI) actually allows you to specify the expression and get back its value.
However, this is not quite possible: there's no Java language compiler inside the JVM. When the process is paused JDI allows you to read or write any variables and fields or even execute an existing method with given arguments. But nothing like addition or multiplication.
On the other hand, #IntelliJIDEA debugger allows you to evaluate even statements! You can write there `for`, `if`, `while`, `switch` - whatever. You can even declare local variables. So how this works?
Read 10 tweets
8 Feb
Многие из вас знают "Сказку о глупом мышонке" Самуила Яковлевича Маршака. Либо слышали её от родителей, либо читали детям, либо и то, и другое. Так получилось, что я несколько раз переосмысливал, почему вообще Мышонок глупый.
Для тех, кто не в курсе, напоминаю содержание: Мышка-мать укладывает ребёнка спать посредством колыбельной. Мышонок критикует вокальные данные матери и требует лучшего певца. Мать по очереди приглашает всех соседей, которые также не проходят отбор в шоу "Голос".
Последним пунктом программы выступает хедлайнер - Кошка, которая наконец-то проходит quality-gate, и её песню мёрджат в продакшн. Однако неожиданно на проде появляются утечки. В частности, куда-то исчезает Мышонок.
Read 17 tweets
26 Mar 20
Ух, какую годноту от @phillennium пропустил. В каментах пожар! habr.com/ru/company/jug…
Я недавно задумался, что разбухание софта вызвано обработкой редких ситуаций. Например, рассмотрим часы. Самое простое - отсчитывать часы, минуты, секунды, 24/60/60, тут даже не на байты счёт, а на транзисторы.
Затем все захотят дату. В самых простых вариантах число и месяц без года. Каждые четыре года вручную подводишь часы на день. Можно, но неудобно. Добавляем год. Сперва в узком диапазоне, например, от 1980 до 2099 (в семь бит) влезает, на нашу жизнь хватит.
Read 11 tweets
29 Dec 18
Факты о Новосибирском офисе JetBrains (независимо от числа лайков).
1. Новосибирский офис расположен по адресу Демакова 30, в трёхстах метрах от башен Технопарка.
2. Офис открылся 1 сентября 2016 года. Я работаю в офисе с того же дня. Можно сказать, старожил.
Read 54 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(