, 14 tweets, 3 min read Read on Twitter
Think you understand computers? Which of the following is true:

1. ‘:' < ‘;’
2. ‘:’ > ‘;'

Correct answer: it depends! In a fresh install of MySQL, ‘:' < ‘;’. In Postgres, ‘:’ > ‘;'.

Follow my descent into madness (and Unicode collation) to find out why...

1/n
First some context. I had a list of strings:

'foo:a'
'bar:b'
'bar:c'
'baz:d'

I wanted to select everything with a prefix of 'bar:' using only '>' and '<' (the reason is another story), so I looked up the ASCII table and found that ':' + 1 = ';'

2/n
Cool, so then I try "SELECT ... WHERE key > 'bar:' AND key < 'bar;'" on my original DB on DynamoDB. That worked great.

I decide to port things to MySQL and suddenly that query returns NOTHING.

I begin to question everything I know and hold dear.

3/n
I eventually reduce a test case down to:

1. "SELECT ':' < ';'" returns true
2. SELECTing ':' and ';' ORDER BY key ASC from my _table_ returns ';' before ':'

Wat.

Then I remembered another weird text issue...

4/n
When I was at Dropbox, anytime you mentioned text handling someone would groan about the "Turkish i problem":

With a Turkish locale, lower('I') is not 'i' but a different character 'ı'.

This is a huge pain if you require that lower-cased filename must be unique.

5/n
The trauma from Dropbox lead me to think that this might be some weird Unicode thing. After all, I had explicitly set my table to use UTF8 encoding (specifically 'utf8mb4') with a "collation" of 'utf8mb4_unicode_ci'... because the MySQL docs told me to?

6/n
What is a collation? A collation is what determines the ordering of strings.

"Isn't that just the alphabet?" you ask.

Sure, but what happens when you have letters like 'À', 'Ẵ', and 'Ā'? Someone has to define a sort order and for UTF8 it's the Unicode consortium.

7/n
The Unicode Collation Algorithm (unicode.org/reports/tr10/) gives an incredibly sophisticated algorithm that as far as I can tell... no one implements fully.

But it all starts with a huge table of everything in Unicode and "weights" for sorting: unicode.org/Public/UCA/4.0….

8/n
So lets look at the weights from the previous file for ':' and ';'

'COLON' -> 023D.0020.0002.003A
'SEMICOLON' -> 023A.0020.0002.003B

These are tuples of four "levels" and the leftmost level is most significant (read: what you should sort by).

9/n
Curiously, COLON and SEMICOLON are in the reverse order from ASCII!

Why?

I have no idea, but it's probably an artifact of needing to squeeze in other characters like "ARABIC COLON" (weight 023B) in between them and scrambling things up in the process.

10/n
Later, I discover the WEIGHT_STRING function in MySQL which validates this:

> HEX(WEIGHT_STRING(':'))
-> 3A (The ASCII byte for ':')

> HEX(WEIGHT_STRING(
CAST(':' AS CHAR CHARACTER SET 'utf8mb4')
COLLATE 'utf8mb4_unicode_ci'))
-> 023D (The weight from above)

11/n
MySQL by default uses 'latin1' with a 'latin1_swedish_ci' collation (really!), which determines the default collation for string literals (it can be changed with a "SET NAMES..." command).

Postgres infers things from your locale, so for me it's 'UTF8' and 'en_US.UTF-8.'

12/n
Alas, our mystery is solved!

In the end I switched to 'utf8mb4_binary' which sorts by Unicode (numerical) code point and does what I want.

Now you know way more about colons than perhaps you ever thought possible.

To close, here is a photo of one of our dogs:

13/13
PSA/PS: Don't use 'utf8mb4_general_ci'. The MySQL manual states that "comparisons for [this] collation [are] slightly less correct." Browsing the source code, I think it chops off everything but the lowest level weight. Making ':' less than ';'... against the UTF8 Rules!
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Ben Newhouse
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!