Read on Twitter

12,399 views

Ben Newhouse

@newhouseb

, 14 tweets, 3 min read Read on Twitter

Think you understand computers? Which of the following is true:

1. ‘:' < ‘;’
2. ‘:’ > ‘;'

Correct answer: it depends! In a fresh install of MySQL, ‘:' < ‘;’. In Postgres, ‘:’ > ‘;'.

Follow my descent into madness (and Unicode collation) to find out why...

1/n

First some context. I had a list of strings:

'foo:a'
'bar:b'
'bar:c'
'baz:d'

I wanted to select everything with a prefix of 'bar:' using only '>' and '<' (the reason is another story), so I looked up the ASCII table and found that ':' + 1 = ';'

2/n

Cool, so then I try "SELECT ... WHERE key > 'bar:' AND key < 'bar;'" on my original DB on DynamoDB. That worked great.

I decide to port things to MySQL and suddenly that query returns NOTHING.

I begin to question everything I know and hold dear.

3/n

I eventually reduce a test case down to:

1. "SELECT ':' < ';'" returns true
2. SELECTing ':' and ';' ORDER BY key ASC from my _table_ returns ';' before ':'

Wat.

Then I remembered another weird text issue...

4/n

When I was at Dropbox, anytime you mentioned text handling someone would groan about the "Turkish i problem":

With a Turkish locale, lower('I') is not 'i' but a different character 'ı'.

This is a huge pain if you require that lower-cased filename must be unique.

5/n

The trauma from Dropbox lead me to think that this might be some weird Unicode thing. After all, I had explicitly set my table to use UTF8 encoding (specifically 'utf8mb4') with a "collation" of 'utf8mb4_unicode_ci'... because the MySQL docs told me to?

6/n

What is a collation? A collation is what determines the ordering of strings.

"Isn't that just the alphabet?" you ask.

Sure, but what happens when you have letters like 'À', 'Ẵ', and 'Ā'? Someone has to define a sort order and for UTF8 it's the Unicode consortium.

7/n

The Unicode Collation Algorithm (unicode.org/reports/tr10/) gives an incredibly sophisticated algorithm that as far as I can tell... no one implements fully.

But it all starts with a huge table of everything in Unicode and "weights" for sorting: unicode.org/Public/UCA/4.0….

8/n

So lets look at the weights from the previous file for ':' and ';'

'COLON' -> 023D.0020.0002.003A
'SEMICOLON' -> 023A.0020.0002.003B

These are tuples of four "levels" and the leftmost level is most significant (read: what you should sort by).

9/n

Curiously, COLON and SEMICOLON are in the reverse order from ASCII!

Why?

I have no idea, but it's probably an artifact of needing to squeeze in other characters like "ARABIC COLON" (weight 023B) in between them and scrambling things up in the process.

10/n

Later, I discover the WEIGHT_STRING function in MySQL which validates this:

> HEX(WEIGHT_STRING(':'))
-> 3A (The ASCII byte for ':')

> HEX(WEIGHT_STRING(
CAST(':' AS CHAR CHARACTER SET 'utf8mb4')
COLLATE 'utf8mb4_unicode_ci'))
-> 023D (The weight from above)

11/n

MySQL by default uses 'latin1' with a 'latin1_swedish_ci' collation (really!), which determines the default collation for string literals (it can be changed with a "SET NAMES..." command).

Postgres infers things from your locale, so for me it's 'UTF8' and 'en_US.UTF-8.'

12/n

Alas, our mystery is solved!

In the end I switched to 'utf8mb4_binary' which sorts by Unicode (numerical) code point and does what I want.

Now you know way more about colons than perhaps you ever thought possible.

To close, here is a photo of one of our dogs:

13/13

PSA/PS: Don't use 'utf8mb4_general_ci'. The MySQL manual states that "comparisons for [this] collation [are] slightly less correct." Browsing the source code, I think it chops off everything but the lowest level weight. Making ':' less than ';'... against the UTF8 Rules!

Like this thread? Get email updates or save it to PDF!

Subscribe to Ben Newhouse

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Like this thread? Get email updates or save it to PDF!

Subscribe to Ben Newhouse

This content may be removed anytime!

Try unrolling a thread yourself!

Related threads

Trending hashtags

Did Thread Reader help you today?