Profile picture
foone @Foone
, 12 tweets, 2 min read Read on Twitter
Today I had to explain to a coworker what the BOM is.
It's a character in unicode: U+FEFF, the Byte Order Mark.
It's special and invisible, but it can be useful
it's primary purpose is in UTF-16/UCS-2 where it tells you if the file is encoded in UTF-16-LE or UTF-16-BE. Those are the little-endian and big-endian versions of UTF-16.
its use is slightly weirder in UTF-8, because UTF-8 has a defined byte order and doesn't have any LE/BE variation.
But it does at least confirm "this is a UTF-8 text file", since UTF-8 can look like ASCII.
and that's the problem. Microsoft decided that rather than require encoding definitions or guess what encoding the text is in, it would always require a UTF-8 BOM at the start of the file when reading it, to determine if it's UTF-8
and conversely, Microsoft text-editing tools always write out the UTF-8 BOM, even if they loaded the file without one.
Most text editors know that this character is invisible and shouldn't be rendered, which means you can have BOMs in your files and not know it.
and it turns out someone opened a couple files in our (Linux-specific) codebase in Visual Studio or Notepad sometime a couple years back, so those files have UTF-8 BOMs in them.
My coworker realized there was a weird character at the beginning on a file while doing some command-line hackery and had no idea what it was.
So yeah, if you see U+FEFF, that's what it is. That's encoded as EF BB BF in UTF-8.
A fun fact: You can't tell the difference between UTF-32-LE and UTF-16-LE using the BOM, cause they start the same (and your UTF-16-LE could have NULs in it)
Thankfully no one and their dog uses UTF-32 for document storage/transmission so this is a non-issue for the most part.
Tim Bray on xml-dev once said:

"Actually, I think that the UTF-8 BOM is a deeply stupid idea that serves no useful purpose in any imaginable universe. We wouldn't be thinking about were it not for the fact that MS Notepad happens to write one for UTF-8 documents."
Thank you for coming to my presentation entitled
"The Unicode Byte Order Mark: It's The BOM, Yo!"
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to foone
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($3.00/month or $30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!