Tweet

Calascibetta Romain

Follow @Dinoosaure

20 Oct, 24 tweets, 7 min read

Let's talk about one of the most old program I ever see! The file command.

A small research about this command brings us to 1986-1987 - I was not born - so it's a really old project. But it still is updated (last release is 5.38 - 2020-06-27).

But the official mirror repository seems not really accurate (where the Changelog started from 2003). The first "commit" is dated 1987 (according to the website).

darwinsys.com/file/

This command and specially the underlying library libmagic is used EVERYWHERE!

I mean, when you get an image from a website, the webserver is able to recognize the file as an image and it informs to you the MIME type of the file (for an image, it can be image/png for example).

This is what we send as the Content-Type header's field. Of course, such mechanism exists into emails too.

Even if such software like chromium re-implemented such library by themselves - the file command still is the bedrock for any who wants to recognize a file.

So, some people will tell me that it's enough to recognize "the extension" of the file, fill an arbitrary dictionary and trust only on that. However, we know that an *.mp3 can be an *.exe...

The question is, how to recognize the type of a file (which can come from everywhere) with a certain insurance of safety? file wants to solve this issue and does not trust on the given file but only on its database.

And, god, the database is really complex! The idea is to describe a decision tree where we can check a value (an integer, a string, etc.) which should appear into the file at a certain "offset".

So, as long as we go deep in the tree, we are able to recognize the given file and finish the path to a MIME type!

But... A format such as an image for exemple can be really complex. Indeed, file does not just recognize the MIME type of the file but it can give you some information - such as the size of the image, ratio, etc.

At each step of our decision tree, we are able to append an information. This information can take the tested value and is formatted "à la C".

And, we are able to do an "indirection". That mean we restart the all recognition process at "the current relative offset" - and be able to recognize some sub-information such as JPEG's markers for example.

We can save a specifc pattern too and re-use it at another point - and by this way we have the idea of the recursion into the database...

Finally, and not the best, we can use regexp to recognize a file - but the "man 5 magic" advises to you that a regex can take an exponential time to process...

From such process, we are able to describe a "best-effort" way to recognize a file - I mean, at several steps, you can fail but the real fail is when you have visited the whole tree.

Of course, the way to walk into the tree is a descending walk - and not ascending walk.

So, why I started to talk about file? Because, as I said, we need it in many software - a simple MUA, a webserver, etc.

And, as a #MirageOS developer, I would like to have such ability in my unikernel. However, I can not just integrate file as is into my unikernel.

I said that the process wants to check a specific value **into** a **file**. So file mostly uses lseek and read syscalls. However, what I can do in MirageOS when I don't have lseek or the idea of a file...?

It's why I rewrote file in #OCaml... which does exactly the same as the file command but the core - the decision tree, who to construct the output string, etc. - is agnostic to the system.

github.com/dinosaure/conan

The other interesting part is to try to _type_ the database. As I said, we use a formatted information "à la C". It's even possible to use %s for an integer... as in C. The OCaml program goes further and try to unify tested value and output formatted.

It's a real archeologic work and I was suprised (sometimes afraid) by such a piece of software - about algorithms, C code, the whole design.

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Calascibetta Romain

Try unrolling a thread yourself!

Did Thread Reader help you today?

Like this author's thread?