Profile picture
Leigh Dodds @ldodds
, 16 tweets, 8 min read Read on Twitter
Lets talk about identifiers and why they're an important building block in our data infrastructure... Photo by Adrian Curiel
Identifiers are labels that we assign to things. Roads, cars, products, buildings, companies, diseases, species, books, academic papers, stars, planets, drugs and yes, people.
Identifiers exist everywhere. Every single application, website, service or game you've ever used rely on identifiers. Sometimes they're visible because they are printed on things as bar-codes or serial numbers. But most of the time they're invisible. So we take them for granted.
Why are identifiers so important? It's because they are starting point for how we collect and manage data. Let's say we want to collect data about a building. We might want to record when it was built, who owns it, etc. To do that we need an identifier for every building.
We can always create our own identifiers. But if we want to share our data then we have a problem. We now need to work out which building in my data is the same as yours. That can be a lot of work for us. It's even harder for someone else to do without our help.
There are lots of reports about how much effort it costs data scientists to tidy up and clean data, so they can do useful things with it. One of these repetitive tasks is mapping between identifiers hbr.org/2016/09/bad-da…
We can avoid all this by just using the same identifiers in our datasets. For example if we can agree on a system for giving a building an identifier, then we can use those identifiers right from the start. Our data is immediately easier to use.
If you hear someone talking about linking together datasets, what they mean is making sure those dataset use the same identifiers. And where they don't, updating the datasets to add them in.
The more datasets that use the same identifiers. The more valuable each individual dataset becomes. This is called a network effect. Network effects are very powerful. nfx.com/post/network-e…
Governments, organisations and sectors all create useful identifiers. So often when we are collecting new data we don't need to come up with a new identifier. We can just use an existing one. For example we have identifiers for products, academic papers, buildings and books.
Unfortunately reusing identifiers is often hard to do. The organisations that create the identifiers don't make them easy to reuse. For example by providing a way to find out which identifier applies to which building. And they're often not available under an open licence.
When existing identifiers are hard to reuse, or are only available under closed licences, then they add costs to everyone who could have benefited from them. Costs might be in licensing fees or in time & effort required
to create and maintain unnecessary competing identifiers.
Arguments around opening identifiers often focus on the costs incurred by the organisations that create and maintain them. But this overlooks the costs incurred by EVERYONE else when those identifiers aren't open.
Identifiers are infrastructure. They provide the scaffolding around which we build our data & our digital systems. Identifiers have most utility when they are open. This @ODIHQ & Thomson Reuters paper has more details: innovation.thomsonreuters.com/en/labs/data-i…
For more detail on the specific problem of property identifiers in the UK, read @owenboswarva excellent post here: owenboswarva.com/blog/post-addr…
I've tried to illustrate how identifiers can be used to link together datasets in this short video. I don't know what I'm doing with animations, but there it is!

Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Leigh Dodds
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!