Account Share

 

Thread by @ldodds: "Lets talk about identifiers and why they're an important building block in our data infrastructure... Identifiers are labels that we assign […]"

16 tweets, 8 min read
Lets talk about identifiers and why they're an important building block in our data infrastructure... Photo by Adrian Curiel
Identifiers are labels that we assign to things. Roads, cars, products, buildings, companies, diseases, species, books, academic papers, stars, planets, drugs and yes, people.
Identifiers exist everywhere. Every single application, website, service or game you've ever used rely on identifiers. Sometimes they're visible because they are printed on things as bar-codes or serial numbers. But most of the time they're invisible. So we take them for granted.
Why are identifiers so important? It's because they are starting point for how we collect and manage data. Let's say we want to collect data about a building. We might want to record when it was built, who owns it, etc. To do that we need an identifier for every building.
We can always create our own identifiers. But if we want to share our data then we have a problem. We now need to work out which building in my data is the same as yours. That can be a lot of work for us. It's even harder for someone else to do without our help.
There are lots of reports about how much effort it costs data scientists to tidy up and clean data, so they can do useful things with it. One of these repetitive tasks is mapping between identifiers hbr.org/2016/09/bad-da…
We can avoid all this by just using the same identifiers in our datasets. For example if we can agree on a system for giving a building an identifier, then we can use those identifiers right from the start. Our data is immediately easier to use.
If you hear someone talking about linking together datasets, what they mean is making sure those dataset use the same identifiers. And where they don't, updating the datasets to add them in.
The more datasets that use the same identifiers. The more valuable each individual dataset becomes. This is called a network effect. Network effects are very powerful. nfx.com/post/network-e…
Governments, organisations and sectors all create useful identifiers. So often when we are collecting new data we don't need to come up with a new identifier. We can just use an existing one. For example we have identifiers for products, academic papers, buildings and books.
Unfortunately reusing identifiers is often hard to do. The organisations that create the identifiers don't make them easy to reuse. For example by providing a way to find out which identifier applies to which building. And they're often not available under an open licence.
When existing identifiers are hard to reuse, or are only available under closed licences, then they add costs to everyone who could have benefited from them. Costs might be in licensing fees or in time & effort required
to create and maintain unnecessary competing identifiers.
Arguments around opening identifiers often focus on the costs incurred by the organisations that create and maintain them. But this overlooks the costs incurred by EVERYONE else when those identifiers aren't open.
Identifiers are infrastructure. They provide the scaffolding around which we build our data & our digital systems. Identifiers have most utility when they are open. This @ODIHQ & Thomson Reuters paper has more details: innovation.thomsonreuters.com/en/labs/data-i…
For more detail on the specific problem of property identifiers in the UK, read @owenboswarva excellent post here: owenboswarva.com/blog/post-addr…
I've tried to illustrate how identifiers can be used to link together datasets in this short video. I don't know what I'm doing with animations, but there it is!

Missing some Tweet in this thread?
You can try to force a refresh.
This content can be removed from Twitter at anytime, get a PDF archive by mail!
This is a Premium feature, you will be asked to pay $30.00/year
for a one year Premium membership with unlimited archiving.
Don't miss anything from @ldodds,
subscribe and get alerts when a new unroll is available!
Did Thread Reader help you today?
Support me: I'm a solo developer! Read more about the story
Become a 💎 Premium member ($30.00/year) and get exclusive features!
Too expensive?
Make a small donation instead. Buy me a coffee ($5) or help for the server cost ($10):
Donate with 😘 Paypal or  Become a Patron 😍 on Patreon.com
Trending hashtags
Did Thread Reader help you today?
Support me: I'm a solo developer! Read more about the story
Become a 💎 Premium member ($30.00/year) and get exclusive features!
Too expensive?
Make a small donation instead. Buy me a coffee ($5) or help for the server cost ($10):
Donate with 😘 Paypal or  Become a Patron 😍 on Patreon.com