I am endlessly fascinated with content tagging systems. They're ubiquitous in software and have so many nuances, but I can't find anything on how to design and implement anymore more than the barebones basics of a system.

Some thoughts in a thread.
A tag is a metadata label associated with content, primarily used for querying and grouping. The tag name is also the id: two tags with the same name are the same tag.

Tags appear everywhere: #hashtags, wikipedia categories, blog post labels, AWS infra tags...
Now, are `horse` and `horses` the same tag? They're different strings, but I'd be pretty miffed if I queried for `horse` and got only half the data.

So for serious querying we need some kind of relationship between tags
The simplest relationship is tag aliases: if A is aliased to B, then querying A is identical to querying B. The only system I know that does that is the fanfiction site AO3, where teams of volunteers manually create aliases from, say, "snarry" to "Harry/Snape"
Any additional structure, though, raises questions about both design and use. Should users be able to intentionally query a specific tag alias? There's no correct design choice here, depends on your users. AO3 went with "no."

A harder problem: tag hierarchies, or "subtags".
With subtags, we can query "science" and get everything tagged "physics", "quantum physics", and "quantum computing".

Usage questions:

* Can things be tagged with root tags or just leaf tags?
* How do users search just X, not its subtags?
There's also implementation considerations. First, transitive queries are expensive, how do you optimize them? Second, how do you prevent cycles in the tag hierarchy, where A and B are both transitive subtags of each other?
It gets even more complex if tags can have multiple parents, like Wikipedia categories. "American Male Novelists" is a subtag of "American Male Writers" and "American Novelists". Now we have diamond problems, redundancy, a whole host of other edge cases.
Notice that the more richness in structure we add, the more ambiguous our queries become. "Every article that shares a tag with article X that's not also in article Y" is unambiguous with simple tags, less so with a tag tree, even less so with a tag directed acyclic graph
You also see "smart" tags, which are based on a predicate. One mobile device manager (MDM) I worked with allowed for things like "every iPad assigned to a classroom in Ferndale High School is considered to be tagged FHS".

Advice: don't let the tag predicates refer to other tags
At one point we accidentally added two smart tags:

A: "Anything tagged B"
B: "Anything not tagged A"

And then the MDM crashed.

We also saw major performance issues, where every content change forced all the smart tags to recompute, which was incredibly expensive.
One last type of tag structure: "key-value" tags, ie `priority: 1`. Then searching for `priority` will get you all content with a `priority` tag, while `priority: 1` gets you that specific subset. Lots of project management tooling has key-value tags, as does AWS
Hypothetically you could have richer queries with key-value tags, like `start-date: BEFORE 2022-03-01`, but I haven't seen that available in any systems. You either search no values or a specific enumerated set.
Another interesting thing about tag systems: who creates the tags, and who are they for? In most systems all users can create arbitrary tags. For distributed platforms, like the semantic web and social media, this encourages spammers to add lots of unrelated tags
But even good actors go heavy on the tags, because nobody's curating any of the tags and you have no idea which ones people use. I call this the Instagram Problem

#tag #tags #tagging #label #labels #tagsystems #taxonomies #folksonomy #hashtags #taggerlyfe #taggerlifestyle
I should note that software engineers aren't the first people to deal with these kinds of questions, which are ubiquitous in library and information sciences. I imagine they have a lot of really good theory and case studies on tagging systems, but I haven't been able to find it

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Inactive; Bluesky is @hillelwayne(dot)com

Inactive; Bluesky is @hillelwayne(dot)com Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @hillelogram

Nov 16, 2022
I was bitten by the knowledge management bug in 2020 but didn't like any of the apps I tried, including ones I made for myself. I recently tried a new approach: everything's on the filesystem, all relationships are represented with symlinks.

It's working really well!
Take tagging. All "tags" are subfolders of the Tags/ folder. If I want to tag `xyz.txt` as "TDD", I just add a symlink to "Tags/TDD". Now I can get everything tagged "TDD" with "ls Tags/TDD".

Getting all of xyz's tags? `gci -R Tags | ? -Prop Link -eq xyz`

(NB: I use powershell)
But wait, there's more! I can get everything that *shares* a tag with xyz by piping that to `ls`.

Now what if I want hierarchical tags, like "TDD is a subtag of testing"? Easy, just symlink Tags/TDD in Tags/Testing and use ls -R instead of ls for lookups.
Read 6 tweets
Nov 16, 2022
Since Twitter had to go through with the sale out of fiscal duty to the shareholders, I tried to figure out what that meant for me. AFAICT based on this Vanguard Semiannual report, for every $1,000 in an S&P 500 index fund, I made approx 45 cents.

personal.vanguard.com/funds/reports/…
Is that worth it? Probably not for me, because I'm internet poisoned, but the average American is blissfully free of Twitter. Hard to figure out how much they made. Conditional median retirement account in 2019 was 65k, so… 'bout 30ish bucks per family?

federalreserve.gov/publications/f…
I dunno, I guess if you went to 63 millionish families and said "a service you've never ever cared about is going to explode, here's 30 bucks", most would take the 30

Obv this is WILDLY Fermi estimate territory, just trying to get a sense for what "duty to the shareholds" meant
Read 4 tweets
Nov 16, 2022
Someone brought up a potential issue with my theory: a legal source that used "boilerplate"… from 1865! That would throw my entire chain of events out the window.

I looked into it though and concluded it's not sufficient evidence. Here's my thinking: 🧵

google.com.au/books/edition/…
First, that got me looking for the *earliest* use of boilerplate. Google Books helpfully gave me this source from 1540: google.com/books/edition/…

Wait, that's before *boilers*. Did Google just record the wrong date?

Seems so! "Acts of Malice" is actually from 1999.
So now we know that some texts are incorrectly dated. Maybe "Advisory Opinions" is also misdated? The typeface looks anachronistic, but I know nothing about typography, so I can't use that as a dating mechanism. Other historians could, though! Text from the book in a typeface that I *think* is more mode
Read 7 tweets
Nov 1, 2022
Why don't developers write more personal GUI tooling? I mean, besides the obvious reason that GUI libraries kinda suck and are much more oriented towards making consumer apps than personal tooling, and also because there are no good GUI tooling exemplars, and...
By "GUI tooling", I mean like `.\script` into the terminal and it pops open a lil window you can interact with.

The usual response is "CLI is better" but it's not better 100% of the time, and there's lots of cases where GUIs are real helpful!

The problem is easiness
If it's really easy to whip up a small GUI, then you'll use it for the 10% of cases where a GUI really helps. But it's really hard, so people never bother to learn. Then they don't use it even for the 2% of cases where it's the best possible tool for the job
Read 7 tweets
Oct 26, 2022
While generally I think that software mocks are a Bad Idea, I also think that letting go of e2e testing is giving up a really powerful testing technique. e2e tests feature interaction in a way that unit tests don't. The trick is they're not at all "unit tests but bigger".
Unit tests can be written like scripts, e2e tests need to be "treated as an artifact": you write supporting infra, you create domain objects, you document, etc. You have to be intentional about it. It's more expensive but in return you get a lot more coverage of interacting parts
At a previous job we got a provider to give us a test account and wrote e2e tests that made changes to that account's data. Took time to set up and effort to maintain but it found a lot of really subtle issues that unit tests couldn't.
Read 8 tweets
Sep 28, 2022
Ever since Strangeloop I've been thinking about end-user programming: people should write their own software, not just consume it from professionals. While I strongly believe this too, I never mesh with the advocates, and I wanted to figure out why. 🧵

inkandswitch.com/end-user-progr…
I feel like I'm the perfect audience for this: I'm an expert AutoHotKey programmer and write tons of vim plugins and powershell scripts, and I just started making my own browser extensions. But at the same time, I don't care about the "model" end-user proglangs: smalltalk & lisp.
Listening to the end-user programming people, I always feel like I'm coming from a different world. I'm not convinced that repls and fully introspectable systems, a la Pharo, are necessary for end-user programming. The most successful examples, VB6 and Excel, have neither, right?
Read 13 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(