My Authors
Read all threads
[Read] Microsoft Academic Graph: When experts are not enough - mitpressjournals.org/doi/full/10.11… - A great read, answers a lot of things I was wondering about. Also a lot less technical than the earlier paper on MAG frontiersin.org/articles/10.33…
Firstly, there is a lot of skepticism on DOIs, ORCIDs. "URIs are not designed with human factors in mind, such that their low usability and common typographical errors in DOIs and ORCIDs have frequently been a source of frustration and an adoption barrier"
Microsoft Research argues that DOIs for JMLR are now defunct? & "not having DOIs for these pubs does not hamper their accessibility".
Interesting aside of the dangers of mitigating "biases is those that originated from human-generated exemplars for machine learning" & gives an example of Dimensions using human experts to guide learning of fields of research.
Interesting how they treat patents as publications & also looking to adapt the model to "treat data sets and software packages as publications and, indeed, some of the data sets are already included as stand-alone publications in MAG"
The section on "Article-centric versus venue-centric principle" talks about DORA & "embraces this spirit and includes all articles from the web that are deemed as scholarly by a machine-learning-based classifier". But what about "predatory" journals?
Indeed "articles published in obscure venues will be found in MAG, including journals that some considered “predatory” in nature." but there are a few checks.... (1/2)
First there's the salience metric they calculate (fully described in another technical paper), but to save effort they also do PCA on the citation graph & "only the nodes corresponding to the largest component are selected". This usually works cos....
" questionable content typically manifests as local clusters in the citation graph and attacks with the means to infiltrate the entire scholarly communications globally have yet to be encountered." - in other words predatory stuff usually isn't cited much?
But using PCA & selecting the biggest cluster has a drawback as it biases against non-english pubs as they cite english pubs but not vice versa! This holds for student term papers & seminar reports which are filtered out by MAS, but not GS.
In any case this leads to a "feature in MAG to report two sets of “citation counts” for each publication, corresponding to citation sources that are included in or excluded from MAG, respectively."
"Observing that name-key-based approaches, such as used in GS, tend to overconflate authors and thus assign more publication records to an author, MAS deliberately takes the opposite direction & decides to err on the conservative side" also mentions "complementing" such systems
Here's the shocker - this conservativity is "extended to the treatment of crowdsourced data for author disambiguation that are collected from the profile management feature in the Microsoft Academic website" ! Yes, even your human suggestions can be rejected
Your inputs are treated as ML "supervision signals to recompute the decision boundaries but not to change the confidence threshold" as such ".. crowdsourced change requests can still be rejected by the learning algorithm and not reflected in MAG."
MAG handles institutions as a secondary attribute in the "is authored by" relationship so that they can declare affiliation per publication. This allows you to do analysis by either authors (throughout whole career), or only after they joined a institution. Nice.
There's also a lot of detail on how MAG handles institutions, which I won't summarise suffice to say very interesting in view of the earlier paper I just read comparing WOS and Scopus affiliation IDs *& the difficulties of handling diff grainularities.
The section on publishing "venues" in MAG , which is what we usually calls journals makes a bold statement - "publication venues are no longer just those neatly defined by the publishers but wherever the provenance of a publication resides"
then they talk about MAG "not appearing to respect ISBNs" (they mean ISSNs?), "because the nuances that publishers adopt in assigning multiple ISBNs to a single journal are hard to process consistently. " - well yeah...
"Using the publication-centric approach, MAG will recognize only the first publication event and treat the subsequent one as a reprint." - I wonder if this is leads to what some have observed that MAG tends to state a much earlier pub year than in traditional
Another safeguard "MAG does not immediately report a newly discovered pub venue as an entity as soon as it is recognized by the algorithm, until the saliencies of its publications have jointly exceeded a manually chosen threshold." One of few areas there is human intervention!
Section on how saliencies work is a simplified explaination of earlier technical paper. Speculates that just as page rank can fight link spam, saliency might be resistant to citation cartels. might even handle coercive citations as saliencies takes into account citation context
Finally the discussion section, mentions that while MAG is often compared to GS as both are free and using the same method of extracting data from the web, "the analogy is off in many respects." (1/2).
Firstly unlike google, "Microsoft does not rely on building the most detailed profile for each person for targeted services" and they do "not use the browse and click information at all in its machine learning components."
Microsoft's business model is as a "data platform that promotes and adds value to Microsoft’s cloud business, and the website is a demonstration of the unique values we would like to showcase on the platform." - Also explains why they don't offer "library links" like in GS,
Instead their platform approach allows libraries to build & merge MAG data with their own holdings, more technical work but less privacy challenges compared to Library links in GS. Platform approach also explains why they are working with Lens.org, Semantic Scholar
All in all a very interesting paper setting out how Microsoft Research approach is different from the rest, without getting crazy technical. I need to sleep, but looking forward to see what the other papers from WOS, Scopus, Dimensions, Crossref & Opencitations are like.
Missing some Tweet in this thread? You can try to force a refresh.

Enjoying this thread?

Keep Current with Aaron Tay

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!