Tweet

𝚓𝚘𝚗𝚗𝚢﹏𝚜𝚊𝚞𝚗𝚍𝚎𝚛𝚜

Jan 25 • 15 tweets • 6 min read

More fun publisher surveillance:
Elsevier embeds a hash in the PDF metadata that is *unique for each time a PDF is downloaded*, this is a diff between metadata from two of the same paper. Combined with access timestamps, they can uniquely identify the source of any shared PDFs.

You can see for yourself using exiftool.
To remove all of the top-level metadata, you can use exiftool and qpdf:

exiftool -all:all= <path.pdf> -o <output1.pdf>
qpdf --linearize <output1.pdf> <output2.pdf>

To remove *all* metadata, you can use dangerzone or mat2

Also present in the metadata are NISO tags for document status indicating the "final published version" (VoR), and limits on what domains it should be present on. Elsevier scans for PDFs with this metadata, so good idea to strip it any time you're sharing a copy.

Links:
exiftool: exiftool.org
qpdf: qpdf.sourceforge.io
dangerzone (GUI, render PDF as images, then re-OCR everything): dangerzone.rocks
mat2 (render PDF as images, don't OCR): 0xacab.org/jvoisin/mat2

here's a shell script that recursively removes metadata from pdfs in a provided (or current) directory as described above. For mac/*nix-like computers, and you need to have qpdf and exiftool installed:
gist.github.com/sneakers-the-r…

The metadata appears to be preserved on papers from sci-hub. since it works by using harvested academic credentials to download papers, this would allow publishers to identify which accounts need to be closed/secured

https://twitter.com/json_dirs/status/1486135162505072641?t=Wg5XAzujycz79Cop_ap8vQ&s=19

for any security researchers out there, here are a few more "hashes" that a few have noted do not appear to be random and might be decodable. exiftool apparently squashed the whitespace so there is a bit more structure to them than in the OP:
gist.github.com/sneakers-the-r…

https://twitter.com/kmagnacca/status/1486209676979032064?t=GT8fV5QG-4SGTkLadYpCNQ&s=19

https://twitter.com/SchmiegSophie/status/1486206774159970305?t=GT8fV5QG-4SGTkLadYpCNQ&s=19

https://twitter.com/horsemankukka/status/1486268962119761924?s=20

this is the way to get the correct tags:
(on mac i needed to install gnu grep with homebrew `brew install grep` and then use `ggrep` )
will follow up with dataset tomorrow.

https://twitter.com/horsemankukka/status/1486268962119761924?s=20

of course there's smarter watermarking, the metadata is notable because you could scan billions of pdfs fast. this comment on HN got me thinking about this PDF /OpenAction I couldn't make sense of earlier, on open, access metadata, so something with sizes and layout...

updated the above gist with correctly extracted tags, and included python code to extract your own, feel free to add them in the comments. since we don't know what they contain yet not adding other metadata. definitely patterned, not a hash, but idk yet.

https://twitter.com/json_dirs/status/1486289288115359747?t=QwmBvbOgh2fCkjSOZSh3Fw&s=19

you go to school to study "the brain" and then the next thing you know you're learning how to debug surveillance in PDF rendering to understand how publishers have so contorted the practice of science for profit. how can there be "normal science" when this is normal?

follow-up: there does not appear to be any further watermarking: taking two files with different identifying tags, stripping metadata, and relinearizing with qpdf's --deterministic-id flag yields PDFs identical with a diff, ie. no differentiating watermark (but plz check my work)

which is surprising to me, so I'm a little hesitant to make that as a general claim

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @json_dirs

𝚓𝚘𝚗𝚗𝚢﹏𝚜𝚊𝚞𝚗𝚍𝚎𝚛𝚜

@json_dirs

Dec 4, 2021

Of course Elsevier's "enhanced pdf viewer" tracks where you click, view, if you hide the page, etc. and then transmits a big base64 blob of events along with ID from University proxy when you leave. I'm sure straight to SciVal for sale.
Is this the way we want science to work?

genuinely sad that avoiding/gaming surveillance to keep your Bench Performance Rankings in the fundable range might have to become part of basic scientific training.

https://twitter.com/jeffersonpooley/status/1462813222193537025?s=20

https://twitter.com/jeffersonpooley/status/1462813222193537025?s=20

Read 21 tweets

𝚓𝚘𝚗𝚗𝚢﹏𝚜𝚊𝚞𝚗𝚍𝚎𝚛𝚜

@json_dirs

Jul 6, 2021

it's that proclaimin' hour: the more I learn about the lineage of each, the more I think a blending of the open science, semweb/linked data, and piracy communities would be an extremely healthy thing for the human knowledge ecosystem. but I still don't know much much they talk rn

the semweb/ld people are maybe a decade ahead of the open science people on the cultural burnout from purity/overpromising vs. tooling to make it real problem. the piracy people are maybe 15y ahead on incentive systems for uploading data and decentralized curation of metadata

if academia was as nimble as infosec, it would pick off the admins, sysops, and moderators of private BitTorrent trackers: they built what we're trying to build a decade ago and did it at the scale of the entirety of human culture, not just one scientific subdiscipline.

Read 10 tweets

𝚓𝚘𝚗𝚗𝚢﹏𝚜𝚊𝚞𝚗𝚍𝚎𝚛𝚜

@json_dirs

Jul 4, 2021

https://twitter.com/json_dirs/status/1410085210499141632

"practicality" might have been the worst choice of word possible here. I meant as compared to maybe "ineffability," favoring embracing folk knowledge. In this context it means vs. idealism, academic impossibility. humbling how the meaning flips over such a small cultural divide

https://twitter.com/json_dirs/status/1410085210499141632

https://twitter.com/json_dirs/status/1410090568215007232?s=19

I was using linked data here as if it were a near-synonym for semantic web, a subconcept of the broader idea: that's how I've heard it used. but to some people in these communities it is an opposing idea, a reaction against semweb. so close and yet so far

https://twitter.com/json_dirs/status/1410090568215007232?s=19

the variation in reaction is instructive: some hear semweb as the totalizing ontology part (it's impossible, it failed), some as the technology and its effects (thriving under different names!) some hear it as the syntax and standards of triplets and beyond (still don't know!!)

Read 5 tweets

𝚓𝚘𝚗𝚗𝚢﹏𝚜𝚊𝚞𝚗𝚍𝚎𝚛𝚜

@json_dirs

Oct 18, 2019

https://twitter.com/biorxivpreprint/status/1184986268003946496

Today we're releasing Autopilot - a Python framework for performing complex behavioral experiments by distributing them over swarms of Raspberry Pis. Let me tell you how we've rewritten what you should consider possible in your experiments /1
site: auto-pi-lot.com

https://twitter.com/biorxivpreprint/status/1184986268003946496

We think that to make transformative progress in understanding the brain, we need to study complex, naturalistic behavior. We know we're not alone there, but the technical demands of these experiments start to sound almost comical: eg. how hard would it be for you to... /2

measure pupil dilation, respiration, and running speed; track the position of a dozen points in 3d, record hundreds of channels of ephys while an animal performs some complex task in a virtual reality space that you render in real time? /3

Read 34 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

𝚓𝚘𝚗𝚗𝚢﹏𝚜𝚊𝚞𝚗𝚍𝚎𝚛𝚜

Try unrolling a thread yourself!

More from @json_dirs

𝚓𝚘𝚗𝚗𝚢﹏𝚜𝚊𝚞𝚗𝚍𝚎𝚛𝚜

𝚓𝚘𝚗𝚗𝚢﹏𝚜𝚊𝚞𝚗𝚍𝚎𝚛𝚜

𝚓𝚘𝚗𝚗𝚢﹏𝚜𝚊𝚞𝚗𝚍𝚎𝚛𝚜

𝚓𝚘𝚗𝚗𝚢﹏𝚜𝚊𝚞𝚗𝚍𝚎𝚛𝚜

Did Thread Reader help you today?

Like this author's thread?