More fun publisher surveillance:
Elsevier embeds a hash in the PDF metadata that is *unique for each time a PDF is downloaded*, this is a diff between metadata from two of the same paper. Combined with access timestamps, they can uniquely identify the source of any shared PDFs. [A list of metadata for a PDF, the important field being two
You can see for yourself using exiftool.
To remove all of the top-level metadata, you can use exiftool and qpdf:

exiftool -all:all= <path.pdf> -o <output1.pdf>
qpdf --linearize <output1.pdf> <output2.pdf>

To remove *all* metadata, you can use dangerzone or mat2
Also present in the metadata are NISO tags for document status indicating the "final published version" (VoR), and limits on what domains it should be present on. Elsevier scans for PDFs with this metadata, so good idea to strip it any time you're sharing a copy. Internet takedown programs  Elsevier partners with a technol
Links:
exiftool: exiftool.org
qpdf: qpdf.sourceforge.io
dangerzone (GUI, render PDF as images, then re-OCR everything): dangerzone.rocks
mat2 (render PDF as images, don't OCR): 0xacab.org/jvoisin/mat2
here's a shell script that recursively removes metadata from pdfs in a provided (or current) directory as described above. For mac/*nix-like computers, and you need to have qpdf and exiftool installed:
gist.github.com/sneakers-the-r… [Screenshot of code at URL in tweet, the script first uses &
The metadata appears to be preserved on papers from sci-hub. since it works by using harvested academic credentials to download papers, this would allow publishers to identify which accounts need to be closed/secured
for any security researchers out there, here are a few more "hashes" that a few have noted do not appear to be random and might be decodable. exiftool apparently squashed the whitespace so there is a bit more structure to them than in the OP:
gist.github.com/sneakers-the-r…
this is the way to get the correct tags:
(on mac i needed to install gnu grep with homebrew `brew install grep` and then use `ggrep` )
will follow up with dataset tomorrow.
of course there's smarter watermarking, the metadata is notable because you could scan billions of pdfs fast. this comment on HN got me thinking about this PDF /OpenAction I couldn't make sense of earlier, on open, access metadata, so something with sizes and layout... [top comment on HN thread]  So just take pics of the pages aI don't really know what I'm looking at so I can't really de
updated the above gist with correctly extracted tags, and included python code to extract your own, feel free to add them in the comments. since we don't know what they contain yet not adding other metadata. definitely patterned, not a hash, but idk yet.
you go to school to study "the brain" and then the next thing you know you're learning how to debug surveillance in PDF rendering to understand how publishers have so contorted the practice of science for profit. how can there be "normal science" when this is normal?
follow-up: there does not appear to be any further watermarking: taking two files with different identifying tags, stripping metadata, and relinearizing with qpdf's --deterministic-id flag yields PDFs identical with a diff, ie. no differentiating watermark (but plz check my work)
which is surprising to me, so I'm a little hesitant to make that as a general claim

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with 𝚓𝚘𝚗𝚗𝚢﹏𝚜𝚊𝚞𝚗𝚍𝚎𝚛𝚜

𝚓𝚘𝚗𝚗𝚢﹏𝚜𝚊𝚞𝚗𝚍𝚎𝚛𝚜 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @json_dirs

Dec 4, 2021
Of course Elsevier's "enhanced pdf viewer" tracks where you click, view, if you hide the page, etc. and then transmits a big base64 blob of events along with ID from University proxy when you leave. I'm sure straight to SciVal for sale.
Is this the way we want science to work? Screenshot of a paper opene...Screenshot of JSON of a HTT...Screenshot of a split apart...
genuinely sad that avoiding/gaming surveillance to keep your Bench Performance Rankings in the fundable range might have to become part of basic scientific training.
Read 21 tweets
Jul 6, 2021
it's that proclaimin' hour: the more I learn about the lineage of each, the more I think a blending of the open science, semweb/linked data, and piracy communities would be an extremely healthy thing for the human knowledge ecosystem. but I still don't know much much they talk rn
the semweb/ld people are maybe a decade ahead of the open science people on the cultural burnout from purity/overpromising vs. tooling to make it real problem. the piracy people are maybe 15y ahead on incentive systems for uploading data and decentralized curation of metadata
if academia was as nimble as infosec, it would pick off the admins, sysops, and moderators of private BitTorrent trackers: they built what we're trying to build a decade ago and did it at the scale of the entirety of human culture, not just one scientific subdiscipline.
Read 10 tweets
Jul 4, 2021
"practicality" might have been the worst choice of word possible here. I meant as compared to maybe "ineffability," favoring embracing folk knowledge. In this context it means vs. idealism, academic impossibility. humbling how the meaning flips over such a small cultural divide
I was using linked data here as if it were a near-synonym for semantic web, a subconcept of the broader idea: that's how I've heard it used. but to some people in these communities it is an opposing idea, a reaction against semweb. so close and yet so far
the variation in reaction is instructive: some hear semweb as the totalizing ontology part (it's impossible, it failed), some as the technology and its effects (thriving under different names!) some hear it as the syntax and standards of triplets and beyond (still don't know!!)
Read 5 tweets
Oct 18, 2019
Today we're releasing Autopilot - a Python framework for performing complex behavioral experiments by distributing them over swarms of Raspberry Pis. Let me tell you how we've rewritten what you should consider possible in your experiments /1
site: auto-pi-lot.com
We think that to make transformative progress in understanding the brain, we need to study complex, naturalistic behavior. We know we're not alone there, but the technical demands of these experiments start to sound almost comical: eg. how hard would it be for you to... /2
measure pupil dilation, respiration, and running speed; track the position of a dozen points in 3d, record hundreds of channels of ephys while an animal performs some complex task in a virtual reality space that you render in real time? /3
Read 34 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(