this post was submitted on 20 Jun 2024

937 points (98.9% liked)

Science Memes

12047 readers

1445 users here now

Welcome to c/science_memes @ Mander.xyz!

A place for majestic STEMLORD peacocking, as well as memes about the realities of working in a lab.

Rules

Don't throw mud. Behave like an intellectual and remember the human.
Keep it rooted (on topic).
No spam.
Infographics welcome, get schooled.

This is a science community. We use the Dawkins definition of meme.

Research Committee

[email protected]

Other Mander Communities

Science and Research

Biology and Life Sciences

Physical Sciences

Humanities and Social Sciences

Practical and Applied Sciences

Memes

Miscellaneous

founded 2 years ago

MODERATORS

[email protected]

937

Elsevier (mander.xyz)

submitted 7 months ago by [email protected] to c/[email protected]

141 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] Passerby6497 77 points 7 months ago (4 children)

That's where you print the downloaded PDF to a new PDF. New hash and same content, good luck tracing it back to me fucko.

[–] [email protected] 64 points 7 months ago* (last edited 7 months ago) (2 children)

Unfortunately that wouldn't work as this is information inside the PDF itself so it has nothing to do with the file hash (although that is one way to track.)

Now that this is known, It's not enough to remove metadata from the PDF itself. Each image inside a PDF, for example, can contain metadata. I say this because they're apparently starting a game of whack-a-mole because this won't stop here.

There are multiple ways of removing ALL metadata from a PDF, here are most of them.

It will be slow-ish and probably make the file larger, but if you're sharing a PDF that only you are supposed to have access to, it's worth it. MAT or exiftool should work.

Edit: as spoken about in another comment thread here, there is also pdf/image steganography as a technique they can use.

[–] Passerby6497 8 points 7 months ago (2 children)

Wouldn't printing the PDF to a new PDF inherently strip the metadata put there by the publisher?

[–] sandbox 18 points 7 months ago (3 children)

it’s possible using steganographic techniques to embed digital watermarks which would not be stripped by simply printing to pdf.

[–] FinalRemix 21 points 7 months ago (1 children)

Got it. Print to a low quality JPG, the use AI upscaling to restore the text and graphs.

[–] [email protected] 12 points 7 months ago

You should spread that idea around more, it's pretty ingenious. I'd add first converting to B&W if possible.

[–] [email protected] 12 points 7 months ago* (last edited 7 months ago) (1 children)

This is a great point. Image watermarking steganography is nearly impossible to defeat unless you can obtain multiple copies of the 'same' file from multiple users to look for differences. It could be a change of a single 5-15 pixels from one rgb code off.

rgb(255, 251, 0)

rgb(255, 252, 0)

Which would be imperceptable to the human eye. Depending on the number of users it may need to change more or less pixels.

There is a ton of work in this field and its very interesting, for anyone considering majoring in computer science / information security.

Another 'neat' technology everyone should know about is machine identification codes, or, the tiny ~~secret~~ tracking dots that color printers print on every page to identify the specific make, model, and serial number (I think?) of the printer the page was printed from. I don't believe B&W printers have tracking dots, which were originally used to track creators of counterfeit currency. EFF has a page of color printers which do not include tracking dots on printed pages. This includes color LaserJets along with InkJets, although I would not be surprised if there was a similar tracking feature in place now or in the future "for safety and privacy reasons," but none that I am aware of.

[–] [email protected] 3 points 7 months ago* (last edited 7 months ago)

I wonder if it's common for those steganography techniques to have some mechanism for defeating the fairly simple strategy of getting 2 copies of the file from different sources, and looking at the differences between them to expose all the watermarks.

(I'd think you would need sections of watermark that are the same for any 2 or n combinations of copies of the data, which may be pretty easy to do in many cases, though the difference makes detecting the general watermarking strategy massively easier for the un-watermarkers)

[–] [email protected] 2 points 7 months ago (1 children)

When is why you steghide random data to the image to fuck up the other end =]

[–] [email protected] 2 points 7 months ago (1 children)

Unless you know specifically what they're adding or changing this wouldn't work. If they have a hidden 'barcode' and you add another hidden 'barcode' or modify the image in a way to remove some or all of theirs, they'd still be able to read theirs.

[–] [email protected] 4 points 7 months ago

yeah, youd have to sample other downloads to collect statistics and unsteghide theirs to effectively ensure your fuzzing worked

[–] [email protected] 4 points 7 months ago* (last edited 7 months ago)

Good question. I believe the browser "Print to PDF" function simply saves the loaded PDF to a PDF file locally, so it wouldn't work (if I'm correct.)

I'm not an expert in this field, but you can ask on StackExchange or the author of MAT or exiftool. You can also do it yourself (I'll explain how) by making a PDF with a jpg file with your metadata, opening it and printing to pdf, and then extract the image Do let us know your findings! I'm on a smartphone so can't do it.

If you do try it yourself, a note from the linked SE page is that you won't be able to extract the original file extension (it's unknown, so you either have to know what it is, or look at the file headers, or try all extensions), so if you use your own .jpg with your own exif data, rename to .jpg when finished (I believe exif is handled differently based on file type.)

There are multiple tools to add exif data to an image but the exiftool website has some easy examples for our purpose.

(do this as the first step before adding to the PDF)

(command line here, but there are exiftool GUIs)

exiftool -artist="Phil Harvey" -copyright="2011 Phil Harvey" YourFile.jpg

Adds Phil Harvey and the copyright information to the file. If you're on a smartphone and have the time and really have to know, then hypothetically there should be web-based tools for every step needed. I'm just not familiar with any and it's possible the web-based tool would remove the metadata when creating or extracting the PDF.

[–] [email protected] 2 points 7 months ago

Okay, got it. Print the PDF, then scan it and save as PDF.

Or get some monks to get a handwritten copy, like the good old times.

[–] [email protected] 5 points 7 months ago

I know PDF providers who visibly print the customer's name or number in the header of every page, along with short copyright text. I use qpdf --stream-decompress to make the PDF into human-readable PostScript, and then Python+regex to remove each header text, which stand out a bit from other PDF elements. The script throws an error if more or fewer elements than pages have been removed but that hasn't happened yet. Processed documents sometimes have screwed-up non-ASCII characters in the Table of Contents for some reason but I don't have the originas anymore so IDK if it's my fault. Still, I wouldn't share the PDFs unless in text-only or printed form because of any other steganographic shenanigans in the file. I would absolutely torrent them if I could repurchase them under a new identity and verify that the files are identical.

BTW, has anyone figured out how to embed Python code in PDF? The whitespace always gets reencoded as x-coordinates so copy&pasting it never preserves indentation. No, you can't use the Ogham Space Mark (Unicode's only non-blank character classified as a space) for indentation in Python, I tried.

[–] Olgratin_Magmatoe 5 points 7 months ago

You'd be safer IRL printing it on a printer without yellow ink, then scanning it, then deleting the metadata from the scan.

[–] IlIllIIIllIlIlIIlI 3 points 7 months ago

I saw some that add background watermarks too into random pages and locations.