How to prevent content in PDFs from being “scraped” by AI

Posted by Sanford Bingham on Feb 29, 2024 6:20:04 PM

In December 2023, the New York Times filed suit against OpenAI, claiming that ChatGPT was built from “uncompensated use” of the Times’ “intellectual property.” The filing showcased hundreds of examples in which answers from the chatbot were nearly identical to articles published by the Times. More disturbingly, multiple answers that were attributed to the Times hadn’t ever been published, and some even contained false information.  photo_2024-02-29 17.25.46

It may be years before courts rule on the legality of data-collection methods used by OpenAI (and, by extension, most or all commercial Large Language Models (LLMs)). In the meantime,  publishers are expressing concern about how and when their content is ingested into LLMs, and urgently seeking technical solutions to prevent their intellectual property from being used to train LLMs and regurgitated in unpredictable ways by AI tools.

Why publishers of high value PDFs are concerned

Some digital documents are already sold under contracts that expressly forbid indexing, summarization, and use in training LLMs. In nearly all cases these documents are distributed as PDF files, because PDFs are digital containers for information that can be controlled at the document level.

For example, restrictions on use in AI are now included in the sale agreements for PDFs of technical standards, for several reasons:

  • The text of standards documents is precise and any condensation or rephrasing might change the meaning of important parts, which for certain types of standards (aeronautics, fuels, heavy machinery, etc.) could lead to catastrophic outcomes.

  • Even the most sophisticated LLMs regularly introduce errors, aka “hallucinations”, into their output and there is currently no reliable way to distinguish between the good answers derived from the original text and those made up by the model. Some systems attempt to address this concern by including references to the source text, but links only mitigate risk of inaccuracy if the user properly compares the answer with the citation.

  • Data ingested into a LLM by one user may become part of the training set for that LLM and thus be made available to other users, in violation of copyright. Most commercial LLM offerings now include some kind of segmentation to prevent unsanctioned use, but the effectiveness of these controls remains untested.

photo_2024-02-29 15.13.40How AI LLMs extract content from PDFs for training

The mechanism by which LLMs ingest PDF content is, at a high level, no different from that used by search engines: it involves extraction of text from the PDF and then processing to organize and structure the text to enable indexing or, for LLMs, “tokenization.”

There are significant technical differences between how the extracted text is processed/tokenized, but insofar as both approaches are built from the initial PDF text extraction they expose the same risks and complications, including:

  • Some PDFs have no text to extract, because they are images of pages. In this case the text must be generated using Optical Character Recognition (OCR), which is generally accurate but almost never produces perfect results.

  • PDFs are typically unstructured data and some elements – especially tables, charts and special formatting – are difficult to parse, even for the most sophisticated tools.

  • PDFs often contain complex punctuation and other typographical constructs that may change the meaning of text if not properly understood, especially if the text is in an unusual language.

The text of encrypted PDFs, however, cannot be extracted. So a PDF that has been encrypted – either with a password or via a Security Handler like FileOpen – cannot be either indexed or ingested into an LLM. 


Encrypting PDFs is the most effective defense against scraping

So how does this work? 

The contents of an encrypted PDF can only be read by an application capable of decryption that has access to the required key. There are two ways of accomplishing this, and which approach you take has everything to do with the size of your intended viewing audience, and whether you know who they are in advance.

Password Method (Standard Security)

One way to encrypt a PDF is by using the built-in “Password” or “Standard” Security Handler, which nearly all PDF applications support. This is the method that stops you from opening the PDF until you input a password.

Password Security isn’t useful except to keep information to yourself or to share it with people that you trust. It is necessary to trust anyone who has the password, because there’s no way to prevent that person from sharing both the PDF and the password.

Password Security also doesn’t scale well, both because adding passwords is typically a manual process and because it is necessary to distribute both the PDF and the password (which ought to be different for each PDF). Password security just wasn’t designed for publishing or other broad distribution. 

Security Handler Method (FileOpen Rights Management)

The PDF format has support for other “Security Handlers,” like FileOpen, that can encrypt the contents of the file either in advance or in real time on demand, and selectively enable decryption in specific applications, for specific individuals or groups, etc. 

When viewing PDFs encrypted with FileOpen, the authentication process is normally invisible to legitimate, authorized end-users, but locks out unauthorized users and bots. 

Encrypted PDFs can also be delivered without requiring authentication. That is, a website can contain links that anyone can click to view the PDF in a browser, but the PDF content is still protected against extraction by AI crawlers and users are prevented from downloading/sharing the files.

Content owners shouldn’t wait to deploy encryption against AI scrapingphoto_2024-02-29 17.25.36

As the various lawsuits against AI operators demonstrate, Copyright is not self-enforcing. Protecting intellectual property requires either technical measures, like encryption, to prevent unauthorized use or legal action to remedy past abuses. 

Implementing systems to prevent theft of content is simple, inexpensive, and effective. Eventually the legal system will determine whether the alternative - going to court in an attempt to redress past actions and to enjoin future ones - also works.

The images in this post were generated by Google Gemini

Topics: document encryption DRM advice AI