Skip to main content

Data Privacy

It goes without saying that customers put important and potentially sensitive documents through the Unstract Platform and LLMWhisperer. Because of this, we take data privacy very seriously.

Apart from various compliance standards we meet, the Unstract Platform is carefully designed from the ground up not to store any documents that pass through it during normal course of operation. As processing occurs, documents are discarded once results are returned to the caller.

The difference between LLMWhisperer and Unstract

It's important to understand the difference between LLMWhisperer and Unstract to better understand how your documents and data from your documents flows through the system.

Before documents can processed efficiently by Large Language Models, text has to be extracted from them. This is where LLMWhisperer comes in. It is a raw text extraction service that extracts raw text from input documents in a manner that LLMs can best understand. This raw text is then utilized by Unstract to create structured data using LLMs.

How LLMWhisperer handles documents

Like mentioned above, LLMWhisperer is an LLM-targeting raw text extraction service. Based on whether you're on the free plan or the Pro plan, the way documents are treated varies.

LLMWhisperer paid plans

Documents are not stored anywhere in the system. You should think or it as a clean pass-through. Raw text is extracted from the documents that are sent to the system, the result is returned and the document is discarded.

LLMWhisperer free plans

LLMWhisperer provides users with a forever-free plan that allows processing up to 100 pages free per day. When this plan is used and documents are sent to LLMWhisperer, they may be stored and they may be used to improve the system. The free plan is provided for testing the capabilities of LLMWhisperer only. Do not send documents that are sensitive when using this plan.

Other LLMWhisperer metadata

It is very common in document data extraction use cases for users to need to verify extractions for operational or legal reasons. To this end, a user interface where the extracted data and the source document are presented side-by-side is a common approach. A mechanism to highlight the extracted data in the source document ensures that reviewers are able to efficiently review extractions.

LLMWhisperer sports a unique privacy-preserving source document highlighting technology. This highlighting technology works in tandem with an LLM to achieve source document highlighting.

LLMWhisperer is designed carefully to achieve source document highlighting without the need to store any data from the document except line number coordinates. Line number coordinates contain only the bounding box coordinates for each extracted line, but no actual text or data from the document itself.

The Unstract Platform

The deployment phase of document data extraction on the Unstract Platform in done in one the following ways:

The raw data needed to structure the documents comes from a text extraction service like LLMWhisperer and it is not stored anywhere in the first three types of integrations mentioned above, namely API Deployments, ETL Pipelines and Task Pipelines. Human Quality Review however needs the documents to be displayed to reviewers along with the extracted data side-by-side. So, if Human Quality Review is turned on (this is an opt-in feature that needs to be explicitly turned on), then documents are retained until review is complete. After the review, they're deleted.

Prompt Studio

Prompt Studio is the environment where prompt engineers define the schema of the extraction while also developing generic prompts that can work well to extract data from different variants of the document. To do this successfully, you'll need to provide a representative sample of document variants in the Prompt Studio project you're working on. These documents are of course stored in the system. While we recommend that you continue to have the document variants available as part of the Prompt Studio project should you need to tweak your prompts or the extraction schema, once the Prompt Studio project has been exported, there is technically no need for the documents. They can be deleted from the Prompt Studio project without affecting any APIs etc deployed using the project.