Skip to main content

SinglePass Extraction Introduction

In Prompt Studio, when you are working on field extractions, it's way easier for you to think of one prompt as something used for extracting one field, or a group of closely related fields (like the individual items in a line item) from a document. This also means that you can tweak a prompt to make it better without having to worry about it affecting the quality of other prompts. This one to one relationship between a prompt and a field is a design feature of Prompt Studio that makes long-term maintenance of projects with dozens of complex extractions easier.

Cost vs. prompt complexity

By default, each prompt in Unstract is run against the full context of the input document. This means that, if there are a lot of prompts, it can get pretty expensive in terms of LLM token costs. One obvious technique is to combine all your prompts into one huge prompt that outputs a large JSON with all the fields you need extracted. This works, but like we discussed earlier, this prompt can become unmaintainable very quickly, especially if you need to tweak prompts from time to time because you keep seeing new variants of the input document. How can we strike a balance?

SinglePass Extraction to the rescue

Unstract uses the default Large Language Model configured in the Prompt Studio project to combine all your prompts into a single, large prompt automatically. This has the advantage that you can continue to easily maintain smaller, atomic prompts, while, during actual deployment, all your prompts are combined using an LLM into a single prompt that is run against the full context of your input document.

Limitations

What we're doing with prompt engineering essentially is instructing the LLM we're using to fetch and format the data we need to extract. However, there can be simple prompts like the following:

What is the name of the customer?

When a project has even dozens of such prompts, and when SinglePass Extraction combined all of them into one large prompt, there is no problem. But, consider prompts like the following:

Extract the customer name from the given context. Note that the customer’s middle name should always be ignored. Also, respond with the name formatted as “Last name, First name”. Note that the context might contain the name of the customer’s relationship manager. Please note that their name needs to be ignored. We are interested in the customer’s name only.

When SinglePass Extraction combines dozens of prompts like these to form a single, large prompt, many LLMs might have trouble following the instructions in it just because of how dense it might be. Before any Prompt Studio project is exported and deployed to be used in production, you'll need to verify that it's working well with a representative sample of variants. This is what we will discuss in the next section Developing and verifying SinglePass Extraction.

A note on chunking

For extraction use cases that Unstract is designed for, it's always a good idea to avoid chunking. You should only consider it if the document in question will never fit into the input context window of the chosen LLM.

You need to think about document data extraction from unstructured documents differently from regular RAG (Retrieval Augmented Generation) use cases. Most high volume document data extraction uses cases use documents that are a few pages long and so and almost always, 100% accuracy is targeted and is achievable as well. This is the reason, we need to operate with full context vs. retrieved chunked context.

This advice against chunking is for accuracy reasons. While Unstract supports chunking, retrieval is generally the weakest link in any RAG application and can severely impact the overall quality of the extraction.

To understand more about chunking, please read our Chunking Guide.