Skip to main content

Chunk size and overlap

In the context of Retrieval-Augmented Generation (RAG), chunking refers to the division of a large input data, typically a large document or documents, into smaller, more manageable segments or "chunks" for efficient information retrieval. RAG techniques leverage these chunks to retrieve relevant context that enhances the generation of responses or text outputs from large language models or LLMs

The purpose of chunking in RAG is to improve the speed and accuracy of the retrieval process. By breaking down the corpus into smaller pieces, the model can quickly search through a more focused subset of data, reducing the computational overhead and improving the relevance of the information retrieved.

Chunking in Unstract’s context

Unstract uses Large Language Models (LLMs) at its core to enable extraction of information from unstructured documents. Each large language model has what is called a context size. The context size of a large language model like GPT 4 refers to the maximum number of tokens (words and punctuation marks) the model can consider at one time as input. This is also known as the model's "window" or "attention span." For instance, the GPT 3.5 Turbo model has a context size of 4,096 tokens. This means that the model (GPT 3.5 turbo) can take into account up to 4,096 tokens of text to process/generate text. If the input exceeds this limit, the earliest tokens will be truncated, and the model will only consider the most recent 4,096 tokens.

Note: Around 1,000 english words is about 750 tokens

Let’s assume you have 2 documents:

  1. Document A with 2,000 words (approximately 1,500 tokens)
  2. Document B with 15,000 words (approximately 11,250 tokens)

Document A will easily fit into the 4,096 token limit of GPT 3.5 turbo model. So we can pass the entire document as context along with a prompt to GPT 3.5 turbo and expect the model to have context on the whole document in one go. But the scenario is different for Document B which has 11,250 tokens which won’t fit into the 4,096 token limit of GPT 3.5 turbo model. Here is where chunking comes into play for Unstract. For Document A chunking is not required. But for Document B chunking is required. We will have to select a chunk size for handling Document B.

Note: Setting Chunk Size to 0 disables Chunking.

Pros and Cons of Chunking

No Chunking (0 Chunk Size)With Chunking
Pros1. Context of document is fully transferred to LLM.
2. Leads to the best quality results
3. No Vector DB is involved
4. No Embedding is involved
5. No Retrieval engine is involved (we do not have to worry about the quality of embedding or chunk size strategies)
6. Almost always provides the best quality / most accurate results.
1. Information from very large documents, that won’t fit into the context of an LLM as a whole can be extracted
2. Lower cost since only small chunks of the document are sent to the LLM for a single prompt.
3. Lower latency (faster result generation) since only part of the document is sent to the LLM
Cons1. Does not work for large documents
2. Higher cost since entire document is sent to LLM for every field extraction
3. Higher latency (slower result generation) since the entire document is sent to the LLM irrespective of the field being extracted.
1. Usage of Vector DB
2. Usage of Embeddings
3. Quality of retrieval (selecting the right chunks to send to LLM) is dependent on many factors
3.1 Selection of chunk size
3.2 Selection of overlap
3.3 Retrieval strategy used
3.4 Quality of embedding
3.5 Information density / distribution within a document
4. Requires iterative experimentation to arrive at the above settings

How to choose

If your documents are smaller than the context size of the LLM (Document A)

Choose no chunking (0 chunk size) if your document’s text contents can fit into the context size of the selected LLM. This provides the best results. But if you have lots of extractions to be made from each document, the cost might increase significantly. But note that our SaaS and Enterprise versions of Unstract have a couple features called Summarized Extraction and Single-Pass Extraction. With these options, you can still enjoy the power and ease of use of no chunking and at the same time keep costs low, without any effort on your side.

If your documents are larger than the context size of the LLM (Document B)

Choose chunking. There is no other option.

Rough page count for choosing your strategy

From our experience, we see an average of 400 words per page for dense documents. Most documents have significantly lower than 400 words per page. For our calculations, let us assume that we will be dealing with documents with 500 words per page.

Note: 500 words per page is substantially on the higher end of the scale. Real world use cases, especially averaged over multiple pages will be much lower.

LLM ModelContext SizeNo Chunking
(max pages)
Requires Chunking
If the document is…
Llama2,048 (2K)5 Pages> 5 Pages
Llama 24,096 (2K)10 Pages> 10 Pages
GPT 3.5 Turbo4,096 (4K)10 Pages> 10 Pages
GPT 48,192 (8K)20 Pages> 20 Pages
Mistral 7B32,768 (32K)80 Pages> 80 Pages
GPT 4 Turbo131,072 (128K)320 Pages> 320 Pages
Gemini 1.5 Pro131,072 (128K)320 Pages> 320 Pages
Claude 3 Sonnet204,800 (200K)500 Pages> 500 Pages

Cost Considerations for no chunking strategy

It is very inviting to use a no chunking strategy considering today’s leading models can handle 100+ pages. But you will also have to consider the cost implications during extraction.

For the sake of calculations, let’s assume that we are dealing with documents with 400 words per page. We see this as an average value for dense documents.

Note: The costs mentioned below are rough estimates. There are a lot of dynamic conditions which affect the actual costs. They can be used to estimate ballpark figures only. For challenger modes, a 20% retry cost is added. A Challenger LLM improves accuracy of the extraction by challenging the Extractor LLM, forcing it to reevaluate results.

LLM ModelChallenger LLM ModelApprox Cost per prompt per pageApprox Cost
for 10 page 10 prompt document
No Summary & No ChallengerNo Summary & With ChallengerNo Summary & No ChallengerNo Summary & With Challenger
GPT 4Gemini 1.5 Pro$0.0100$0.0145$1.000$1.450
Claude 3 Sonnet$0.0131$1.310
GPT 4 TurboGemini 1.5 Pro$0.0030$0.0061$0.300$0.610
Claude 3 Sonnet$0.0047$0.470
Gemini 1.5 ProGPT 4 Turbo$0.0021$0.0061$0.210$0.610
Claude 3 Sonnet$0.0036$0.360
Claude 3 SonnetGPT 4 Turbo$0.0009$0.0047$0.090$0.470
Gemini 1.5 Pro$0.0036$0.360

** Pricing info as on 3rd May 2024

Choosing chunk size and overlap

Choosing the right chunking size and overlap is crucial for optimising performance and retrieval quality.

  • Chunking: This involves breaking down the input data into manageable pieces or "chunks". The size of each chunk is critical because it affects what information density will be available for embedding as vectors. A very small size can lead to not enough information and a very large size can dilute the embedded information.
  • Overlap: Overlap between chunks ensures that information at the boundaries is not lost or contextually isolated. This overlapping area can help in creating a more seamless integration of retrieved information.

Determining Chunk Size

Context size of the LLM: Consider the maximum context window of your LLM (e.g. 4,096 tokens for GPT 3.5 Turbo). The chunk size should be large enough to provide meaningful content but not exceed the model’s capacity to handle context efficiently.

Content Type: Depending on the nature of the text (e.g., technical documents, conversational transcripts), different chunk sizes may be optimal. Dense, information-rich text might require smaller chunks to avoid missing critical details.

Retrieval strategy requirements: Note that some retrieval strategies will retrieve top-k number of chunks and pass it to the LLM. So this means that if 3 chunks need to be passed to the LLM, all 3 chunks should fit in the LLM’s context size simultaneously.

Deciding on Overlap Size

Ensure Contextual Integrity: The overlap should be sufficient to maintain the context between chunks. Typically, an overlap of 10-20% of the chunk size is a good starting point.

Trade-offs: More overlap means better context preservation but can lead to redundancy and increased computational load. Find a balance based on your specific application needs.

Experiment and Adjust

Pilot Testing: Start with a baseline based on the guidelines above and then test the chunking and overlap in real scenarios. Analyse the impact on retrieval quality and model output coherence.

Iterative Adjustments: Adjust chunk sizes and overlap based on performance metrics and qualitative feedback. This might require several iterations to optimise.

Resource Constraints

Computational Resources: Larger chunks and more overlap can strain your computational resources. Ensure your settings are sustainable given your latency and cost constraints

Latency Requirements: If your application requires very low latency, you might need to optimise for faster retrieval, potentially at the cost of some contextual depth.

Retrieval strategies

The retrieval strategy refers to the methodology or approach used to locate and retrieve relevant information from a document in response to a user query or task requirement. The choice of retrieval strategy impacts how effectively and efficiently a system can provide relevant information, which is critical in Retrieval-Augmented Generation (RAG) techniques. There are two retrieval strategies available in Unstract.

Simple retriever: is a type of retrieval model that employs straightforward mechanisms to fetch relevant information. We use a combination of keywords and vector search to implement the simple retriever. The Simple retriever is fast and efficient for large-scale applications but may lack the nuance and deep understanding provided by more complex models.

Subquestion retriever: is an advanced retrieval approach used in complex query scenarios. It involves breaking down a complex query into simpler subquestions, retrieving relevant information for each subquestion, and then aggregating the results to address the original complex query. This type of retriever is particularly useful when dealing with multifaceted questions that require pulling together information from different contexts or domains. It enhances the model's ability to handle detailed or nuanced queries by focusing on the specific information needs identified in each subquestion.