Skip to main content

Extraction API

Convert your PDF/Scanned documents to text format which can be used by LLMs

Endpoint/whisper
URLhttps://llmwhisperer-api.unstract.com/v1/whisper
MethodPOST
Headersunstract-key: <YOUR API KEY>
Bodyapplication/octet-stream

Parameters

ParameterTypeDefaultRequiredDescription
urlstringNoThe default behaviour of the API is to process the document sent in the request body. If you want to process a document from a URL, you can provide the URL here. The URL should be accessible without any authentication. If the request body is empty, the API will try to process the document from the URL.
processing_modestringYesThe processing mode to be used. Choose between ocr and text
output_modestringYesline-printer or text
page_seperatorstring<<<NoThe string to be used as a page separator.
force_text_processingbooleanfalseNoIf set to true, the document will be processed as text only. If set to false, the document will be processed based on LLMWhisperer's chosed stratergy.
pages_to_extractstringNoDefine which pages to extract. By default all pages are extracted. You can specify which pages to extract with this parameter. Example 1-5,7,21- will extract pages 1,2,3,4,5,7,21,22,23,24... till the last page.
timeoutinteger200NoThe time in seconds after which the request will automatically switch to async mode. If a timeout occurs then the API will return a 202 message along with whisper-hash which can be used later to check processing status and retrieve the text. Refer to the async operation documentation for more information
store_metadata_for_highlightingbooleanfalseNoIf set to true, metadata required for the highlighting is stored. If you do not require highlighting API, set this to false. Note that setting this to true will store your text in our servers
median_filter_sizeinteger0NoThe size of the median filter to be applied to the image. This is used to remove noise from the image. This parameter works only in on-prem version of LLMWhisperer
gaussian_blur_radiusinteger0NoThe radius of the gaussian blur to be applied to the image. This is used to remove noise from the image. This parameter works only in on-prem version of LLMWhisperer
ocr_providerstringadvancedNoThe OCR provider to be used. Choose between simple and advanced. This parameter works only in on-prem version of LLMWhisperer
line_splitter_tolerancefloat0.4NoFactor to decide when to move text to the next line when it is above or below the baseline. The default value of 0.4 signifies 40% of the average character height
horizontal_stretch_factorfloat1.0NoFactor by which a horizontal stretch has to applied. It defaults to 1.0. A stretch factor of 1.1 would mean at 10% stretch factor applied. Normally this factor need not be adjusted. You might want to use this parameter when multi column layouts back into each other. For example in a two column layout, the two columns get merged into one.

Processing Modes

OCR [ocr] mode extracts text by considering the entire page as an image. Though this mode produces the best results, it is very slow compared to text mode.

Text [text] mode extracts text by directly extracting text embedded inside PDF files. This mode is very fast compared to OCR mode. It is recommended to use text mode when dealing with documents that contain proper text.

We recommend using OCR mode when dealing with the following types of documents:

  • Scanned documents
  • Documents with non-standard text layout
  • Documents with handwriting
  • Documents with form elements like checkboxes, radio buttons, etc.

We recommend using Text mode when dealing with the following types of documents:

  • Documents which contain proper text. Typically, documents generated by software like MS Word, Google Docs, etc. fall into this category.
  • Software-generated PDFs like invoices, receipts, etc.
  • Large documents with many pages

Text mode is very fast compared to OCR mode. It is recommended to use text mode when dealing with documents that contain proper text.

Forcing Text Processing

Sometimes forcing text processing might be required. LLMWhisperer might switch to OCR mode even if the document contains proper text. This can happen under certain circumances like:

  • Document contains a background image which we are not able to remove properly. Forcing text mode will ignore the background image and extract text.
  • Document contains watermark images in the background and the text in the watermarks gets extracted as regular text in the result.
  • Certificates containing decorative or watermark text in the background image.

If you are sure that the document contains proper text and you want to extract only the text forcing text mode will not only yeild better results but also increase speed of processing significantly.

You can force text processing by setting force_text_processing to true.

Output Modes

Line Printer [line-printer] mode tries to extract the text from the document as is, maintaining the structural layout of the document. This works very well for LLM consumption. This mode uses many techniques to provide the text in the best possible way for LLMs. It also removes white spaces and other unwanted characters from the text to make the result more cost-effective for LLMs.

Text [text] mode extracts the text from the document without applying any processing or intelligence. This mode is useful when the line-printer mode is not able to extract the text properly. This can happen if the document contains too many different fonts and font sizes.

Sync/Async Mode and Timeout

The API can be used in both sync and async modes. If the processing takes more than the supplied timeout value, the API will automatically switch to async mode.

For example, if you set the timeout to 20 seconds and the processing takes more than 20 seconds, the API will return a 202 message along with a whisper-hash which can be used later to check processing status and retrieve the text.

In Async mode, for safety and privacy reasons the extracted text is stored on our servers in memory only for 60 minutes. The text is deleted after this time. Make sure that the text is retrieved within this time frame.

Note that there is a hard timeout of 200 seconds. The call will switch to async mode after 200 seconds even if the timeout is set to a higher value.

Metadata for Highlighting

If you are using the highlighting API, you can set store_metadata_for_highlighting to true. This will store the metadata required for highlighting. If you do not require highlighting API, set this to false. Note that setting this to true will store your text in our servers.

Request Body

The request body should contain the PDF/Scanned document that needs to be converted to text. The document should be in application/octet-stream format.

Example Curl Request

curl -X POST --location 'https://llmwhisperer-api.unstract.com/v1/whisper?force_text_processing=true&processing_mode=text&output_mode=line-printer' \
-H 'Content-Type: application/octet-stream' \
-H 'unstract-key: <Your API Key>' \
--data-binary '@your-file-to-process.pdf'
info

To include the headers in the response use curl -i in the request.

Response

HTTP StatusContent-TypeHeadersDescription
200text/plainWhisper-HashThe API will return the extracted text in the response body. The header will contain Whisper-Hash which can be used to extract highlighting info if required.
202application/jsonThe API will return a JSON with whisper-hash which can be used with the status API to get status and later retrieve the extracted text. Refer below for JSON format

Example 202 Response

{
"message": "Processing time exceeded X seconds. Use the status...",
"status": "processing",
"whisper-hash": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}