Version: 1.0.0

Extraction API

Convert your PDF/Scanned documents to text format which can be used by LLMs


Endpoint	`/whisper`
URL	`https://llmwhisperer-api.unstract.com/v1/whisper`
Method	`POST`
Headers	`unstract-key: <YOUR API KEY>`
Body	`application/octet-stream`

Parameters

Parameter	Type	Default	Required	Description
url (deprecated, use url_in_post instead)	string		No	The default behaviour of the API is to process the document sent in the request body. If you want to process a document from a URL, you can provide the URL here. The URL should be accessible without any authentication. If the request body is empty, the API will try to process the document from the URL.
processing_mode	string		Yes	The processing mode to be used. Choose between `ocr` and `text`
output_mode	string		Yes	`line-printer` or `text`
page_seperator	string	`<<<`	No	The string to be used as a page separator.
force_text_processing	boolean	`false`	No	If set to `true`, the document will be processed as text only. If set to `false`, the document will be processed based on LLMWhisperer's chosed stratergy.
pages_to_extract	string		No	Define which pages to extract. By default all pages are extracted. You can specify which pages to extract with this parameter. Example `1-5,7,21-` will extract pages 1,2,3,4,5,7,21,22,23,24... till the last page.
timeout	integer	`200`	No	The time in seconds after which the request will automatically switch to async mode. If a timeout occurs then the API will return a `202` message along with `whisper-hash` which can be used later to check processing status and retrieve the text. Refer to the async operation documentation for more information
store_metadata_for_highlighting	boolean	`false`	No	If set to true, metadata required for the highlighting is stored. If you do not require highlighting API, set this to `false`. Note that setting this to `true` will store your text in our servers
median_filter_size	integer	`0`	No	The size of the median filter to be applied to the image. This is used to remove noise from the image. This parameter works only in on-prem version of LLMWhisperer
gaussian_blur_radius	integer	`0`	No	The radius of the gaussian blur to be applied to the image. This is used to remove noise from the image. This parameter works only in on-prem version of LLMWhisperer
ocr_provider	string	`advanced`	No	The OCR provider to be used. Choose between `simple` and `advanced`. This parameter works only in on-prem version of LLMWhisperer
line_splitter_tolerance	float	`0.4`	No	Factor to decide when to move text to the next line when it is above or below the baseline. The default value of `0.4` signifies 40% of the average character height
horizontal_stretch_factor	float	`1.0`	No	Factor by which a horizontal stretch has to applied. It defaults to `1.0`. A stretch factor of `1.1` would mean at 10% stretch factor applied. Normally this factor need not be adjusted. You might want to use this parameter when multi column layouts back into each other. For example in a two column layout, the two columns get merged into one.
url_in_post	boolean	`false`	No	If set to `true` send the URL to download from - in the post body. See example below

Processing Modes

OCR [ocr] mode extracts text by considering the entire page as an image. Though this mode produces the best results, it is very slow compared to text mode.

Text [text] mode extracts text by directly extracting text embedded inside PDF files. This mode is very fast compared to OCR mode. It is recommended to use text mode when dealing with documents that contain proper text.

We recommend using OCR mode when dealing with the following types of documents:

Scanned documents
Documents with non-standard text layout
Documents with handwriting
Documents with form elements like checkboxes, radio buttons, etc.

We recommend using Text mode when dealing with the following types of documents:

Documents which contain proper text. Typically, documents generated by software like MS Word, Google Docs, etc. fall into this category.
Software-generated PDFs like invoices, receipts, etc.
Large documents with many pages

Text mode is very fast compared to OCR mode. It is recommended to use text mode when dealing with documents that contain proper text.

Forcing Text Processing

Sometimes forcing text processing might be required. LLMWhisperer might switch to OCR mode even if the document contains proper text. This can happen under certain circumances like:

Document contains a background image which we are not able to remove properly. Forcing text mode will ignore the background image and extract text.
Document contains watermark images in the background and the text in the watermarks gets extracted as regular text in the result.
Certificates containing decorative or watermark text in the background image.

If you are sure that the document contains proper text and you want to extract only the text forcing text mode will not only yeild better results but also increase speed of processing significantly.

You can force text processing by setting force_text_processing to true.

Output Modes

Line Printer [line-printer] mode tries to extract the text from the document as is, maintaining the structural layout of the document. This works very well for LLM consumption. This mode uses many techniques to provide the text in the best possible way for LLMs. It also removes white spaces and other unwanted characters from the text to make the result more cost-effective for LLMs.

Text [text] mode extracts the text from the document without applying any processing or intelligence. This mode is useful when the line-printer mode is not able to extract the text properly. This can happen if the document contains too many different fonts and font sizes.

Sync/Async Mode and Timeout

The API can be used in both sync and async modes. If the processing takes more than the supplied timeout value, the API will automatically switch to async mode.

For example, if you set the timeout to 20 seconds and the processing takes more than 20 seconds, the API will return a 202 message along with a whisper-hash which can be used later to check processing status and retrieve the text.

In Async mode, for safety and privacy reasons the extracted text is stored on our servers in memory only for 60 minutes. The text is deleted after this time. Make sure that the text is retrieved within this time frame.

Note that there is a hard timeout of 200 seconds. The call will switch to async mode after 200 seconds even if the timeout is set to a higher value.

Metadata for Highlighting

If you are using the highlighting API, you can set store_metadata_for_highlighting to true. This will store the metadata required for highlighting. If you do not require highlighting API, set this to false. Note that setting this to true will store your text in our servers.

Request Body

The request body should contain the PDF/Scanned document that needs to be converted to text. The document should be in application/octet-stream format.

If you are using the url_in_post parameter, the URL should be sent in the request body. Content-Type should be text/plain. See curl example below.

Example Curl Requests

Upload document

curl -X POST --location 'https://llmwhisperer-api.unstract.com/v1/whisper?force_text_processing=true&processing_mode=text&output_mode=line-printer' \
-H 'Content-Type: application/octet-stream' \
-H 'unstract-key: <Your API Key>' \
--data-binary '@your-file-to-process.pdf'

Process document from URL

curl -X POST --location 'https://llmwhisperer-api.unstract.com/v1/whisper?force_text_processing=false&processing_mode=ocr&output_mode=line-printer&url_in_post=true' \
-H 'Content-Type: text/plain' \
-H 'unstract-key: <Your API Key>' \
--data 'https://your-url-to-process.pdf'

To include the headers in the response use curl -i in the request.

Response

HTTP Status	Content-Type	Headers	Description
200	`text/plain`	`Whisper-Hash`	The API will return the extracted text in the response body. The header will contain `Whisper-Hash` which can be used to extract highlighting info if required.
202	`application/json`		The API will return a JSON with `whisper-hash` which can be used with the status API to get status and later retrieve the extracted text. Refer below for JSON format

Example `202` Response

{
    "message": "Processing time exceeded X seconds. Use the status...",
    "status": "processing",
    "whisper-hash": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}

Parameters​

Processing Modes​

Forcing Text Processing​

Output Modes​

Sync/Async Mode and Timeout​

Metadata for Highlighting​

Request Body​

Example Curl Requests​

Upload document​

Process document from URL​

Response​

Example 202 Response​