Extraction API
Convert your PDF/Scanned documents to text format which can be used by LLMs
Endpoint | /whisper |
URL | https://llmwhisperer-api.unstract.com/v1/whisper |
Method | POST |
Headers | unstract-key: <YOUR API KEY> |
Body | application/octet-stream |
Parameters
Parameter | Type | Default | Required | Description |
---|---|---|---|---|
url (deprecated, use url_in_post instead) | string | No | The default behaviour of the API is to process the document sent in the request body. If you want to process a document from a URL, you can provide the URL here. The URL should be accessible without any authentication. If the request body is empty, the API will try to process the document from the URL. | |
processing_mode | string | Yes | The processing mode to be used. Choose between ocr and text | |
output_mode | string | Yes | line-printer or text | |
page_seperator | string | <<< | No | The string to be used as a page separator. |
force_text_processing | boolean | false | No | If set to true , the document will be processed as text only. If set to false , the document will be processed based on LLMWhisperer's chosed stratergy. |
pages_to_extract | string | No | Define which pages to extract. By default all pages are extracted. You can specify which pages to extract with this parameter. Example 1-5,7,21- will extract pages 1,2,3,4,5,7,21,22,23,24... till the last page. | |
timeout | integer | 200 | No | The time in seconds after which the request will automatically switch to async mode. If a timeout occurs then the API will return a 202 message along with whisper-hash which can be used later to check processing status and retrieve the text. Refer to the async operation documentation for more information |
store_metadata_for_highlighting | boolean | false | No | If set to true, metadata required for the highlighting is stored. If you do not require highlighting API, set this to false . Note that setting this to true will store your text in our servers |
median_filter_size | integer | 0 | No | The size of the median filter to be applied to the image. This is used to remove noise from the image. This parameter works only in on-prem version of LLMWhisperer |
gaussian_blur_radius | integer | 0 | No | The radius of the gaussian blur to be applied to the image. This is used to remove noise from the image. This parameter works only in on-prem version of LLMWhisperer |
ocr_provider | string | advanced | No | The OCR provider to be used. Choose between simple and advanced . This parameter works only in on-prem version of LLMWhisperer |
line_splitter_tolerance | float | 0.4 | No | Factor to decide when to move text to the next line when it is above or below the baseline. The default value of 0.4 signifies 40% of the average character height |
horizontal_stretch_factor | float | 1.0 | No | Factor by which a horizontal stretch has to applied. It defaults to 1.0 . A stretch factor of 1.1 would mean at 10% stretch factor applied. Normally this factor need not be adjusted. You might want to use this parameter when multi column layouts back into each other. For example in a two column layout, the two columns get merged into one. |
url_in_post | boolean | false | No | If set to true send the URL to download from - in the post body. See example below |
Processing Modes
OCR [ocr
] mode extracts text by considering the entire page as an image. Though this mode produces the best results, it is very slow compared to text mode.
Text [text
] mode extracts text by directly extracting text embedded inside PDF files. This mode is very fast compared to OCR mode. It is recommended to use text mode when dealing with documents that contain proper text.
We recommend using OCR mode when dealing with the following types of documents:
- Scanned documents
- Documents with non-standard text layout
- Documents with handwriting
- Documents with form elements like checkboxes, radio buttons, etc.
We recommend using Text mode when dealing with the following types of documents:
- Documents which contain proper text. Typically, documents generated by software like MS Word, Google Docs, etc. fall into this category.
- Software-generated PDFs like invoices, receipts, etc.
- Large documents with many pages
Text mode is very fast compared to OCR mode. It is recommended to use text mode when dealing with documents that contain proper text.
Forcing Text Processing
Sometimes forcing text processing might be required. LLMWhisperer might switch to OCR mode even if the document contains proper text. This can happen under certain circumances like:
- Document contains a background image which we are not able to remove properly. Forcing text mode will ignore the background image and extract text.
- Document contains watermark images in the background and the text in the watermarks gets extracted as regular text in the result.
- Certificates containing decorative or watermark text in the background image.
If you are sure that the document contains proper text and you want to extract only the text forcing text mode will not only yeild better results but also increase speed of processing significantly.
You can force text processing by setting force_text_processing
to true
.
Output Modes
Line Printer [line-printer
] mode tries to extract the text from the document as is, maintaining the structural layout of the document. This works very well for LLM consumption. This mode uses many techniques to provide the text in the best possible way for LLMs. It also removes white spaces and other unwanted characters from the text to make the result more cost-effective for LLMs.
Text [text
] mode extracts the text from the document without applying any processing or intelligence. This mode is useful when the line-printer
mode is not able to extract the text properly. This can happen if the document contains too many different fonts and font sizes.
Sync/Async Mode and Timeout
The API can be used in both sync and async modes. If the processing takes more than the supplied timeout value, the API will automatically switch to async mode.
For example, if you set the timeout to 20
seconds and the processing takes more than 20
seconds, the API will return a 202
message along with a whisper-hash
which can be used later to check processing status and retrieve the text.
In Async mode, for safety and privacy reasons the extracted text is stored on our servers in memory only for 60 minutes. The text is deleted after this time. Make sure that the text is retrieved within this time frame.
Note that there is a hard timeout of 200 seconds. The call will switch to async mode after 200 seconds even if the timeout is set to a higher value.
Metadata for Highlighting
If you are using the highlighting API, you can set store_metadata_for_highlighting
to true
. This will store the metadata required for highlighting. If you do not require highlighting API, set this to false
. Note that setting this to true
will store your text in our servers.
Request Body
The request body should contain the PDF/Scanned document that needs to be converted to text. The document should be in application/octet-stream
format.
If you are using the url_in_post
parameter, the URL should be sent in the request body. Content-Type should be text/plain
. See curl example below.
Example Curl Requests
Upload document
curl -X POST --location 'https://llmwhisperer-api.unstract.com/v1/whisper?force_text_processing=true&processing_mode=text&output_mode=line-printer' \
-H 'Content-Type: application/octet-stream' \
-H 'unstract-key: <Your API Key>' \
--data-binary '@your-file-to-process.pdf'
Process document from URL
curl -X POST --location 'https://llmwhisperer-api.unstract.com/v1/whisper?force_text_processing=false&processing_mode=ocr&output_mode=line-printer&url_in_post=true' \
-H 'Content-Type: text/plain' \
-H 'unstract-key: <Your API Key>' \
--data 'https://your-url-to-process.pdf'
To include the headers in the response use
curl -i
in the request.
Response
HTTP Status | Content-Type | Headers | Description |
---|---|---|---|
200 | text/plain | Whisper-Hash | The API will return the extracted text in the response body. The header will contain Whisper-Hash which can be used to extract highlighting info if required. |
202 | application/json | The API will return a JSON with whisper-hash which can be used with the status API to get status and later retrieve the extracted text. Refer below for JSON format |
Example 202
Response
{
"message": "Processing time exceeded X seconds. Use the status...",
"status": "processing",
"whisper-hash": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}