Extraction API
Convert your PDF/Scanned documents to text format which can be used by LLMs
Endpoint | /whisper |
URL | https://llmwhisperer-api.us-central.unstract.com/api/v2/whisper |
Method | POST |
Headers | unstract-key: <YOUR API KEY> |
Body | application/octet-stream |
Parameters
Parameter | Type | Default | Required | Description |
---|---|---|---|---|
mode | string | form | No | The processing mode to be used. Refer to the modes section for more information. |
output_mode | string | layout_preserving | No | layout_preserving or text output mode |
page_seperator | string | <<< | No | The string to be used as a page separator. |
pages_to_extract | string | No | Define which pages to extract. By default all pages are extracted. You can specify which pages to extract with this parameter. Example 1-5,7,21- will extract pages 1,2,3,4,5,7,21,22,23,24... till the last page. | |
median_filter_size | integer | 0 | No | The size of the median filter to be applied to the image. This is used to remove noise from the image. This parameter works only in the low_cost mode |
gaussian_blur_radius | integer | 0 | No | The radius of the gaussian blur to be applied to the image. This is used to remove noise from the image. This parameter works only in the low_cost mode |
line_splitter_tolerance | float | 0.4 | No | Factor to decide when to move text to the next line when it is above or below the baseline. The default value of 0.4 signifies 40% of the average character height |
line_splitter_strategy | string | left-priority | No | The line splitter strategy to use. An advanced option for customizing the line splitting process. Refer to the documentation below |
horizontal_stretch_factor | float | 1.0 | No | Factor by which a horizontal stretch has to applied. It defaults to 1.0 . A stretch factor of 1.1 would mean at 10% stretch factor applied. Normally this factor need not be adjusted. You might want to use this parameter when multi column layouts back into each other. For example in a two column layout, the two columns get merged into one. |
url_in_post | boolean | false | No | If set to true send the URL to download from - in the post body. See example below |
url (deprecated, use url_in_post instead) | string | No | The default behaviour of the API is to process the document sent in the request body. If you want to process a document from a URL, you can provide the URL here. The URL should be accessible without any authentication. If the request body is empty, the API will try to process the document from the URL. | |
mark_vertical_lines | boolean | false | No | Whether to reproduce vertical lines in the document. Note: This parameter is not applicable if mode=native_text. |
mark_horizontal_lines | boolean | false | No | Whether to reproduce horizontal lines in the document. Note: This parameter is not applicable if mode=native_text and will not work if mark_vertical_lines is set to false. |
lang | string | eng | No | The language hint to OCR. Currently auto detected. This parameter is ingnored in the version. |
tag | string | default | No | Auditing feature. Set a value which will be associated with the invocation of the API. This can be used for cross referencing in usage reports |
file_name | string | No | Auditing feature. Set a value which will be associated with the invocation of the API. This can be used for cross referencing in usage reports | |
use_webhook | string | No | The webhook's name which will should be called after the conversion is complete. The name should have been registered earlier using the webhooks management endpoint | |
webhook_metadata | string | No | Any metadata which should be sent to the webhook. This data is sent verbatim to the callback endpoint. Refer to webhooks documentation. |
Modes
Refer to detailed comparison of modes here.
Use native_text
only when dealing with the following types of documents:
- Documents which contain proper text. Typically, documents generated by software like MS Word, Google Docs, etc. fall into this category.
- Software-generated PDFs like invoices, receipts, etc.
Text mode is very fast compared to OCR mode. It is recommended to use text mode when dealing with documents that contain native text (not scanned pages).
Output Modes
Layout preserving (layout_preserving
) mode tries to extract the text from the document as is, maintaining the structural layout of the document. This works very well for LLM consumption. This mode uses many techniques to provide the text in the best possible way for LLMs. It also removes white spaces and other unwanted characters from the text to make the result more cost-effective for LLMs.
Text (text
) mode extracts the text from the document without applying any processing or intelligence. This mode is useful when the layout_preserving
mode is not able to extract the text properly. This can happen if the document contains too many different fonts and font sizes.
Request Body
The request body should contain the PDF/Scanned document that needs to be converted to text. The document should be in application/octet-stream
format.
If you are using the url_in_post
parameter, the URL should be sent in the request body. Content-Type should be text/plain
. See curl example below.
Example Curl Requests
Upload document
curl -X POST --location 'https://llmwhisperer-api.us-central.unstract.com/api/v2/whisper?mode=form&output_mode=layout_preserving' \
-H 'Content-Type: application/octet-stream' \
-H 'unstract-key: <Your API Key>' \
--data-binary '@your-file-to-process.pdf'
Process document from URL
curl -X POST --location 'https://llmwhisperer-api.us-central.unstract.com/api/v2/whisper?mode=form&output_mode=layout_preserving&url_in_post=true' \
-H 'Content-Type: text/plain' \
-H 'unstract-key: <Your API Key>' \
--data 'https://your-url-to-process.pdf'
To include the headers in the response use
curl -i
in the request.
Response
A successful request will return a 202
status code with a JSON response containing the whisper_hash
which can be used to check the status of the conversion process.
The typical workflow is to call the /whisper
API to convert your document to text format. Check the status of the conversion process by calling the /whisper-status
API. Repeat this step until the status is processed
. Once the conversion is done, retrieve the converted text by calling the /whisper-retrieve
API. Another wokrflow is to use the webhooks to get the converted text. Refer to the documentation for more information.
HTTP Status | Content-Type | Headers | Description |
---|---|---|---|
202 | application/json | The API will return a JSON with whisper_hash which can be used with the status API to get status and later retrieve the extracted text. Refer below for JSON format |
Example 202
Response
{
"message": "Whisper Job Accepted",
"status": "processing",
"whisper_hash": "xxxxxa96|xxxxxxxxxxxxxxxxxxx4ed3da759ef670f"
}