Skip to main content
Version: 2.0.0

Extraction API

Convert your PDF/Scanned documents to text format which can be used by LLMs

Endpoint/whisper
URLhttps://llmwhisperer-api.us-central.unstract.com/api/v2/whisper
MethodPOST
Headersunstract-key: <YOUR API KEY>
Bodyapplication/octet-stream

Parameters

ParameterTypeDefaultRequiredDescription
modestringformNoThe processing mode to be used. Refer to the modes section for more information.
output_modestringlayout_preservingNolayout_preserving or text output mode
page_seperatorstring<<<NoThe string to be used as a page separator.
pages_to_extractstringNoDefine which pages to extract. By default all pages are extracted. You can specify which pages to extract with this parameter. Example 1-5,7,21- will extract pages 1,2,3,4,5,7,21,22,23,24... till the last page.
median_filter_sizeinteger0NoThe size of the median filter to be applied to the image. This is used to remove noise from the image. This parameter works only in the low_cost mode
gaussian_blur_radiusinteger0NoThe radius of the gaussian blur to be applied to the image. This is used to remove noise from the image. This parameter works only in the low_cost mode
line_splitter_tolerancefloat0.4NoFactor to decide when to move text to the next line when it is above or below the baseline. The default value of 0.4 signifies 40% of the average character height
line_splitter_strategystringleft-priorityNoThe line splitter strategy to use. An advanced option for customizing the line splitting process. Refer to the documentation below
horizontal_stretch_factorfloat1.0NoFactor by which a horizontal stretch has to applied. It defaults to 1.0. A stretch factor of 1.1 would mean at 10% stretch factor applied. Normally this factor need not be adjusted. You might want to use this parameter when multi column layouts back into each other. For example in a two column layout, the two columns get merged into one.
url_in_postbooleanfalseNoIf set to true send the URL to download from - in the post body. See example below
url (deprecated, use url_in_post instead)stringNoThe default behaviour of the API is to process the document sent in the request body. If you want to process a document from a URL, you can provide the URL here. The URL should be accessible without any authentication. If the request body is empty, the API will try to process the document from the URL.
mark_vertical_linesbooleanfalseNoWhether to reproduce vertical lines in the document. Note: This parameter is not applicable if mode=native_text.
mark_horizontal_linesbooleanfalseNoWhether to reproduce horizontal lines in the document. Note: This parameter is not applicable if mode=native_text and will not work if mark_vertical_lines is set to false.
langstringengNoThe language hint to OCR. Currently auto detected. This parameter is ingnored in the version.
tagstringdefaultNoAuditing feature. Set a value which will be associated with the invocation of the API. This can be used for cross referencing in usage reports
file_namestringNoAuditing feature. Set a value which will be associated with the invocation of the API. This can be used for cross referencing in usage reports
use_webhookstringNoThe webhook's name which will should be called after the conversion is complete. The name should have been registered earlier using the webhooks management endpoint
webhook_metadatastringNoAny metadata which should be sent to the webhook. This data is sent verbatim to the callback endpoint. Refer to webhooks documentation.

Modes

Refer to detailed comparison of modes here.

Use native_text only when dealing with the following types of documents:

  • Documents which contain proper text. Typically, documents generated by software like MS Word, Google Docs, etc. fall into this category.
  • Software-generated PDFs like invoices, receipts, etc.

Text mode is very fast compared to OCR mode. It is recommended to use text mode when dealing with documents that contain native text (not scanned pages).

Output Modes

Layout preserving (layout_preserving) mode tries to extract the text from the document as is, maintaining the structural layout of the document. This works very well for LLM consumption. This mode uses many techniques to provide the text in the best possible way for LLMs. It also removes white spaces and other unwanted characters from the text to make the result more cost-effective for LLMs.

Text (text) mode extracts the text from the document without applying any processing or intelligence. This mode is useful when the layout_preserving mode is not able to extract the text properly. This can happen if the document contains too many different fonts and font sizes.

Request Body

The request body should contain the PDF/Scanned document that needs to be converted to text. The document should be in application/octet-stream format.

If you are using the url_in_post parameter, the URL should be sent in the request body. Content-Type should be text/plain. See curl example below.

Example Curl Requests

Upload document

curl -X POST --location 'https://llmwhisperer-api.us-central.unstract.com/api/v2/whisper?mode=form&output_mode=layout_preserving' \
-H 'Content-Type: application/octet-stream' \
-H 'unstract-key: <Your API Key>' \
--data-binary '@your-file-to-process.pdf'

Process document from URL

curl -X POST --location 'https://llmwhisperer-api.us-central.unstract.com/api/v2/whisper?mode=form&output_mode=layout_preserving&url_in_post=true' \
-H 'Content-Type: text/plain' \
-H 'unstract-key: <Your API Key>' \
--data 'https://your-url-to-process.pdf'

To include the headers in the response use curl -i in the request.

Response

A successful request will return a 202 status code with a JSON response containing the whisper_hash which can be used to check the status of the conversion process.

The typical workflow is to call the /whisper API to convert your document to text format. Check the status of the conversion process by calling the /whisper-status API. Repeat this step until the status is processed. Once the conversion is done, retrieve the converted text by calling the /whisper-retrieve API. Another wokrflow is to use the webhooks to get the converted text. Refer to the documentation for more information.

HTTP StatusContent-TypeHeadersDescription
202application/jsonThe API will return a JSON with whisper_hash which can be used with the status API to get status and later retrieve the extracted text. Refer below for JSON format

Example 202 Response

{
"message": "Whisper Job Accepted",
"status": "processing",
"whisper_hash": "xxxxxa96|xxxxxxxxxxxxxxxxxxx4ed3da759ef670f"
}