Version: 2.0.0

Extraction API

Convert your PDF/Scanned documents to text format which can be used by LLMs


Endpoint	`/whisper`
URL	`https://llmwhisperer-api.us-central.unstract.com/api/v2/whisper`
Method	`POST`
Headers	`unstract-key: <YOUR API KEY>`
Body	Binary data (file contents)

Parameters

Parameter	Type	Default	Required	Description
mode	string	`form`	No	The processing mode to be used. Refer to the modes section for more information.
output_mode	string	`layout_preserving`	No	`layout_preserving` or `text` output mode
page_seperator	string	`<<<`	No	The string to be used as a page separator.
pages_to_extract	string		No	Define which pages to extract. By default all pages are extracted. You can specify which pages to extract with this parameter. Example `1-5,7,21-` will extract pages 1,2,3,4,5,7,21,22,23,24... till the last page.
median_filter_size	integer	`0`	No	The size of the median filter to be applied to the image. This is used to remove noise from the image. This parameter works only in the `low_cost` mode
gaussian_blur_radius	integer	`0`	No	The radius of the gaussian blur to be applied to the image. This is used to remove noise from the image. This parameter works only in the `low_cost` mode
line_splitter_tolerance	float	`0.4`	No	Factor to decide when to move text to the next line when it is above or below the baseline. The default value of `0.4` signifies 40% of the average character height
line_splitter_strategy	string	`left-priority`	No	The line splitter strategy to use. An advanced option for customizing the line splitting process. Refer to the documentation below
horizontal_stretch_factor	float	`1.0`	No	Factor by which a horizontal stretch has to applied. It defaults to `1.0`. A stretch factor of `1.1` would mean at 10% stretch factor applied. Normally this factor need not be adjusted. You might want to use this parameter when multi column layouts back into each other. For example in a two column layout, the two columns get merged into one.
url_in_post	boolean	`false`	No	If set to `true` send the URL to download from - in the post body. See example below
url (deprecated, use url_in_post instead)	string		No	The default behaviour of the API is to process the document sent in the request body. If you want to process a document from a URL, you can provide the URL here. The URL should be accessible without any authentication. If the request body is empty, the API will try to process the document from the URL.
mark_vertical_lines	boolean	`false`	No	Whether to reproduce vertical lines in the document. Note: This parameter is not applicable if mode=native_text.
mark_horizontal_lines	boolean	`false`	No	Whether to reproduce horizontal lines in the document. Note: This parameter is not applicable if mode=native_text and will not work if mark_vertical_lines is set to false.
lang	string	`eng`	No	The language hint to OCR. Currently auto detected. This parameter is ingnored in the version.
tag	string	`default`	No	Auditing feature. Set a value which will be associated with the invocation of the API. This can be used for cross referencing in usage reports
file_name	string		No	Auditing feature. Set a value which will be associated with the invocation of the API. This can be used for cross referencing in usage reports
use_webhook	string		No	The webhook's name which will should be called after the conversion is complete. The name should have been registered earlier using the webhooks management endpoint
webhook_metadata	string		No	Any metadata which should be sent to the webhook. This data is sent verbatim to the callback endpoint. Refer to webhooks documentation.
add_line_nos	boolean	`false`	No	Adds line numbers to the extracted text and saves line metadata, which can be queried later using the highlights API.

Modes

Refer to detailed comparison of modes here.

Use native_text only when dealing with the following types of documents:

Documents which contain proper text. Typically, documents generated by software like MS Word, Google Docs, etc. fall into this category.
Software-generated PDFs like invoices, receipts, etc.

Text mode is very fast compared to OCR mode. It is recommended to use text mode when dealing with documents that contain native text (not scanned pages).

Output Modes

Layout preserving (layout_preserving) mode tries to extract the text from the document as is, maintaining the structural layout of the document. This works very well for LLM consumption. This mode uses many techniques to provide the text in the best possible way for LLMs. It also removes white spaces and other unwanted characters from the text to make the result more cost-effective for LLMs.

Text (text) mode extracts the text from the document without applying any processing or intelligence. This mode is useful when the layout_preserving mode is not able to extract the text properly. This can happen if the document contains too many different fonts and font sizes.

Request Body

The request body should contain the PDF/Scanned document that needs to be converted to text. The document should be in application/octet-stream format.

If you are using the url_in_post parameter, the URL should be sent in the request body. Content-Type should be text/plain. See curl example below.

Example Curl Requests

Upload document

curl -X POST --location 'https://llmwhisperer-api.us-central.unstract.com/api/v2/whisper?mode=form&output_mode=layout_preserving' \
-H 'unstract-key: <Your API Key>' \
--data-binary '@your-file-to-process.pdf'

Process document from URL

curl -X POST --location 'https://llmwhisperer-api.us-central.unstract.com/api/v2/whisper?mode=form&output_mode=layout_preserving&url_in_post=true' \
-H 'unstract-key: <Your API Key>' \
--data 'https://your-url-to-process.pdf'

To include the headers in the response use curl -i in the request.

Response

A successful request will return a 202 status code with a JSON response containing the whisper_hash which can be used to check the status of the conversion process.

The typical workflow is to call the /whisper API to convert your document to text format. Check the status of the conversion process by calling the /whisper-status API. Repeat this step until the status is processed. Once the conversion is done, retrieve the converted text by calling the /whisper-retrieve API. Another wokrflow is to use the webhooks to get the converted text. Refer to the documentation for more information.

HTTP Status	Content-Type	Headers	Description
202	`application/json`		The API will return a JSON with `whisper_hash` which can be used with the status API to get status and later retrieve the extracted text. Refer below for JSON format

Example `202` Response

{
  "message": "Whisper Job Accepted",
  "status": "processing",
  "whisper_hash": "xxxxxa96|xxxxxxxxxxxxxxxxxxx4ed3da759ef670f"
}

Parameters​

Modes​

Output Modes​

Request Body​

Example Curl Requests​

Upload document​

Process document from URL​

Response​

Example 202 Response​