Set up LLMWhisperer Text Extractor-v1

Reliable answers need reliable input data. This is where LLMWhisperer comes in. It can extract text from text-based PDFs, perform OCR on image based PDFs or even extract text in layout-preserving mode in any PDF, text-based on otherwise.

LLMWhisperer is a specialized document text extractor developed by the Unstract team to specially targeting Large Language Models. It is available as a free service with a generous daily quota, but also available as a paid service should you need to process documents more than the free quota will allow.

In this section, we'll see how to create a free LLMWhisperer account, create an LLMWhisperer API key which we'll use to let the Unstract Platform connect and utilize the LLMWhisperer service. So, let's get started.

Head to the LLMWhisperer portal and sign up for a free account after clicking on the "Sign Up" link from the top navigation. It's a pretty standard sign up process. Once you fill out the sign-up form, you should get a verification email. This should contain a link for you to click on, verify and then login with the email and password you specified.

info

When you sign up for LLM whisperer, you will be automatically subscribed to Tier-1: Free 100 Pages Per Day plan.By using this plan you can, Process up to 100 pages a day completely free! No credit card required.To increase the usage limit,visit LLM whisperer plans

Once you're logged into the service successfully, let's get you an API key now. Click on "Profile" from the top navigation menu and then click on "show" to reveal the API key.

Now that we have an API Key for the LLMWhisperer Text Extractor service, let's connect it to the Unstract Platform.

Signin to the Unstract Platform
From the side navigation menu, choose Settings 🞂 Text Extractor
Click on the New Text Extractor button
From the list of supported Text Extractors, choose LLMWhisperer. You should see a dialog box where you enter details.
For Name, enter a name for this connector.
Leave the URL field to the default value.
In Unstract Key, enter the key created in the above section.
In Mode, choose
- Native Text - To process text based files.
- Low Cost - To process high quality pdf documents.
- High Quality - To process Medium/low quality scanned PDFs and Handwritten documents.
- Form - To process pdf with Checkbox and radio button detection
- Check the below image for the comparison of modes
In Output Mode, choose
- layout_preserving - To extract the text line by line.
- text - Keeps the context of the document in place.
Line Splitter Tolerance - Default value is 0.4 .Reduce this value to split lines less often, increase to split lines more often. Useful when PDFs have multi-column layout with text in each column that is not aligned.
Horizontal Stretch Factor - Default value is 1 .Increase this value to stretch text horizontally, decrease to compress text horizontally. Useful when multi-column text merge with each other.
Page number(s) or range to extract- Specify the range of pages to extract (e.g., 1-5, 7, 10-12, 50-). Leave it empty to extract all pages.
Page separator - Specify a pattern to separate the pages in the document (e.g., <<< {{page_no}} >>>, <<< >>>). This pattern will be inserted at the end of every page. Omit page_no if you don't want to include the page number in the separator.
Mark Vertical Lines & Mark Horizonatal Lines
- Both buttons must be in the same state (either both checked or both unchecked) for the operation to proceed.
- If checked, it will extract the tables structure(border and internal lines) with dotted lines.Check this buttons if the table is outlined properly in the document.
Leave the Median Filter Size and Gaussian Blue Radius to their default values.
Click on Test Connection and ensure it succeeds. You can finally click on Submit and that should create a new Text Extractor Profile for use in your Unstract projects.

This was the last step connecting various dependencies for you to start using the Unstract Platform!