Skip to main content

LLMWhisperer Text Extractor

LLMWhisperer is technology that presents data from complex documents to LLMs in a way they’re able to best understand it.

Getting started with LLMWhisperer Text Extractor

info

When you sign up for LLM whisperer, you will be automatically subscribed to Tier-1: Free 100 Pages Per Day plan.By using this plan you can, Process up to 100 pages a day completely free! No credit card required.To increase the usage limit,visit LLM whisperer plans

  1. Sign-up a LLM Whisperer account and sign in.

  2. To get the API key click on "Profile" from the top navigation menu and then click on "show" to reveal the API key.

    img LLMWhisperer Text Extractor Configuration

Setting up the LLM Whisperer connector in Unstract

Now that we have an API key from LLM Whisperer, we can use it to set up a Text Extractor profile on the Unstract platform. For this:

  • Sign in to the Unstract Platform

  • From the side navigation menu, choose Settings 🞂 Text Extractor

  • Click on the New Text Extractor button.

  • From the list of Embeddings, choose LLMWhisperer. You should see a dialog box where you enter details.

    img LLMWhisperer Text Extractor Configuration

  • For Name, enter a name for this connector.

  • Leave the URL field to the default value.

  • In Unstract Key, enter the key created in the above section.

  • Enable Force Text Processing check box to process only text based files.

  • In Processing Mode, choose

    • text - To process text based files.
    • OCR - To process image/scanned image based files.
  • In Output Mode, choose

    • line printer - To extract the text line by line.
    • text - Keeps the context of the document in place.
    • dump-text - output is a raw dump of the text in the pages.
  • For Median Filter Size enter the window size. For eg, if input is 3, then filter window will be considered as 3*3.

    • A median filter reduces noise by replacing each data point with the median of its neighbors, effectively smoothing the data while preserving edges. The window size determines the extent of smoothing: a larger window size increases noise reduction but can blur edges, whereas a smaller window size maintains detail but is less effective at noise removal.
  • For Gausian Blur Radius enter the blur radius.

    • Gaussian blur is a filter that smooths images by averaging pixel values with their neighbors, where the weights decrease with distance according to a Gaussian distribution. The radius, or standard deviation of the Gaussian function, controls the extent of blurring: a larger radius results in more blurring and a smoother image, while a smaller radius preserves more of the original details but provides less noise reduction.
  • Click on Test Connection and ensure it succeeds. You can finally click on Submit and that should create a new Embedding Profile for use in your Unstract projects.

Median filter and Gaussian blur.How it works?

Median Filter

The median filter is a powerful tool in image processing for removing noise while preserving edges. Let's see a simple example.

  • INPUT: Imagine a small grayscale image with the following pixel intensities (0 represents black, 255 represents white):
 50  100  150  (Top row)
75 20 225 (Middle row) - Noise! (pixel value 20 is much lower than others)
125 175 250 (Bottom row)
  • FILTER: Now the noisy pixel is replaced with the median value of the sorted list.

    • The sorted list is 20, 50, 75, 100, 125, 150, 175, 225, 250 and the middle number is 125. So 20 is replaced with 125
  • OUTPUT: Now the output pixel intensities will have,

     50  100  150  (Top row)
    75 125 225 (Middle row) - Noise! (pixel value 20 is much lower than others)
    125 175 250 (Bottom row)

Gaussian Blurring and Gaussian Radius

info

The Gaussian blur radius determines the spread of this Gaussian distribution within the kernel. A larger radius increases the size of the kernel, effectively extending the pixel region considered for averaging.

Gaussian blurring is a widely used image processing technique that reduces noise and creates a smooth, out-of-focus effect. Let's see a simple example.

  • INPUT: Imagine a small grayscale image with the following pixel intensities (0 represents black, 255 represents white):
 100  120  150
80 90 110
60 70 80
  • FILTER: A Gaussian kernel is a small matrix that defines the weights used for averaging. Here's a common 3x3 Gaussian kernel:
1   2   1
2 4 2
1 2 1
  1. Multiply each pixel in the input image with the corresponding weight in the kernel: 100 * 1 = 100 120 * 2 = 240 150 * 1 = 150 80 * 2 = 160 90 * 4 = 360 (center, highest weight) 110 * 2 = 220 60 * 1 = 60 70 * 2 = 140 80 * 1 = 80
  2. Sum the products: 100 + 240 + 150 + 160 + 360 + 220 + 60 + 140 + 80 = 1510.
  3. Divide the sum by the total weight (sum of all values in the kernel): 1510 / (1 + 2 + 1 + 2 + 4 + 2 + 1 + 2 + 1) = 1510 / 16 = 94.375 (rounded to two decimal places)
  • OUTPUT:
94  102  110 (approx.)
88 97 106 (approx.)
82 91 100 (approx.)