What's needed to get started

This guide helps you set up your Unstract environment. You have:

Purchased an Unstract Cloud subscription after your trial got over.
You have an on-prem version of Unstract set up and ready to use.
You've set up the open source version of Unstract on your laptop or cloud environment.

Initial set up

You'll need access to four services to started (Don't worry, these are easy to set up and are available for free to get started!):

LLM or Large Language Model service: this service takes in raw text, helps reason and structure specific data/fields we care about. We will use OpenAI for this. While this guide walks you through setting up a specific LLM, you can look at the documentation for all supported LLMs to configure specific LLMs.
Embedding Model: this service assigns special codes to words based on their meaning and helps organize data within a document or data source. This is useful when we need to extract relevant portions—based on a user's query or specific data we need to extract—from large documents. We will use OpenAI for this as well. While this guide walks you through setting up a specific embedding model, you can look at the documentation for all supported embedding models to configure specific ones.
Vector Database: this service works in conjunction with the embedding model to actually store special representations of documents and retrieve portions we're interested in from them. To put it in simple words, the embedding model contains the logic to organize data and vector databases help in storing and retrieving it. While this guide walks you through setting up a specific vector database, you can look at the documentation for all supported vector databases to configure specific ones. Usually the data gets persisted in the vector database under a table named unstract_embedding-dimension. Incase of Postgres, this table is called data_unstract_embedding-dimension. For example, if you are using an embedding model with dimension 1536, then the nodes get stored under unstract_1536. For Postgres, this will be data_unstract_1536.
Text Extractor: This service, much like its name implies, extracts text from documents and images, typically PDFs. These PDFs can be native text or composed simply of scanned images. OCR is typically built into these services. However, in the near future, we're also adding the ability to used 3rd party OCR services. While this guide walks you through setting up a specific text extractor, you can look at the documentation for all supported text extraction services to configure specific ones.

Initial set up​

Initial set up