Getting Started
Let's consider that you want to extract structured data from a bunch of credit card statements from different issuers. We're taking credit card statements as an example since no matter what your background, it's safe to assume that you'll know what a typical credit card statement looks like and what key points of information it might contain, since credit cards are fairly common part of our lives. At the same time, we know that every issuer has their own format and even for the same issuer, the format of the statement can keep changing from time to time. So, it's a pretty decent challenge to use such statements to build our first Unstract project.
Credit card statements are typically emailed to users as PDF documents. Like most unstructured documents, these statements, although most of them consist of the same bits of key information (customer name, customer address, issuer name, statement date, list of spends, etc), they come in wildly different formatting and lengths like we discussed before. It has never been easy to get data from these varied types of statements into a database or into an application in structured form for easy querying, analysis or visualization. The Unstract Platform lets you do this with no code needed by leveraging the power of Large Language Models.
Not only will we build a simple, generic parser for credit card statements, we will also deploy this parser as both an API (to which you can send a PDF statement and get JSON data back) and also as an ETL pipeline (which can structure PDF statements and push data into a data warehouse or database for further analysis).