LLMChallenge: Introduction
Large Language Models can work really well for document data extraction use cases. The trouble with them however, is that they can hallucinate. And worse, there is no reliably way to detect if an LLM has indeed hallucinated. If Large Language Models are to be deployed in production for at-scale use cases, users need to be sure that there are no hallucinations happening. Wrong extractions undermine trust in the system.
What is LLMChallenge
LLMChallenge is a type of LLM-as-a-judge implementation, a technique that is regarded as one of the most reliable available today to ensure accuracy and fight hallucinations.
How does it work?
LLMChallenge uses not one, but two Large Language Models. The first one is an extraction model and the other the challenger. For every field extraction, the challenger model is used to score the results produced by the extractor model. They "converse" with each other and the system observes if they arrive at a consensus. If they don't arrive at a consensus, the value of the field being extracted is set to null
. Unstract's philosophy is: A null
value is better than a wrong value. Like discussed above, wrong values undermine trust in the system and they need to be avoided. If a system is found to be extracting wrong values often, users involved will be forced to second guess all extractions. They might as well do the extraction fully manually.
Increased accuracy
While the primary use of LLMChallenge is to detect and avoid hallucinations, wrong and ambiguous extractions, it has a very powerful side-effect. As the challenge "conversation" happens between the extractor and the challenger LLMs, if for whatever reason the extracted value was off the first time, in many occasions, the extraction LLM corrects itself, responding with the right value. While, after this, the challenger LLM still has to reach consensus for the value to be accepted, the percentage of right values extracted goes up, apart from the detection of hallucinations. This makes LLMChallenge a powerful ally when using Large Language Models at scale in production.
Cost and latency impact
While LLMChallenge increases operating costs, it is fully justified by the tremendous value it provides in terms of increased accuracy. Users can rest assured that LLM hallucinations are not leading to wrong field values being extracted. As to how much the cost increase might be is use case dependent.
It should also be expected that extractions taken longer than when LLMChallenge is disabled because of the involvement of two LLMs. Again, the added latency is absolutely worth the increase in accuracy and the actual latency depends on the use case.