Comparing Different LLM Models For Data Extraction
A lot of human and computer processing time is spent getting data from one format to another. Many processes defy automation because they run on a long tail of unstructured data, human text, and documents designed for printing. The explosion of AI and large language models (LLMs) opens the door to fully automated data extraction at scale, both in terms of processing large amounts of documents and processing a long tail of formats without having to anticipate and build for all the possible options.
Too Many Models
Not all LLM models are equal. The naive approach is to go for the biggest, most robust models you can run and expect the best results. Other teams might spend cycles fine-tuning models or prompts for the best results. Neither approach will keep up with the rate of evolution in LLMs.
Major versions of LLMs are released several times a year, and minor iterations come out every couple of weeks. Couple that with the speed of evolution from communities like HuggingFace or LlamaHub, and teams could spend all of their time updating their code or risk falling behind their competitors.
Frameworks like LlamaIndex or LangChain are AI and model-agnostic, but developers still embed models and prompts into their code. AI coding is similar to the early days of UI development, where models, views, widgets, configuration, and localization were all coded together and not abstracted into various layers.
Frameworks like DSPy are trying to create modular approaches to separate business logic, prompts, and AI agents.
But anyway you look at it, while LLMs open the door for automating a whole new area of data processing, they also open the door to a new type of software architecture where there isn’t much established understanding of what it is going to be like to keep a system built today up to date a year from now.
Comparing Multiple Models
Despite the speed of evolution, you have to start somewhere. I wanted to test how LLMs perform in a realistic scenario: read a resume, answer questions about the candidate, and return those answers in a well-defined format.
I ran through a series of LLMs and models, asking each to do the exact same task. I compared each LLM’s answer against known values and scored the results based on a) how many questions it could answer and b) how accurate the answers were.
This required the LLM to understand the structure of a document designed for humans, understand the document's semantics, be able to answer specific questions about the candidate using what it “read” from the resume, and construct an output that conforms to a strict schema.
RAGs Require At Least Two Models
LLMs know nothing beyond what was used to train their model. For example, Chat-GPT 3.0 was released in 2020 and cannot answer questions about what has happened since then.
You can solve the LLM’s limited knowledge by giving it the necessary information. I could put the resume into the request prompt or use a second model to convert the resume into a searchable database that extends the LLM's knowledge of the world.
The searchable database scales better both in terms of complexity and performance. Using a database to extend your LLMs is called Retrieval Augmented Generation (RAG). This introduces a second model, an embedding model. LLMs do not work in human language. Human text is parsed into tokens, or snippets of text. The LLM models how these tokens interact and behave. LLMs don’t model human language; they model tokens.
For them to pull in new human text, that must also be turned into tokens. Then those tokens are turned into vectors, which is a way of mapping a token into a coordinate system. When the LLM wants to find additional information, it will create a query by finding tokens it wants more information about. Then it will use the vector coordinates to find entries in the database that are “closest” to the query token’s coordinates.
Different LLM models interact differently with different embedding models. While many teams focus on optimizing a single model, they often overlook how different models and data formats interact.
The Scoring Process
Full code available at https://github.com/lucasmcgregor/medium__llm_comparison
I used LlamaIndex as my framework for invoking models and ran my models locally using Ollama.
I used Pydantic to define the output format and describe the data for the LLMs to extract.
I tested the combinations of:
3 embedding models
and 13 LLM models:
Because language systems are statistical, their results fluctuate. I ran each combination of embedding and LLM model 10 times and took an average and a total score to measure accuracy and consistency.
For scoring, I added points when the system accurately extracted information from the resume. I deducted points when it couldn’t extract a mandatory fact, such as a name, or when it extracted a fact inaccurately, such as hallucinating a skill or interpreting an employer as a role.
FAIL is when the LLM produced output that couldn’t be mapped to the Pydantic schema, thus producing results I couldn’t programatically validate.
Data Structures
class ResumeData(BaseModel):
"""
Pydantic model representing common fields extracted from a resume.
"""
full_name: str|None = Field("full_name", strict=False, description="Full name of the candidate")
email: str|None = Field("email", strict=False, description="Contact email address")
phone: str|None = Field("phone", strict=False, description="Contact phone number")
summary: str|None = Field("summary", strict=False, description="Professional summary or objective")
education: List[dict]|None = Field(default_factory=list, description="List of education entries")
languages: List[str]|List[dict]|None = Field(default_factory=list, description="List of language proficiencies")
linkedin_url: str|None = Field("linkedin_url", strict=False, description="LinkedIn profile URL")
skills: List[str]|List[dict]|None = Field(default_factory=list, description="List of professional skills")
work_experience: List["WorkExperience"]|None = Field(default_factory=list, description="List of work experience entries")
class WorkExperience(BaseModel):
"""
Pydanitc model that represents a work experience
"""
title: str | None = Field("title", description="Job title or position")
organization: str | None = Field("organization", description="Company or organization name")
start_date: str | None = Field("start_date", description="Start date of employment")
end_date: str | None = Field("end_date", description="End date of employment")
details: Any | Any = Field(default_factory=dict, description="Additional details about the role")
I wanted the results to be structured, but not rigid. Most fields can be null, or accept an open structure of either a list of strings or dicts. I wanted to give the LLMs plenty of leeway and validate their results by comparing their responses to known answers instead of rejecting them on a schema that is too rigid.
Even still, many LLMs return poorly formatted JSON, so I checked for common mistakes and corrected them before trying to parse.
For the models, I wanted at least one nested field. Most Pydantic tests I have seen focus on flat models, which is rare in real-world scenarios.
The Results
Overall, most models performed better when using data embedded BAAI/bge-base-en-v1.5. It produced the results with the best averages, best totals, and the least amount of failures.
Qwen3 and Gemma3 have 4B parameter models that beat larger Llama3.3 with 70.6B parameters. Both Qwen and Gemma3 have been distilled from larger models, and both also have hybrid language and reasoning modes, meaning that they are designed to analyze the prompt, strategize how to answer it, and examine their results. This seems to have helped with more complex reasoning and extraction tasks.
Seeing the success of the smaller Qwen series, I tested all the sizes. There is an inflection point at 4B parameters for this challenge. The smaller models performed poorly, but after 4B, the larger models just ran slower but didn’t improve the results.
Gemma3 was a top scorer when paired with BAAI/bge-base-en-v1.5, but switching the embedding to nomic-embed-text caused Gemma3 to no longer be able to produce legal JSON results, showing the unexpected interactions between models, even in a simple RAG setup.
Conclusions
- Your embed model that extends your LLM’s knowledge greatly impacts the LLM’s performance.
- Bigger models or larger vectors cost more to run, but do not always return improved results.
- Interactions between models are not always straightforward to predict or understand.
As agent-based systems evolve, they will involve multiple AIs and models, each of which can be retrained, upgraded, or swapped out for a different option. Each interaction becomes a point of instability.
Complex software has been made possible by breaking down problems into simple, separate, and modular components. Enterprise software is feasible because we have decades of established patterns to create stability, scalability, and security.
Agent-based systems shift the complexity that used to be contained in design patterns into black box AIs. Old modular patterns won’t apply anymore. Teams must discover and create new ways to maintain enterprise-level reliability and safety. With only two models, this demo shows the vast range of performance differences from different combinations. Today’s enterprise systems are built on hundreds of thousands of independent modules. Tomorrow’s agent-based systems will have dozens to possibly thousands of various models and AI agents all interacting.
Software architecture will need to develop new patterns and approaches. It will be less about designing around data exchanges and formats and more about AI interaction points. These are the new APIs.