How We Leveraged Pydantic and Llama index for Data Extraction from unstructured files

Introduction
Extracting information from PDF files is a common use case. Often we want to extract structured data from PDF files such as resumes, invoices, or reports. Extracting relevant information while maintaining a consistent structure is crucial for automation and further analysis. By leveraging LLMs and Pydantic, developers can transform unstructured PDF content into a well-defined format, ready for integration into various workflows. This blog presents an example of how to parse resumes using Streamlit, Llama Index, and Pydantic, but the principles can be applied to any structured document processing task.
At KubeNine we leverage Structured Output very heavily in any script where we deal with LLMs. We realised that with Pydantic in the picture it’s extremely easy to extract structured information from unstructured text. In this blog we try to share some of our findings.
Usage
We talk about a use case where we tried to extract structured output from resumes submitted to us. We initially built the script as a Streamlit app to first test out how our code was working on the sample resumes. Post testing we integrated this into careers.kubehire.com where whenever a resume gets submitted - the data extraction script would trigger as a celery job - extract all the details and help us generate data for further evaluation.
I) File Structure and Explanation
The solution is built using Streamlit and leverages the Llama Index library for parsing PDFs and generating structured data. Here is a breakdown of the files:
.env
:
- Stores sensitive information like API keys.
OPENAI_API_KEY=your-openai-api-key
.gitignore
:
Ensures sensitive files and unnecessary directories are excluded from version control.
.venv/
.streamlit/
*__pycache__/
.env
app.py
:
- This main script is responsible for uploading resumes, processing them, and displaying the results.
requirements.txt
:
- Lists the Python dependencies needed to run the solution.
Resume.py
:
- Defines the structured data model for resumes using Pydantic.
II) File Contents and Explanation
app.py
:Example snippet fromapp.py
:
- Handles file uploads and resume parsing using Llama Index.
- Key features:
- File upload using Streamlit.
- PDF text extraction with Llama Index’s
PDFReader
. - Structured data generation with OpenAI’s LLM API.
- JSON download of parsed data.
from Resume import Resume
from llama_index.readers.file import PDFReader
pdf_reader = PDFReader()
documents = pdf_reader.load_data(file=uploaded_file_path)
resume_text = documents[0].text
llm = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
sllm = llm.as_structured_llm(Resume)
structured_data = sllm.complete(resume_text).raw
Resume.py
:
- The Pydantic model for structured resume data can be changed to include any specific requirement to be parsed from the resume.
- The
Resume
object includes:
- Full Name: Candidate’s name.
- Email: Contact email address.
- Phone: Contact number.
- Summary: A brief overview or objective extracted from the resume.
- Education: List of degrees, institutions, and timelines.
- Work Experience: Job roles, companies, durations, and descriptions.
- Skills: A list of technical and soft skills with proficiency levels.
- Links: URLs to LinkedIn, GitHub, or personal websites.
Example snippet from Resume.py
:
from pydantic import BaseModel, Field, HttpUrl
from typing import List, Optional
from datetime import date
class Resume(BaseModel):
full_name: str = Field(description="Full name of the person")
email: Optional[str] = Field(description="Email address")
phone: Optional[str] = Field(description="Phone number")
summary: str = Field(
default="Generated summary based on the resume content.",
description="Summary or objective from the resume",
)
education: List[Education] = Field(description="List of educational qualifications")
work_experience: List[WorkExperience] = Field(description="List of work experiences")
skills: List[Skill] = Field(description="List of skills")
links: List[HttpUrl] = Field(description="List of relevant links such as LinkedIn or GitHub profiles")
III) Running the Solution
- Clone the repository and install dependencies:
pip install -r requirements.txt
- Set up the
.env
file with your OpenAI API key. - Launch the script:
streamlit run app.py
- Upload a PDF resume and click "Process Resume" to extract structured data. You can download the data as a JSON file for further use.
Screenshot of the solution with an uploaded resume and parsed JSON data displayed.
Conclusion
This solution transforms unstructured resume content into standardized data, ready for further processing. The Resume
Pydantic model can be customized to capture specific details from the parsed resume. You can explore the complete implementation on the github repo here.
Advanced use cases can include adapting the structure to evaluate candidates against job-specific requirements using AI models. Advanced solutions like LlamaParse can extend this parsing approach, as it provides markdown extraction from complex PDFs with images, tables, graphs, and more.
At KubeNine, we’ve leveraged this solution to simplify our hiring process. By structuring resumes consistently, we’ve created a workflow and enabled AI-driven analysis for better decision-making and fast processing of the candidate screenings.