Python

How We Leveraged Pydantic and Llama index for Data Extraction from unstructured files

Krupakar Reddy

07 Feb 2025 • 3 min read

Introduction

Extracting information from PDF files is a common use case. Often we want to extract structured data from PDF files such as resumes, invoices, or reports. Extracting relevant information while maintaining a consistent structure is crucial for automation and further analysis. By leveraging LLMs and Pydantic, developers can transform unstructured PDF content into a well-defined format, ready for integration into various workflows. This blog presents an example of how to parse resumes using Streamlit, Llama Index, and Pydantic, but the principles can be applied to any structured document processing task.

At KubeNine we leverage Structured Output very heavily in any script where we deal with LLMs. We realised that with Pydantic in the picture it’s extremely easy to extract structured information from unstructured text. In this blog we try to share some of our findings.

Usage

We talk about a use case where we tried to extract structured output from resumes submitted to us. We initially built the script as a Streamlit app to first test out how our code was working on the sample resumes. Post testing we integrated this into careers.kubehire.com where whenever a resume gets submitted - the data extraction script would trigger as a celery job - extract all the details and help us generate data for further evaluation.

I) File Structure and Explanation

The solution is built using Streamlit and leverages the Llama Index library for parsing PDFs and generating structured data. Here is a breakdown of the files:

.env:
- Stores sensitive information like API keys.

OPENAI_API_KEY=your-openai-api-key

.gitignore:
Ensures sensitive files and unnecessary directories are excluded from version control.

.venv/
.streamlit/
*__pycache__/
.env

app.py:

This main script is responsible for uploading resumes, processing them, and displaying the results.

requirements.txt:

Lists the Python dependencies needed to run the solution.

Resume.py:

Defines the structured data model for resumes using Pydantic.

II) File Contents and Explanation

app.py:Example snippet from app.py:

Handles file uploads and resume parsing using Llama Index.
Key features:

File upload using Streamlit.
PDF text extraction with Llama Index’s PDFReader.
Structured data generation with OpenAI’s LLM API.
JSON download of parsed data.

from Resume import Resume
from llama_index.readers.file import PDFReader

pdf_reader = PDFReader()
documents = pdf_reader.load_data(file=uploaded_file_path)
resume_text = documents[0].text

llm = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
sllm = llm.as_structured_llm(Resume)
structured_data = sllm.complete(resume_text).raw

Resume.py:

The Pydantic model for structured resume data can be changed to include any specific requirement to be parsed from the resume.
The Resume object includes:

Full Name: Candidate’s name.
Email: Contact email address.
Phone: Contact number.
Summary: A brief overview or objective extracted from the resume.
Education: List of degrees, institutions, and timelines.
Work Experience: Job roles, companies, durations, and descriptions.
Skills: A list of technical and soft skills with proficiency levels.
Links: URLs to LinkedIn, GitHub, or personal websites.

Example snippet from Resume.py:

from pydantic import BaseModel, Field, HttpUrl
from typing import List, Optional
from datetime import date

class Resume(BaseModel):
    full_name: str = Field(description="Full name of the person")
    email: Optional[str] = Field(description="Email address")
    phone: Optional[str] = Field(description="Phone number")
    summary: str = Field(
        default="Generated summary based on the resume content.",
        description="Summary or objective from the resume",
    )
    education: List[Education] = Field(description="List of educational qualifications")
    work_experience: List[WorkExperience] = Field(description="List of work experiences")
    skills: List[Skill] = Field(description="List of skills")
    links: List[HttpUrl] = Field(description="List of relevant links such as LinkedIn or GitHub profiles")

III) Running the Solution

Clone the repository and install dependencies:

pip install -r requirements.txt

Set up the .env file with your OpenAI API key.
Launch the script:

streamlit run app.py

Upload a PDF resume and click "Process Resume" to extract structured data. You can download the data as a JSON file for further use.

Screenshot of the solution with an uploaded resume and parsed JSON data displayed.

Conclusion

This solution transforms unstructured resume content into standardized data, ready for further processing. The Resume Pydantic model can be customized to capture specific details from the parsed resume. You can explore the complete implementation on the github repo here.

Advanced use cases can include adapting the structure to evaluate candidates against job-specific requirements using AI models. Advanced solutions like LlamaParse can extend this parsing approach, as it provides markdown extraction from complex PDFs with images, tables, graphs, and more.

At KubeNine, we’ve leveraged this solution to simplify our hiring process. By structuring resumes consistently, we’ve created a workflow and enabled AI-driven analysis for better decision-making and fast processing of the candidate screenings.

How We Leveraged Pydantic and Llama index for Data Extraction from unstructured files

Krupakar Reddy

Table of Contents

Introduction

Usage

I) File Structure and Explanation

II) File Contents and Explanation

III) Running the Solution

Conclusion

Recent Posts

Saving Cost with EC2 Strategy but More Operational Overhead? What is Better: ECS Fargate vs ECS EC2?

Rapid UI Prototyping with Python Streamlit: Audio Transcription, End-to-End

Securely Accessing AWS Resources from GitHub Actions using OpenID Connect (OIDC)

Add Blazing Fast Search to Your Django Site with Algolia

How a Simple Open Source Tool Drastically Improved Our Time to Ship GitHub Actions