AI Research Agent – Semantic Document Intelligence Platform
2025

Building an AI Research Analyzer: A Deep Dive into LangChain, LangGraph, and Groq Integration
Introduction
In the age of information overload, researchers and academics face a significant challenge: sifting through hundreds of research papers to find the ones most relevant to their work. Manual review is time-consuming, error-prone, and often overwhelming. This is where AI comes to the rescue.
In this comprehensive blog post, we'll explore how to build an AI Research Analyzer - an intelligent system that uses Large Language Models (LLMs) to automatically evaluate research papers against user queries. We'll dive deep into the architecture, technologies, and code implementation, showing you exactly how this powerful tool works under the hood.
The Problem
Imagine you're working on a research project about "machine learning applications in healthcare." You have a collection of 50 PDF research papers, and you need to identify which ones are actually relevant to your topic. Traditional approaches would require:
This process could take days or weeks. Our AI Research Analyzer can do this in minutes, providing detailed analysis and match evaluations for each paper.
The Solution: AI-Powered Research Analysis
Our solution leverages the power of:
The system takes user queries and PDF files as input, processes them through an intelligent pipeline, and returns comprehensive analysis results.
Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ User Interface │
│ (Streamlit Web App) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Query Input │ │ PDF Upload │ │ Results Display │ │
│ └──────┬───────┘ └──────┬───────┘ └─────────┬────────┘ │
└─────────┼──────────────────┼────────────────────┼────────────┘
│ │ │
▼ ▼ │
┌─────────────────────────────────────────────────┼────────────┐
│ PDF Processing Layer │ │
│ ┌──────────────────────────────────────────┐ │ │
│ │ Extract text from PDFs using PyPDF │ │ │
│ │ Structure data with file names │ │ │
│ └──────────────────┬───────────────────────┘ │ │
└─────────────────────┼───────────────────────────┘ │
│ │
▼ │
┌───────────────────────────────────────────────────────────────┐
│ LangGraph Workflow Orchestration │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ Node 1: Analyze │ ───► │ Node 2: Evaluate │ │
│ │ Individual Papers │ │ Matches & Summary │ │
│ └──────────────────────┘ └──────────────────────┘ │
└─────────────────────┬────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────┐
│ Groq LLM API (Llama 3.3) │
│ Fast inference for analysis │
└───────────────────────────────────────────────────────────────┘Technology Stack Deep Dive
1. **Python 3.13**
Our foundation. Python provides excellent libraries for AI/ML, web development, and PDF processing.
2. **Streamlit** - Web Interface Framework
Streamlit is a Python framework that makes it incredibly easy to build interactive web applications. It's perfect for data science and AI applications.
Key Features:
3. **LangChain** - LLM Application Framework
LangChain is a framework for developing applications powered by language models. It provides:
4. **LangGraph** - Workflow Orchestration
LangGraph extends LangChain with graph-based workflow capabilities:
5. **Groq API** - Ultra-Fast LLM Inference
Groq provides lightning-fast inference for LLMs:
6. **PyPDF** - PDF Processing
PyPDF is a pure Python library for PDF manipulation:
Code Walkthrough
Let's examine each component of our system in detail.
Component 1: PDF Processor (`pdf_processor.py`)
The PDF processor is responsible for extracting text content from uploaded PDF files.
from typing import List
import pypdf
from io import BytesIO
class PDFProcessor:
"""Processes PDF files and extracts text content"""
def extract_text_from_pdf(self, pdf_file: BytesIO) -> str:
"""
Extract text from a single PDF file
This method uses PyPDF's PdfReader to parse the PDF
and extract text from each page.
"""
try:
pdf_reader = pypdf.PdfReader(pdf_file)
text = ""
# Iterate through each page
for page_num, page in enumerate(pdf_reader.pages):
page_text = page.extract_text()
# Add page markers for better context
text += f"\n--- Page {page_num + 1} ---\n"
text += page_text
text += "\n"
return text
except Exception as e:
raise Exception(f"Error extracting text from PDF: {str(e)}")Key Points:
def process_multiple_pdfs(self, pdf_files: List[BytesIO], file_names: List[str] = None) -> List[dict]:
"""
Process multiple PDF files and return structured data
This method handles batch processing of PDFs, maintaining
file names and metadata for later reference.
"""
processed_pdfs = []
for idx, pdf_file in enumerate(pdf_files):
# Use provided file name or generate default
file_name = file_names[idx] if file_names and idx < len(file_names) else f"Paper {idx + 1}"
try:
text = self.extract_text_from_pdf(pdf_file)
processed_pdfs.append({
"index": idx, # For ordering
"name": file_name, # Original filename
"text": text, # Extracted content
"length": len(text) # For metadata
})
except Exception as e:
# Graceful error handling
processed_pdfs.append({
"index": idx,
"name": file_name,
"text": "",
"error": str(e),
"length": 0
})
return processed_pdfsWhy This Design?
Component 2: Research Analyzer (`research_analyzer.py`)
This is the heart of our system - it uses LangGraph to orchestrate the analysis workflow.
State Definition
from typing_extensions import TypedDict
class ResearchState(TypedDict):
"""State for the research analysis graph"""
query: str # User's research query
papers: List[Dict[str, Any]] # Processed PDF data
results: List[Dict[str, Any]] # Individual analyses
summary: Optional[str] # Final summaryWhy TypedDict?
Initialization
class ResearchAnalyzer:
def __init__(self, groq_api_key: str, model_name: str = "llama-3.3-70b-versatile"):
"""
Initialize the Research Analyzer
Sets up the Groq LLM client and builds the LangGraph workflow.
"""
self.llm = ChatGroq(
groq_api_key=groq_api_key,
model_name=model_name,
temperature=0.1 # Low temperature for consistent, focused analysis
)
self._build_graph()Temperature Setting:
Building the LangGraph Workflow
def _build_graph(self):
"""Build the LangGraph workflow for research analysis"""
# Create a state graph with our ResearchState schema
workflow = StateGraph(ResearchState)
# Add nodes - each node is a function that processes the state
workflow.add_node("analyze_papers", self._analyze_papers_node)
workflow.add_node("evaluate_matches", self._evaluate_matches_node)
# Set the entry point
workflow.set_entry_point("analyze_papers")
# Define the flow: analyze_papers → evaluate_matches → END
workflow.add_edge("analyze_papers", "evaluate_matches")
workflow.add_edge("evaluate_matches", END)
# Compile the graph into an executable workflow
self.graph = workflow.compile()Graph Structure:
START → analyze_papers → evaluate_matches → ENDWhy This Flow?
1. Sequential processing: First analyze each paper individually
2. Then synthesize: Combine individual analyses into a summary
3. Clear separation: Each node has a single responsibility
Node 1: Analyze Papers
def _analyze_papers_node(self, state: ResearchState) -> ResearchState:
"""Analyze each paper individually"""
query = state["query"]
papers = state["papers"]
results = []
# Define the analysis prompt template
analysis_prompt = ChatPromptTemplate.from_messages([
("system", """You are an expert research paper analyzer. Your task is to analyze research papers
and determine how well they match a given query. Be thorough and precise in your analysis.
For each paper, provide:
1. A summary of the paper's main content
2. Key findings and contributions
3. Relevance to the query (on a scale of 0-100)
4. Specific sections or findings that match the query
5. A clear match verdict (MATCH or NO MATCH)"""),
("human", """Query: {query}
Paper Content:
{paper_text}
Please analyze this paper and provide a detailed evaluation of how well it matches the query.""")
])Prompt Engineering:
for paper in papers:
try:
# Format the prompt with actual data
messages = analysis_prompt.format_messages(
query=query,
paper_text=paper["text"][:8000] # Limit to 8000 chars for API
)
# Invoke the LLM
response = self.llm.invoke(messages)
# Store results
results.append({
"paper_index": paper["index"],
"paper_name": paper.get("name", f"Paper {paper['index'] + 1}"),
"analysis": response.content,
"raw_text_length": paper["length"]
})
except Exception as e:
# Error handling per paper
results.append({
"paper_index": paper["index"],
"paper_name": paper.get("name", f"Paper {paper['index'] + 1}"),
"error": str(e),
"analysis": None
})
# Update state with results
state["results"] = results
return stateKey Design Decisions:
Node 2: Evaluate Matches
def _evaluate_matches_node(self, state: ResearchState) -> ResearchState:
"""Final evaluation and summary of matches"""
query = state["query"]
results = state["results"]
# Create summary prompt
summary_prompt = ChatPromptTemplate.from_messages([
("system", """You are a research evaluation expert. Summarize the analysis results
and provide a final verdict on which papers match the query."""),
("human", """Query: {query}
Analysis Results:
{analyses}
Provide a final summary indicating which papers match the query and why.""")
])
# Format all analyses into a single text
analyses_text = "\n\n".join([
f"{r.get('paper_name', 'Paper ' + str(r['paper_index'] + 1))}:\n{r['analysis']}"
for r in results if r.get('analysis')
])
try:
messages = summary_prompt.format_messages(
query=query,
analyses=analyses_text[:10000] # Limit length
)
summary_response = self.llm.invoke(messages)
state["summary"] = summary_response.content
except Exception as e:
state["summary"] = f"Error generating summary: {str(e)}"
return stateSummary Generation:
Main Analysis Method
def analyze(self, query: str, papers: List[Dict[str, Any]]) -> Dict[str, Any]:
"""
Analyze research papers against a query
This is the main entry point that orchestrates the entire workflow.
"""
# Initialize state
initial_state = ResearchState(
query=query,
papers=papers,
results=[],
summary=None
)
# Execute the graph
final_state = self.graph.invoke(initial_state)
# Return formatted results
return {
"query": query,
"results": final_state["results"],
"summary": final_state.get("summary", "No summary available")
}Workflow Execution:
1. Create initial state with query and papers
2. Invoke the compiled graph
3. Graph executes nodes sequentially
4. Return final state with all results
Component 3: Streamlit Application (`app.py`)
The Streamlit app provides the user interface for our system.
Setup and Configuration
import streamlit as st
import os
from dotenv import load_dotenv
from pdf_processor import PDFProcessor
from research_analyzer import ResearchAnalyzer
# Load environment variables
load_dotenv()
# Page configuration
st.set_page_config(
page_title="AI Research Analyzer",
page_icon="📚",
layout="wide"
)
# Initialize session state
if "analyzer" not in st.session_state:
st.session_state.analyzer = None
if "processed_papers" not in st.session_state:
st.session_state.processed_papers = []
if "analysis_results" not in st.session_state:
st.session_state.analysis_results = NoneSession State:
Sidebar Configuration
with st.sidebar:
st.header("⚙️ Configuration")
# API Key input
api_key_input = st.text_input(
"Groq API Key",
type="password",
help="Enter your Groq API key or set GROQ_API_KEY in .env file",
value=os.getenv("GROQ_API_KEY", "")
)
if api_key_input:
os.environ["GROQ_API_KEY"] = api_key_input
# Model selection
model_name = st.selectbox(
"Groq Model",
options=[
"llama-3.3-70b-versatile",
"llama-3.1-8b-instant",
"mixtral-8x7b-32768",
"gemma-7b-it",
"llama-3.2-90b-text-preview"
],
index=0,
help="Select the Groq model to use for analysis."
)User Controls:
Main Processing Logic
if uploaded_files and query:
if st.button("🔬 Analyze Papers", type="primary", use_container_width=True):
with st.spinner("Processing PDFs and analyzing..."):
try:
# Step 1: Initialize PDF processor
processor = PDFProcessor()
# Step 2: Extract file data and names
pdf_data = [file.read() for file in uploaded_files]
file_names = [file.name for file in uploaded_files]
from io import BytesIO
pdf_files = [BytesIO(data) for data in pdf_data]
# Step 3: Process PDFs
processed_papers = processor.process_multiple_pdfs(pdf_files, file_names)
st.session_state.processed_papers = processed_papers
# Step 4: Initialize analyzer
groq_api_key = os.getenv("GROQ_API_KEY")
if not groq_api_key:
st.error("Please set GROQ_API_KEY in the sidebar or .env file")
return
analyzer = ResearchAnalyzer(
groq_api_key=groq_api_key,
model_name=model_name
)
# Step 5: Run analysis
results = analyzer.analyze(query, processed_papers)
st.session_state.analysis_results = results
st.success("✅ Analysis complete!")
except Exception as e:
st.error(f"Error during analysis: {str(e)}")Processing Flow:
1. Extract data: Read PDF files into memory
2. Process PDFs: Extract text and structure data
3. Initialize analyzer: Set up LangGraph workflow
4. Run analysis: Execute the graph workflow
5. Store results: Save for display
Results Display
if st.session_state.analysis_results:
st.markdown("---")
st.header("📊 Analysis Results")
results = st.session_state.analysis_results
# Display summary
st.subheader("📋 Summary")
st.info(results.get("summary", "No summary available"))
# Display individual analyses
st.subheader("📑 Individual Paper Analyses")
for idx, result in enumerate(results.get("results", [])):
paper_name = result.get("paper_name", f"Paper {result['paper_index'] + 1}")
with st.expander(f"📄 {paper_name}", expanded=False):
if result.get("error"):
st.error(f"Error: {result['error']}")
elif result.get("analysis"):
st.markdown(result["analysis"])
st.caption(f"Text length: {result.get('raw_text_length', 0)} characters")
else:
st.warning("No analysis available for this paper")Display Features:
How It Works: Step-by-Step Execution
Let's trace through a complete execution:
Step 1: User Input
User enters query: "machine learning in healthcare"
User uploads: paper1.pdf, paper2.pdf, paper3.pdfStep 2: PDF Processing
# PDFProcessor extracts text
paper1 → "This paper discusses ML applications..."
paper2 → "Healthcare data analysis using neural networks..."
paper3 → "A survey of computer vision techniques..."Step 3: LangGraph Execution
Initial State:
{
"query": "machine learning in healthcare",
"papers": [
{"index": 0, "name": "paper1.pdf", "text": "..."},
{"index": 1, "name": "paper2.pdf", "text": "..."},
{"index": 2, "name": "paper3.pdf", "text": "..."}
],
"results": [],
"summary": None
}After Node 1 (analyze_papers):
{
"query": "machine learning in healthcare",
"papers": [...],
"results": [
{
"paper_index": 0,
"paper_name": "paper1.pdf",
"analysis": "This paper discusses ML applications in healthcare... Relevance: 95/100. VERDICT: MATCH"
},
{
"paper_index": 1,
"paper_name": "paper2.pdf",
"analysis": "This paper focuses on healthcare data... Relevance: 88/100. VERDICT: MATCH"
},
{
"paper_index": 2,
"paper_name": "paper3.pdf",
"analysis": "This paper discusses computer vision... Relevance: 15/100. VERDICT: NO MATCH"
}
],
"summary": None
}After Node 2 (evaluate_matches):
{
"query": "machine learning in healthcare",
"papers": [...],
"results": [...],
"summary": "Based on the analysis, 2 out of 3 papers match the query. Paper1.pdf and paper2.pdf both discuss machine learning applications in healthcare, with high relevance scores. Paper3.pdf focuses on computer vision and does not match the query."
}Step 4: Display Results
Key Features
1. **Multi-PDF Processing**
2. **Intelligent Analysis**
3. **Workflow Orchestration**
4. **User-Friendly Interface**
5. **Flexible Configuration**
Performance Considerations
Text Truncation
paper_text=paper["text"][:8000] # Limit to 8000 charsWhy?
Batch Processing
Error Handling
Advanced Use Cases
1. **Literature Review Automation**
Researchers can quickly identify relevant papers from large collections.
2. **Content Discovery**
Find documents matching specific topics in document repositories.
3. **Quality Filtering**
Filter papers by relevance score to focus on most relevant content.
4. **Research Gap Analysis**
Identify which aspects of a topic are covered or missing in a collection.
Future Enhancements
1. **Parallel Processing**
# Could use asyncio or multiprocessing
async def analyze_paper_async(paper):
# Parallel analysis2. **Vector Embeddings**
3. **Citation Analysis**
4. **Multi-Query Support**
5. **Export Functionality**
Best Practices Implemented
1. **Separation of Concerns**
2. **Error Handling**
3. **Type Hints**
4. **State Management**
5. **Prompt Engineering**
Conclusion
This AI Research Analyzer demonstrates the power of combining modern AI frameworks (LangChain, LangGraph) with fast inference APIs (Groq) to solve real-world problems. The architecture is:
The system showcases how agentic AI workflows can be orchestrated using LangGraph, making complex multi-step processes manageable and maintainable.
Whether you're a researcher looking to automate literature reviews, a student organizing research materials, or a developer building document analysis tools, this architecture provides a solid foundation for AI-powered research assistance.
Code Repository Structure
Research_agent/
├── app.py # Streamlit UI (182 lines)
├── pdf_processor.py # PDF extraction (73 lines)
├── research_analyzer.py # LangGraph workflow (174 lines)
├── requirements.txt # Dependencies
├── setup_env.py # Environment setup helper
├── .env.example # Environment template
└── README.md # Project documentationGetting Started
1. Install dependencies:
pip install -r requirements.txt2. Set up environment:
# Create .env file
GROQ_API_KEY=your_api_key_here3. Run the application:
streamlit run app.py4. Use the interface:
Final Thoughts
Building AI applications doesn't have to be complex. By leveraging frameworks like LangChain and LangGraph, we can create sophisticated AI workflows with relatively simple code. The key is understanding:
This project demonstrates all of these principles in a practical, real-world application. Happy coding!