Building an AI Research Analyzer: A Deep Dive into LangChain, LangGraph, and Groq Integration

Introduction

In the age of information overload, researchers and academics face a significant challenge: sifting through hundreds of research papers to find the ones most relevant to their work. Manual review is time-consuming, error-prone, and often overwhelming. This is where AI comes to the rescue.

In this comprehensive blog post, we'll explore how to build an AI Research Analyzer - an intelligent system that uses Large Language Models (LLMs) to automatically evaluate research papers against user queries. We'll dive deep into the architecture, technologies, and code implementation, showing you exactly how this powerful tool works under the hood.

The Problem

Imagine you're working on a research project about "machine learning applications in healthcare." You have a collection of 50 PDF research papers, and you need to identify which ones are actually relevant to your topic. Traditional approaches would require:

Reading through each paper manually

Identifying key concepts and themes

Comparing them against your research query

Making subjective judgments about relevance

This process could take days or weeks. Our AI Research Analyzer can do this in minutes, providing detailed analysis and match evaluations for each paper.

The Solution: AI-Powered Research Analysis

Our solution leverages the power of:

LangChain - For building LLM applications

LangGraph - For orchestrating complex AI workflows

Groq API - For ultra-fast LLM inference

Streamlit - For an intuitive web interface

The system takes user queries and PDF files as input, processes them through an intelligent pipeline, and returns comprehensive analysis results.

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                        User Interface                         │
│                      (Streamlit Web App)                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │ Query Input  │  │ PDF Upload  │  │ Results Display   │   │
│  └──────┬───────┘  └──────┬───────┘  └─────────┬────────┘   │
└─────────┼──────────────────┼────────────────────┼────────────┘
          │                  │                    │
          ▼                  ▼                    │
┌─────────────────────────────────────────────────┼────────────┐
│              PDF Processing Layer                │            │
│  ┌──────────────────────────────────────────┐  │            │
│  │  Extract text from PDFs using PyPDF       │  │            │
│  │  Structure data with file names           │  │            │
│  └──────────────────┬───────────────────────┘  │            │
└─────────────────────┼───────────────────────────┘            │
                      │                                          │
                      ▼                                          │
┌───────────────────────────────────────────────────────────────┐
│              LangGraph Workflow Orchestration                  │
│  ┌──────────────────────┐      ┌──────────────────────┐     │
│  │  Node 1: Analyze     │ ───► │  Node 2: Evaluate    │     │
│  │  Individual Papers   │      │  Matches & Summary   │     │
│  └──────────────────────┘      └──────────────────────┘     │
└─────────────────────┬────────────────────────────────────────┘
                      │
                      ▼
┌───────────────────────────────────────────────────────────────┐
│                    Groq LLM API (Llama 3.3)                   │
│              Fast inference for analysis                      │
└───────────────────────────────────────────────────────────────┘

Technology Stack Deep Dive

1. Python 3.13

Our foundation. Python provides excellent libraries for AI/ML, web development, and PDF processing.

2. Streamlit - Web Interface Framework

Streamlit is a Python framework that makes it incredibly easy to build interactive web applications. It's perfect for data science and AI applications.

Key Features:

No HTML/CSS/JavaScript required

Built-in widgets (file uploaders, text inputs, buttons)

Automatic reactive updates

Session state management

3. LangChain - LLM Application Framework

LangChain is a framework for developing applications powered by language models. It provides:

Abstraction layer for working with different LLM providers

Prompt templates for consistent, reusable prompts

Chain composition for complex workflows

Memory management for conversational applications

4. LangGraph - Workflow Orchestration

LangGraph extends LangChain with graph-based workflow capabilities:

State management - Maintains state across workflow steps

Node-based architecture - Each node performs a specific task

Conditional routing - Dynamic workflow paths based on conditions

Error handling - Built-in error recovery mechanisms

5. Groq API - Ultra-Fast LLM Inference

Groq provides lightning-fast inference for LLMs:

Hardware acceleration - Custom inference chips

Multiple model support - Llama, Mixtral, Gemma

Low latency - Sub-second response times

Cost-effective - Pay-per-use pricing

6. PyPDF - PDF Processing

PyPDF is a pure Python library for PDF manipulation:

Text extraction - Extracts text from PDF files

Page-by-page processing - Handles large documents

Metadata access - Retrieves document information

Code Walkthrough

Let's examine each component of our system in detail.

Component 1: PDF Processor (`pdf_processor.py`)

The PDF processor is responsible for extracting text content from uploaded PDF files.

from typing import List
import pypdf
from io import BytesIO

class PDFProcessor:
    """Processes PDF files and extracts text content"""
    
    def extract_text_from_pdf(self, pdf_file: BytesIO) -> str:
        """
        Extract text from a single PDF file
        
        This method uses PyPDF's PdfReader to parse the PDF
        and extract text from each page.
        """
        try:
            pdf_reader = pypdf.PdfReader(pdf_file)
            text = ""
            
            # Iterate through each page
            for page_num, page in enumerate(pdf_reader.pages):
                page_text = page.extract_text()
                # Add page markers for better context
                text += f"\n--- Page {page_num + 1} ---\n"
                text += page_text
                text += "\n"
            
            return text
        except Exception as e:
            raise Exception(f"Error extracting text from PDF: {str(e)}")

Key Points:

Uses `BytesIO` to handle in-memory PDF data (no file system writes)

Processes pages sequentially to maintain document structure

Adds page markers to preserve context

Error handling ensures one bad PDF doesn't crash the entire process

def process_multiple_pdfs(self, pdf_files: List[BytesIO], file_names: List[str] = None) -> List[dict]:
    """
    Process multiple PDF files and return structured data
    
    This method handles batch processing of PDFs, maintaining
    file names and metadata for later reference.
    """
    processed_pdfs = []
    
    for idx, pdf_file in enumerate(pdf_files):
        # Use provided file name or generate default
        file_name = file_names[idx] if file_names and idx < len(file_names) else f"Paper {idx + 1}"
        
        try:
            text = self.extract_text_from_pdf(pdf_file)
            processed_pdfs.append({
                "index": idx,           # For ordering
                "name": file_name,      # Original filename
                "text": text,           # Extracted content
                "length": len(text)     # For metadata
            })
        except Exception as e:
            # Graceful error handling
            processed_pdfs.append({
                "index": idx,
                "name": file_name,
                "text": "",
                "error": str(e),
                "length": 0
            })
    
    return processed_pdfs

Why This Design?

Structured output: Returns dictionaries with consistent keys

Error resilience: Continues processing even if one PDF fails

Metadata preservation: Keeps file names and indices for display

Scalability: Can handle any number of PDFs

Component 2: Research Analyzer (`research_analyzer.py`)

This is the heart of our system - it uses LangGraph to orchestrate the analysis workflow.

State Definition

from typing_extensions import TypedDict

class ResearchState(TypedDict):
    """State for the research analysis graph"""
    query: str                              # User's research query
    papers: List[Dict[str, Any]]            # Processed PDF data
    results: List[Dict[str, Any]]           # Individual analyses
    summary: Optional[str]                  # Final summary

Why TypedDict?

Type safety for state management

Clear contract for what data flows through the graph

IDE autocomplete support

Runtime validation

Initialization

class ResearchAnalyzer:
    def __init__(self, groq_api_key: str, model_name: str = "llama-3.3-70b-versatile"):
        """
        Initialize the Research Analyzer
        
        Sets up the Groq LLM client and builds the LangGraph workflow.
        """
        self.llm = ChatGroq(
            groq_api_key=groq_api_key,
            model_name=model_name,
            temperature=0.1  # Low temperature for consistent, focused analysis
        )
        self._build_graph()

Temperature Setting:

`0.1` = More deterministic, focused responses

Lower values = More consistent analysis

Higher values = More creative but less reliable

Building the LangGraph Workflow

def _build_graph(self):
    """Build the LangGraph workflow for research analysis"""
    
    # Create a state graph with our ResearchState schema
    workflow = StateGraph(ResearchState)
    
    # Add nodes - each node is a function that processes the state
    workflow.add_node("analyze_papers", self._analyze_papers_node)
    workflow.add_node("evaluate_matches", self._evaluate_matches_node)
    
    # Set the entry point
    workflow.set_entry_point("analyze_papers")
    
    # Define the flow: analyze_papers → evaluate_matches → END
    workflow.add_edge("analyze_papers", "evaluate_matches")
    workflow.add_edge("evaluate_matches", END)
    
    # Compile the graph into an executable workflow
    self.graph = workflow.compile()

Graph Structure:

START → analyze_papers → evaluate_matches → END

Why This Flow?

1. Sequential processing: First analyze each paper individually

2. Then synthesize: Combine individual analyses into a summary

3. Clear separation: Each node has a single responsibility

Node 1: Analyze Papers

def _analyze_papers_node(self, state: ResearchState) -> ResearchState:
    """Analyze each paper individually"""
    query = state["query"]
    papers = state["papers"]
    results = []
    
    # Define the analysis prompt template
    analysis_prompt = ChatPromptTemplate.from_messages([
        ("system", """You are an expert research paper analyzer. Your task is to analyze research papers 
        and determine how well they match a given query. Be thorough and precise in your analysis.
        
        For each paper, provide:
        1. A summary of the paper's main content
        2. Key findings and contributions
        3. Relevance to the query (on a scale of 0-100)
        4. Specific sections or findings that match the query
        5. A clear match verdict (MATCH or NO MATCH)"""),
        ("human", """Query: {query}
        
        Paper Content:
        
        {paper_text}
        
        Please analyze this paper and provide a detailed evaluation of how well it matches the query.""")
    ])

Prompt Engineering:

System message: Sets the AI's role and expectations

Structured output: Requests specific information (summary, relevance score, verdict)

Context: Provides both query and paper content

    for paper in papers:
        try:
            # Format the prompt with actual data
            messages = analysis_prompt.format_messages(
                query=query,
                paper_text=paper["text"][:8000]  # Limit to 8000 chars for API
            )
            
            # Invoke the LLM
            response = self.llm.invoke(messages)
            
            # Store results
            results.append({
                "paper_index": paper["index"],
                "paper_name": paper.get("name", f"Paper {paper['index'] + 1}"),
                "analysis": response.content,
                "raw_text_length": paper["length"]
            })
        except Exception as e:
            # Error handling per paper
            results.append({
                "paper_index": paper["index"],
                "paper_name": paper.get("name", f"Paper {paper['index'] + 1}"),
                "error": str(e),
                "analysis": None
            })
    
    # Update state with results
    state["results"] = results
    return state

Key Design Decisions:

Text truncation: Limits to 8000 characters to avoid token limits

Per-paper error handling: One failure doesn't stop the entire process

State mutation: Updates and returns the state object

Node 2: Evaluate Matches

def _evaluate_matches_node(self, state: ResearchState) -> ResearchState:
    """Final evaluation and summary of matches"""
    query = state["query"]
    results = state["results"]
    
    # Create summary prompt
    summary_prompt = ChatPromptTemplate.from_messages([
        ("system", """You are a research evaluation expert. Summarize the analysis results 
        and provide a final verdict on which papers match the query."""),
        ("human", """Query: {query}
        
        Analysis Results:
        
        {analyses}
        
        Provide a final summary indicating which papers match the query and why.""")
    ])
    
    # Format all analyses into a single text
    analyses_text = "\n\n".join([
        f"{r.get('paper_name', 'Paper ' + str(r['paper_index'] + 1))}:\n{r['analysis']}"
        for r in results if r.get('analysis')
    ])
    
    try:
        messages = summary_prompt.format_messages(
            query=query,
            analyses=analyses_text[:10000]  # Limit length
        )
        
        summary_response = self.llm.invoke(messages)
        state["summary"] = summary_response.content
    except Exception as e:
        state["summary"] = f"Error generating summary: {str(e)}"
    
    return state

Summary Generation:

Takes all individual analyses as input

Asks LLM to synthesize and provide overall verdict

Creates a cohesive summary of matches

Main Analysis Method

def analyze(self, query: str, papers: List[Dict[str, Any]]) -> Dict[str, Any]:
    """
    Analyze research papers against a query
    
    This is the main entry point that orchestrates the entire workflow.
    """
    # Initialize state
    initial_state = ResearchState(
        query=query,
        papers=papers,
        results=[],
        summary=None
    )
    
    # Execute the graph
    final_state = self.graph.invoke(initial_state)
    
    # Return formatted results
    return {
        "query": query,
        "results": final_state["results"],
        "summary": final_state.get("summary", "No summary available")
    }

Workflow Execution:

1. Create initial state with query and papers

2. Invoke the compiled graph

3. Graph executes nodes sequentially

4. Return final state with all results

Component 3: Streamlit Application (`app.py`)

The Streamlit app provides the user interface for our system.

Setup and Configuration

import streamlit as st
import os
from dotenv import load_dotenv
from pdf_processor import PDFProcessor
from research_analyzer import ResearchAnalyzer

# Load environment variables
load_dotenv()

# Page configuration
st.set_page_config(
    page_title="AI Research Analyzer",
    page_icon="📚",
    layout="wide"
)

# Initialize session state
if "analyzer" not in st.session_state:
    st.session_state.analyzer = None
if "processed_papers" not in st.session_state:
    st.session_state.processed_papers = []
if "analysis_results" not in st.session_state:
    st.session_state.analysis_results = None

Session State:

Streamlit's way of maintaining data across reruns

Prevents re-initialization on every interaction

Stores processed data for display

Sidebar Configuration

with st.sidebar:
    st.header("⚙️ Configuration")
    
    # API Key input
    api_key_input = st.text_input(
        "Groq API Key",
        type="password",
        help="Enter your Groq API key or set GROQ_API_KEY in .env file",
        value=os.getenv("GROQ_API_KEY", "")
    )
    
    if api_key_input:
        os.environ["GROQ_API_KEY"] = api_key_input
    
    # Model selection
    model_name = st.selectbox(
        "Groq Model",
        options=[
            "llama-3.3-70b-versatile",
            "llama-3.1-8b-instant",
            "mixtral-8x7b-32768",
            "gemma-7b-it",
            "llama-3.2-90b-text-preview"
        ],
        index=0,
        help="Select the Groq model to use for analysis."
    )

User Controls:

API key can be set via UI or environment variable

Model selection allows users to choose speed vs. quality

Helpful tooltips guide users

Main Processing Logic

if uploaded_files and query:
    if st.button("🔬 Analyze Papers", type="primary", use_container_width=True):
        with st.spinner("Processing PDFs and analyzing..."):
            try:
                # Step 1: Initialize PDF processor
                processor = PDFProcessor()
                
                # Step 2: Extract file data and names
                pdf_data = [file.read() for file in uploaded_files]
                file_names = [file.name for file in uploaded_files]
                from io import BytesIO
                pdf_files = [BytesIO(data) for data in pdf_data]
                
                # Step 3: Process PDFs
                processed_papers = processor.process_multiple_pdfs(pdf_files, file_names)
                st.session_state.processed_papers = processed_papers
                
                # Step 4: Initialize analyzer
                groq_api_key = os.getenv("GROQ_API_KEY")
                if not groq_api_key:
                    st.error("Please set GROQ_API_KEY in the sidebar or .env file")
                    return
                
                analyzer = ResearchAnalyzer(
                    groq_api_key=groq_api_key,
                    model_name=model_name
                )
                
                # Step 5: Run analysis
                results = analyzer.analyze(query, processed_papers)
                st.session_state.analysis_results = results
                
                st.success("✅ Analysis complete!")
                
            except Exception as e:
                st.error(f"Error during analysis: {str(e)}")

Processing Flow:

1. Extract data: Read PDF files into memory

2. Process PDFs: Extract text and structure data

3. Initialize analyzer: Set up LangGraph workflow

4. Run analysis: Execute the graph workflow

5. Store results: Save for display

Results Display

if st.session_state.analysis_results:
    st.markdown("---")
    st.header("📊 Analysis Results")
    
    results = st.session_state.analysis_results
    
    # Display summary
    st.subheader("📋 Summary")
    st.info(results.get("summary", "No summary available"))
    
    # Display individual analyses
    st.subheader("📑 Individual Paper Analyses")
    
    for idx, result in enumerate(results.get("results", [])):
        paper_name = result.get("paper_name", f"Paper {result['paper_index'] + 1}")
        with st.expander(f"📄 {paper_name}", expanded=False):
            if result.get("error"):
                st.error(f"Error: {result['error']}")
            elif result.get("analysis"):
                st.markdown(result["analysis"])
                st.caption(f"Text length: {result.get('raw_text_length', 0)} characters")
            else:
                st.warning("No analysis available for this paper")

Display Features:

Summary section: Overall verdict and match evaluation

Expandable sections: Individual paper analyses in collapsible sections

Error handling: Clear error messages if analysis fails

Metadata: Shows text length for context

How It Works: Step-by-Step Execution

Let's trace through a complete execution:

Step 1: User Input

User enters query: "machine learning in healthcare"
User uploads: paper1.pdf, paper2.pdf, paper3.pdf

Step 2: PDF Processing

# PDFProcessor extracts text
paper1 → "This paper discusses ML applications..."
paper2 → "Healthcare data analysis using neural networks..."
paper3 → "A survey of computer vision techniques..."

Step 3: LangGraph Execution

Initial State:

{
    "query": "machine learning in healthcare",
    "papers": [
        {"index": 0, "name": "paper1.pdf", "text": "..."},
        {"index": 1, "name": "paper2.pdf", "text": "..."},
        {"index": 2, "name": "paper3.pdf", "text": "..."}
    ],
    "results": [],
    "summary": None
}

After Node 1 (analyze_papers):

{
    "query": "machine learning in healthcare",
    "papers": [...],
    "results": [
        {
            "paper_index": 0,
            "paper_name": "paper1.pdf",
            "analysis": "This paper discusses ML applications in healthcare... Relevance: 95/100. VERDICT: MATCH"
        },
        {
            "paper_index": 1,
            "paper_name": "paper2.pdf",
            "analysis": "This paper focuses on healthcare data... Relevance: 88/100. VERDICT: MATCH"
        },
        {
            "paper_index": 2,
            "paper_name": "paper3.pdf",
            "analysis": "This paper discusses computer vision... Relevance: 15/100. VERDICT: NO MATCH"
        }
    ],
    "summary": None
}

After Node 2 (evaluate_matches):

{
    "query": "machine learning in healthcare",
    "papers": [...],
    "results": [...],
    "summary": "Based on the analysis, 2 out of 3 papers match the query. Paper1.pdf and paper2.pdf both discuss machine learning applications in healthcare, with high relevance scores. Paper3.pdf focuses on computer vision and does not match the query."
}

Step 4: Display Results

Summary shows overall match count

Individual analyses show detailed evaluations

Users can expand each paper to see full analysis

Key Features

1. Multi-PDF Processing

Handles multiple PDFs simultaneously

Maintains file names and metadata

Error-resilient (continues if one PDF fails)

2. Intelligent Analysis

Context-aware evaluation

Relevance scoring (0-100)

Clear match verdicts

Detailed explanations

3. Workflow Orchestration

LangGraph manages complex state

Sequential node execution

Error handling at each step

4. User-Friendly Interface

Simple upload and query interface

Real-time progress indicators

Expandable result sections

Model selection options

5. Flexible Configuration

Multiple LLM model options

Configurable via UI or environment variables

Temperature and parameter control

Performance Considerations

Text Truncation

paper_text=paper["text"][:8000]  # Limit to 8000 chars

Why?

API token limits

Cost management

Faster processing

Most relevant content is usually at the beginning

Batch Processing

Processes papers sequentially (could be parallelized)

Each paper analyzed independently

Summary generated after all analyses complete

Error Handling

Graceful degradation

Individual paper failures don't stop the process

Clear error messages for debugging

Advanced Use Cases

1. Literature Review Automation

Researchers can quickly identify relevant papers from large collections.

2. Content Discovery

Find documents matching specific topics in document repositories.

3. Quality Filtering

Filter papers by relevance score to focus on most relevant content.

4. Research Gap Analysis

Identify which aspects of a topic are covered or missing in a collection.

Future Enhancements

1. Parallel Processing

# Could use asyncio or multiprocessing
async def analyze_paper_async(paper):
    # Parallel analysis

2. Vector Embeddings

Store paper embeddings for semantic search

Faster similarity matching

RAG (Retrieval Augmented Generation) integration

3. Citation Analysis

Extract and analyze citations

Build citation networks

Identify influential papers

4. Multi-Query Support

Handle multiple queries simultaneously

Compare results across queries

Generate comparative analysis

5. Export Functionality

Export results as PDF/CSV

Generate reports

Shareable analysis summaries

Best Practices Implemented

1. Separation of Concerns

PDF processing separate from analysis

UI separate from business logic

Clear module boundaries

2. Error Handling

Try-except blocks at critical points

Graceful degradation

User-friendly error messages

3. Type Hints

Better code documentation

IDE support

Runtime validation

4. State Management

TypedDict for state schema

Clear state transitions

Immutable state updates

5. Prompt Engineering

Clear system messages

Structured output requests

Context preservation

Conclusion

This AI Research Analyzer demonstrates the power of combining modern AI frameworks (LangChain, LangGraph) with fast inference APIs (Groq) to solve real-world problems. The architecture is:

Modular: Easy to extend and modify

Scalable: Can handle multiple papers efficiently

User-friendly: Intuitive Streamlit interface

Robust: Error handling and graceful degradation

The system showcases how agentic AI workflows can be orchestrated using LangGraph, making complex multi-step processes manageable and maintainable.

Whether you're a researcher looking to automate literature reviews, a student organizing research materials, or a developer building document analysis tools, this architecture provides a solid foundation for AI-powered research assistance.

Code Repository Structure

Research_agent/
├── app.py                 # Streamlit UI (182 lines)
├── pdf_processor.py       # PDF extraction (73 lines)
├── research_analyzer.py   # LangGraph workflow (174 lines)
├── requirements.txt       # Dependencies
├── setup_env.py          # Environment setup helper
├── .env.example          # Environment template
└── README.md             # Project documentation

Getting Started

1. Install dependencies:

pip install -r requirements.txt

2. Set up environment:

# Create .env file
GROQ_API_KEY=your_api_key_here

3. Run the application:

streamlit run app.py

4. Use the interface:

Enter your research query

Upload PDF files

Click "Analyze Papers"

Review results

Final Thoughts

Building AI applications doesn't have to be complex. By leveraging frameworks like LangChain and LangGraph, we can create sophisticated AI workflows with relatively simple code. The key is understanding:

State management - How data flows through the system

Prompt engineering - How to get the best results from LLMs

Error handling - How to make systems robust

User experience - How to make tools accessible

This project demonstrates all of these principles in a practical, real-world application. Happy coding!

AI Research Agent – Semantic Document Intelligence Platform