← Back to Blogs

AI Research Agent – Semantic Document Intelligence Platform

2025

AI Research Agent – Semantic Document Intelligence Platform

Building an AI Research Analyzer: A Deep Dive into LangChain, LangGraph, and Groq Integration


Introduction


In the age of information overload, researchers and academics face a significant challenge: sifting through hundreds of research papers to find the ones most relevant to their work. Manual review is time-consuming, error-prone, and often overwhelming. This is where AI comes to the rescue.


In this comprehensive blog post, we'll explore how to build an AI Research Analyzer - an intelligent system that uses Large Language Models (LLMs) to automatically evaluate research papers against user queries. We'll dive deep into the architecture, technologies, and code implementation, showing you exactly how this powerful tool works under the hood.


The Problem


Imagine you're working on a research project about "machine learning applications in healthcare." You have a collection of 50 PDF research papers, and you need to identify which ones are actually relevant to your topic. Traditional approaches would require:


  • Reading through each paper manually

  • Identifying key concepts and themes

  • Comparing them against your research query

  • Making subjective judgments about relevance

  • This process could take days or weeks. Our AI Research Analyzer can do this in minutes, providing detailed analysis and match evaluations for each paper.


    The Solution: AI-Powered Research Analysis


    Our solution leverages the power of:


  • LangChain - For building LLM applications

  • LangGraph - For orchestrating complex AI workflows

  • Groq API - For ultra-fast LLM inference

  • Streamlit - For an intuitive web interface

  • The system takes user queries and PDF files as input, processes them through an intelligent pipeline, and returns comprehensive analysis results.


    Architecture Overview


    ┌─────────────────────────────────────────────────────────────┐
    │                        User Interface                         │
    │                      (Streamlit Web App)                      │
    │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐   │
    │  │ Query Input  │  │ PDF Upload  │  │ Results Display   │   │
    │  └──────┬───────┘  └──────┬───────┘  └─────────┬────────┘   │
    └─────────┼──────────────────┼────────────────────┼────────────┘
              │                  │                    │
              ▼                  ▼                    │
    ┌─────────────────────────────────────────────────┼────────────┐
    │              PDF Processing Layer                │            │
    │  ┌──────────────────────────────────────────┐  │            │
    │  │  Extract text from PDFs using PyPDF       │  │            │
    │  │  Structure data with file names           │  │            │
    │  └──────────────────┬───────────────────────┘  │            │
    └─────────────────────┼───────────────────────────┘            │
                          │                                          │
                          ▼                                          │
    ┌───────────────────────────────────────────────────────────────┐
    │              LangGraph Workflow Orchestration                  │
    │  ┌──────────────────────┐      ┌──────────────────────┐     │
    │  │  Node 1: Analyze     │ ───► │  Node 2: Evaluate    │     │
    │  │  Individual Papers   │      │  Matches & Summary   │     │
    │  └──────────────────────┘      └──────────────────────┘     │
    └─────────────────────┬────────────────────────────────────────┘
                          │
                          ▼
    ┌───────────────────────────────────────────────────────────────┐
    │                    Groq LLM API (Llama 3.3)                   │
    │              Fast inference for analysis                      │
    └───────────────────────────────────────────────────────────────┘

    Technology Stack Deep Dive


    1. **Python 3.13**


    Our foundation. Python provides excellent libraries for AI/ML, web development, and PDF processing.


    2. **Streamlit** - Web Interface Framework


    Streamlit is a Python framework that makes it incredibly easy to build interactive web applications. It's perfect for data science and AI applications.


    Key Features:


  • No HTML/CSS/JavaScript required

  • Built-in widgets (file uploaders, text inputs, buttons)

  • Automatic reactive updates

  • Session state management

  • 3. **LangChain** - LLM Application Framework


    LangChain is a framework for developing applications powered by language models. It provides:


  • Abstraction layer for working with different LLM providers

  • Prompt templates for consistent, reusable prompts

  • Chain composition for complex workflows

  • Memory management for conversational applications

  • 4. **LangGraph** - Workflow Orchestration


    LangGraph extends LangChain with graph-based workflow capabilities:


  • State management - Maintains state across workflow steps

  • Node-based architecture - Each node performs a specific task

  • Conditional routing - Dynamic workflow paths based on conditions

  • Error handling - Built-in error recovery mechanisms

  • 5. **Groq API** - Ultra-Fast LLM Inference


    Groq provides lightning-fast inference for LLMs:


  • Hardware acceleration - Custom inference chips

  • Multiple model support - Llama, Mixtral, Gemma

  • Low latency - Sub-second response times

  • Cost-effective - Pay-per-use pricing

  • 6. **PyPDF** - PDF Processing


    PyPDF is a pure Python library for PDF manipulation:


  • Text extraction - Extracts text from PDF files

  • Page-by-page processing - Handles large documents

  • Metadata access - Retrieves document information

  • Code Walkthrough


    Let's examine each component of our system in detail.


    Component 1: PDF Processor (`pdf_processor.py`)


    The PDF processor is responsible for extracting text content from uploaded PDF files.


    from typing import List
    import pypdf
    from io import BytesIO
    
    class PDFProcessor:
        """Processes PDF files and extracts text content"""
        
        def extract_text_from_pdf(self, pdf_file: BytesIO) -> str:
            """
            Extract text from a single PDF file
            
            This method uses PyPDF's PdfReader to parse the PDF
            and extract text from each page.
            """
            try:
                pdf_reader = pypdf.PdfReader(pdf_file)
                text = ""
                
                # Iterate through each page
                for page_num, page in enumerate(pdf_reader.pages):
                    page_text = page.extract_text()
                    # Add page markers for better context
                    text += f"\n--- Page {page_num + 1} ---\n"
                    text += page_text
                    text += "\n"
                
                return text
            except Exception as e:
                raise Exception(f"Error extracting text from PDF: {str(e)}")

    Key Points:


  • Uses `BytesIO` to handle in-memory PDF data (no file system writes)

  • Processes pages sequentially to maintain document structure

  • Adds page markers to preserve context

  • Error handling ensures one bad PDF doesn't crash the entire process

  • def process_multiple_pdfs(self, pdf_files: List[BytesIO], file_names: List[str] = None) -> List[dict]:
        """
        Process multiple PDF files and return structured data
        
        This method handles batch processing of PDFs, maintaining
        file names and metadata for later reference.
        """
        processed_pdfs = []
        
        for idx, pdf_file in enumerate(pdf_files):
            # Use provided file name or generate default
            file_name = file_names[idx] if file_names and idx < len(file_names) else f"Paper {idx + 1}"
            
            try:
                text = self.extract_text_from_pdf(pdf_file)
                processed_pdfs.append({
                    "index": idx,           # For ordering
                    "name": file_name,      # Original filename
                    "text": text,           # Extracted content
                    "length": len(text)     # For metadata
                })
            except Exception as e:
                # Graceful error handling
                processed_pdfs.append({
                    "index": idx,
                    "name": file_name,
                    "text": "",
                    "error": str(e),
                    "length": 0
                })
        
        return processed_pdfs

    Why This Design?


  • Structured output: Returns dictionaries with consistent keys

  • Error resilience: Continues processing even if one PDF fails

  • Metadata preservation: Keeps file names and indices for display

  • Scalability: Can handle any number of PDFs

  • Component 2: Research Analyzer (`research_analyzer.py`)


    This is the heart of our system - it uses LangGraph to orchestrate the analysis workflow.


    State Definition


    from typing_extensions import TypedDict
    
    class ResearchState(TypedDict):
        """State for the research analysis graph"""
        query: str                              # User's research query
        papers: List[Dict[str, Any]]            # Processed PDF data
        results: List[Dict[str, Any]]           # Individual analyses
        summary: Optional[str]                  # Final summary

    Why TypedDict?


  • Type safety for state management

  • Clear contract for what data flows through the graph

  • IDE autocomplete support

  • Runtime validation

  • Initialization


    class ResearchAnalyzer:
        def __init__(self, groq_api_key: str, model_name: str = "llama-3.3-70b-versatile"):
            """
            Initialize the Research Analyzer
            
            Sets up the Groq LLM client and builds the LangGraph workflow.
            """
            self.llm = ChatGroq(
                groq_api_key=groq_api_key,
                model_name=model_name,
                temperature=0.1  # Low temperature for consistent, focused analysis
            )
            self._build_graph()

    Temperature Setting:


  • `0.1` = More deterministic, focused responses

  • Lower values = More consistent analysis

  • Higher values = More creative but less reliable

  • Building the LangGraph Workflow


    def _build_graph(self):
        """Build the LangGraph workflow for research analysis"""
        
        # Create a state graph with our ResearchState schema
        workflow = StateGraph(ResearchState)
        
        # Add nodes - each node is a function that processes the state
        workflow.add_node("analyze_papers", self._analyze_papers_node)
        workflow.add_node("evaluate_matches", self._evaluate_matches_node)
        
        # Set the entry point
        workflow.set_entry_point("analyze_papers")
        
        # Define the flow: analyze_papers → evaluate_matches → END
        workflow.add_edge("analyze_papers", "evaluate_matches")
        workflow.add_edge("evaluate_matches", END)
        
        # Compile the graph into an executable workflow
        self.graph = workflow.compile()

    Graph Structure:


    START → analyze_papers → evaluate_matches → END

    Why This Flow?


    1. Sequential processing: First analyze each paper individually


    2. Then synthesize: Combine individual analyses into a summary


    3. Clear separation: Each node has a single responsibility


    Node 1: Analyze Papers


    def _analyze_papers_node(self, state: ResearchState) -> ResearchState:
        """Analyze each paper individually"""
        query = state["query"]
        papers = state["papers"]
        results = []
        
        # Define the analysis prompt template
        analysis_prompt = ChatPromptTemplate.from_messages([
            ("system", """You are an expert research paper analyzer. Your task is to analyze research papers 
            and determine how well they match a given query. Be thorough and precise in your analysis.
            
            For each paper, provide:
            1. A summary of the paper's main content
            2. Key findings and contributions
            3. Relevance to the query (on a scale of 0-100)
            4. Specific sections or findings that match the query
            5. A clear match verdict (MATCH or NO MATCH)"""),
            ("human", """Query: {query}
            
            Paper Content:
            
            {paper_text}
            
            Please analyze this paper and provide a detailed evaluation of how well it matches the query.""")
        ])

    Prompt Engineering:


  • System message: Sets the AI's role and expectations

  • Structured output: Requests specific information (summary, relevance score, verdict)

  • Context: Provides both query and paper content

  •     for paper in papers:
            try:
                # Format the prompt with actual data
                messages = analysis_prompt.format_messages(
                    query=query,
                    paper_text=paper["text"][:8000]  # Limit to 8000 chars for API
                )
                
                # Invoke the LLM
                response = self.llm.invoke(messages)
                
                # Store results
                results.append({
                    "paper_index": paper["index"],
                    "paper_name": paper.get("name", f"Paper {paper['index'] + 1}"),
                    "analysis": response.content,
                    "raw_text_length": paper["length"]
                })
            except Exception as e:
                # Error handling per paper
                results.append({
                    "paper_index": paper["index"],
                    "paper_name": paper.get("name", f"Paper {paper['index'] + 1}"),
                    "error": str(e),
                    "analysis": None
                })
        
        # Update state with results
        state["results"] = results
        return state

    Key Design Decisions:


  • Text truncation: Limits to 8000 characters to avoid token limits

  • Per-paper error handling: One failure doesn't stop the entire process

  • State mutation: Updates and returns the state object

  • Node 2: Evaluate Matches


    def _evaluate_matches_node(self, state: ResearchState) -> ResearchState:
        """Final evaluation and summary of matches"""
        query = state["query"]
        results = state["results"]
        
        # Create summary prompt
        summary_prompt = ChatPromptTemplate.from_messages([
            ("system", """You are a research evaluation expert. Summarize the analysis results 
            and provide a final verdict on which papers match the query."""),
            ("human", """Query: {query}
            
            Analysis Results:
            
            {analyses}
            
            Provide a final summary indicating which papers match the query and why.""")
        ])
        
        # Format all analyses into a single text
        analyses_text = "\n\n".join([
            f"{r.get('paper_name', 'Paper ' + str(r['paper_index'] + 1))}:\n{r['analysis']}"
            for r in results if r.get('analysis')
        ])
        
        try:
            messages = summary_prompt.format_messages(
                query=query,
                analyses=analyses_text[:10000]  # Limit length
            )
            
            summary_response = self.llm.invoke(messages)
            state["summary"] = summary_response.content
        except Exception as e:
            state["summary"] = f"Error generating summary: {str(e)}"
        
        return state

    Summary Generation:


  • Takes all individual analyses as input

  • Asks LLM to synthesize and provide overall verdict

  • Creates a cohesive summary of matches

  • Main Analysis Method


    def analyze(self, query: str, papers: List[Dict[str, Any]]) -> Dict[str, Any]:
        """
        Analyze research papers against a query
        
        This is the main entry point that orchestrates the entire workflow.
        """
        # Initialize state
        initial_state = ResearchState(
            query=query,
            papers=papers,
            results=[],
            summary=None
        )
        
        # Execute the graph
        final_state = self.graph.invoke(initial_state)
        
        # Return formatted results
        return {
            "query": query,
            "results": final_state["results"],
            "summary": final_state.get("summary", "No summary available")
        }

    Workflow Execution:


    1. Create initial state with query and papers


    2. Invoke the compiled graph


    3. Graph executes nodes sequentially


    4. Return final state with all results


    Component 3: Streamlit Application (`app.py`)


    The Streamlit app provides the user interface for our system.


    Setup and Configuration


    import streamlit as st
    import os
    from dotenv import load_dotenv
    from pdf_processor import PDFProcessor
    from research_analyzer import ResearchAnalyzer
    
    # Load environment variables
    load_dotenv()
    
    # Page configuration
    st.set_page_config(
        page_title="AI Research Analyzer",
        page_icon="📚",
        layout="wide"
    )
    
    # Initialize session state
    if "analyzer" not in st.session_state:
        st.session_state.analyzer = None
    if "processed_papers" not in st.session_state:
        st.session_state.processed_papers = []
    if "analysis_results" not in st.session_state:
        st.session_state.analysis_results = None

    Session State:


  • Streamlit's way of maintaining data across reruns

  • Prevents re-initialization on every interaction

  • Stores processed data for display

  • Sidebar Configuration


    with st.sidebar:
        st.header("⚙️ Configuration")
        
        # API Key input
        api_key_input = st.text_input(
            "Groq API Key",
            type="password",
            help="Enter your Groq API key or set GROQ_API_KEY in .env file",
            value=os.getenv("GROQ_API_KEY", "")
        )
        
        if api_key_input:
            os.environ["GROQ_API_KEY"] = api_key_input
        
        # Model selection
        model_name = st.selectbox(
            "Groq Model",
            options=[
                "llama-3.3-70b-versatile",
                "llama-3.1-8b-instant",
                "mixtral-8x7b-32768",
                "gemma-7b-it",
                "llama-3.2-90b-text-preview"
            ],
            index=0,
            help="Select the Groq model to use for analysis."
        )

    User Controls:


  • API key can be set via UI or environment variable

  • Model selection allows users to choose speed vs. quality

  • Helpful tooltips guide users

  • Main Processing Logic


    if uploaded_files and query:
        if st.button("🔬 Analyze Papers", type="primary", use_container_width=True):
            with st.spinner("Processing PDFs and analyzing..."):
                try:
                    # Step 1: Initialize PDF processor
                    processor = PDFProcessor()
                    
                    # Step 2: Extract file data and names
                    pdf_data = [file.read() for file in uploaded_files]
                    file_names = [file.name for file in uploaded_files]
                    from io import BytesIO
                    pdf_files = [BytesIO(data) for data in pdf_data]
                    
                    # Step 3: Process PDFs
                    processed_papers = processor.process_multiple_pdfs(pdf_files, file_names)
                    st.session_state.processed_papers = processed_papers
                    
                    # Step 4: Initialize analyzer
                    groq_api_key = os.getenv("GROQ_API_KEY")
                    if not groq_api_key:
                        st.error("Please set GROQ_API_KEY in the sidebar or .env file")
                        return
                    
                    analyzer = ResearchAnalyzer(
                        groq_api_key=groq_api_key,
                        model_name=model_name
                    )
                    
                    # Step 5: Run analysis
                    results = analyzer.analyze(query, processed_papers)
                    st.session_state.analysis_results = results
                    
                    st.success("✅ Analysis complete!")
                    
                except Exception as e:
                    st.error(f"Error during analysis: {str(e)}")

    Processing Flow:


    1. Extract data: Read PDF files into memory


    2. Process PDFs: Extract text and structure data


    3. Initialize analyzer: Set up LangGraph workflow


    4. Run analysis: Execute the graph workflow


    5. Store results: Save for display


    Results Display


    if st.session_state.analysis_results:
        st.markdown("---")
        st.header("📊 Analysis Results")
        
        results = st.session_state.analysis_results
        
        # Display summary
        st.subheader("📋 Summary")
        st.info(results.get("summary", "No summary available"))
        
        # Display individual analyses
        st.subheader("📑 Individual Paper Analyses")
        
        for idx, result in enumerate(results.get("results", [])):
            paper_name = result.get("paper_name", f"Paper {result['paper_index'] + 1}")
            with st.expander(f"📄 {paper_name}", expanded=False):
                if result.get("error"):
                    st.error(f"Error: {result['error']}")
                elif result.get("analysis"):
                    st.markdown(result["analysis"])
                    st.caption(f"Text length: {result.get('raw_text_length', 0)} characters")
                else:
                    st.warning("No analysis available for this paper")

    Display Features:


  • Summary section: Overall verdict and match evaluation

  • Expandable sections: Individual paper analyses in collapsible sections

  • Error handling: Clear error messages if analysis fails

  • Metadata: Shows text length for context

  • How It Works: Step-by-Step Execution


    Let's trace through a complete execution:


    Step 1: User Input


    User enters query: "machine learning in healthcare"
    User uploads: paper1.pdf, paper2.pdf, paper3.pdf

    Step 2: PDF Processing


    # PDFProcessor extracts text
    paper1 → "This paper discusses ML applications..."
    paper2 → "Healthcare data analysis using neural networks..."
    paper3 → "A survey of computer vision techniques..."

    Step 3: LangGraph Execution


    Initial State:


    {
        "query": "machine learning in healthcare",
        "papers": [
            {"index": 0, "name": "paper1.pdf", "text": "..."},
            {"index": 1, "name": "paper2.pdf", "text": "..."},
            {"index": 2, "name": "paper3.pdf", "text": "..."}
        ],
        "results": [],
        "summary": None
    }

    After Node 1 (analyze_papers):


    {
        "query": "machine learning in healthcare",
        "papers": [...],
        "results": [
            {
                "paper_index": 0,
                "paper_name": "paper1.pdf",
                "analysis": "This paper discusses ML applications in healthcare... Relevance: 95/100. VERDICT: MATCH"
            },
            {
                "paper_index": 1,
                "paper_name": "paper2.pdf",
                "analysis": "This paper focuses on healthcare data... Relevance: 88/100. VERDICT: MATCH"
            },
            {
                "paper_index": 2,
                "paper_name": "paper3.pdf",
                "analysis": "This paper discusses computer vision... Relevance: 15/100. VERDICT: NO MATCH"
            }
        ],
        "summary": None
    }

    After Node 2 (evaluate_matches):


    {
        "query": "machine learning in healthcare",
        "papers": [...],
        "results": [...],
        "summary": "Based on the analysis, 2 out of 3 papers match the query. Paper1.pdf and paper2.pdf both discuss machine learning applications in healthcare, with high relevance scores. Paper3.pdf focuses on computer vision and does not match the query."
    }

    Step 4: Display Results


  • Summary shows overall match count

  • Individual analyses show detailed evaluations

  • Users can expand each paper to see full analysis

  • Key Features


    1. **Multi-PDF Processing**


  • Handles multiple PDFs simultaneously

  • Maintains file names and metadata

  • Error-resilient (continues if one PDF fails)

  • 2. **Intelligent Analysis**


  • Context-aware evaluation

  • Relevance scoring (0-100)

  • Clear match verdicts

  • Detailed explanations

  • 3. **Workflow Orchestration**


  • LangGraph manages complex state

  • Sequential node execution

  • Error handling at each step

  • 4. **User-Friendly Interface**


  • Simple upload and query interface

  • Real-time progress indicators

  • Expandable result sections

  • Model selection options

  • 5. **Flexible Configuration**


  • Multiple LLM model options

  • Configurable via UI or environment variables

  • Temperature and parameter control

  • Performance Considerations


    Text Truncation


    paper_text=paper["text"][:8000]  # Limit to 8000 chars

    Why?


  • API token limits

  • Cost management

  • Faster processing

  • Most relevant content is usually at the beginning

  • Batch Processing


  • Processes papers sequentially (could be parallelized)

  • Each paper analyzed independently

  • Summary generated after all analyses complete

  • Error Handling


  • Graceful degradation

  • Individual paper failures don't stop the process

  • Clear error messages for debugging

  • Advanced Use Cases


    1. **Literature Review Automation**


    Researchers can quickly identify relevant papers from large collections.


    2. **Content Discovery**


    Find documents matching specific topics in document repositories.


    3. **Quality Filtering**


    Filter papers by relevance score to focus on most relevant content.


    4. **Research Gap Analysis**


    Identify which aspects of a topic are covered or missing in a collection.


    Future Enhancements


    1. **Parallel Processing**


    # Could use asyncio or multiprocessing
    async def analyze_paper_async(paper):
        # Parallel analysis

    2. **Vector Embeddings**


  • Store paper embeddings for semantic search

  • Faster similarity matching

  • RAG (Retrieval Augmented Generation) integration

  • 3. **Citation Analysis**


  • Extract and analyze citations

  • Build citation networks

  • Identify influential papers

  • 4. **Multi-Query Support**


  • Handle multiple queries simultaneously

  • Compare results across queries

  • Generate comparative analysis

  • 5. **Export Functionality**


  • Export results as PDF/CSV

  • Generate reports

  • Shareable analysis summaries

  • Best Practices Implemented


    1. **Separation of Concerns**


  • PDF processing separate from analysis

  • UI separate from business logic

  • Clear module boundaries

  • 2. **Error Handling**


  • Try-except blocks at critical points

  • Graceful degradation

  • User-friendly error messages

  • 3. **Type Hints**


  • Better code documentation

  • IDE support

  • Runtime validation

  • 4. **State Management**


  • TypedDict for state schema

  • Clear state transitions

  • Immutable state updates

  • 5. **Prompt Engineering**


  • Clear system messages

  • Structured output requests

  • Context preservation

  • Conclusion


    This AI Research Analyzer demonstrates the power of combining modern AI frameworks (LangChain, LangGraph) with fast inference APIs (Groq) to solve real-world problems. The architecture is:


  • Modular: Easy to extend and modify

  • Scalable: Can handle multiple papers efficiently

  • User-friendly: Intuitive Streamlit interface

  • Robust: Error handling and graceful degradation

  • The system showcases how agentic AI workflows can be orchestrated using LangGraph, making complex multi-step processes manageable and maintainable.


    Whether you're a researcher looking to automate literature reviews, a student organizing research materials, or a developer building document analysis tools, this architecture provides a solid foundation for AI-powered research assistance.


    Code Repository Structure


    Research_agent/
    ├── app.py                 # Streamlit UI (182 lines)
    ├── pdf_processor.py       # PDF extraction (73 lines)
    ├── research_analyzer.py   # LangGraph workflow (174 lines)
    ├── requirements.txt       # Dependencies
    ├── setup_env.py          # Environment setup helper
    ├── .env.example          # Environment template
    └── README.md             # Project documentation

    Getting Started


    1. Install dependencies:


    pip install -r requirements.txt

    2. Set up environment:


    # Create .env file
    GROQ_API_KEY=your_api_key_here

    3. Run the application:


    streamlit run app.py

    4. Use the interface:


  • Enter your research query

  • Upload PDF files

  • Click "Analyze Papers"

  • Review results

  • Final Thoughts


    Building AI applications doesn't have to be complex. By leveraging frameworks like LangChain and LangGraph, we can create sophisticated AI workflows with relatively simple code. The key is understanding:


  • State management - How data flows through the system

  • Prompt engineering - How to get the best results from LLMs

  • Error handling - How to make systems robust

  • User experience - How to make tools accessible

  • This project demonstrates all of these principles in a practical, real-world application. Happy coding!