2024-12-17 10:47:33 +01:00
2024-12-17 10:47:33 +01:00
2024-12-17 10:47:33 +01:00
2024-12-17 10:47:33 +01:00

XBO Product Query System

This repository contains a PDF-based product query assistant built using LangChain, OpenAI models, and FAISS for vector-based search. It processes PDF documents, creates a searchable FAISS index, and allows querying product-related information interactively.


Features

  • Processes PDF documents and extracts text.
  • Creates and saves a FAISS index for fast, vector-based querying.
  • Utilizes OpenAI's GPT models to answer natural language questions about products.
  • Provides an interactive CLI-based query interface.
  • Allows toggling FAISS index creation for optimized usage.

Requirements

  • Python 3.8+
  • An OpenAI API key
  • A folder containing PDF documents

Installation

  1. Clone the Repository:

    git clone https://gitea.digital-bridge.net/nasim/CodeChallenge.git
    cd CodeChallenge
    
  2. Set Up a Virtual Environment:

    python -m venv venv
    source venv/bin/activate  
    
  3. Install Dependencies: Install all required Python libraries using the requirements.txt file:

    pip install -r requirements.txt
    

Configuration

The application configuration is defined in app/config.py.

Setting Up the OpenAI API Key

Replace the placeholder API key or use environment variables for production:

  • Open app/config.py and add your OpenAI API key:

    OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "your-openai-api-key")
    
  • Alternatively, create a .env file at the root of your project and add:

    OPENAI_API_KEY=your-openai-api-key
    PDF_FOLDER=./app/pdfs
    

CREATE_FAISS_INDEX Flag

  • This flag determines whether to create a new FAISS index from PDFs or load an existing one.

  • In app/config.py, set:

    CREATE_FAISS_INDEX = True  # To create a new index
    CREATE_FAISS_INDEX = False # To load an existing index
    

On initial start set this to TRUE

Usage

  1. Place all PDF files in the folder specified in PDF_FOLDER (default: ./app/pdfs).

  2. Run the application:

    python -m app.main
    
  3. Interact with the assistant using natural language questions:

    • Example: "Welche Leuchtmittel haben eine Lebensdauer von mehr als 3000 Stunden?"
    • To exit, type exit.

Folder Structure

xbo-product-query/
│
├── app/
│   ├── main.py                # Application entry point
│   ├── config.py              # Configuration settings
│   ├── services/
│   │   ├── file_service.py    # Handles PDF file processing
│   │   ├── faiss_service.py   # Handles FAISS index creation/loading
│   │   ├── dependencies.py    # Dependency injection for services
│   │
│   ├── pdfs/                  # Directory to store PDF files
│
├── requirements.txt           # Python dependencies
└── README.md                  # Documentation

Scalability

When scaling up following optimization can be applied:

  1. Update FAISS

    • Allow appending new documents to the existing FAISS index without rebuilding it entirely.
  2. Memory and Disk Management

    • Use a persistent FAISS index stored on disk, which can be loaded as needed.
    • Enable FAISS's diskann mode to load only the necessary portions of the index into memory, reducing RAM consumption.
  3. Batch Processing of Queries

    • Break large queries into smaller batches to improve response times.
    • Distribute query processing across multiple threads or workers to enable parallel computation.
  4. Efficient Chunking Strategy

    • Divide large PDF texts into smaller, manageable chunks.
    • Implement a consistent chunking algorithm to ensure relevant information is efficiently embedded and retrieved.
  5. Map-Reduce Strategy

    • Use LangChain's Map-Reduce approach for processing large datasets. The Map step processes individual document chunks, and the Reduce step combines intermediate results into a final response.
Description
No description provided
Readme 2.4 MiB
Languages
Python 100%