XBO Product Query System

This repository contains a PDF-based product query assistant built using LangChain, OpenAI models, and FAISS for vector-based search. It processes PDF documents, creates a searchable FAISS index, and allows querying product-related information interactively.

Features

Processes PDF documents and extracts text.
Creates and saves a FAISS index for fast, vector-based querying.
Utilizes OpenAI's GPT models to answer natural language questions about products.
Provides an interactive CLI-based query interface.
Allows toggling FAISS index creation for optimized usage.

Requirements

Python 3.8+
An OpenAI API key
A folder containing PDF documents

Installation

Clone the Repository:

git clone https://gitea.digital-bridge.net/nasim/CodeChallenge.git
cd CodeChallenge

Set Up a Virtual Environment:

python -m venv venv
source venv/bin/activate

Install Dependencies: Install all required Python libraries using the requirements.txt file:
```
pip install -r requirements.txt
```

Configuration

The application configuration is defined in app/config.py.

Setting Up the OpenAI API Key

Replace the placeholder API key or use environment variables for production:

Open app/config.py and add your OpenAI API key:

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "your-openai-api-key")

Alternatively, create a .env file at the root of your project and add:
```
OPENAI_API_KEY=your-openai-api-key
PDF_FOLDER=./app/pdfs
```

CREATE_FAISS_INDEX Flag

This flag determines whether to create a new FAISS index from PDFs or load an existing one.

In app/config.py, set:

CREATE_FAISS_INDEX = True  # To create a new index
CREATE_FAISS_INDEX = False # To load an existing index

Usage

Place all PDF files in the folder specified in PDF_FOLDER (default: ./app/pdfs).
Run the application:
```
python -m app.main
```
Interact with the assistant using natural language questions:
- Example: "Welche Leuchtmittel haben eine Lebensdauer von mehr als 3000 Stunden?"
- To exit, type exit.

Folder Structure

CodeChallenge/
│
├── app/
│   ├── main.py                # Application entry point
│   ├── config.py              # Configuration settings
│   ├── services/
│   │   ├── file_service.py    # Handles PDF file processing
│   │   ├── faiss_service.py   # Handles FAISS index creation/loading
│   │   ├── dependencies.py    # Dependency injection for services
│   │
│   ├── pdfs/                  # Directory to store PDF files
│
├── requirements.txt           # Python dependencies
└── README.md                  # Documentation

Scalability

When scaling up following optimization can be applied:

Update FAISS
- Allow appending new documents to the existing FAISS index without rebuilding it entirely.
Use a Vector Store DB
- Use a Vector DB for scalibility
Batch Processing of Queries
- Break large queries into smaller batches to improve response times.
- Distribute query processing across multiple threads or workers to enable parallel computation.
Efficient Chunking Strategy
- Divide large PDF texts into smaller, manageable chunks.
- Implement a consistent chunking algorithm to ensure relevant information is efficiently embedded and retrieved.
Map-Reduce Strategy
- Use LangChain's Map-Reduce approach for processing large datasets. The Map step processes individual document chunks, and the Reduce step combines intermediate results into a final response.

3.8 KiB Raw Permalink Blame History