# **XBO Product Query System** This repository contains a **PDF-based product query assistant** built using **LangChain**, **OpenAI** models, and **FAISS** for vector-based search. It processes PDF documents, creates a searchable FAISS index, and allows querying product-related information interactively. --- ## **Features** - Processes PDF documents and extracts text. - Creates and saves a FAISS index for fast, vector-based querying. - Utilizes OpenAI's GPT models to answer natural language questions about products. - Provides an interactive CLI-based query interface. - Allows toggling FAISS index creation for optimized usage. --- ## **Requirements** - Python 3.8+ - An OpenAI API key - A folder containing PDF documents --- ## **Installation** 1. **Clone the Repository**: ```bash git clone https://gitea.digital-bridge.net/nasim/CodeChallenge.git cd CodeChallenge ``` 2. **Set Up a Virtual Environment**: ```bash python -m venv venv source venv/bin/activate ``` 3. **Install Dependencies**: Install all required Python libraries using the `requirements.txt` file: ```bash pip install -r requirements.txt ``` --- ## **Configuration** The application configuration is defined in **`app/config.py`**. ### **Setting Up the OpenAI API Key** Replace the placeholder API key or use environment variables for production: - Open **`app/config.py`** and add your OpenAI API key: ```python OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "your-openai-api-key") ``` - Alternatively, create a `.env` file at the root of your project and add: ```dotenv OPENAI_API_KEY=your-openai-api-key PDF_FOLDER=./app/pdfs ``` ### **CREATE_FAISS_INDEX Flag** - This flag determines whether to create a new FAISS index from PDFs or load an existing one. - In **`app/config.py`**, set: ```python CREATE_FAISS_INDEX = True # To create a new index CREATE_FAISS_INDEX = False # To load an existing index ``` --- ## **Usage** 1. Place all PDF files in the folder specified in `PDF_FOLDER` (default: `./app/pdfs`). 2. Run the application: ```bash python -m app.main ``` 3. Interact with the assistant using natural language questions: - Example: *"Welche Leuchtmittel haben eine Lebensdauer von mehr als 3000 Stunden?"* - To exit, type `exit`. --- ## **Folder Structure** ``` CodeChallenge/ │ ├── app/ │ ├── main.py # Application entry point │ ├── config.py # Configuration settings │ ├── services/ │ │ ├── file_service.py # Handles PDF file processing │ │ ├── faiss_service.py # Handles FAISS index creation/loading │ │ ├── dependencies.py # Dependency injection for services │ │ │ ├── pdfs/ # Directory to store PDF files │ ├── requirements.txt # Python dependencies └── README.md # Documentation ``` ## **Scalability** When scaling up following optimization can be applied: 1. **Update FAISS** - Allow appending new documents to the existing FAISS index without rebuilding it entirely. 2. **Use a Vector Store DB** - Use a Vector DB for scalibility 3. **Batch Processing of Queries** - Break large queries into smaller batches to improve response times. - Distribute query processing across multiple threads or workers to enable parallel computation. 4. **Efficient Chunking Strategy** - Divide large PDF texts into smaller, manageable chunks. - Implement a consistent chunking algorithm to ensure relevant information is efficiently embedded and retrieved. 5. **Map-Reduce Strategy** - Use LangChain's Map-Reduce approach for processing large datasets. The Map step processes individual document chunks, and the Reduce step combines intermediate results into a final response.