ProductQuery/README.md


# **XBO Product Query System**

This repository contains a **PDF-based product query assistant** built using **LangChain**, **OpenAI** models, and **FAISS** for vector-based search. It processes PDF documents, creates a searchable FAISS index, and allows querying product-related information interactively.

---

## **Features**

- Processes PDF documents and extracts text.
- Creates and saves a FAISS index for fast, vector-based querying.
- Utilizes OpenAI's GPT models to answer natural language questions about products.
- Provides an interactive CLI-based query interface.
- Allows toggling FAISS index creation for optimized usage.

---

## **Requirements**

- Python 3.8+
- An OpenAI API key
- A folder containing PDF documents

---

## **Installation**

1. **Clone the Repository**:
   ```bash
   git clone https://gitea.digital-bridge.net/nasim/CodeChallenge.git
   cd CodeChallenge
   ```

2. **Set Up a Virtual Environment**:
   ```bash
   python -m venv venv
   source venv/bin/activate
   ```

3. **Install Dependencies**:
   Install all required Python libraries using the `requirements.txt` file:
   ```bash
   pip install -r requirements.txt
   ```
---

## **Configuration**

The application configuration is defined in **`app/config.py`**.

### **Setting Up the OpenAI API Key**
Replace the placeholder API key or use environment variables for production:
- Open **`app/config.py`** and add your OpenAI API key:
   ```python
   OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "your-openai-api-key")
   ```

- Alternatively, create a `.env` file at the root of your project and add:
   ```dotenv
   OPENAI_API_KEY=your-openai-api-key
   PDF_FOLDER=./app/pdfs
   ```

### **CREATE_FAISS_INDEX Flag**
- This flag determines whether to create a new FAISS index from PDFs or load an existing one.

- In **`app/config.py`**, set:
   ```python
   CREATE_FAISS_INDEX = True  # To create a new index
   CREATE_FAISS_INDEX = False # To load an existing index
   ```
---

## **Usage**

1. Place all PDF files in the folder specified in `PDF_FOLDER` (default: `./app/pdfs`).

2. Run the application:
   ```bash
   python -m app.main
   ```

3. Interact with the assistant using natural language questions:
   - Example: *"Welche Leuchtmittel haben eine Lebensdauer von mehr als 3000 Stunden?"*
   - To exit, type `exit`.

---

## **Folder Structure**

```
CodeChallenge/
│
├── app/
│   ├── main.py                # Application entry point
│   ├── config.py              # Configuration settings
│   ├── services/
│   │   ├── file_service.py    # Handles PDF file processing
│   │   ├── faiss_service.py   # Handles FAISS index creation/loading
│   │   ├── dependencies.py    # Dependency injection for services
│   │
│   ├── pdfs/                  # Directory to store PDF files
│
├── requirements.txt           # Python dependencies
└── README.md                  # Documentation
```

## **Scalability**

When scaling up following optimization can be applied:

1. **Update FAISS**

   - Allow appending new documents to the existing FAISS index without rebuilding it entirely.

2. **Use a Vector Store DB**
   - Use a Vector DB for scalibility

3. **Batch Processing of Queries**
   - Break large queries into smaller batches to improve response times.
   - Distribute query processing across multiple threads or workers to enable parallel computation.

4. **Efficient Chunking Strategy**
   - Divide large PDF texts into smaller, manageable chunks.
   - Implement a consistent chunking algorithm to ensure relevant information is efficiently embedded and retrieved.

5. **Map-Reduce Strategy**
   - Use LangChain's Map-Reduce approach for processing large datasets. The Map step processes individual document chunks, and the Reduce step combines intermediate results into a final response.