ProductQuery/README.md
2024-12-17 21:18:46 +01:00

130 lines
3.8 KiB
Markdown

# **XBO Product Query System**
This repository contains a **PDF-based product query assistant** built using **LangChain**, **OpenAI** models, and **FAISS** for vector-based search. It processes PDF documents, creates a searchable FAISS index, and allows querying product-related information interactively.
---
## **Features**
- Processes PDF documents and extracts text.
- Creates and saves a FAISS index for fast, vector-based querying.
- Utilizes OpenAI's GPT models to answer natural language questions about products.
- Provides an interactive CLI-based query interface.
- Allows toggling FAISS index creation for optimized usage.
---
## **Requirements**
- Python 3.8+
- An OpenAI API key
- A folder containing PDF documents
---
## **Installation**
1. **Clone the Repository**:
```bash
git clone https://gitea.digital-bridge.net/nasim/CodeChallenge.git
cd CodeChallenge
```
2. **Set Up a Virtual Environment**:
```bash
python -m venv venv
source venv/bin/activate
```
3. **Install Dependencies**:
Install all required Python libraries using the `requirements.txt` file:
```bash
pip install -r requirements.txt
```
---
## **Configuration**
The application configuration is defined in **`app/config.py`**.
### **Setting Up the OpenAI API Key**
Replace the placeholder API key or use environment variables for production:
- Open **`app/config.py`** and add your OpenAI API key:
```python
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "your-openai-api-key")
```
- Alternatively, create a `.env` file at the root of your project and add:
```dotenv
OPENAI_API_KEY=your-openai-api-key
PDF_FOLDER=./app/pdfs
```
### **CREATE_FAISS_INDEX Flag**
- This flag determines whether to create a new FAISS index from PDFs or load an existing one.
- In **`app/config.py`**, set:
```python
CREATE_FAISS_INDEX = True # To create a new index
CREATE_FAISS_INDEX = False # To load an existing index
```
---
## **Usage**
1. Place all PDF files in the folder specified in `PDF_FOLDER` (default: `./app/pdfs`).
2. Run the application:
```bash
python -m app.main
```
3. Interact with the assistant using natural language questions:
- Example: *"Welche Leuchtmittel haben eine Lebensdauer von mehr als 3000 Stunden?"*
- To exit, type `exit`.
---
## **Folder Structure**
```
CodeChallenge/
├── app/
│ ├── main.py # Application entry point
│ ├── config.py # Configuration settings
│ ├── services/
│ │ ├── file_service.py # Handles PDF file processing
│ │ ├── faiss_service.py # Handles FAISS index creation/loading
│ │ ├── dependencies.py # Dependency injection for services
│ │
│ ├── pdfs/ # Directory to store PDF files
├── requirements.txt # Python dependencies
└── README.md # Documentation
```
## **Scalability**
When scaling up following optimization can be applied:
1. **Update FAISS**
- Allow appending new documents to the existing FAISS index without rebuilding it entirely.
2. **Use a Vector Store DB**
- Use a Vector DB for scalibility
3. **Batch Processing of Queries**
- Break large queries into smaller batches to improve response times.
- Distribute query processing across multiple threads or workers to enable parallel computation.
4. **Efficient Chunking Strategy**
- Divide large PDF texts into smaller, manageable chunks.
- Implement a consistent chunking algorithm to ensure relevant information is efficiently embedded and retrieved.
5. **Map-Reduce Strategy**
- Use LangChain's Map-Reduce approach for processing large datasets. The Map step processes individual document chunks, and the Reduce step combines intermediate results into a final response.