130 lines
3.8 KiB
Markdown
130 lines
3.8 KiB
Markdown
|
|
# **XBO Product Query System**
|
|
|
|
This repository contains a **PDF-based product query assistant** built using **LangChain**, **OpenAI** models, and **FAISS** for vector-based search. It processes PDF documents, creates a searchable FAISS index, and allows querying product-related information interactively.
|
|
|
|
---
|
|
|
|
## **Features**
|
|
|
|
- Processes PDF documents and extracts text.
|
|
- Creates and saves a FAISS index for fast, vector-based querying.
|
|
- Utilizes OpenAI's GPT models to answer natural language questions about products.
|
|
- Provides an interactive CLI-based query interface.
|
|
- Allows toggling FAISS index creation for optimized usage.
|
|
|
|
---
|
|
|
|
## **Requirements**
|
|
|
|
- Python 3.8+
|
|
- An OpenAI API key
|
|
- A folder containing PDF documents
|
|
|
|
---
|
|
|
|
## **Installation**
|
|
|
|
1. **Clone the Repository**:
|
|
```bash
|
|
git clone https://gitea.digital-bridge.net/nasim/CodeChallenge.git
|
|
cd CodeChallenge
|
|
```
|
|
|
|
2. **Set Up a Virtual Environment**:
|
|
```bash
|
|
python -m venv venv
|
|
source venv/bin/activate
|
|
```
|
|
|
|
3. **Install Dependencies**:
|
|
Install all required Python libraries using the `requirements.txt` file:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
---
|
|
|
|
## **Configuration**
|
|
|
|
The application configuration is defined in **`app/config.py`**.
|
|
|
|
### **Setting Up the OpenAI API Key**
|
|
Replace the placeholder API key or use environment variables for production:
|
|
- Open **`app/config.py`** and add your OpenAI API key:
|
|
```python
|
|
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "your-openai-api-key")
|
|
```
|
|
|
|
- Alternatively, create a `.env` file at the root of your project and add:
|
|
```dotenv
|
|
OPENAI_API_KEY=your-openai-api-key
|
|
PDF_FOLDER=./app/pdfs
|
|
```
|
|
|
|
### **CREATE_FAISS_INDEX Flag**
|
|
- This flag determines whether to create a new FAISS index from PDFs or load an existing one.
|
|
|
|
- In **`app/config.py`**, set:
|
|
```python
|
|
CREATE_FAISS_INDEX = True # To create a new index
|
|
CREATE_FAISS_INDEX = False # To load an existing index
|
|
```
|
|
---
|
|
|
|
## **Usage**
|
|
|
|
1. Place all PDF files in the folder specified in `PDF_FOLDER` (default: `./app/pdfs`).
|
|
|
|
2. Run the application:
|
|
```bash
|
|
python -m app.main
|
|
```
|
|
|
|
3. Interact with the assistant using natural language questions:
|
|
- Example: *"Welche Leuchtmittel haben eine Lebensdauer von mehr als 3000 Stunden?"*
|
|
- To exit, type `exit`.
|
|
|
|
---
|
|
|
|
## **Folder Structure**
|
|
|
|
```
|
|
CodeChallenge/
|
|
│
|
|
├── app/
|
|
│ ├── main.py # Application entry point
|
|
│ ├── config.py # Configuration settings
|
|
│ ├── services/
|
|
│ │ ├── file_service.py # Handles PDF file processing
|
|
│ │ ├── faiss_service.py # Handles FAISS index creation/loading
|
|
│ │ ├── dependencies.py # Dependency injection for services
|
|
│ │
|
|
│ ├── pdfs/ # Directory to store PDF files
|
|
│
|
|
├── requirements.txt # Python dependencies
|
|
└── README.md # Documentation
|
|
```
|
|
|
|
## **Scalability**
|
|
|
|
When scaling up following optimization can be applied:
|
|
|
|
1. **Update FAISS**
|
|
|
|
- Allow appending new documents to the existing FAISS index without rebuilding it entirely.
|
|
|
|
2. **Use a Vector Store DB**
|
|
- Use a Vector DB for scalibility
|
|
|
|
3. **Batch Processing of Queries**
|
|
- Break large queries into smaller batches to improve response times.
|
|
- Distribute query processing across multiple threads or workers to enable parallel computation.
|
|
|
|
4. **Efficient Chunking Strategy**
|
|
- Divide large PDF texts into smaller, manageable chunks.
|
|
- Implement a consistent chunking algorithm to ensure relevant information is efficiently embedded and retrieved.
|
|
|
|
5. **Map-Reduce Strategy**
|
|
- Use LangChain's Map-Reduce approach for processing large datasets. The Map step processes individual document chunks, and the Reduce step combines intermediate results into a final response.
|
|
|