3.8 KiB
XBO Product Query System
This repository contains a PDF-based product query assistant built using LangChain, OpenAI models, and FAISS for vector-based search. It processes PDF documents, creates a searchable FAISS index, and allows querying product-related information interactively.
Features
- Processes PDF documents and extracts text.
- Creates and saves a FAISS index for fast, vector-based querying.
- Utilizes OpenAI's GPT models to answer natural language questions about products.
- Provides an interactive CLI-based query interface.
- Allows toggling FAISS index creation for optimized usage.
Requirements
- Python 3.8+
- An OpenAI API key
- A folder containing PDF documents
Installation
-
Clone the Repository:
git clone https://gitea.digital-bridge.net/nasim/CodeChallenge.git cd CodeChallenge -
Set Up a Virtual Environment:
python -m venv venv source venv/bin/activate -
Install Dependencies: Install all required Python libraries using the
requirements.txtfile:pip install -r requirements.txt
Configuration
The application configuration is defined in app/config.py.
Setting Up the OpenAI API Key
Replace the placeholder API key or use environment variables for production:
-
Open
app/config.pyand add your OpenAI API key:OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "your-openai-api-key") -
Alternatively, create a
.envfile at the root of your project and add:OPENAI_API_KEY=your-openai-api-key PDF_FOLDER=./app/pdfs
CREATE_FAISS_INDEX Flag
-
This flag determines whether to create a new FAISS index from PDFs or load an existing one.
-
In
app/config.py, set:CREATE_FAISS_INDEX = True # To create a new index CREATE_FAISS_INDEX = False # To load an existing index
Usage
-
Place all PDF files in the folder specified in
PDF_FOLDER(default:./app/pdfs). -
Run the application:
python -m app.main -
Interact with the assistant using natural language questions:
- Example: "Welche Leuchtmittel haben eine Lebensdauer von mehr als 3000 Stunden?"
- To exit, type
exit.
Folder Structure
CodeChallenge/
│
├── app/
│ ├── main.py # Application entry point
│ ├── config.py # Configuration settings
│ ├── services/
│ │ ├── file_service.py # Handles PDF file processing
│ │ ├── faiss_service.py # Handles FAISS index creation/loading
│ │ ├── dependencies.py # Dependency injection for services
│ │
│ ├── pdfs/ # Directory to store PDF files
│
├── requirements.txt # Python dependencies
└── README.md # Documentation
Scalability
When scaling up following optimization can be applied:
-
Update FAISS
- Allow appending new documents to the existing FAISS index without rebuilding it entirely.
-
Use a Vector Store DB
- Use a Vector DB for scalibility
-
Batch Processing of Queries
- Break large queries into smaller batches to improve response times.
- Distribute query processing across multiple threads or workers to enable parallel computation.
-
Efficient Chunking Strategy
- Divide large PDF texts into smaller, manageable chunks.
- Implement a consistent chunking algorithm to ensure relevant information is efficiently embedded and retrieved.
-
Map-Reduce Strategy
- Use LangChain's Map-Reduce approach for processing large datasets. The Map step processes individual document chunks, and the Reduce step combines intermediate results into a final response.