Genesis commit

This commit is contained in:
nsa 2024-12-17 10:47:33 +01:00
commit c7d85e7e5d
29 changed files with 444 additions and 0 deletions

132
README.md Normal file
View File

@ -0,0 +1,132 @@
# **XBO Product Query System**
This repository contains a **PDF-based product query assistant** built using **LangChain**, **OpenAI** models, and **FAISS** for vector-based search. It processes PDF documents, creates a searchable FAISS index, and allows querying product-related information interactively.
---
## **Features**
- Processes PDF documents and extracts text.
- Creates and saves a FAISS index for fast, vector-based querying.
- Utilizes OpenAI's GPT models to answer natural language questions about products.
- Provides an interactive CLI-based query interface.
- Allows toggling FAISS index creation for optimized usage.
---
## **Requirements**
- Python 3.8+
- An OpenAI API key
- A folder containing PDF documents
---
## **Installation**
1. **Clone the Repository**:
```bash
git clone https://gitea.digital-bridge.net/nasim/CodeChallenge.git
cd CodeChallenge
```
2. **Set Up a Virtual Environment**:
```bash
python -m venv venv
source venv/bin/activate
```
3. **Install Dependencies**:
Install all required Python libraries using the `requirements.txt` file:
```bash
pip install -r requirements.txt
```
---
## **Configuration**
The application configuration is defined in **`app/config.py`**.
### **Setting Up the OpenAI API Key**
Replace the placeholder API key or use environment variables for production:
- Open **`app/config.py`** and add your OpenAI API key:
```python
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "your-openai-api-key")
```
- Alternatively, create a `.env` file at the root of your project and add:
```dotenv
OPENAI_API_KEY=your-openai-api-key
PDF_FOLDER=./app/pdfs
```
### **CREATE_FAISS_INDEX Flag**
- This flag determines whether to create a new FAISS index from PDFs or load an existing one.
- In **`app/config.py`**, set:
```python
CREATE_FAISS_INDEX = True # To create a new index
CREATE_FAISS_INDEX = False # To load an existing index
```
On initial start set this to TRUE
---
## **Usage**
1. Place all PDF files in the folder specified in `PDF_FOLDER` (default: `./app/pdfs`).
2. Run the application:
```bash
python -m app.main
```
3. Interact with the assistant using natural language questions:
- Example: *"Welche Leuchtmittel haben eine Lebensdauer von mehr als 3000 Stunden?"*
- To exit, type `exit`.
---
## **Folder Structure**
```
xbo-product-query/
├── app/
│ ├── main.py # Application entry point
│ ├── config.py # Configuration settings
│ ├── services/
│ │ ├── file_service.py # Handles PDF file processing
│ │ ├── faiss_service.py # Handles FAISS index creation/loading
│ │ ├── dependencies.py # Dependency injection for services
│ │
│ ├── pdfs/ # Directory to store PDF files
├── requirements.txt # Python dependencies
└── README.md # Documentation
```
## **Scalability**
When scaling up following optimization can be applied:
1. **Update FAISS**
- Allow appending new documents to the existing FAISS index without rebuilding it entirely.
2. **Memory and Disk Management**
- Use a persistent FAISS index stored on disk, which can be loaded as needed.
- Enable FAISS's `diskann` mode to load only the necessary portions of the index into memory, reducing RAM consumption.
3. **Batch Processing of Queries**
- Break large queries into smaller batches to improve response times.
- Distribute query processing across multiple threads or workers to enable parallel computation.
4. **Efficient Chunking Strategy**
- Divide large PDF texts into smaller, manageable chunks.
- Implement a consistent chunking algorithm to ensure relevant information is efficiently embedded and retrieved.
5. **Map-Reduce Strategy**
- Use LangChain's Map-Reduce approach for processing large datasets. The Map step processes individual document chunks, and the Reduce step combines intermediate results into a final response.

12
app/__init__.py Normal file
View File

@ -0,0 +1,12 @@
# app/__init__.py
from app.config import get_config
from fastapi import FastAPI
def create_app(env_name: str) -> FastAPI:
config = get_config(env_name)
app = FastAPI(title="Code Challenge")
app.config = config
return app

56
app/config.py Normal file
View File

@ -0,0 +1,56 @@
import os
class BaseConfig:
"""
Base configuration class that holds default settings for the application.
Environment-specific configurations will inherit from this class.
"""
PDF_FOLDER = os.getenv("PDF_FOLDER", "./app/pdfs")
ENV = "base"
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "your-api-key-here")
CREATE_FAISS_INDEX = True
def __init__(self):
"""
Validates required configurations and ensures all necessary environment variables are set.
"""
if not self.OPENAI_API_KEY:
raise ValueError("OPENAI_API_KEY environment variable must be set.")
class DevelopmentConfig(BaseConfig):
"""
Configuration class for the development environment.
Inherits defaults from BaseConfig.
"""
ENV = "development"
DEBUG = True
class ProductionConfig(BaseConfig):
"""
Configuration class for the production environment.
Inherits defaults from BaseConfig but overrides production-specific settings.
"""
ENV = "production"
DEBUG = False
def get_config(env_name: str = "development"):
"""
Retrieves the appropriate configuration instance based on the environment name.
:param env_name: Name of the environment (e.g., 'development', 'production').
:return: An instance of the selected configuration class.
"""
configs = {
"development": DevelopmentConfig,
"production": ProductionConfig,
}
config_class = configs.get(env_name.lower())
if not config_class:
raise ValueError(f"Unknown environment '{env_name}'. Valid options are 'development' or 'production'.")
config_instance = config_class()
print(f"[INFO] Loaded configuration for environment: {config_instance.ENV}")
return config_instance

87
app/main.py Normal file
View File

@ -0,0 +1,87 @@
import asyncio
import sys
from app.services.dependencies import get_file_service, get_faiss_service
from app.config import get_config
from langchain_openai import ChatOpenAI
from langchain.schema import Document
from langchain.prompts import PromptTemplate
async def main():
"""
Entry point for the XBO product assistant.
"""
config = get_config(env_name="development")
file_service = get_file_service(config=config)
openai_api_key = config.OPENAI_API_KEY
create_faiss_index = config.CREATE_FAISS_INDEX
print ("Create FAISS Index: ", create_faiss_index)
print("Wilkommen zum XBO Kaufberater!")
faiss_service = get_faiss_service(openai_api_key)
try:
if create_faiss_index:
print("[INFO] Creating a new FAISS index...")
pdfs = file_service.load_pdfs()
if not pdfs:
print("[ERROR] No PDFs found.")
sys.exit(1)
all_documents = []
for pdf in pdfs:
print(f"Processing PDF: {pdf}")
text = file_service.extract_text_from_pdf(pdf)
all_documents.append(Document(page_content=text, metadata={"source": pdf}))
vectorstore = faiss_service.create_faiss_index(all_documents)
else:
vectorstore = faiss_service.load_faiss_index()
except Exception as e:
print(f"[ERROR] {e}")
sys.exit(1)
llm = ChatOpenAI(model="gpt-4o", openai_api_key=openai_api_key)
retriever = vectorstore.as_retriever(search_kwargs={"k": 21}, search_type="mmr")
while True:
user_input = input("\nWas möchten Sie wissen? (type 'exit' to quit): ").strip()
if user_input.lower() == "exit":
print("Auf Wiedersehen!")
break
try:
print("[INFO] Retrieving relevant documents...")
docs = retriever.invoke(user_input)
if not docs:
print("\n[ANSWER]: Keine passenden Informationen gefunden.")
continue
context = "\n\n".join([doc.page_content for doc in docs])
prompt = PromptTemplate(
template="""
Du bist ein Assistent, der Fragen zu Produktinformationen beantwortet.
Kontext:
{context}
Frage:
{question}
Antwort:
""",
input_variables=["context", "question"]
)
response = llm.invoke(prompt.format(context=context, question=user_input))
print("\n[ANSWER]:")
print(response.content)
except Exception as e:
print(f"[ERROR] Failed to process query: {e}")
if __name__ == "__main__":
asyncio.run(main())

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -0,0 +1,33 @@
from fastapi import Depends
from app.config import get_config
from app.services.file_service import FileService
from app.services.faiss_service import FAISSService
def get_file_service(config=Depends(get_config)) -> FileService:
"""
Dependency function to provide a FileService instance.
:param config: Configuration object obtained via dependency injection.
:return: An instance of FileService initialized with the PDF folder path.
"""
if not hasattr(config, "PDF_FOLDER") or not config.PDF_FOLDER:
raise ValueError("PDF_FOLDER is not configured in the application settings.")
return FileService(folder_path=config.PDF_FOLDER)
def get_faiss_service(file_service=Depends(get_file_service)) -> FAISSService:
"""
Dependency function to provide a FAISSService instance.
:param config: Configuration object obtained via dependency injection.
:param file_service: FileService instance for handling PDFs and documents.
:return: An instance of FAISSService initialized with vectorstore and embeddings.
"""
config = get_config()
return FAISSService(
openai_api_key=config.OPENAI_API_KEY,
index_path="local_faiss_index",
)

View File

@ -0,0 +1,53 @@
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document
class FAISSService:
"""
A service for creating and loading FAISS indexes for document embeddings.
"""
def __init__(self, openai_api_key, index_path="local_faiss_index"):
"""
Initialize the FAISS service.
:param openai_api_key: OpenAI API key for embeddings.
:param index_path: Path to save or load the FAISS index.
"""
self.openai_api_key = openai_api_key
self.index_path = index_path
def create_faiss_index(self, documents):
"""
Create a FAISS index from a list of documents.
:param documents: List of langchain Document objects.
:return: FAISS vectorstore instance.
"""
print("[INFO] Creating FAISS index...")
vectorstore = FAISS.from_documents(
documents,
OpenAIEmbeddings(
model="text-embedding-ada-002",
openai_api_key=self.openai_api_key
)
)
vectorstore.save_local(self.index_path)
print(f"[INFO] FAISS index saved to {self.index_path}.")
return vectorstore
def load_faiss_index(self):
"""
Load an existing FAISS index.
:return: Loaded FAISS vectorstore instance.
"""
print("[INFO] Loading FAISS index...")
vectorstore = FAISS.load_local(
self.index_path,
OpenAIEmbeddings(openai_api_key=self.openai_api_key),
allow_dangerous_deserialization=True
)
print(f"[INFO] FAISS index loaded from {self.index_path}.")
return vectorstore

View File

@ -0,0 +1,49 @@
# app/services/file_service.py
import os
import pdfplumber
class FileService:
"""
A service to handle file-related operations, including loading PDFs from a folder.
"""
def __init__(self, folder_path: str):
"""
Initialize the FileService with the folder path to read files from.
"""
self.folder_path = os.path.abspath(folder_path)
# print(f"[DEBUG] Initialized FileService with folder path: {self.folder_path}")
def load_pdfs(self):
"""
Reads all PDF files from the folder and returns their paths.
:return: List of paths to PDF files in the folder.
"""
if not os.path.exists(self.folder_path):
raise FileNotFoundError(f"The folder {self.folder_path} does not exist.")
pdf_files = [
os.path.join(self.folder_path, f)
for f in os.listdir(self.folder_path)
if f.endswith(".pdf")
]
if not pdf_files:
raise FileNotFoundError(f"No PDF files found in the folder {self.folder_path}.")
return pdf_files
def extract_text_from_pdf(self, pdf_path):
"""
Extracts text from the PDF file using pdfplumber.
:param pdf_path: Path to the PDF file.
:return: Extracted text as a string.
"""
text = ""
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
page_text = page.extract_text()
if page_text:
text += page_text + "\n"
return text

22
requirements.txt Normal file
View File

@ -0,0 +1,22 @@
# FastAPI and ASGI server
fastapi
# LangChain and associated dependencies
langchain
langchain-community
langchain-openai
# OpenAI API client
openai
# Embeddings and Vector Stores
faiss-cpu
# Utility Libraries
python-dotenv==1.0.1 # For environment variable management
pydantic==2.7.1 # Data validation library (dependency for FastAPI and LangChain)
# Async support
asyncio==3.4.3 # Part of Python >=3.6 but included explicitly for clarity
pdfplumber