AI RAG Engine

<< Voltar para modelos

AI RAG Engine

Visitar Repositório Abrir no sandbox

Tech stack

Python 3.8+
Google Vertex AI
Google ADK (Agent Development Kit)
Gemini Models
PyPDF2 / PyMuPDF

Features

Multi-Modal Document Processing: Extract and process text, images, and tables from PDF documents
Intelligent Chunking: Optimized document segmentation with overlap for better context preservation
Hybrid Search: Combines semantic and keyword-based search with configurable weighting
LLM Reranking: Advanced result reranking using Gemini models for improved relevance
Grounding Verification: Automatic verification of response claims against source material
Context Management: Multi-turn conversation support with context tracking
Google ADK Integration: Modern agent framework using Google’s Agent Development Kit
Cloud Storage Integration: Seamless GCS integration for document management

Architecture

 
 PDF Documents RAG Ingestion Vertex AI RAG 
 Pipeline Corpus 
 
 
 
 
 
 
 Google ADK 
 Agent 
 
 
 
 
 Gemini Model 
 (Flash 2.5)

Requirements

Python 3.8+
Google Cloud Platform account
Vertex AI API enabled
Google Cloud Storage bucket

Installation

Clone the repository:

git clone <repository-url>
cd ai-rag-engine

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate # On Windows: venvScriptsactivate

Install dependencies:
```
pip install -r requirements.txt
```

Set up environment variables:
Create a .env file with the following:

GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_CLOUD_LOCATION=us-central1
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
GCS_BUCKET_NAME=your-bucket-name

Usage

Basic Usage

Place your PDF documents in the docs/ directory
Run the RAG agent:
```
python rag_agent.py
```

Programmatic Usage

from rag_agent import ADKRAGAgent, create_rag_corpus, import_documents_to_corpus

# Initialize the system
corpus_id = create_rag_corpus(
 corpus_name="my-knowledge-base",
 description="Enterprise documentation"
)

# Upload and import documents from local docs folder
from rag_agent import upload_file_to_gcs
import os

# Upload local PDFs to GCS
document_paths = [
 "docs/technical_manual.pdf",
 "docs/product_specs.pdf",
 "docs/user_guide.pdf"
]

gcs_uris = []
for doc_path in document_paths:
 if os.path.exists(doc_path):
 gcs_uri = upload_file_to_gcs(doc_path, os.getenv('GCS_BUCKET_NAME'))
 gcs_uris.append(gcs_uri)

# Import uploaded documents to RAG corpus
import_documents_to_corpus(corpus_id, gcs_uris)

# Create and use the agent
adk_agent = ADKRAGAgent(
 corpus_id=corpus_id,
 project_id="your-project-id",
 location="us-central1"
)

agent = adk_agent.create_agent()
response = adk_agent.chat(agent, "What are the system requirements?")
print(response)

Key Components

RAGAgent

Core RAG functionality including:

Hybrid search with semantic and keyword matching
Context management for multi-turn conversations
Result reranking using LLM scoring
Grounding verification for hallucination prevention

ADKRAGAgent

Google ADK wrapper that provides:

Tool-based RAG search functionality
Native integration with Gemini models
Automatic tool calling and response synthesis
Session-based conversation tracking

Document Processing

PDF text extraction with PyPDF2
Image extraction with PyMuPDF (fitz)
Table content parsing and structuring
Multi-modal embedding creation

Configuration

Retrieval Parameters

Adjust in configure_retrieval_parameters():

similarity_top_k: Number of results to retrieve (default: 10)
vector_distance_threshold: Similarity threshold (default: 0.5)
alpha: Hybrid search weight (default: 0.5)

Chunking Settings

Modify in chunk_document():

chunk_size: Characters per chunk (default: 1000)
overlap: Overlap between chunks (default: 200)

Model Configuration

Change models in initialization:

RAG Agent: gemini-2.5-flash
ADK Agent: gemini-2.0-flash-001
Embeddings: text-embedding-004

Project Structure

ai-rag-engine/
 rag_agent.py # Main RAG system implementation
 requirements.txt # Python dependencies
 .env # Environment variables (not in git)
 docs/ # Source PDF documents
 extracted_images/ # Extracted images from PDFs
 README.md # This file

Features in Detail

Hybrid Search

Combines semantic embeddings with keyword matching for optimal retrieval:

results = agent.hybrid_search(
 corpus_id=corpus_id,
 query="your query",
 semantic_weight=0.7, # 70% semantic, 30% keyword
 top_k=10
)

Grounding Verification

Ensures responses are factually grounded in source documents:

verification = agent.verify_grounding(
 response=answer,
 sources=retrieved_docs
)

Multi-Modal Processing

Handle documents with text, images, and tables:

images = agent.extract_images_from_pdf(pdf_path, output_dir)
table_data = agent.process_table_content(table_text)
embedding = agent.create_multimodal_embedding(text, image_path, table_data)

Limitations

Rate limits: 5 requests/minute for RAG retrieval operations
Requires 3-minute wait time after corpus document import for indexing
PDF processing only (extend for other formats as needed)

Contributing

Contributions are welcome! Please follow these guidelines:

Fork the repository
Create a feature branch
Make your changes with clear commit messages
Submit a pull request

License

[Add your license information here]

Support

For issues and questions:

Open an issue on GitHub
Check Google Vertex AI documentation: https://cloud.google.com/vertex-ai/docs

Acknowledgments

Built with Google Vertex AI RAG Engine
Powered by Google ADK and Gemini models
Uses PyPDF2 and PyMuPDF for document processing

AI RAG Engine

Tech stack

Features

Architecture

Requirements

Installation

Usage

Basic Usage

Programmatic Usage

Key Components

RAGAgent

ADKRAGAgent

Document Processing

Configuration

Retrieval Parameters

Chunking Settings

Model Configuration

Project Structure

Features in Detail

Hybrid Search

Grounding Verification

Multi-Modal Processing

Limitations

Contributing

License

Support

Acknowledgments

Mais modelos

AI Travel Planner

FactFlux

Universal LLM Mentions Tracker