This system provides a complete solution for ingesting PDF documents, extracting text and images, creating embeddings using Google's Vertex AI, and performing similarity searches using Google's Vector Search service.
The system consists of three main components:
- PDF Ingestion Function (
pdf_ingestion/main.py
): A background Cloud Function triggered by Cloud Storage uploads - Similarity Search Function (
search/main.py
): HTTP Cloud Functions for querying similar content - Shared Utilities (
shared_utils.py
): Common initialization and helper functions
app2/
├── src/functions/
│ ├── shared_utils.py # Shared utilities for both functions
│ ├── pdf_ingestion/ # PDF ingestion function
│ │ ├── main.py # PDF processing logic
│ │ └── requirements.txt # Dependencies for PDF function
│ └── search/ # Search functions
│ ├── main.py # Search logic
│ └── requirements.txt # Dependencies for search function
├── deploy.sh # Deployment script for all functions
├── vector_search_setup.sh # Vector Search setup
├── vector_search_metadata.json # Vector Search configuration
├── test_search.py # Test script
└── README.md # This file
- PDF Processing: Extracts text and images from PDF documents using PyMuPDF
- Smart Text Chunking: Splits text into manageable chunks with overlap for better embedding quality
- Image Extraction: Extracts and processes images from PDFs
- Vertex AI Embeddings: Uses Google's
textembedding-gecko@003
model for high-quality embeddings - Vector Search: Stores embeddings in Google's Vector Search for fast similarity queries
- Metadata Storage: Stores document metadata and content in Firestore
- Multiple Search Types:
- Text-based similarity search
- Document-to-document similarity search
- Content type filtering (text/image)
- Function Name:
pdf-ingestion-function
- Trigger: Cloud Storage uploads (PDF files only)
- Purpose: Processes uploaded PDFs and creates embeddings
- Function Name:
pdf-search-function
- Trigger: HTTP POST requests
- Purpose: Performs text-based similarity search
- Function Name:
pdf-document-search-function
- Trigger: HTTP POST requests
- Purpose: Finds documents similar to a given document
- Upload a PDF file to your configured Cloud Storage bucket
- The function will automatically process it and create embeddings
curl -X POST "https://your-region-your-project.cloudfunctions.net/pdf-search-function" \
-H "Content-Type: application/json" \
-d '{
"query": "financial statements",
"top_k": 5,
"content_type": "text"
}'
curl -X POST "https://your-region-your-project.cloudfunctions.net/pdf-document-search-function" \
-H "Content-Type: application/json" \
-d '{
"document_id": "your-document-id",
"top_k": 5
}'
Run the test script to verify functionality:
python test_search.py
google-cloud-storage
: Cloud Storage integrationgoogle-cloud-firestore
: Document metadata storagegoogle-cloud-aiplatform
: Vertex AI and Vector Searchlangchain-google-vertexai
: Vertex AI embeddingsPyMuPDF
: PDF processing
google-cloud-firestore
: Document metadata storagegoogle-cloud-aiplatform
: Vector Searchlangchain-google-vertexai
: Vertex AI embeddings
User uploads document → Cloud Storage → Cloud Function trigger → Document AI processing →
Extract text & visual elements → Generate embeddings → Store in Vector Search & Vertex AI Search