# Setting Up Your Self-Hosted AI Stack - Part 2: Document Processing and RAG with Apache Tika and Qdrant Vector Database

# What we're building today

We're adding document processing capabilities to the foundation from [Part 1](https://triedandtestedbuilds.com/self-hosted-ai-stack-part-1). This extends your chat interface to work with your files and documents.

Two key additions make this possible:

*   **Document Processing**: Extract text from documents like PDFs, Word docs, and text files using Apache Tika (with OCR capabilities for scanned documents)
    
*   **RAG (Retrieval-Augmented Generation)**: Search and retrieve relevant information from your documents to answer questions
    

This transforms your local AI from conversation-only to document-aware. Upload a file, ask questions about its contents, and get answers grounded in your actual documents rather than the model's training data.

# The bigger picture

I'm building this self-hosted AI stack and documenting everything as I go. Part 2 is where real power emerges. We're adding the intelligence layer that transforms your local AI from a chat toy into a genuine knowledge assistant.

Here's the thing: even with today's resources, implementing proper document processing and RAG together is surprisingly complex. This guide captures the lessons learned getting both capabilities production ready.

What we've built so far:

*   [**Part 1: Building the foundation with Open WebUI, Ollama, and Postgres**: Your foundational chat interface with local LLM hosting](https://triedandtestedbuilds.com/self-hosted-ai-stack-part-1)
    

What's still coming:

*   [**Part 3: Visual Data Extraction**: Web upload interface with N8N workflows and Vision Language Models for extracting structured data from images](https://triedandtestedbuilds.com/self-hosted-ai-stack-part-3)
    
*   **Part 4: Model Superpowers**: Advanced WebUI configuration with tools and knowledge integration
    
*   **Part 5: Intelligent Automation**: WebUI filters and N8N workflows for content processing
    

# Prerequisites

*   [Part 1 foundation stack running](https://triedandtestedbuilds.com/self-hosted-ai-stack-part-1)
    
*   **Ollama running** (verify with `ollama serve` and if you see "address already in use", it's already running)
    
*   24GB+ RAM recommended (16GB minimum)
    
*   50GB+ storage for documents and vectors
    
*   Docker and Docker Compose
    

> **Important**: If Part 1 is running, stop it first with `cd ../part-1-building-the-foundation && docker compose down` before starting Part 2.

# Credits where credits are due

Massive thanks to the creators and contributors of these open source projects that make this possible:

*   [Apache Tika](https://tika.apache.org/) - Repository: [apache/tika](https://github.com/apache/tika)
    
*   [Qdrant](https://qdrant.tech/) - Repository: [qdrant/qdrant](https://github.com/qdrant/qdrant)
    
*   [Ollama](https://ollama.com) - Repository: [ollama/ollama](https://github.com/ollama/ollama)
    
*   [Open WebUI](https://openwebui.com) - Repository: [open-webui/open-webui](https://github.com/open-webui/open-webui)
    
*   [Docker](https://docker.com/) - Repositories: [moby/moby](https://github.com/moby/moby) & [docker/compose](https://github.com/docker/compose)
    

# Quick start (skip explanations, just get it running)

1.  **Download the embedding model** (~275MB):
    
    ```bash
    ollama pull nomic-embed-text
    ```
    
2.  **Clone the repository or pull latest**:
    
    ```bash
    git clone https://github.com/FarzamMohammadi/self-hosted-ai-stack
    # Or if you already have it: git pull origin main
    ```
    
3.  **Navigate to part 2**:
    
    ```bash
    cd part-2-rag-with-tika-and-qdrant
    ```
    
4.  **Start the enhanced stack**:
    
    ```bash
    docker compose up -d
    ```
    

Done. Open [http://localhost:3000](http://localhost:3000), sign in, and you'll see document upload button (`+`) in the bottom left corner of the chat. Upload a PDF and start asking questions about it.

> **Important**: Open WebUI handles files through two distinct pathways:
> 
> **Document Processing (what we built)**: Extracts text from documents like PDFs, Word docs, and text files through Tika. Does NOT work with standalone image files (PNG, JPEG, etc.).
> 
> **Image Analysis (not covered here)**: For standalone image files, you'll need to download and use a vision model.

# Understanding what you built

Let's break down what we just built and why each component matters. Understanding the architecture transforms you from someone following steps into someone who owns the system.

## What are document processing and RAG, and why you need both

Your AI model was trained on data from months or years ago. It doesn't know about your company's latest reports, personal documents, or today's news. Ask about quarterly financials, and it apologizes for lacking access.

**Document Processing** uses Apache Tika to extract machine-readable text directly from documents like PDFs, Word docs, and other formats by parsing their internal structure. For scanned documents or image-based content within PDFs, Tika automatically falls back to OCR using Tesseract when needed.

**RAG (Retrieval-Augmented Generation)** solves the knowledge gap by giving your AI instant access to search through your documents for relevant information.

Together, they create a document intelligence system that processes documents and provides grounded answers.

**The transformation:**

| Without Document Processing + RAG | With Document Processing + RAG |
| --- | --- |
| "I can't access your documents." | "Based on your invoice PDF, the total amount due is $2,847." |
| "I don't have access to your company's financial data." | "Based on your Q3 report, revenue increased 23% compared to last quarter." |
| Limited to conversation only | Processes documents like PDFs, Word docs, text files |
| Operates with frozen training data | Searches YOUR content for relevant information |
| Generic responses, frequent "I don't have access" | Accurate answers with source attribution |

**Real examples**:

*Document processing:* "What's the total on this invoice?"

*   Without document processing: "I can't read documents or extract text from files."
    
*   With document processing + RAG: "Based on your invoice PDF, the total amount due is $2,847, with a due date of March 15th."
    

*Document analysis:* "What were our biggest challenges last quarter?"

*   Without RAG: "I don't have access to your quarterly reports."
    
*   With document processing + RAG: "Based on your Q3 report, the biggest challenges were supply chain delays (mentioned 8 times) and staffing shortages in the manufacturing division (pages 12-14)."
    

This transforms your AI from a chat interface into a document intelligence system that reads, understands, and answers questions about your documents.

## How vector similarity works

Traditional search looks for exact word matches. Search for "machine learning performance" and miss documents about "AI optimization" or "model efficiency" entirely.

**Embeddings solve this**: mathematical representations that capture semantic meaning.

Think of embeddings as GPS coordinates for concepts:

*   "Machine learning" → \[0.2, 0.8, 0.1, ...\] (768 dimensions)
    
*   "AI optimization" → \[0.3, 0.7, 0.2, ...\]
    
*   "Car repair" → \[0.9, 0.1, 0.3, ...\]
    

Closer coordinates mean more similar meaning. When nomic-embed-text converts "reduce costs" into a 768-dimensional vector, it positions near related concepts:

*   "Budget optimization" (distance: 0.23)
    
*   "Expense management" (distance: 0.31)
    
*   "Financial efficiency" (distance: 0.28)
    

This mathematical precision ensures RAG retrieves conceptually relevant information beyond keyword matches.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1754862170197/8177dfdc-999f-408e-8902-d467b45ae7c1.webp align="center")

*Vector search visualization showing how concepts cluster in high-dimensional space. Related concepts like "Kitten" and "Cat" appear close together in mathematical space, enabling semantic search that finds meaning beyond exact keyword matches.*

> Source: https://odsc.medium.com/a-gentle-introduction-to-vector-search-3c0511bc6771

## Vector databases store and search embeddings

Once you have embeddings, you need somewhere to store and search them efficiently. Regular databases can't handle similarity search.

**Traditional database (SQL):**

```sql
SELECT * FROM documents WHERE title CONTAINS 'machine learning'
```

Finds: Documents with exact phrase "machine learning" in title

**Vector database:**

```plaintext
SEARCH FOR documents SIMILAR TO embedding([0.2, 0.8, 0.1, ...])
```

Finds: Documents about AI, neural networks, model training, performance optimization, etc.

Vector databases use specialized algorithms to find similar vectors among millions quickly. Semantic search understands intent:

*   Search: "reduce costs" → Finds: "budget optimization", "expense management", "financial efficiency"
    
*   Search: "team problems" → Finds: "communication challenges", "staff conflicts", "collaboration issues"
    

## How the complete RAG flow works

**Document Processing (Setup Phase):** Upload document → **Apache Tika server** extracts machine-readable text from documents like PDFs, Word docs, and text files (using OCR via Tesseract for scanned content when needed) → Open WebUI splits text into 1000-character chunks → **nomic-embed-text model** (via Ollama) converts each chunk into 768-dimensional vectors → **Qdrant vector database** stores vectors with original text for fast similarity search

**Query Processing (When You Ask Questions):** Question → **nomic-embed-text model** converts question to 768-dimensional vector → **Qdrant vector database** searches and finds 5 most similar document chunks → Open WebUI compiles relevant text excerpts → Enhanced prompt (question + retrieved excerpts) sent to **Ollama LLM** → AI generates grounded response with source attribution

**The complete document processing + RAG orchestration**: Open WebUI handles this entire workflow behind the scenes:

*   **Apache Tika** extracts text from documents like PDFs, Word docs, and text files (with automatic OCR fallback for scanned content)
    
*   **nomic-embed-text** converts extracted content into 768-dimensional vectors
    
*   **Qdrant** stores and indexes vectors for fast similarity search
    
*   **Open WebUI** orchestrates the complete pipeline from upload to intelligent response
    

What typically requires complex document processing and RAG engineering becomes simple upload and query. Your documents transform into a searchable knowledge base within seconds.

# Setting up the document processing and RAG components

Now that you understand the concepts, let's examine the technical implementation that makes everything work.

## System requirements

| Component | Memory | Capabilities | Storage Impact |
| --- | --- | --- | --- |
| **Apache Tika** | 4GB | Text extraction from documents like PDFs, Word docs, and text files (with OCR for scanned content) | ~2MB per processed document |
| **nomic-embed-text** | 275MB | Document embedding generation | Creates 768-dimensional vectors |
| **Qdrant** | 2GB | Vector storage and similarity search | ~50KB per document chunk |
| **Overall system** | 24GB+ | Process 100+ documents concurrently | Scales to 10,000+ documents |
| **Storage growth** | 50GB+ | Raw documents + vector indexes | ~5MB total per average business document |

*Performance estimates are rough approximations based on system understanding and real-world usage patterns.*

# Putting it all together

Here's our enhanced `docker-compose.yml` that adds complete document processing and RAG capabilities to our foundation:

```yaml
version: '3.8'

services:
  postgres:
    image: postgres:15-alpine
    container_name: postgres
    ports:
      - '5432:5432'
    environment:
      - POSTGRES_DB=openwebui
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=securepassword123
    volumes:
      - ./volumes/postgres/data:/var/lib/postgresql/data
    restart: unless-stopped
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U postgres -d openwebui']
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - local-ai-network

  pgadmin:
    image: dpage/pgadmin4:latest
    container_name: pgadmin
    ports:
      - '5050:80'
    environment:
      - PGADMIN_DEFAULT_EMAIL=admin@local.ai
      - PGADMIN_DEFAULT_PASSWORD=admin123
      - PGADMIN_CONFIG_SERVER_MODE=False
      - PGADMIN_CONFIG_MASTER_PASSWORD_REQUIRED=False
    volumes:
      - ./volumes/pgadmin:/var/lib/pgadmin
    restart: unless-stopped
    depends_on:
      postgres:
        condition: service_healthy
    networks:
      - local-ai-network

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: webui
    ports:
      - '3000:8080'
    volumes:
      - ./volumes/open-webui/data:/app/backend/data
    environment:
      # Ollama connection
      - OLLAMA_BASE_URL=http://host.docker.internal:11434

      # Database connection
      - DATABASE_URL=postgresql://postgres:securepassword123@postgres:5432/openwebui

      # Basic settings
      - WEBUI_SECRET_KEY=your-secret-key-here
      - WEBUI_AUTH=true
      - ENABLE_SIGNUP=true
      - DEFAULT_MODELS=qwen3:8b
      
      # Document Processing + RAG Configuration (NEW)
      - UPLOAD_DIR=/app/backend/data/uploads
      - ENABLE_PERSISTENT_CONFIG=false
      - CONTENT_EXTRACTION_ENGINE=tika
      - TIKA_SERVER_URL=http://tika:9998
      - VECTOR_DB=qdrant
      - QDRANT_URI=http://qdrant:6333/
      - QDRANT_COLLECTION_PREFIX=document_chunks
      - RAG_EMBEDDING_MODEL=nomic-embed-text
      - RAG_EMBEDDING_ENGINE=ollama
      - ENABLE_RAG_HYBRID_SEARCH=false
      - RAG_RELEVANCE_THRESHOLD=0.75
      - CHUNK_SIZE=1000
      - CHUNK_OVERLAP=100

    extra_hosts:
      - 'host.docker.internal:host-gateway'
    restart: unless-stopped
    depends_on:
      postgres:
        condition: service_healthy
    networks:
      - local-ai-network

  qdrant:
    image: qdrant/qdrant:latest
    container_name: qdrant
    ports:
      - '6333:6333'
      - '6334:6334'
    environment:
      - QDRANT__SERVICE__HTTP_PORT=6333
      - QDRANT__SERVICE__GRPC_PORT=6334
    volumes:
      - ./volumes/qdrant/storage:/qdrant/storage
    restart: unless-stopped
    networks:
      - local-ai-network

  tika:
    image: apache/tika:latest-full
    container_name: tika-server
    ports:
      - '9998:9998'
    environment:
      - TIKA_CONFIG=/opt/tika/config/tika-config.xml
      - JAVA_OPTS=-Xmx2g -Xms512m -Dfile.encoding=UTF-8 -Djava.awt.headless=true
      - TIKA_OCR_LANGUAGE=eng
      - TIKA_PDF_OCR_STRATEGY=OCR_AND_TEXT_EXTRACTION
    volumes:
      - ../shared/tika/config:/opt/tika/config
    command:
      - --host=0.0.0.0
      - --port=9998
      - --config=/opt/tika/config/tika-config.xml
    restart: unless-stopped
    networks:
      - local-ai-network

networks:
  local-ai-network:
    driver: bridge
```

**What's new from Part 1:**

This Docker Compose builds on your existing Part 1 setup by adding two key containers:

*   **Qdrant**: Vector database for storing document embeddings
    
*   **Apache Tika**: Text extraction engine for documents like PDFs, Word docs, and text files
    

**Key configuration changes:**

*   `ENABLE_PERSISTENT_CONFIG=false`: Lets these new document processing + RAG settings override Part 1 configs
    
*   `CONTENT_EXTRACTION_ENGINE=tika`: Routes document processing through the new Tika container
    
*   `RAG_RELEVANCE_THRESHOLD=0.75`: Sets search quality threshold
    
*   `QDRANT_COLLECTION_PREFIX=document_chunks`: Names your vector storage collections
    
*   `TIKA_PDF_OCR_STRATEGY=OCR_AND_TEXT_EXTRACTION`: Extracts machine-readable text first, then uses OCR for scanned content within documents
    
*   `CHUNK_SIZE=1000` & `CHUNK_OVERLAP=100`: Document splitting settings for better search
    

## Tika Configuration

[The repository](https://github.com/FarzamMohammadi/self-hosted-ai-stack) includes a [pre-configured](https://github.com/FarzamMohammadi/self-hosted-ai-stack/blob/main/shared/tika/config/tika-config.xml) `tika-config.xml` [file](https://github.com/FarzamMohammadi/self-hosted-ai-stack/blob/main/shared/tika/config/tika-config.xml) mounted into the Tika container via the volume mapping `../shared/tika/config:/opt/tika/config` (shown on line 305 of the Docker Compose file). This configuration contains simple yet optimized settings for document processing, including PDF text extraction and OCR fallback capabilities, plus resource limits to prevent excessive usage during document processing.

# Testing your document processing + RAG setup

Time to verify everything works as expected. Let's systematically test each component:

## 1\. Container health check

```bash
docker compose ps
```

All five containers should show as running:

*   `postgres` (healthy)
    
*   `pgadmin`
    
*   `webui`
    
*   `qdrant`
    
*   `tika-server`
    

## 2\. Verify service endpoints

**Qdrant Database:**

*   Health check: Open [http://localhost:6333/healthz](http://localhost:6333/healthz) - should show "healthz check successful"
    
*   Dashboard: Browse [http://localhost:6333/dashboard](http://localhost:6333/dashboard) - explore your document collections and vector data
    

**Tika server:**

*   Quick version check: Open [http://localhost:9998/version](http://localhost:9998/version) - shows Tika version info
    
*   All available routes: Browse [http://localhost:9998/](http://localhost:9998/) - lists all Tika API endpoints
    

**Ollama with both models:**

```bash
ollama list
```

Should list both `qwen3:8b` and `nomic-embed-text`.

## 3\. Upload and test different document types

1.  Navigate to [http://localhost:3000](http://localhost:3000)
    
2.  Sign in with your existing account
    
3.  Look for the document upload icon in the bottom left corner of the chat
    
4.  **Upload documents**: PDFs, Word docs, or text files
    
5.  Wait for processing to complete
    
6.  Ask specific questions about the document content
    

**Test questions for document processing:**

*   "What's the total amount on this invoice?"
    
*   "Extract the key information from this document"
    
*   "What are the main points in this document?"
    

**Test questions for RAG analysis:**

*   "What are the main topics covered in this document?"
    
*   "Summarize the key findings"
    
*   "What does the document say about \[specific topic\]?"
    

## 4\. Verify vector storage

Check that Qdrant is storing your document vectors: [http://localhost:6333/dashboard#/collections](http://localhost:6333/dashboard#/collections)

You should see collections created for your uploaded documents.

## 5\. View extracted text (optional)

Want to see exactly what text was extracted from your uploaded documents? This helps verify text extraction accuracy (including OCR when used) and understand what content the AI uses.

1.  Visit [http://localhost:5050](http://localhost:5050)
    
2.  Login with `admin@local.ai` / `admin123`
    
3.  If you haven't already connected to the database, add a server connection:
    
    1.  Right click on **Servers** → **Register** → **Server**
        
    2.  **General** tab → **Name**: `local-ai`
        
    3.  **Connection** tab:
        
        *   **Host name/address**: `postgres`
            
        *   **Port**: `5432`
            
        *   **Maintenance database**: `postgres`
            
        *   **Username**: `postgres`
            
        *   **Password**: `securepassword123`
            
    4.  Click **Save**
        
4.  Navigate to **local-ai** → **Databases** → **openwebui** → **Schemas** → **public** → **Tables** → **file**
    
5.  Right-click on the **file** table and select **View/Edit Data** → **All Rows**
    
6.  Look at the **data** column to see the extracted text from your uploaded documents
    

**What you'll see:**

*   The `filename` column shows your original file names
    
*   The `data` column contains the full extracted text from each document
    
*   This is exactly what the AI uses to answer your questions about the documents
    

**Troubleshooting tip:** If the `data` column is empty or contains gibberish, the document may be corrupted or contain primarily images. This setup works best with standard document formats and doesn't process standalone image files.

## Troubleshooting common issues

| Issue | Quick Fix | Details |
| --- | --- | --- |
| Document upload fails | Check `docker compose logs tika-server` | Sweet spot: 5 to 15MB files |
| No RAG responses | Verify `ollama list` includes both models | First upload takes 2 to 3 times longer |
| RAG responses seem inaccurate | Check chunk size and overlap settings | Adjust `CHUNK_SIZE=500` for technical docs |
| Tika memory errors | Increase to `-Xmx6g` or `-Xmx8g` | 4GB handles most documents |
| No text extracted | Standalone image file uploaded (PNG, JPEG, etc.) | Use vision models for standalone image files |
| System memory issues | Monitor with `docker stats` | 16GB minimum, 24GB comfortable |
| Vector search too slow | Check Qdrant storage space | Database performance degrades when disk is full |

# What's next

You've successfully built a sophisticated document processing and RAG-enabled document intelligence system! Your self-hosted stack now includes:

✅ **Ollama** serving your LLM and embedding models

✅ **Open WebUI** with complete document intelligence

✅ **PostgreSQL** storing conversations and configurations

✅ **Qdrant** powering semantic document search and retrieval

✅ **Apache Tika** providing text extraction from documents like PDFs, Word docs, and text files (with OCR capabilities for scanned content)

## Coming up in Part 3

We'll build a **visual data extraction pipeline** with:

*   **Web upload interface** for batch image processing
    
*   **N8N workflows** orchestrating the extraction pipeline
    
*   **Vision Language Models (VLM)** via Ollama for analyzing images
    
*   **PostgreSQL storage** for tracking jobs and extracted structured data
    
*   **Automated job processing** with retry logic and error handling
    

This adds the ability to extract structured data from images at scale, perfect for processing receipts, invoices, forms, or any visual documents.

## Homework before Part 3

Explore your new document processing and RAG capabilities:

*   **Test document processing**: Upload PDFs, Word docs, and text files
    
*   **Test RAG**: Upload various documents and ask complex questions
    
*   **Multi-language**: Review the `tika-config.xml` file in `self-hosted-ai-stack/shared/` directory to understand language settings, add new languages to it, then test documents in different languages (hands-on way to learn Tika configuration)
    
*   **Multi-format**: Try Word docs, PowerPoint slides, Excel sheets (remember: this processes standard document formats, but not standalone image files)
    
*   **Advanced queries**: Try complex multi-document queries across different formats
    
*   **Monitor processing**: Browse vector collections in Qdrant's web interface at [http://localhost:6333/dashboard](http://localhost:6333/dashboard)
    

## Helpful resources

*   [Open WebUI RAG Documentation](https://docs.openwebui.com/features/rag)
    
*   [Qdrant Documentation](https://qdrant.tech/documentation/)
    
*   [Apache Tika Supported Formats](https://tika.apache.org/2.9.1/formats.html)
    
*   [Ollama Embedding Models](https://ollama.com/search?q=embed)
    

* * *

*This is part of my "Complete Self-Hosted AI Infrastructure" series. Follow along as we build increasingly sophisticated AI capabilities, all running self-hosted on your machine.*