Data Processing - Ragnerock Docs

When you upload data to Ragnerock, it passes through a multi-stage processing pipeline that extracts content, splits it into searchable units, and generates vector embeddings. This makes your data ready for semantic search, annotation, and agentic research workflows.

Overview

Every data source goes through four core stages:

Upload & Storage: The data source is securely stored in your project’s storage
Content Extraction: Content is extracted using format-specific strategies (OCR for PDFs, direct parsing for text and tabular files)
Chunking: Extracted content is segmented into semantic units optimized for search and annotation
Embedding Generation: Each chunk is converted into a vector representation that enables semantic similarity search

┌───────────┐    ┌────────────┐    ┌──────────┐    ┌────────────┐
│  Upload   │───▶│ Extraction │───▶│ Chunking │───▶│ Embeddings │
│ & Storage │    │            │    │          │    │            │
└───────────┘    └────────────┘    └──────────┘    └────────────┘

Once all four stages complete, the data source is fully indexed and available for search, annotation workflows, and agent-powered research.

Data Types

Ragnerock handles two categories of data sources, each with a distinct processing path:

Document-Like Data

Formats: PDF, Word (.docx), Markdown, plain text, webpages

Document-like data is processed page by page. Content is extracted into pages, then split into paragraph-level chunks. Each chunk receives a vector embedding for semantic search. Document-like data supports annotation at multiple granularities: document, page, paragraph, or sentence level.

Tabular Data

Formats: Excel (.xlsx), CSV

Tabular data is parsed into sheets and rows rather than pages and paragraphs. Each row is treated as an individual unit. Column types are automatically inferred during parsing. Tabular data is queryable directly via SQL and does not go through the embedding pipeline. Tabular data supports annotation at row, sheet, or document level.

Content Extraction

Ragnerock uses format-specific extraction strategies to maximize content quality:

PDF Extraction

PDFs are processed using OCR (optical character recognition) technology that handles both scanned documents and digitally-created PDFs. The extraction engine:

Processes each page independently
Recognizes and preserves table structures
Converts content to a structured text format with page boundaries
Handles complex layouts including multi-column text, headers, and footers
Extracts embedded images (charts, diagrams, exhibits, screenshots) and forwards each to the model service for vision-model summarization
Splices the resulting descriptions back into the page markdown inline as [Image: ...] markers, keeping the rest of the pipeline (chunking, embedding, annotation) text-only
Persists the original images to blob storage with DocumentImage records, enabling provenance, citation, and future re-rendering or re-summarization

Direct Text Extraction

Word documents, Markdown files, and plain text are parsed directly without OCR. The content is decoded and validated, preserving the original structure.

Web Content Extraction

Webpages are fetched and converted to clean text, stripping navigation, ads, and boilerplate while preserving the meaningful content structure.

Tabular Extraction

Excel and CSV files are parsed with automatic column type inference. Ragnerock detects whether each column contains numbers, dates, text, or other data types.

Large spreadsheets are supported with configurable row limits per sheet.

Chunking

After extraction, content is segmented into chunks, the fundamental units used for search and annotation.

Paragraph-Based Segmentation

For document-like data, Ragnerock splits content on paragraph boundaries (double line breaks). This preserves the natural semantic structure of the document rather than using arbitrary fixed-size windows. Each chunk includes:

The text content of the paragraph
Character position metadata (start and end offsets within the page)
A reference to the originating page

Position metadata enables Ragnerock to link search results and annotations back to their exact location in the source document, supporting full citation and verification.

Row-Based Units

For tabular data, each row is treated as an individual chunk. Rows are queryable directly via SQL without requiring the embedding step.

Quality Filtering

During chunking, Ragnerock applies quality filters to remove low-value content. For example, chunks that consist entirely of corrupted OCR output or non-meaningful characters are discarded. This keeps the search index clean and improves result quality.

Embedding Generation

Each chunk from document-like data is converted into a high-dimensional vector embedding, a numerical representation that captures the semantic meaning of the text. These embeddings power Ragnerock’s semantic search.

Vector storage: Embeddings are stored in a vector database optimized for fast similarity search across your entire data library
Batched processing: Chunks are grouped into batches for efficient embedding generation, with multiple batches processed in parallel
Separate storage: Embedding vectors are stored separately from your source content, enabling flexible deployment configurations including Bring Your Own Database setups

Once embeddings are generated, the data source is fully searchable through both traditional keyword matching and semantic similarity, where conceptually related content surfaces even when exact terms don’t match.

Job System

Data processing runs asynchronously through a job system designed for reliability, scalability, and visibility.

What Is a Job?

A job is the unit of work that tracks a single data source’s processing from start to finish. When you upload a data source (or manually trigger reprocessing), a job is created and placed in a queue for asynchronous execution.

Every processing job is visible in the Jobs dashboard, where you can monitor progress, inspect per-document status, and review logs.

Status Lifecycle

Every job progresses through a series of statuses:

Pending ───▶ Processing ───▶ Success
                  │
                  └──────────▶ Error

Pending: The job is queued and waiting to be picked up by a worker
Processing: The job is actively being worked on
Success: All processing stages completed successfully
Error: Processing failed (partial results may still be available)

Processing Phases

Within the “Processing” status, a job moves through distinct phases:

Parsing ───▶ Chunking ───▶ Embedding ───▶ Annotating ───▶ Complete

Parsing: Extract content and structure from the raw data source
Chunking: Segment extracted content into semantic units
Embedding: Generate vector embeddings for each chunk (document-like data only)
Annotating: Run any configured annotation workflows (if applicable)
Complete: All phases finished; data source is fully indexed

If any annotation workflows are configured with auto-run, they execute automatically once processing completes.

Two-Tier Architecture

Ragnerock uses a two-tier approach for scalable processing:

Orchestration tier: A job worker manages the overall lifecycle. It parses the data source, creates chunks, and then spawns batched subtasks for embedding and annotation work. When all subtasks for a phase complete, the job worker advances to the next phase.

Execution tier: Subtask workers handle the compute-intensive work. Each subtask processes a batch of chunks, generating embeddings or running annotation agents. Subtasks execute in parallel for throughput, and each reports its progress back to the parent job.

                    ┌─────────────────-────┐
                    │     Job Worker       │
                    │  (Orchestration)     │
                    └──────────┬───────-───┘
                               │ spawns batches
                    ┌──────────┴──────────┐
              ┌─────┴─────┐         ┌─────┴─────┐
              │ Subtask   │         │ Subtask   │
              │ Worker 1  │  ...    │ Worker N  │
              │ (Batch)   │         │ (Batch)   │
              └───────────┘         └───────────┘

Batch Processing

The job system handles batch uploads seamlessly. Upload hundreds of data sources at once and they process in parallel. Each data source gets its own independent job, so a failure in one doesn’t affect the others.

Reliability

The job system includes several resilience mechanisms:

Automatic retries: Failed subtasks are retried automatically (up to a configurable limit)
Failure thresholds: A job tolerates a small percentage of subtask failures before marking the overall job as failed
Progress tracking: Subtask completion counts are tracked in real time, enabling progress indicators in the UI
Idempotent re-runs: Jobs can be manually re-triggered; previously completed work (e.g., already-embedded chunks) is skipped

Pipeline Diagram

The complete data processing flow, from upload to fully indexed data source:

┌──────────────────────────────────────────────────────────────────┐
│                     User Uploads Data Source                     │
└────────────────────────────────┬─────────────────────────────────┘
                                 ▼
┌──────────────────────────────────────────────────────────────────┐
│  Store data source in secure storage and create processing job   │
└────────────────────────────────┬─────────────────────────────────┘
                                 ▼
┌──────────────────────────────────────────────────────────────────┐
│  Job Worker picks up job from queue                              │
│                                                                  │
│  ┌─────────────┐   ┌─────────────┐   ┌──────────────────────┐    │
│  │ 1. Parse    │──▶│ 2. Chunk    │──▶│ 3. Spawn embedding   │    │
│  │    content  │   │    content  │   │    subtasks          │    │
│  └─────────────┘   └─────────────┘   └──────────┬───────────┘    │
└─────────────────────────────────────────────────┬────────────────┘
                                                  ▼
┌──────────────────────────────────────────────────────────────────┐
│  Subtask Workers process embedding batches in parallel           │
│                                                                  │
│  ┌───────────┐   ┌───────────┐   ┌───────────┐                   │
│  │ Batch 1   │   │ Batch 2   │   │ Batch N   │                   │
│  │ embed     │   │ embed     │   │ embed     │                   │
│  │ chunks    │   │ chunks    │   │ chunks    │                   │
│  └─────┬─────┘   └─────┬─────┘   └─────┬─────┘                   │
│        └───────────────┴───────────────┘                         │
│                        ▼                                         │
│           Store vectors in search index                          │
└────────────────────────────────┬─────────────────────────────────┘
                                 ▼
┌──────────────────────────────────────────────────────────────────┐
│  Job Worker re-invoked: advance to annotation phase              │
│  (if annotation workflows are configured)                        │
│                                                                  │
│  For each workflow node in sequence:                             │
│    → Spawn annotation subtasks                                   │
│    → Subtask workers run AI agents on each batch                 │
│    → Store structured annotation results                         │
└────────────────────────────────┬─────────────────────────────────┘
                                 ▼
┌──────────────────────────────────────────────────────────────────┐
│  Job marked SUCCESS: data source is fully indexed and searchable │
└──────────────────────────────────────────────────────────────────┘