Data Ingestion

Upload and process documents in any format, including PDFs, spreadsheets, presentations, and more.

Ragnerock can ingest documents in a variety of formats, from PDFs and spreadsheets to plain text and images. You can upload files directly from your local machine, provide a URL to fetch remotely, or configure automated web scraping to continuously ingest content from websites. Once uploaded, every document passes through a processing pipeline that extracts text, generates searchable chunks, and creates vector embeddings, making it ready for semantic search, annotation workflows, and agentic research.

Supported Formats

Ragnerock processes three categories of documents (text, tabular, and image), each with a distinct extraction and indexing path.

FormatExtensionsCategoryProcessing Path
PDF.pdfTextOCR extraction > pages > paragraph chunks > embeddings
Word.docxTextDirect text extraction > paragraph chunks > embeddings
Markdown.mdTextDirect parsing > paragraph chunks > embeddings
Plain Text.txtTextDirect parsing > paragraph chunks > embeddings
Jupyter Notebook.ipynbTextCell extraction > paragraph chunks > embeddings
Excel.xlsxTabularSheet + row parsing > row-based units > embeddings
CSV.csvTabularRow parsing > row-based units > embeddings
JPEG.jpg, .jpegImageAnnotation workflows only (no embedding)
PNG.pngImageAnnotation workflows only (no embedding)

Text documents are extracted into pages, split into paragraph-level chunks, and embedded for semantic search. They support annotation at document, page, or chunk level.

Tabular documents are parsed into sheets and rows. Each row’s column values are combined into a text representation that can be embedded and searched. They support annotation at document, sheet, or row level.

Image documents skip extraction and embedding entirely. They are processed only through annotation workflows, where vision-capable AI models analyze the image content directly.

Upload Methods

File Upload

Click the Upload button in the sidebar or on the welcome screen to open the upload dialog.

  1. Drag and drop files into the upload area, or click Browse to select files from your computer
  2. Select multiple files for batch upload. Each file is processed independently.
  3. Optionally assign documents to a dataset for organization
  4. Click Upload to start processing

The upload dialog shows real-time progress for each file. Once a file is uploaded, processing begins automatically: text extraction, chunking, and embedding generation run in the background. If any annotation workflows are configured to auto-run, they execute once processing completes.

The upload dialog showing the drag-and-drop area with a file ready to upload, the dataset selector, and the Upload button

URL Ingestion

You can also ingest documents by URL. In the upload dialog, switch to the URL tab and enter the document URL. Ragnerock downloads the file and processes it identically to a local upload. This is useful for documents hosted on file servers, cloud storage, or public URLs.

Web Scraping

For continuous ingestion from websites, Ragnerock provides a web scraping system with a dedicated dashboard. Navigate to Jobs > Data Ingest to access the scraping dashboard.

The dashboard lets you:

  • Create scrape configurations: Specify a URL, crawl depth, page limits, and optional authentication
  • Schedule recurring runs: Set up cron-based scheduling for periodic re-scraping
  • Monitor scrape runs: View a timeline of all runs with their status and document counts
  • Track content changes: Compare versions of scraped pages with a side-by-side diff viewer
  • View the page content tree: Browse the hierarchical structure of scraped pages

The data ingest dashboard showing scrape configurations with URLs and status

Web scrape configurations support:

  • Crawl depth (1-3 levels): how many link levels deep to follow from the starting URL
  • Page limits: maximum pages to scrape per depth level
  • Scheduled runs: cron-based scheduling for periodic re-scraping
  • Change detection: only new or modified pages are re-processed on subsequent runs
  • Authentication: HTTP Basic Auth for protected sites
  • Linked file discovery: PDFs, Word documents, spreadsheets, and other files linked on scraped pages are automatically downloaded and ingested as separate documents

Processing Pipeline

After upload, every document passes through a series of processing stages. The entire pipeline runs asynchronously. You can continue working while documents process in the background.

┌────────┐    ┌────────────┐    ┌──────────┐    ┌────────────┐    ┌────────────┐
│ Upload │───>│ Extraction │───>│ Chunking │───>│ Embedding  │───>│ Annotation │
│        │    │            │    │          │    │ Generation │    │ (optional) │
└────────┘    └────────────┘    └──────────┘    └────────────┘    └────────────┘

Upload & Validation

The document is stored in your project’s blob storage and its format is validated. A processing job is created and queued for a worker to pick up.

Text Extraction

Ragnerock uses format-specific extraction strategies:

  • PDFs: Processed with OCR technology that handles both scanned and digitally-created documents. Content is extracted page by page, with table structures preserved. Large PDFs are processed in 15-page batches internally.
  • Word, Markdown, plain text: Parsed directly without OCR. Content is decoded and validated with the original structure preserved.
  • Jupyter Notebooks: Cell content is extracted as text, preserving code and markdown cells.
  • Excel and CSV: Parsed into structured sheets and rows with automatic column type inference (see Format-Specific Guidance below).
  • Images: No text extraction is performed. Images proceed directly to annotation workflows.

Chunking

Extracted text is segmented into chunks, the fundamental units used for search and annotation:

  • Text documents are split on paragraph boundaries (double line breaks), preserving the natural semantic structure. Each chunk includes character position metadata that maps back to the exact location in the source document.
  • Tabular documents treat each row as an individual chunk. Column values are combined into a text representation for embedding.
  • Images skip chunking entirely.

A quality filter removes low-value chunks (e.g., corrupted OCR output or non-meaningful characters) to keep the search index clean.

Embedding Generation

Each chunk is converted into a vector embedding that captures its semantic meaning. Embeddings are stored in PostgreSQL with pgvector and power Ragnerock’s semantic search, enabling you to find conceptually related content even when exact terms don’t match.

Chunks are processed in parallel batches for throughput. Once embeddings are generated, the document is fully searchable.

Annotation

If your project has annotation workflows configured to auto-run, they execute automatically after embedding completes. This lets you extract structured data (sentiment scores, financial metrics, classifications, etc.) from every document as it’s ingested. See Annotations for details.

For a deeper look at the job system, worker architecture, and reliability mechanisms, see Data Processing Architecture.

Monitoring Processing Status

Each document in the document list shows a status badge indicating where it is in the pipeline. You can also monitor all processing jobs in the Jobs dashboard.

StatusBadgeMeaning
PendingGrayQueued, waiting for a worker to pick it up
ProcessingBlue spinnerActively being parsed, chunked, or embedded
ReadyGreen checkmarkAll stages completed. Document is fully indexed.
ErrorRed indicatorProcessing failed (partial results may be available)

Handling Errors

When a document fails processing, hover over the error badge in the document list to see a description of the failure. Common causes include:

  • Unsupported format: File doesn’t match the declared document type
  • Corrupted file: File can’t be read or parsed
  • Extraction failure: OCR or parsing service encountered an error
  • Timeout: Processing exceeded the allowed time window

You can re-upload the document to retry processing. For web scraping jobs, transient errors (timeouts, rate limits) are automatically retried, while permanent errors (malformed URLs, blocked domains) are reported immediately.

Batch Upload

The upload dialog supports selecting multiple files at once. Simply drag multiple files into the upload area, or use your file picker to select multiple documents. Each file is processed independently and you can monitor progress for all uploads in the document list.

Format-Specific Guidance

PDFs

Ragnerock handles both digitally-created and scanned PDFs using OCR technology. Key characteristics:

  • Table extraction: Tables embedded in PDFs are recognized and preserved as structured markdown during extraction. This means table content is searchable and available for annotation.
  • Scanned documents: Handwritten or scanned pages are OCR-processed. Native digital PDFs produce higher-quality text, so prefer digital originals when available.
  • Large files: PDFs are processed in 15-page batches internally. There’s no hard page limit, but very large documents (hundreds of pages) take proportionally longer.
  • Page-level access: After processing, each page’s content is stored separately, enabling page-level annotations and precise citations.

Excel (.xlsx)

Each sheet in an Excel file is parsed independently:

  • Column type inference: Ragnerock automatically detects column types including string, integer, float, boolean, datetime, and mixed.
  • Row-based processing: Each row becomes a separate searchable unit. Column values are combined into a text representation for embedding.
  • Row limits: Up to 100,000 rows per sheet are supported. Files exceeding this limit will fail processing.
  • Cell value limits: Individual cell values are truncated at 1,000 characters. Row content (all columns combined) is capped at 8,000 characters for embedding.
  • Multi-sheet support: Each sheet is processed and stored as a distinct section of the document.

CSV

CSV files are processed as a single-sheet tabular document with the same column type inference and row limits as Excel. Column headers are read from the first row.

Images (JPG, PNG)

Images are not parsed for text and do not receive vector embeddings. They are processed exclusively through annotation workflows. This is useful when you need vision-capable AI models to classify, describe, or extract information from charts, diagrams, or scanned documents that you want to treat as images rather than OCR text.

Jupyter Notebooks (.ipynb)

Notebook cells (code and markdown) are extracted as text content and processed through the standard text pipeline. This makes notebook content searchable alongside your other documents.

Best Practices

  1. Use descriptive names: Document names are used for keyword search and display in the UI. Use consistent, descriptive naming conventions (e.g., AAPL-10K-2024 rather than document-1).

  2. Prefer digital PDFs: Native digital PDFs extract more accurately than scanned documents. When you have both, use the digital version.

  3. Check status before querying: Wait for documents to reach SUCCESS status before running search queries, annotation workflows, or agent conversations that reference them.

  4. Batch thoughtfully: Each upload triggers an independent processing job. Uploading hundreds of documents at once is fine (they process in parallel) but monitor the batch for failures.

  5. Right-size your spreadsheets: While Ragnerock supports up to 100,000 rows per sheet, smaller spreadsheets (under 10,000 rows) produce better embedding quality and faster processing.

Next Steps

  • Data Sources: Data management, processing pipeline, and organization
  • Annotations: Extract structured data from your documents with annotation workflows
  • Data Processing: Architecture details on the processing pipeline and job system