Skip to content
← Back to site
Soleri | Docs

Content Ingestion

Your vault doesn’t have to grow one pattern at a time. Content ingestion lets you feed entire documents into the agent — articles, meeting transcripts, PDF books, documentation pages — and the agent extracts knowledge items, classifies them, deduplicates against your existing vault, and stores what’s new.

Found an article worth remembering? Feed it directly:

You: “Ingest this article: https://example.com/distributed-systems-patterns

Agent: Fetched and processed. 4 entries extracted, 1 duplicate skipped.

  • Circuit Breaker Pattern (pattern, distributed-systems)
  • Bulkhead Isolation (pattern, distributed-systems)
  • Retry with Exponential Backoff (pattern, distributed-systems)
  • Timeout Best Practices — already in vault

The agent fetches the page, extracts text, sends it through an LLM for classification, and checks each extracted item against your vault’s content hashes. Duplicates are skipped automatically.

You can specify a domain and tags to organize the results:

You: “Ingest https://example.com/k8s-security with domain: infrastructure, tags: kubernetes, security”

Agent: 3 entries extracted and tagged.

For content that isn’t at a URL — meeting notes, copied text, transcripts:

You: “Ingest this transcript from our architecture review meeting…”

Agent: Processed as transcript. 5 entries extracted.

Source types help the LLM classify content more accurately:

Source typeUse for
articleBlog posts, published articles
transcriptMeeting recordings, podcast transcripts
notesPersonal notes, quick captures
documentationTechnical docs, API references, READMEs

The agent uses source type as context for extraction — a transcript might yield decisions and action items, while documentation yields patterns and conventions.

When you have multiple items to ingest at once:

You: “Ingest these three items:

  1. Our coding standards doc (text: ’…’)
  2. The accessibility checklist (text: ’…’)
  3. Meeting notes from sprint retro (text: ’…’)”

Agent: Batch complete: 3 sources processed, 11 entries extracted, 2 duplicates skipped.

Each item in a batch has its own title, source type, domain, and tags. Items are processed sequentially so deduplication works across the batch — if item 2 would create a duplicate of something item 1 just added, it’s caught.

For longer documents like PDF books, the agent uses a chunked pipeline:

You: “Ingest this book: /path/to/design-systems.pdf, title: ‘Design Systems Handbook’, domain: design, author: ‘A. Smith’”

Agent: Job created: job-abc123. 24 chunks ready (10 pages each). Process chunks to extract knowledge.

The PDF is parsed, hashed (so re-ingesting the same file is detected), and split into page-window chunks. Nothing is extracted yet — this step just prepares the pipeline.

Process chunks in batches — this is where the LLM extracts and classifies knowledge:

You: “Process 5 chunks of job-abc123”

Agent: Processed chunks 1-5. 18 entries extracted, 3 duplicates. 19 chunks remaining.

You can process all chunks at once or in smaller batches. Smaller batches give you a chance to review intermediate results and stop early if the quality drops.

You: “What’s the status of intake jobs?”

Agent: 1 job:

  • job-abc123 — Design Systems Handbook: 5/24 chunks processed, 18 entries extracted

If you want to see what the pipeline would extract from a specific page range without storing anything:

You: “Preview pages 50-60 of design-systems.pdf”

Agent: Preview (not stored): 4 potential entries found in pages 50-60.

Every ingestion path follows the same core pipeline:

  1. Extract text — from URL, raw text, or PDF pages
  2. Classify via LLM — identify patterns, anti-patterns, decisions, conventions
  3. Deduplicate — content-hash comparison against existing vault entries
  4. Store — new entries go into the vault with domain, tags, and source metadata

The LLM does the heavy lifting of turning unstructured text into structured knowledge items. You don’t need to manually tag or categorize — the agent infers type, severity, and domain from context.

  • Set a domain — it gives the LLM classification context and keeps your vault organized
  • Use accurate source types — a transcript is processed differently than documentation
  • Add tags — tags applied at ingestion time propagate to all extracted entries
  • Preview first for books — check a small page range before processing the whole thing
  • Don’t worry about duplicates — the dedup pipeline handles them automatically

Previous: Entry Linking & Knowledge Graph — connect entries with typed links. Next: Knowledge Review Workflow — team quality control for vault entries.