Skip to content

How Search Works

JickleJime uses a three-way hybrid search to find relevant content. Each query runs three independent searches in parallel, and the results are merged using Reciprocal Rank Fusion (RRF) to produce a single ranked list.

The three search signals

1. Vector search (semantic)

Your query is converted into a vector embedding using Azure OpenAI's text-embedding-3-small model (1536 dimensions). This embedding is compared against all stored chunk embeddings using cosine similarity via a PostgreSQL HNSW index.

Strengths: Captures meaning — "weapon damage" matches combat mechanics even without those exact words.

Weaknesses: Can miss domain-specific terms, proper nouns, or exact phrases.

2. Keyword search (full-text)

Your query is parsed into search terms using PostgreSQL's websearch_to_tsquery function with the English stemmer. These terms are matched against a precomputed tsvector column that combines each chunk's section heading and content.

Strengths: Exact term matching — "Crawling" finds the exact word.

Weaknesses: Doesn't handle compound word variations ("shortsword" won't match "short sword").

3. Trigram search (fuzzy)

Your query is compared character-by-character using PostgreSQL's pg_trgm extension, which breaks text into three-character sequences and measures overlap.

Strengths: Handles compound words ("battleaxe" matches "battle axe"), typos, and partial matches.

Weaknesses: Less precise for very short queries.

Reciprocal Rank Fusion

The three result lists are merged using RRF with the formula:

score(chunk) = Σ  1 / (k + rank)

Each chunk's final score is the sum of its reciprocal ranks across all lists it appears in. A chunk ranked #1 in all three lists scores highest; a chunk appearing in only one list scores lower.

The smoothing constant k controls how much weight top-ranked results get. JickleJime uses k=10 (rather than the academic default of 60) because the small candidate pools (~20 results per list) need stronger discrimination between ranks.

graph LR
    Q[User Query] --> V[Vector Search]
    Q --> K[Keyword Search]
    Q --> T[Trigram Search]
    V --> RRF[Reciprocal Rank Fusion]
    K --> RRF
    T --> RRF
    RRF --> R[Top-K Results]

Ingestion pipeline

Before content can be searched, it goes through the ingestion pipeline:

graph LR
    D[Document] --> E[Extract]
    E --> C[Chunk]
    C --> R[Register]
    R --> A[Archive]
    A --> I[Create Interpretation]
    I --> Em[Embed]
    Em --> S[Store]
  1. Extract — Azure AI Document Intelligence parses the document into structural elements (paragraphs, tables, headings, page boundaries)
  2. Chunk — Content is split into passages that respect structural boundaries, with configurable overlap between chunks
  3. Register — The document is registered in the document registry (or retrieved if it already exists)
  4. Archive — If blob storage is configured, the source file is archived to the configured container (default jime-documents, see DocumentArchiveContainerName) and the blob URI is stored on the document record
  5. Create interpretation — A new interpretation version is created from the full extracted text, linking all subsequent chunks to this processing run. Re-ingesting creates a new version rather than overwriting.
  6. Embed — Azure OpenAI generates a 1536-dimensional vector for each chunk, processed in batches with retry logic
  7. Store — Chunks are saved to PostgreSQL with their embedding, content hash, page range, section heading, and full-text search vector, linked to the interpretation version

Re-ingesting a document creates a new interpretation with fresh chunks. Previous interpretations are deactivated but preserved, maintaining a full processing history.

Filters

Search results can be filtered along three dimensions:

Filter Effect Example
Scope Controls audience visibility all (players), dm-only (DM only)
Source Restricts to specific documents "rulebook.pdf", "#1"
Category Restricts to document categories "rules", "sessions"

Filters are applied before the search runs, so they reduce the candidate pool rather than post-filtering results. This improves both relevance and performance.