Skip to content

Ingestion

Ingest documents and audio into the knowledge base for searching and chat.

ingest

Ingest a document into the vector store — extract text, chunk into passages, generate embeddings, and store.

jime ingest <file>
Argument / Option Type Required Description
<file> string Yes Path to the document file (PDF)

How it works

  1. Extract — Sends the document to Azure AI Document Intelligence, which parses it into structural elements (paragraphs, tables, headings)
  2. Chunk — Splits the content into passages that respect structural boundaries (sections, pages) with configurable overlap
  3. Register — Creates or retrieves the document record in the registry
  4. Archive — If blob storage is configured, archives the source file to the configured container (default jime-documents) for durable storage. The blob URI is saved on the document record.
  5. Create interpretation — Creates a new versioned interpretation record from the full extracted text, linking all subsequent chunks to this processing run. Re-ingesting creates a new version rather than overwriting.
  6. Embed — Generates vector embeddings using Azure OpenAI (text-embedding-3-small), processing chunks in batches with retry logic
  7. Store — Saves chunks with embeddings, page ranges, section headings, and content hashes to PostgreSQL, linked to the interpretation version

Re-ingestion

Running jime ingest on a file that has already been ingested creates a new interpretation — the previous interpretation and its chunks are deactivated but preserved. This lets you re-process documents (e.g., after pipeline improvements) without losing history.

Examples

# Ingest a rulebook
jime ingest shadowdark-rules.pdf

# Re-ingest the same file — creates a new versioned interpretation
jime ingest shadowdark-rules.pdf

Tip

Re-ingesting a file is safe and non-destructive. Each run creates a new interpretation with fresh chunks while preserving all previous versions.


list

List all documents that have been ingested into the vector store.

jime list

Displays a table with:

Column Description
ID Document identifier (use with #N syntax)
File Source filename
Category Assigned category (if any)
Chunks Number of stored chunks
Pages Page range covered
Ingested Date the document was ingested

transcribe

Transcribe audio files from a Google Drive folder and ingest the transcriptions into the knowledge base. Uses Azure Speech Services for transcription with speaker diarization.

jime transcribe <folder-id> [options]
Argument / Option Type Required Description
<folder-id> string Yes Google Drive folder ID containing audio files
--file string No Process only the file with this name
--map-speakers flag No Interactively map speaker labels to player names
--reprocess flag No Re-transcribe files already in the corpus

Prerequisites

Transcription requires additional configuration:

  • GoogleServiceAccountKeyPath — Path to a Google service account key file
  • SpeechEndpoint — Azure Speech Services endpoint
  • StorageEndpoint — Azure Blob Storage endpoint

How it works

  1. Download — Fetches audio files from the Google Drive folder
  2. Upload — Stores audio in Azure Blob Storage
  3. Transcribe — Submits to Azure Speech batch transcription with speaker diarization
  4. Create interpretation — Stores the full transcript text as a versioned interpretation
  5. Chunk & Store — Processes the transcription into chunks and stores with embeddings, linked to the interpretation

Examples

# Transcribe all files in a folder
jime transcribe 1ABCdef123ghijk456

# Transcribe a specific file
jime transcribe 1ABCdef123ghijk456 --file "session-2024-01-15.mp3"

# Map speaker labels to player names interactively
jime transcribe 1ABCdef123ghijk456 --map-speakers