Ingestion

Ingest documents and audio into the knowledge base for searching and chat.

ingest

Ingest a document into the vector store — extract text, chunk into passages, generate embeddings, and store.

jime ingest <file>

Argument / Option	Type	Required	Description
`<file>`	string	Yes	Path to the document file (PDF)

How it works

Extract — Sends the document to Azure AI Document Intelligence, which parses it into structural elements (paragraphs, tables, headings)
Chunk — Splits the content into passages that respect structural boundaries (sections, pages) with configurable overlap
Register — Creates or retrieves the document record in the registry
Archive — If blob storage is configured, archives the source file to the configured container (default jime-documents) for durable storage. The blob URI is saved on the document record.
Create interpretation — Creates a new versioned interpretation record from the full extracted text, linking all subsequent chunks to this processing run. Re-ingesting creates a new version rather than overwriting.
Embed — Generates vector embeddings using Azure OpenAI (text-embedding-3-small), processing chunks in batches with retry logic
Store — Saves chunks with embeddings, page ranges, section headings, and content hashes to PostgreSQL, linked to the interpretation version

Re-ingestion

Running jime ingest on a file that has already been ingested creates a new interpretation — the previous interpretation and its chunks are deactivated but preserved. This lets you re-process documents (e.g., after pipeline improvements) without losing history.

Examples

# Ingest a rulebook
jime ingest shadowdark-rules.pdf

# Re-ingest the same file — creates a new versioned interpretation
jime ingest shadowdark-rules.pdf

Tip

Re-ingesting a file is safe and non-destructive. Each run creates a new interpretation with fresh chunks while preserving all previous versions.

list

List all documents that have been ingested into the vector store.

jime list

Displays a table with:

Column	Description
ID	Document identifier (use with `#N` syntax)
File	Source filename
Category	Assigned category (if any)
Chunks	Number of stored chunks
Pages	Page range covered
Ingested	Date the document was ingested

transcribe

Transcribe audio files from a Google Drive folder and ingest the transcriptions into the knowledge base. Uses Azure Speech Services for transcription with speaker diarization.

jime transcribe <folder-id> [options]

Argument / Option	Type	Required	Description
`<folder-id>`	string	Yes	Google Drive folder ID containing audio files
`--file`	string	No	Process only the file with this name
`--map-speakers`	flag	No	Interactively map speaker labels to player names
`--reprocess`	flag	No	Re-transcribe files already in the corpus

Prerequisites

Transcription requires additional configuration:

GoogleServiceAccountKeyPath — Path to a Google service account key file
SpeechEndpoint — Azure Speech Services endpoint
StorageEndpoint — Azure Blob Storage endpoint

How it works

Download — Fetches audio files from the Google Drive folder
Upload — Stores audio in Azure Blob Storage
Transcribe — Submits to Azure Speech batch transcription with speaker diarization
Create interpretation — Stores the full transcript text as a versioned interpretation
Chunk & Store — Processes the transcription into chunks and stores with embeddings, linked to the interpretation

Examples

# Transcribe all files in a folder
jime transcribe 1ABCdef123ghijk456

# Transcribe a specific file
jime transcribe 1ABCdef123ghijk456 --file "session-2024-01-15.mp3"

# Map speaker labels to player names interactively
jime transcribe 1ABCdef123ghijk456 --map-speakers