Ingestion
Ingest documents and audio into the knowledge base for searching and chat.
ingest
Ingest a document into the vector store — extract text, chunk into passages, generate embeddings, and store.
| Argument / Option | Type | Required | Description |
|---|---|---|---|
<file> |
string | Yes | Path to the document file (PDF) |
How it works
- Extract — Sends the document to Azure AI Document Intelligence, which parses it into structural elements (paragraphs, tables, headings)
- Chunk — Splits the content into passages that respect structural boundaries (sections, pages) with configurable overlap
- Register — Creates or retrieves the document record in the registry
- Archive — If blob storage is configured, archives the source file to the configured container (default
jime-documents) for durable storage. The blob URI is saved on the document record. - Create interpretation — Creates a new versioned interpretation record from the full extracted text, linking all subsequent chunks to this processing run. Re-ingesting creates a new version rather than overwriting.
- Embed — Generates vector embeddings using Azure OpenAI (text-embedding-3-small), processing chunks in batches with retry logic
- Store — Saves chunks with embeddings, page ranges, section headings, and content hashes to PostgreSQL, linked to the interpretation version
Re-ingestion
Running jime ingest on a file that has already been ingested creates a new interpretation — the previous interpretation and its chunks are deactivated but preserved. This lets you re-process documents (e.g., after pipeline improvements) without losing history.
Examples
# Ingest a rulebook
jime ingest shadowdark-rules.pdf
# Re-ingest the same file — creates a new versioned interpretation
jime ingest shadowdark-rules.pdf
Tip
Re-ingesting a file is safe and non-destructive. Each run creates a new interpretation with fresh chunks while preserving all previous versions.
list
List all documents that have been ingested into the vector store.
Displays a table with:
| Column | Description |
|---|---|
| ID | Document identifier (use with #N syntax) |
| File | Source filename |
| Category | Assigned category (if any) |
| Chunks | Number of stored chunks |
| Pages | Page range covered |
| Ingested | Date the document was ingested |
transcribe
Transcribe audio files from a Google Drive folder and ingest the transcriptions into the knowledge base. Uses Azure Speech Services for transcription with speaker diarization.
| Argument / Option | Type | Required | Description |
|---|---|---|---|
<folder-id> |
string | Yes | Google Drive folder ID containing audio files |
--file |
string | No | Process only the file with this name |
--map-speakers |
flag | No | Interactively map speaker labels to player names |
--reprocess |
flag | No | Re-transcribe files already in the corpus |
Prerequisites
Transcription requires additional configuration:
GoogleServiceAccountKeyPath— Path to a Google service account key fileSpeechEndpoint— Azure Speech Services endpointStorageEndpoint— Azure Blob Storage endpoint
How it works
- Download — Fetches audio files from the Google Drive folder
- Upload — Stores audio in Azure Blob Storage
- Transcribe — Submits to Azure Speech batch transcription with speaker diarization
- Create interpretation — Stores the full transcript text as a versioned interpretation
- Chunk & Store — Processes the transcription into chunks and stores with embeddings, linked to the interpretation