Architecture
Overview
Intendant is a two-binary system: a command runtime that executes work, and a caller that drives it via AI model APIs.
stdin (JSON) --> intendant-runtime --> executes commands sequentially (blocking)
|
+--> in-memory process state (HashMap<nonce, ProcessInfo>)
+--> $INTENDANT_LOG_DIR/ (stdout/stderr logs per nonce)
|
+--> stdout (result lines with exit code, stdout/stderr tail)
intendant (3 modes) --> detects project root (git) --> loads memory/knowledge/skills
|
+--> User Mode: spawns orchestrator subprocess, monitors progress (no API calls)
+--> Sub-Agent Mode: scoped task, writes results/progress, isolated context
+--> Direct Mode: single-loop execution for simple tasks
|
+--> Presence layer: conversational mediator between user and agent loop
+--> Native tool calling (OpenAI/Anthropic/Gemini) with text extraction fallback
+--> Streaming output: SSE-based token streaming for all 3 providers
+--> Ratatui TUI: status bar, scrollable log, approval panel, askHuman input
+--> Web dashboard: 4-tab app (Activity/Usage/Terminal/Displays) with WASM-driven state
+--> Live voice: Gemini Live / OpenAI Realtime via browser, active/passive multi-browser
+--> MCP Server: --mcp flag, stdio transport, full parity with TUI (tools + resources)
+--> MCP Client: connects to external MCP servers (configured in intendant.toml)
+--> Autonomy system: Low/Medium/High/Full + per-category rules from intendant.toml
+--> Skills system: SKILL.md-based instruction sets with YAML frontmatter
+--> Transcription: server-side Whisper API for browser audio transcription
+--> Landlock sandbox: filesystem restrictions on agent runtime (Linux)
+--> Prompt caching: Anthropic cache_control, OpenAI/Gemini implicit caching
+--> Auto-compaction: triggers at 90% context usage, preserves system+tail messages
+--> Control socket: /tmp/intendant-<pid>.sock (JSON-line protocol)
+--> VNC proxy: WebSocket-to-TCP bridge for noVNC display viewing
+--> Token budget tracking (context-window-aware loop termination)
+--> Session resume: --continue (most recent) or --resume <id> (specific session)
+--> Git worktree isolation for implementation agents
+--> Tagged knowledge store with pub/sub channels between agents
Process State
In-memory HashMap<u64, ProcessInfo> tracking nonce, PID, status, exit code, and timestamp. Ephemeral — does not survive binary restarts. Each runtime invocation starts with an empty process map.
Session Directory
Per-session directory at ~/.intendant/logs/<uuid>/ with UUID-based naming. Contains per-nonce stdout/stderr log files, structured session logs (session.jsonl), conversation history (conversation.jsonl), and askHuman IPC files. The log directory is passed to the runtime via INTENDANT_LOG_DIR.
Execution Model
Commands are processed sequentially. Each command blocks until completion and returns its result directly (exit code, stdout tail, stderr tail). The runtime exits after processing all commands. Daemons backgrounded in bash continue after the tool returns.
Execution Modes
intendant operates in one of three modes, selected automatically based on task complexity and environment:
Direct Mode
Activated for simple tasks, or forced with --direct:
- Single-loop execution with the selected model
- Budget-aware loop: stops at context exhaustion,
donesignal, or 500-turn safety cap - Used for short tasks that don’t need multi-agent orchestration
User Mode
Activated for complex tasks without INTENDANT_ROLE:
- Pure subprocess monitor — makes zero model API calls at Layer 0
- Spawns an orchestrator sub-agent as a child process via
tokio::process::Command - Polls the orchestrator’s progress file every 500ms, relays status to the TUI or stdout
- Reads the orchestrator’s result file on exit; synthesizes a failure if the process crashes
kill_on_drop(true)ensures the orchestrator is terminated if the user quits the TUI
Sub-Agent Mode
Activated when INTENDANT_ROLE env var is set:
- Runs as a child agent with a scoped task
- Writes periodic progress to
INTENDANT_PROGRESS_FILE - Writes final results (summary, findings, artifacts, token usage) to
INTENDANT_RESULT_FILE - Uses role-specific system prompts (
SysPrompt_research.md,SysPrompt_implementation.md, etc.)
See Multi-Agent Orchestration for the full sub-agent architecture.
How It Works (Direct Mode)
- Loads
.envand selects the API provider (OpenAI, Anthropic, or Gemini). OpenAI uses the Responses API (/v1/responses), Anthropic uses the Messages API, Gemini uses thegenerateContentendpoint. All providers support streaming via SSE - Configures structured output (JSON mode), reasoning controls, native tool calling, prompt caching (Anthropic
cache_control), and max output tokens based on model capabilities and env vars - Detects the project root (via
git rev-parse --show-toplevel, falls back to cwd) - Resolves role-appropriate system prompt via cascade: project root →
~/.config/intendant/→ compiled-in default. When native tools are enabled, uses the condensedSysPrompt_tools.md(tool docs live in API tool definitions instead of prose) - Injects the project working directory into the conversation so the model knows which project to work in
- Loads knowledge from
<project>/.intendant/memory.json, injects into conversation - Loads
INTENDANT.mdproject instructions (global then project-local), injects into conversation - Logs the full messages array to
turn_NNN_messages.jsonbefore each API call - Sends the task to the chat API via streaming (
chat_stream()), withmax_tokens/max_output_tokens, optionalreasoning, optional JSON format, and native tool definitions when enabled. API requests use exponential backoff retry (up to 5 retries) for rate-limit (429) and server errors (5xx). Text deltas are forwarded to the TUI in real-time - Logs reasoning content (both summary and full text) to
turn_NNN_reasoning.txtwhen available - Processes the model’s response via one of two paths:
- Native tool call path (when response contains tool calls): Collects individual tool calls, assembles them into an
AgentInputbatch, pipes to the runtime, maps results back to per-tool-call responses. Handlesmanage_contextandsignal_donetool calls caller-side. Raw API output items (reasoning + function_call) are preserved for verbatim echo-back in subsequent requests, which reasoning models (GPT-5, o3, o4) require - Legacy text extraction path (fallback): Extracts JSON from the response text (handles structured output, code fences, and bare JSON), checks for explicit
donesignal ({"done": true})
- Native tool call path (when response contains tool calls): Collects individual tool calls, assembles them into an
- Applies context directives (
drop_turns,summarize) to the conversation - Injects project context (
memory_file) into relevant commands - Classifies commands by action category (file read/write/delete, exec, network, destructive) and checks autonomy rules
- If approval is required:
- TUI mode: emits an approval request and waits for user response
- Headless mode: denies execution (no implicit auto-approve fallback)
- Pipes the JSON to the
intendant-runtimebinary and waits for completion with a hard timeout (120s default, 600s foraskHuman) - Feeds the agent output back as the next user message (text path) or as individual tool results (tool call path), appending a token budget summary
- Repeats until the model signals done, responds with no JSON, or the context budget is exhausted
- In headless mode, if the model emits
askHuman, the loop sends a recovery prompt back to the model (continue with explicit assumptions) instead of blocking on human-input timeout
askHuman Behavior
- In TUI mode,
askHumanopens the input panel and writes your answer to the session-scoped response file. - Empty submit is rejected in the TUI; you must provide non-empty input or press
Escto cancel. - In headless mode (
--no-tuior non-interactive stdin),askHumancannot be answered interactively. The loop tells the model to continue with explicit assumptions instead of waiting for the runtime timeout. - Runtime-level timeout for unanswered
askHumanremains 5 minutes.
Streaming
All three providers support streaming via chat_stream() on the ChatProvider trait:
- Anthropic:
stream: trueon Messages API, parsescontent_block_delta,content_block_start/stop,message_delta - OpenAI:
stream: trueon Responses API, parsesresponse.output_text.delta,response.function_call_arguments.delta,response.completed - Gemini:
streamGenerateContent?alt=sseendpoint, parses chunked JSON candidates
Text deltas are forwarded to the TUI via AppEvent::ModelResponseDelta and accumulated in App::streaming_buffer, which is cleared when the full ModelResponse arrives.
Rate-Limit Retry
API requests use send_with_retry() with exponential backoff (1s × 2^attempt + jitter, up to 5 retries) for HTTP 429 and 5xx responses. Non-retryable errors (400, 401, etc.) fail immediately. API keys in error messages are masked via mask_api_keys().
Prompt Caching
- Anthropic: Uses
anthropic-beta: prompt-caching-2024-07-31header with structured system content containingcache_control: {"type": "ephemeral"} - OpenAI: Automatic server-side caching for prompts >1024 tokens (no API changes needed)
- Gemini: Implicit context caching (no API changes needed)
Auto-Compaction
When context usage reaches 90% (usage_fraction() >= 0.90), conversation.auto_compact() triggers:
- Keeps: system message, first 2 context messages, last 4 messages
- Summarizes: oldest half of remaining middle messages via
summarize_turns() - Emits
ContextManagementevent to TUI/MCP
Vision / Display Management
Xvfb is auto-launched lazily on the first turn containing an execAsAgent or captureScreen command when no accessible X display exists (checked via xdpyinfo). The detection flow:
- Already launched? → skip
- Batch contains
execAsAgentorcaptureScreen? No → skip - Current
DISPLAYaccessible? Yes → skip (user has working display) - Auto-launch Xvfb, store guard, set
DISPLAY, emitDisplayReadyevent - On failure → log warning, let
captureScreenfail naturally
Display allocation prefers :99 for a predictable VNC port (5999). If :99 is locked by a live Xvfb from a previous session, it is automatically killed and reclaimed (detected via /proc/<pid>/cmdline). If :99 is held by a non-Xvfb process, allocation falls through to :100+.
Per-provider display resolutions:
- OpenAI: 1024×768 (3 tiles of 512×512)
- Anthropic: 819×1456 (9:16 aspect)
- Gemini: 768×1024 (2 tiles of 768×768)
VNC Remote Observation
An x11vnc server is launched alongside Xvfb as a best-effort co-process (port = 5900 + display_id). If x11vnc is not installed, the display works normally. Both Xvfb and x11vnc are killed on drop via XvfbGuard.
On the guest VM — install x11vnc:
sudo apt-get install -y x11vnc
From your host machine — connect with any VNC client:
# Direct connection (VM on local network)
vncviewer <vm-ip>:5999
# Over SSH tunnel (recommended for remote VMs)
ssh -L 5999:localhost:5999 user@vm-host
vncviewer localhost:5999
If display :99 was already taken, intendant falls through to :100+ and the VNC port shifts accordingly (6000, 6001, …). Check the TUI log panel or stderr for the actual port:
22:29:12 VNC server available at vnc://localhost:5999
Environment
- OS: Linux (Debian 12+)
- Runtime: Tokio async (full features)
- Permissions: Runs as unprivileged user with passwordless sudo
- Display: Auto-managed Xvfb with x11vnc (see above)
- X11 auth: At startup the runtime discovers active X displays and merges their xauth cookies into a session-scoped
session.Xauthorityfile, passed asXAUTHORITYto all spawned commands