agent6 state machines¶
agent6 state machines are a declarative, human-editable, machine-parseable layer on top of agent6 that lets operators compose mini-agents: small, reliable, deterministic programs whose building blocks are agent6 runs, sandboxed tool calls, timed waits, and branches.
This document is the specification and reference for the format and its
runtime. The feature is implemented end-to-end under src/agent6/machine/
and exposed through the agent6 machine subcommands: create, check,
graph, run, status, poke, and replay (§7). It does not change
the security model, the tool surface, or the stability policy in
AGENTS.md; §9 records how each invariant is preserved.
1. Motivation¶
agent6's two workflows (architecture.md), run (the
agent loop) and review (a read-only diff pass), are both single-shot:
you start them, they finish. There is no first-class way to express a
program that runs indefinitely, reacting to the clock or to external
signals, branching, looping, and occasionally invoking an agent run as
one step among many.
"Always-on" autonomous agents target this, but tend to put the LLM in the
driver's seat of the control flow, not just the work, so the same
inputs produce different paths, crashes lose state, and runs can't be
replayed. agent6 can do better because the run workflow is already a
deterministic, snapshot-and-replay state machine internally. State
machines lift that pattern up one layer: the operator authors the
control flow as a static graph, and the LLM stays confined to the work
inside a state.
A representative shape (the watched location and side effects would each be separately-audited tools, out of scope here): poll on a fixed interval; when items appear, have an agent classify each into a typed verdict; on a high-confidence verdict take a side-effecting step and loop, else wait and poll again. A long-running loop with timed polling, branches on agent output, side-effecting steps, and terminal states is what state machines make first-class.
2. Goals and non-goals¶
Goals¶
- Human-editable. An operator authors a machine in a text editor. The format is obvious to read, diff-friendly, and commentable.
- Deterministically parseable. One file → exactly one validated in-memory machine, or a precise error. No ambiguity, no implicit defaults beyond the ones declared in the file.
- Deterministic execution / replayable. Given the same journal of inputs (including captured wall-clock and external reads), re-running reproduces the identical path. You can backtest a run offline.
- Reliable / crash-safe. Kill the process at any point; on restart it rehydrates from an append-only journal at the last completed state and the next scheduled wake. Idempotent: completed side-effecting steps are not re-run.
- Composable. A state can be an agent6 run. Mini-agents are built by wiring states, not by writing Python.
- Confined. The LLM never authors control flow and never gains new tool surface. All side effects still route through the existing jail.
Non-goals¶
- Not a general programming language. The branch/predicate grammar is intentionally non-Turing-complete (no loops inside a predicate, no arbitrary code). Loops exist only as graph edges.
- Not a distributed scheduler. One machine = one OS process (systemd / cron-friendly), restartable. No clustering in v1.
- Not a new network surface. Anything that talks to the outside world is a tool, gated by the existing audit rules.
- Not LLM-authored. Machines are operator artifacts checked into a repo.
3. Design principles¶
- Control flow is static and operator-owned; work is dynamic and
model-owned. The graph of states/edges is fixed at author time. What
happens inside an
agentstate is the usual agent6 loop. - Everything nondeterministic is journaled as a fact. Wall-clock
reads, tool stdout, agent outputs: each is appended to an immutable
event log the moment it is observed. The engine is a pure reducer over
(machine, blackboard, event) → blackboard'. Replay reads the journal instead of re-observing the world. - Fail loudly (repo convention). A missing transition target, an unreachable state, a type mismatch on a blackboard variable, or an unknown key is a load-time error, not a runtime surprise.
- No implicit defaults (mirrors
Config:extra="forbid", frozen=True). Every variable is declared with a type and an explicit initial value (valuefor[vars.operator],defaultfor the mutable[vars.code]/[vars.agent]). Every state declares every outcome edge it can produce.
4. The format¶
A machine is a single TOML file, suffix .asm.toml ("agent6 state
machine"). TOML because the project already standardizes on it, it is
parsed by tomllib (stdlib, no new dependency), and it is
comfortable to hand-edit and diff. The parsed document is validated by a
pydantic v2 model at the trust boundary (extra="forbid", frozen=True),
exactly like Config.
Naming. The suffix is
.asm.toml("agent state machine"; deliberately vendor-neutral likeAGENTS.md, so other tools can adopt it)..a6m.tomlis a documented fallback if the assembly-language.asmclash ever bites; the parser keys off the doubled extension.
4.1 Top-level shape¶
machine = "item-classifier" # stable id, used in <state-dir>/<repo-id>/machines/<id>/
version = 1 # schema version; bumped only on real shape changes
initial = "poll" # name of the entry state
[budget]
max_usd = 25.0 # optional hard cap; or best_effort_usd_limit (see below)
max_transitions = 100000 # hard stop on total edges taken (runaway guard)
# The blackboard is three subtables, named by WHO may write each variable.
# The subtable header is the owner; there is no per-entry discriminator.
[vars.operator] # written by the human at author time; immutable at runtime
inbox_dir = { type = "str", value = "/srv/inbox" }
poll_secs = { type = "int", value = 300 }
[vars.code] # written deterministically by a tool state's capture
pending = { type = "list[str]", default = [] }
cursor = { type = "str", default = "" }
[vars.agent] # written by an agent state's validated finish_run
verdict = { type = "classification", default = {} } # a [schemas.*] record type
[schemas.<name>] # named record types; see 4.6
...
[states.<name>] # one table per state; see 4.3
...
4.2 The blackboard: three owners¶
The key/value store is split into three subtables, named by who may
write each variable. Provenance is the single organizing axis, and the
subtable header carries it, so there is no redundant per-entry writer/
owner field. Who may write a value is therefore a
statically-checkable, fail-loud property of which table a variable lives
in, not a runtime convention.
| subtable | written by | mutability | declared with | example |
|---|---|---|---|---|
[vars.operator] |
the human, at author time | immutable at runtime | value |
inbox_dir, poll_secs, thresholds, an API base |
[vars.code] |
a tool state's capture |
mutable (deterministic) | default |
pending, cursor |
[vars.agent] |
an agent state's validated finish_run payload |
mutable (LLM) | default |
verdict (a [schemas.*] record) |
Only tool states (into [vars.code]) and agent states (into
[vars.agent]) ever mutate the blackboard; branch/wait/terminal
only route, sleep, or end.
[vars.operator]are the machine's parameters: set once when the operator authors/commits the file and never written by any state. Declared with a concretevalue(not adefault). Anycapture/setthat targets an operator var is a load-time error. The names above are illustrative; an operator var may be any JSON-serializable value.[vars.code]change only as a pure function of journaled tool output; this is what keeps the path deterministic and replayable.[vars.agent]change only through the single validated structured output of oneagentstate: the LLM's one sanctioned channel into the blackboard.
At machine check time the validator enforces the ownership wall: a
tool capture may target only [vars.code] vars, an agent capture
may target only [vars.agent] vars, and [vars.operator] vars are
read-only to every state. A tool cannot smuggle a write into an
LLM-owned variable, and an agent cannot overwrite a deterministic one.
Allowed types (all three subtables): str, int, float, bool,
list[<scalar>], json, and any named record type declared in
[schemas.*] (§4.6). The two structured types differ on exactly one
axis, navigability:
jsonis an opaque blob: read or written wholesale only. It may be passed to a tool/agent ({{ x | json }}) or captured as a whole, but it may not be dotted.x.keywherexisjsonis a load-time error. Usejsononly when the machine never inspects the value's internals.- A record type (e.g.
classification) is navigable: every.fieldread in a predicate or template is checked against the schema atmachine checktime; a misspelled field is a load error, not a silent misroute.
Declaring types up front is what makes branch predicates statically
type-checkable: scalars by their declared type, record fields by their
schema, and json simply forbidden from being dotted at all.
The blackboard (all three subtables) is the only state that flows
between states. The mutable halves ([vars.code] + [vars.agent]) are
snapshotted to disk after every transition; [vars.operator] is fixed
for the life of the machine.
4.3 State kinds¶
Every state has a kind. There are five.
| kind | what it does | outcome labels (edges) |
|---|---|---|
agent |
runs one agent6 loop (a Workflow) on a prompt |
ok · failed · budget_exhausted · timeout |
tool |
one sandboxed command via run_in_jail |
ok · nonzero · timeout |
wait |
sleeps until a wall-clock tick or an external signal | tick · signal |
branch |
pure predicate over the blackboard → next state | (chooses a goto directly) |
terminal |
ends the machine | (none; absorbing) |
The outcome labels are a fixed enum per kind, produced by the
state executor deterministically. A non-terminal, non-branch state
must declare an on = { ... } table mapping every label its kind
can emit to a target state name. Omitting a label is a load error.
This is the key to determinism: the edge taken is a pure function of a small, closed set of executor-produced labels, never of free-form LLM text.
agent¶
[states.classify]
kind = "agent"
model = "claude-sonnet-4-6" # any configured provider model
prompt = """
Classify the item at path {{ cursor }}.
Call finish_run with JSON {label, confidence}.
"""
output_schema = "classification" # named schema in [schemas.*]; validates finish_run payload
capture = { finish_json = "verdict" } # parsed finish_run payload -> blackboard var `verdict`
timeout_secs = 600
on = { ok = "route", failed = "poll", budget_exhausted = "halt", timeout = "poll" }
# Optional per-state overrides (inherit the effective config when unset):
# provider = "anthropic" # which [providers.*] entry backs this call
# thinking = "high" # off | low | medium | high (extended thinking)
# temperature = 0.2
# max_usd = 1.5 # this agent slice's caps; or
# best_effort_usd_limit = 1.5 # ...the soft variant (at most one of the two)
# max_input_tokens = 100000
# max_output_tokens = 4096
An agent state spins up a normal agent6 run with its own snapshot
dir, transcript, budget slice, and jail. The only control-flow signal
it returns is the outcome label; its structured product is whatever
finish_run emitted, validated against output_schema, captured into
the blackboard. The LLM cannot pick the next state; it can only
populate variables that a downstream branch reads.
The optional per-state knobs above tune how that loop runs: provider
/ thinking / temperature select and tune the model, and the
max_usd / best_effort_usd_limit / max_input_tokens /
max_output_tokens caps bound this one agent slice. Each falls back to the effective config (machine [config]
overlay < repo < global < defaults; §4.7) when omitted. Connection
secrets are never expressed here, only a provider name that must
already exist in the effective config.
tool¶
[states.scan]
kind = "tool"
command = ["scan-inbox", "--dir", "{{ inbox_dir }}", "--since", "{{ cursor }}"]
output_schema = "scan_result" # types `result` so its fields are navigable
capture = { set = { pending = "{{ result.pending }}", cursor = "{{ result.cursor }}" } }
timeout_secs = 60
on = { ok = "have_items", nonzero = "poll", timeout = "poll" }
A single command, argv-style (never a shell string), run through the
existing run_in_jail. nonzero is any non-zero exit. A tool's
stdout is parsed as JSON and bound to the capture-scope name result
(§4.5). Its capture has two modes, and a state uses at most one:
- Opaque whole-capture:
capture = { stdout_json = "<var>" }binds the entire parsed stdout to one variable. Nooutput_schemais needed;resultis then opaque and may not be dotted. - Typed field-capture: declare
output_schema = "<record>"(a[schemas.*]type, §4.6) to typeresult, then pull fields withset = { <var> = "{{ result.<field> }}" }. Becauseresultis typed, everyresult.<field>is statically checked, mirroring how anagentstate validatesfinish_run.
A list-typed variable spliced as a bare argv element
("{{ pending }}") expands in place to one argument per element (§4.4).
scan-inbox here is an illustrative stand-in; a tool state runs
whatever audited command the operator names.
Network (opt-in, default off). A tool's allow_network is one of
"auto" (default, no network), "allow" (wants the host network), or
"block" (no network, required; refuses to run on hardened, which can't
isolate a single tool). A tool reaches the network only when it sets allow_network =
"allow". Because the machine engine is a host-netns supervisor (each agent
state confines itself in its own subprocess; see §9), an opt-in tool can reach
the host network even while the agents stay confined to the provider API. A
tool command is fixed and operator-reviewed, so it is not a free exfiltration
channel the way a networked run_command would be. Whether the opt-in is
honored is the operator's call via sandbox.tool_network (read from the
global/repo config, never the machine overlay):
sandbox.agent_network |
sandbox.tool_network |
agent egress | tool w/ allow_network="allow" |
|---|---|---|---|
providers (def) |
block (def) |
providers + allow_urls |
⛔ refuse to run |
providers |
only_explicit_states |
providers + allow_urls |
host network |
local |
only_explicit_states |
loopback providers only | host network |
open |
allow |
unconfined | host network (and run_command) |
So the headline setup (confined agents + one operator-reviewed networked tool) is
sandbox.agent_network = "providers", sandbox.tool_network =
"only_explicit_states", and allow_network = "allow" on that one state.
only_explicit_states (and local) need the strict profile; a networked tool
under sandbox.tool_network = "block", or a tool-network config the profile
can't honor, refuses to run at startup naming the state.
Script bundles. A machine is a bundle: the .asm.toml file plus an
optional sibling scripts/ directory holding operator-reviewed helper
scripts (the kind machine create may draft). A tool references one by a
relative path whose first segment is scripts/, e.g.
command = ["bash", "scripts/fetch.sh"]; it resolves against the jail's
mounted cwd at run time, so keep the bundle at (or under) the directory you
run agent6 from. machine check validates the bundle: every entry under
scripts/ must resolve inside the bundle (symlinks that escape via
../absolute are rejected) and every static scripts/... command
reference must exist and stay inside the bundle. On the strict profile the
bundle (the .asm.toml + scripts/) is RO-bound in every jail, so a tool or
agent cannot rewrite its own machine logic or bundled scripts mid-run. On
hardened the cwd is blanket read-write (no mount namespace to carve), so the
bundle is writable there; the surrounding container is the blast radius.
A tool script that needs to persist data across iterations writes to
$AGENT6_MACHINE_DATA_DIR, a per-machine writable directory under the
per-repo state dir (<state-dir>/<repo-id>/machines/<id>/data/, out of the
workspace) granted RW in every tool jail. On the hardened profile the repo
cwd is also blanket read-write, so the persisted-data dir is just the durable
home for cross-iteration state; the journal records every transition either way.
wait¶
[states.poll]
kind = "wait"
every_secs = "{{ poll_secs }}" # exactly one of: every_secs | until | cron
on = { tick = "scan", signal = "scan" }
wait is what makes a machine long-running without burning CPU or
tokens. A state declares exactly one of every_secs, until (an
absolute ISO-8601 instant), or cron (a 5-field expression); zero or
two-or-more is a load error. On entry the engine computes the absolute
next-wake instant and journals it as a fact before sleeping, so a
replay re-reads that instant and never actually sleeps. In v1 the
process simply blocks in-process until the instant (or an external
signal, a file/IPC poke, arrives first); because the wake is
journaled absolutely, the --exit-on-wait persisted-wake driver (§6)
runs the identical file with no format change. (cron is accepted by
the parser but not yet evaluated by the v1 runtime: use every_secs or
until; a cron wait raises at run time.)
branch¶
[states.route]
kind = "branch"
when = [
{ if = "verdict.label == 'urgent' and verdict.confidence >= 0.7", goto = "record" },
{ else = true, goto = "poll" },
]
when is an ordered list; the first matching if wins; a final
else = true is required (total function, no "stuck" state). The
predicate grammar is a restricted, non-Turing-complete expression
language (see §5.2): comparisons, and/or/not, membership,
len(), numeric/string literals, and blackboard references (§4.5). No
function calls beyond a tiny fixed allow-list, no Python attribute
access, no eval. Dotted references like verdict.confidence are
data navigation into a record value interpreted by agent6's own
evaluator (§4.5), never Python attribute resolution. This is a hard
security boundary: a .asm.toml file must never be able to execute
arbitrary code.
terminal¶
[states.halt]
kind = "terminal"
status = "failed" # "ok" | "failed"
reason = "machine budget exhausted"
Absorbing. Emits a machine.end event and returns control to the CLI.
A machine may have many terminal states (success and failure variants).
4.4 Templating and list-splicing¶
Strings may contain {{ ... }} interpolations. The contents of an
interpolation are one reference (§4.5) plus an optional single
filter, nothing more. No arbitrary expressions, no chained filters, no
method calls. Anything richer belongs in a branch predicate, which is
itself restricted. This keeps both author-time validation and replay
simple and keeps the format from quietly becoming a scripting language.
There are exactly two filters, both zero-argument:
| filter | applies to | result |
|---|---|---|
len |
str, list, or a json/record container |
the integer length |
json |
any value | compact JSON, object keys sorted (deterministic) |
There is deliberately no join filter: building a delimited string
that a downstream command must re-split is fragile and injection-prone.
Lists reach a command's argv by splicing instead (below).
An interpolation always produces a string. A bare {{ x }} is legal
only when x resolves to a scalar (str/int/float/bool); a bare
reference to a list, json, or record value is a load error: apply
json (or, for a list in argv, splice it) so the rendering is explicit
rather than a surprising Python repr.
List-splicing (argv only). Inside a tool state's command array,
an element that is exactly the string "{{ listvar }}" (a lone
reference to a list[...] variable, no filter, no surrounding text)
expands in place to one argv element per list item, each rendered as
a scalar. This is the only way a list crosses into a command, and it is
injection-safe because each element stays a distinct argument that is
never re-parsed by a shell. Two load errors guard it: splicing a
non-list value, and embedding {{ listvar }} inside a larger string
("--x={{ items }}") rather than as a standalone element. Filter and
reference grammar are validated at machine check.
4.5 Names, references, and namespaces (normative)¶
This subsection pins down every previously-implicit rule about how
variables are named, written, and read, so that one machine file has
exactly one meaning. Every rule here is enforced by agent6 machine
check and re-checked before machine run; each violation is a
load-time error, never a silent runtime surprise.
Identifier grammar. A variable name and a state name each match
^[a-z][a-z0-9_]*$ (ASCII snake_case). TOML quoted/dotted keys that
would smuggle other characters ("last-seen", "a.b") are a load
error. The restriction exists because variable names appear as bare
Name tokens in predicates (parsed by ast.parse); a non-identifier
could not be one.
Three owners, one flat reference namespace. The [vars.operator],
[vars.code], and [vars.agent] subtables decide who may write a
variable. They do not create three separate read namespaces. Every
variable is referenced everywhere (templates and predicates alike) by
its bare name only: positions, never vars.code.positions and
never code.positions. The owner prefix never appears in a reference.
Three consequences, each a machine check error:
- Global uniqueness across owners. A name may be declared in exactly
one of the three subtables. Declaring
positionsin both[vars.code]and[vars.agent]is rejected: "variablepositionsdeclared in both[vars.code]and[vars.agent]; the three owner subtables share one read namespace". Because a bare reference would otherwise be ambiguous, this is forbidden, not resolved by precedence. - No bare top-level vars. Every variable must live under one of the
three owner subtables. A key written directly under
[vars](i.e.vars.positions) has no declared owner and is rejected: "vars.positionshas no owner subtable; put it in[vars.operator],[vars.code], or[vars.agent]". It is never silently ignored. - Reserved names. The bare names
vars,operator,code,agent, andresultmay not be used as variable names.resultis reserved for capture scope (below); the rest are reserved so a reference can never be read as an owner path.
Reference grammar (one grammar, used identically in predicates and templates).
ref := name ("." key)*
name := an identifier declared in exactly one [vars.*] subtable
key := an identifier (a declared field of a record type)
The first segment is always a declared variable; the validator checks it
exists. Any further .key segments navigate into a record value as
data: they are ordered dictionary lookups performed by agent6's own
evaluator, not Python attribute access and never getattr. The
worked example's verdict.confidence means "the confidence field of
the classification record verdict", not a Python attribute. A .key
segment is legal only when the value it navigates is a record type
(§4.6): each segment is checked against the schema at load, so a
misspelled field is a load error. Dotting an opaque json value, or a
scalar, is a load error: json is wholesale-only by construction
(§4.2), which is what keeps every navigable path statically checkable.
Capture scope and result. Inside a state's capture table the
reserved name result denotes the structured output the state just
produced, and is visible only there. result is not a blackboard
variable, cannot be declared, and is invisible outside the capturing
state. Whether result is navigable follows the same one rule as
every other value (§4.2): it may be dotted only when it is typed by an
output_schema record; for an agent state that schema is mandatory,
for a tool state it is optional (declare it to read fields; omit it
and result is opaque and whole-capture only). A capture has two
forms of target:
- a fixed source key (
stdout_jsonfortool,finish_jsonforagent) naming one blackboard variable to receive the whole output; - a
set = { <var> = "<template>" }table assigning rendered templates (which may readresult/result.<field>) to blackboard variables.
What a capture may write is the ownership wall (§4.2): a tool capture
targets only [vars.code] names; an agent capture only [vars.agent]
names; targeting a [vars.operator] name or an undeclared name is a
load error. The captured value's runtime type must match the target
variable's declared type, or the machine halts loudly.
State-name namespace. State names ([states.<name>]) form a
separate namespace from variables: they are referenced only by
initial, goto, and on targets, never inside predicates or
templates, so a state and a variable may share a name without ambiguity.
Every goto/on target must name a declared state (load error
otherwise), and every declared state must be reachable from initial
(load error otherwise).
4.6 Record schemas ([schemas.*])¶
A record type is a named, field-typed structure declared once under
[schemas.<name>] and used in two places: as a variable's type
(making the variable navigable, §4.2) and as an agent state's
output_schema (validating the finish_run payload at the trust
boundary). One mechanism serves both, so there is exactly one way to
describe structured data in a machine.
The schema language is intentionally tiny: inline TOML, no JSON
Schema, no new dependency (tomllib + pydantic only). Each entry is
field = "<type>" or field = { type = "<type>", ... }:
[schemas.classification]
label = { type = "str", enum = ["urgent", "normal", "spam"] }
confidence = "float"
note = { type = "str", optional = true }
Rules (all enforced at machine check):
| Rule | Behavior |
|---|---|
| Field types | str, int, float, bool, list[<scalar>], another schema name (recursive; cycles are a load error), or json (opaque escape hatch; itself not dottable, §4.2) |
| Required by default | every field must be present in a validated payload unless optional = true (mirrors Config's extra="forbid"); unknown fields are rejected |
enum |
string fields only; constrains a str to a fixed literal list, checked at the finish_run/capture boundary (earlier than a branch would re-check it) |
| Dotting | a .field in a predicate/template is type-checked against the schema (field must exist); a list/json/non-record field may not be dotted further |
4.7 Machine config overlay ([config])¶
A machine file may carry an optional top-level [config] table: an
ordinary agent6 config fragment that layers on top of the effective
repo/global/default config for the duration of the machine run. It is
the highest-precedence config layer (machine[config] < --config
is not applicable here; the machine overlay wins over repo and
global), and every knob agent6 config show lists is valid inside it.
[config.workflow]
verify_command = ["uv", "run", "pytest", "-q"]
[config.review]
trigger = "on_verify_fail"
[config.budget]
best_effort_usd_limit = 50.0
Unset keys read straight through to the lower layers, so a machine only states what it wants to change. Two hard rules:
- No connections/secrets, no sandbox policy. A
[config.providers.*]or[config.sandbox.*]block is a load-time error. Provider endpoints, api-key env names, and secret values live in the global config / secrets store; sandbox policy (network egress incl.allow_urls,run_commands,.gitprotection) is an operator decision in the global/repo config. A machine file may be LLM-drafted or shared, so it must not be able to widen its own egress or weaken its jail through the overlay. The overlay can only route to a provider name that already exists in the effective config. - Per-
agent-state knobs (§4.3) override the overlay for that one state. Precedence for an agent loop is therefore: per-state knob > machine[config]> repo config > global config > built-in default.
5. Execution semantics¶
5.1 The engine as a pure reducer¶
load(file) -> Machine # pydantic, extra=forbid, frozen
blackboard = Machine.initial_vars()
state = Machine.initial
loop:
event = execute(state, blackboard) # the ONLY impure step
journal.append(event) # append-only, fsync
blackboard = reduce(blackboard, event) # pure
state = next_state(Machine, state, event, blackboard) # pure
snapshot(state, blackboard) # atomic temp+rename
if state is terminal: break
execute is the only place the outside world is touched (run an agent,
run a tool, read the clock). Its result is written to the journal as a
fact before the blackboard is updated. reduce and next_state are
pure. Therefore replaying the journal reproduces the exact path,
including which branch was taken, because the captured outputs that the
branch reads are in the journal.
5.2 Determinism guarantees and the predicate evaluator¶
- Branch edges are pure functions of the blackboard, which is itself a pure function of journaled events. No branch ever depends on un-logged state.
- The predicate evaluator is a hand-written recursive evaluator over a
small AST (parsed with
ast.parse(..., mode="eval")then walked against a strict allow-list of node types:Compare,BoolOp,UnaryOp,Name,Constant, a fixed-nameCallallow-list, andAttributenodes reinterpreted as record data-field navigation (§4.5, §4.6), never as Python attribute access. Anything outside the allow-list raises atmachine checktime. The evaluator parses but nevereval/execs, never callsgetattr, and never resolves arbitrary Python names: anAttributechain is walked against the blackboard dict, aNamemust be a declared variable, and any other free name is a load error. - Wall-clock, randomness, and external reads are captured as facts. A
--replay <journal>mode feeds recorded facts instead of touching the world, so a completed run replays to the identical path offline.
5.3 Persistence layout¶
Mirrors the existing per-run layout under the per-repo state dir, out of the workspace:
<state-dir>/<repo-id>/machines/<machine-id>/
journal.jsonl # append-only, fsync'd, one event per line
snapshots/<n>.json # blackboard + current state, atomic temp+rename
agents/<state>/<n>/ # nested agent6 run dirs (snapshots, transcripts)
machine.lock # single-writer guard (one process per machine)
Sizing for long-running machines: both the journal and snapshots grow
monotonically, roughly one journal line (~200 B) plus one snapshot (a few
KB) per transition. A 10-minute-interval machine makes ~150k transitions
a year (3 per tick on the idle path), on the order of tens of MB of
journal and a few hundred MB of snapshots. There is no automatic
rotation in v1; archive or delete an instance directory once its history
is no longer needed for replay, and size [budget] max_transitions as
the primary runaway guard.
5.4 Idempotency and crash recovery¶
Each side-effecting state execution gets a deterministic step id
(<state>, <transition-count>). On restart the engine reads the journal:
if the last line is an in-progress state.begin with no matching
state.end, the step is re-attempted only if it is known-idempotent
(tool/agent reads), otherwise it surfaces for operator decision. The
default posture is at-least-once for reads, never-silently-twice for
writes: destructive tools must be authored to be idempotent (the same
discipline the rest of agent6 already follows).
6. Reliability for 24/7 operation¶
- Restartable, not resident. A
waitstate can either block in-process or persist the next wake time and exit 0, to be re-armed by asystemdtimer / cron. Either way the journal is the source of truth, so a reboot loses nothing. - Runaway guards. The
[budget]USD field and[budget].max_transitionsstop the machine when crossed. A machine that loops forever without awaitand without spending is still bounded bymax_transitions. - Single writer.
machine.lock(flock) guarantees one process per machine id; a second invocation refuses rather than double-acting. - Health/visibility.
agent6 machine status <id>prints the current state, blackboard, last N events, spend, and next wake.agent6 machine graph <file>emits a mermaid or Graphviz-DOT diagram (--format, reachability is already computed at load).
7. CLI surface¶
| command | effect |
|---|---|
agent6 machine create <task> [-o <file>] [--max-attempts N] |
LLM-drafted machine bundle: the .asm.toml plus every scripts/... file its tool states run, plus a scripts/<name>_test.py mock test per script with an external seam (network/clock/files). Each draft is gated before acceptance: machine check validation, ruff lint, ty type check, and the mock tests executed in a no-network jail; failures loop back to the model with the failing source (up to --max-attempts, default 3). Writes a draft the operator reviews, edits, and commits; running it still requires the operator (see §9). |
agent6 machine check <file> |
validate: parse, type-check vars, verify every edge target exists, every state reachable, every branch total, every variable name unique across owners and owned by a subtable (no bare vars.*), every reference resolving to a declared variable, every capture writing a var owned by the writing state kind (tool → [vars.code], agent → [vars.agent], [vars.operator] read-only), the script bundle (scripts/ entries + static scripts/... command refs stay inside the bundle), and static script health (ruff lint + ty type check). No execution, no network. |
agent6 machine test <file> [--blackboard FIXTURE.toml] |
everything check does, plus the bundle's scripts/*_test.py mock tests executed in a no-network jail, plus a pure dry-run (no provider/clock): per state, synthesize the success fact it would emit (a tool's output_schema-shaped JSON / an agent's finish_run payload), push it through the real reduce, and confirm the capture binds and the produced label routes to a declared state; per branch, evaluate each when clause against the declared defaults overlaid with --blackboard and print the winning goto. The full offline simulation: plumbing, schema, routing, and script behavior with every seam mocked (no real network, no model calls). |
agent6 machine graph <file> [--format mermaid\|dot] |
emit the machine as a diagram. mermaid (default) prints stateDiagram-v2; dot prints Graphviz DOT for dot -Tsvg/dot -Tpng and the broader Graphviz/xdot ecosystem. Reachability is already computed at load, so both are pure renders of the same validated graph. |
agent6 machine run <file> [--exit-on-wait] |
start (or resume) a machine. Acquires the lock, drives the loop. With --exit-on-wait, persist the next wake and exit 0 (status waiting) at the first not-ready wait, for an external scheduler (systemd timer / cron) to resume. |
agent6 machine status <id> |
current state, blackboard, spend, next wake. Read-only. |
agent6 machine poke <id> |
signal a waiting instance to wake on its next check. |
agent6 machine replay <id> |
deterministic replay from the journal (no world I/O); backtesting. |
machine check is the human-editability payoff: precise, fail-loud
diagnostics (state "act": branch is not total (no else); add { else =
true, goto = ... }).
7.1 machine create: LLM drafts, operator owns¶
machine create lets the operator describe a loop in plain language
("poll this location, classify new items, take a step on high
confidence") and get a first-cut machine bundle back instead of
authoring it by hand. It is an ordinary jailed agent6 loop with a
specialized prompt: the model is handed this document's grammar (state
kinds, the three-owner blackboard
([vars.operator]/[vars.code]/[vars.agent]), the total-branch rule)
and the task, and is told to return one complete machine by calling
finish_run with a result.toml field holding the entire .asm.toml
and a result.scripts map holding every scripts/... file the tool
states run (plus a scripts/<name>_test.py mock test per script with an
external seam); no new tool and no file-writing capability is
granted. The CLI extracts the bundle and gates it: the same
machine check validation, ruff lint, ty type check, and the mock tests
executed in a no-network jail. Failures (with the failing source) loop
back to the model, up to --max-attempts (default 3), until the
bundle passes. Retries include the prior draft AND its scripts so the
model patches the named problem instead of regenerating from scratch.
On success the validated bundle is written as a draft: with -o <file>
the .asm.toml is written there (overwriting freely); otherwise to
<machine-name>.asm.toml in the working directory, which is never
overwritten (on a name collision the validated draft is printed to
stdout and the command exits non-zero so nothing is clobbered). Scripts
land in scripts/ next to the machine file. Status, spend, and notes go
to stderr.
Crucially this does not weaken the "machines are operator artifacts"
invariant (§9): create only ever drafts a file into the working tree.
The operator reviews it, fine-tunes the constants/prompts, and commits
it; machine run still refuses anything the operator has not committed.
Drafting is assistance; authorization stays human.
8. Where it lives (module boundaries)¶
The tach DAG is cli → machine → workflows → agents → tools → sandbox,
and workflows never import each other. An agent state needs to
invoke the loop workflow, so the engine cannot itself be a workflow
without breaking that rule.
agent6.machine is a top-level package the CLI depends on. The key
boundary decision: the engine does not import the workflow stack.
Rather than constructing a Workflow itself, engine.drive runs an
agent state through an injected agent_runner callable
(Callable[[AgentRequest], AgentExecResult]). The CLI, which already
depends on both agent6.machine and agent6.workflows, builds that
runner and the orchestration around machine create/run, so
agent6.machine never gains an edge into agent6.workflows and the tach
graph stays acyclic.
Files (all from __future__ import annotations, strict pyright, pydantic
only at the parse boundary, @dataclass(frozen=True, slots=True) for the
internal value types):
machine/model.py: pydanticMachineSpec/state/var specs, semantic validation, andfinish_runpayload validation.machine/predicate.py: the allow-list AST predicate evaluator.machine/template.py: the single interpolation/splicing engine shared by the validator and the runtime.machine/graph.py: the mermaid/DOT renderers.machine/journal.py: append-only event log, snapshots, locking, and persisted-wake state.machine/engine.py: the deterministic reducer loop.machine/authoring.py: the dependency-free prompt scaffolding formachine create(grammar guide, per-attempt prompt builder, draft extractor).
No new runtime dependency (tomllib + pydantic + stdlib ast).
9. Security considerations (must not weaken anything in AGENTS.md)¶
- No new LLM tool surface. The fixed set in
tools/schema.pyis unchanged. Machines orchestrate existing capabilities; the LLM inside anagentstate sees the same tools it always did.machine createis no exception: the drafting agent runs the same fixed toolset and returns its.asm.tomlthrough the existingfinish_runpayload, not a new file-writing tool. - No arbitrary code execution from a file. Predicates and templates
are parsed-then-walked against an allow-list; never
eval/exec, nevergetattr. Dotted references are agent6-interpreted json data navigation (§4.5), not Python attribute resolution. A.asm.tomlfile is data, not code. - All side effects stay jailed.
toolstates go throughrun_in_jail; eachagentstate is an ordinary jailed run in its own self-confining subprocess (the engine is a host-netns supervisor). The per-state network model and its refusals are specified in security.md §8. - Spend bounds.
[budget].max_transitionsis required and always binds. The USD limit is optional, at most one of:max_usd(hard:machine runrefuses up front when a covered agent state's model has no price data; per-statemax_usdlikewise) orbest_effort_usd_limit(binds only when spend is measurable; for unpriced or local models). Spend is metered as an estimate: reported cost when available, else cached price times tokens. - Machines are operator artifacts, never LLM-authored. The threat
model assumes the file is written by the operator and reviewed like
code. An LLM proposing a machine is fine, and
agent6 machine create(§7.1) explicitly drafts one, but running one requires the operator to review and commit it.machine createwrites only into the working tree and never auto-runs;machine runoperates on committed files. Drafting is assistance; authorization stays human. - External-world tools remain out of scope. Adding any tool that
reaches the network or an external service is a separate change
requiring the
tools/schema.pysecurity-review trailer and a network/jail audit. The examples in this document use illustrative stand-in tools only.
The commits implementing this feature carry a Security review note:
covering: the parser trust boundary, the predicate allow-list, and
confirmation that no new network endpoint or LLM tool was added.
10. Worked example (full)¶
# item-classifier.asm.toml (ILLUSTRATIVE). scan-inbox/archive-item are
# stand-in audited tools, not part of agent6; they only show the *shape*.
machine = "item-classifier"
version = 1
initial = "poll"
[budget]
max_usd = 25.0
max_transitions = 100000
[vars.operator] # operator inputs, fixed for the life of the machine
inbox_dir = { type = "str", value = "/srv/inbox" }
poll_secs = { type = "int", value = 300 }
[vars.code] # set deterministically by a tool capture
pending = { type = "list[str]", default = [] } # set by the scan tool
cursor = { type = "str", default = "" } # set by the scan tool
[vars.agent] # set by an agent state's finish_run
verdict = { type = "classification", default = {} } # set by classify's finish_run
[schemas.classification] # validates the agent's finish_run payload
label = { type = "str", enum = ["urgent", "normal", "spam"] }
confidence = "float"
[schemas.scan_result] # types the scan tool's stdout so fields are navigable
pending = "list[str]"
cursor = "str"
[states.poll]
kind = "wait"
every_secs = "{{ poll_secs }}" # exactly one of every_secs | until | cron
on = { tick = "scan", signal = "scan" }
[states.scan]
kind = "tool"
command = ["scan-inbox", "--dir", "{{ inbox_dir }}", "--since", "{{ cursor }}"]
output_schema = "scan_result"
capture = { set = { pending = "{{ result.pending }}", cursor = "{{ result.cursor }}" } }
timeout_secs = 60
on = { ok = "have_items", nonzero = "poll", timeout = "poll" }
[states.have_items]
kind = "branch"
when = [
{ if = "len(pending) == 0", goto = "poll" },
{ else = true, goto = "classify" },
]
[states.classify]
kind = "agent"
model = "claude-sonnet-4-6"
prompt = """
Classify these pending items: {{ pending | json }}
Call finish_run with JSON {label:"urgent"|"normal"|"spam", confidence:0..1}.
"""
output_schema = "classification"
capture = { finish_json = "verdict" }
timeout_secs = 600
on = { ok = "route", failed = "poll", budget_exhausted = "halt", timeout = "poll" }
[states.route]
kind = "branch"
when = [
{ if = "verdict.label == 'urgent' and verdict.confidence >= 0.7", goto = "record" },
{ else = true, goto = "poll" },
]
[states.record]
kind = "tool"
# `{{ pending }}` is a lone list reference -> spliced to one argv element per item
command = ["archive-item", "--label", "{{ verdict.label }}", "{{ pending }}"]
timeout_secs = 30
on = { ok = "poll", nonzero = "poll", timeout = "poll" }
[states.halt]
kind = "terminal"
status = "failed"
reason = "machine budget exhausted"
Rendered control flow (what agent6 machine graph would emit):
stateDiagram-v2
[*] --> poll
poll --> scan: tick/signal
scan --> have_items: ok
scan --> poll: nonzero/timeout
have_items --> poll: no items
have_items --> classify: else
classify --> route: ok
classify --> poll: failed/timeout
classify --> halt: budget_exhausted
route --> record: urgent & conf>=.7
route --> poll: else
record --> poll
halt --> [*]
11. Implementation status¶
Implemented in full under src/agent6/machine/ (model, predicate, graph,
engine, journal) and exposed via the machine subcommands (§7), without
touching run/review. All state kinds, crash recovery, replay, the
agent state, the 24/7 ergonomics (status/poke, persisted-wake,
per-agent spend), and machine create are covered by unit tests.
12. Resolved decisions¶
Settled design choices, recorded so the rationale travels with the spec:
waitruntime: the format journals an absolute next-wake instant; the v1 runtime is plain in-process blocking (§4.3, §6). A persisted-wake/systemd driver can run the identical file later.- Schema language: inline
[schemas.*]TOML (§4.6), not JSON Schema; no new dependency, human-editable, one mechanism for bothoutput_schemavalidation and navigable record vars. agentwrites: exactly one validatedfinish_runpayload peragentstate is the LLM's only write channel (§4.2); multiple outputs are fields of one record.- Concurrency: strictly sequential (one active state, no fork/join);
compose by running independent machines.
fork/joinmay come later. jsonnavigability: opaquejsonis wholesale-only; anything navigated with.fieldmust be a declared record type (§4.6), so every path is statically checkable.- List → argv: no
joinfilter; a lone"{{ listvar }}"argv element is spliced to one element per item (§4.4). - Naming: subcommand
machine; suffix.asm.toml(.a6m.tomlfallback, §4).