# Supervision _Path: en/tutorials/supervision_ ## Table of Contents - Process Supervision ## Content # Process Supervision Monitor and link processes to build fault-tolerant systems. ## Monitoring vs Linking **Monitoring** provides one-way observation: - Parent monitors child - Child exits, parent receives EXIT event - Parent continues running **Linking** creates bidirectional fate-sharing: - Parent and child are linked - Either process fails, both terminate - Unless `trap_links=true` is set ```mermaid flowchart TB subgraph Monitoring["MONITORING (one-way)"] direction TB P1[Parent monitors] -->|EXIT event
parent continues| C1[Child exits] end subgraph Linking["LINKING (bidirectional)"] direction TB P2[Parent linked] <-->|LINK_DOWN
both die| C2[Child exits] end ``` ### Spawn with Monitoring Use `process.spawn_monitored()` to spawn and monitor in one call: ```lua local function main() local events_ch = process.events() -- Spawn worker and start monitoring local worker_pid, err = process.spawn_monitored( "app.workers:task_worker", "app:processes" ) if err then return nil, "spawn failed: " .. tostring(err) end -- Wait for worker to complete local event = events_ch:receive() if event.kind == process.event.EXIT then print("Worker exited:", event.from) if event.result then print("Result:", event.result.value) end if event.result and event.result.error then print("Error:", event.result.error) end end end ``` ### Monitor Existing Process Call `process.monitor()` to start monitoring an already-running process: ```lua local function main() local time = require("time") local events_ch = process.events() -- Spawn without monitoring local worker_pid, err = process.spawn( "app.workers:long_worker", "app:processes" ) if err then return nil, "spawn failed: " .. tostring(err) end -- Start monitoring later local ok, monitor_err = process.monitor(worker_pid) if monitor_err then return nil, "monitor failed: " .. tostring(monitor_err) end -- Cancel the worker time.sleep("5ms") process.cancel(worker_pid, "100ms") -- Receive EXIT event local event = events_ch:receive() if event.kind == process.event.EXIT then print("Worker terminated:", event.from) end end ``` ### Stop Monitoring Use `process.unmonitor()` to stop receiving EXIT events: ```lua local function main() local time = require("time") local events_ch = process.events() -- Spawn and monitor local worker_pid, err = process.spawn_monitored( "app.workers:long_worker", "app:processes" ) time.sleep("5ms") -- Stop monitoring local ok, unmon_err = process.unmonitor(worker_pid) if unmon_err then return nil, "unmonitor failed: " .. tostring(unmon_err) end -- Cancel worker process.cancel(worker_pid, "100ms") -- No EXIT event will be received (we unmonitored) local timeout = time.after("200ms") local result = channel.select { events_ch:case_receive(), timeout:case_receive(), } if result.channel == events_ch then return nil, "should not receive event after unmonitor" end end ``` ### Explicit Linking Use `process.link()` to create a bidirectional link: ```lua -- Worker that links to a target process local function worker_main() local time = require("time") local events_ch = process.events() local inbox_ch = process.inbox() -- Enable trap_links to receive LINK_DOWN events process.set_options({ trap_links = true }) -- Receive target PID from sender local msg = inbox_ch:receive() local target_pid = msg:payload():data() local sender = msg:from() -- Create bidirectional link local ok, err = process.link(target_pid) if err then return nil, "link failed: " .. tostring(err) end -- Notify sender we're linked process.send(sender, "linked", process.pid()) -- Wait for LINK_DOWN when target exits local timeout = time.after("3s") local result = channel.select { events_ch:case_receive(), timeout:case_receive(), } if result.channel == events_ch then local event = result.value if event.kind == process.event.LINK_DOWN then return "LINK_DOWN_RECEIVED" end end return nil, "no LINK_DOWN received" end ``` ### Spawn with Link Use `process.spawn_linked()` to spawn and link in one call: ```lua local function parent_main() -- Enable trap_links to handle child death process.set_options({ trap_links = true }) local events_ch = process.events() -- Spawn and link to child local child_pid, err = process.spawn_linked( "app.workers:child_worker", "app:processes" ) if err then return nil, "spawn_linked failed: " .. tostring(err) end -- If child dies, we receive LINK_DOWN local event = events_ch:receive() if event.kind == process.event.LINK_DOWN then print("Child died:", event.from) end end ``` ## Trap Links By default, when a linked process fails, the current process also fails. Set `trap_links=true` to receive LINK_DOWN events instead. ### Default Behavior (trap_links=false) Without `trap_links`, linked process failure terminates the current process: ```lua local function worker_main() local events_ch = process.events() -- trap_links is false by default local opts = process.get_options() print("trap_links:", opts.trap_links) -- false -- Spawn linked worker that will fail local child_pid, err = process.spawn_linked( "app.workers:error_worker", "app:processes" ) -- When child errors, THIS process terminates -- We never reach this point local event = events_ch:receive() end ``` ### With trap_links=true Enable `trap_links` to receive LINK_DOWN events and survive: ```lua local function worker_main() -- Enable trap_links process.set_options({ trap_links = true }) local events_ch = process.events() -- Spawn linked worker that will fail local child_pid, err = process.spawn_linked( "app.workers:error_worker", "app:processes" ) -- Wait for LINK_DOWN event local event = events_ch:receive() if event.kind == process.event.LINK_DOWN then print("Child failed, handling gracefully") return "LINK_DOWN_RECEIVED" end end ``` ### Send Cancel Signal Use `process.cancel()` to gracefully terminate a process: ```lua local function main() local time = require("time") local events_ch = process.events() -- Spawn and monitor worker local worker_pid, err = process.spawn_monitored( "app.workers:long_worker", "app:processes" ) time.sleep("5ms") -- Cancel with 100ms timeout for cleanup local ok, cancel_err = process.cancel(worker_pid, "100ms") if cancel_err then return nil, "cancel failed: " .. tostring(cancel_err) end -- Wait for EXIT event local event = events_ch:receive() if event.kind == process.event.EXIT then print("Worker cancelled:", event.from) end end ``` ### Handle Cancellation Worker receives CANCEL event through `process.events()`: ```lua local function worker_main() local events_ch = process.events() local inbox_ch = process.inbox() while true do local result = channel.select { inbox_ch:case_receive(), events_ch:case_receive(), } if result.channel == events_ch then local event = result.value if event.kind == process.event.CANCEL then -- Cleanup resources cleanup() return "cancelled gracefully" end else -- Process inbox message handle_message(result.value) end end end ``` ### Star Topology Parent with multiple children linking back to it: ```lua -- Parent worker spawns children that link TO parent local function star_parent_main() local time = require("time") local events_ch = process.events() local child_count = 10 -- Enable trap_links to see children die process.set_options({ trap_links = true }) local children = {} -- Spawn children for i = 1, child_count do local child_pid, err = process.spawn( "app.workers:linker_child", "app:processes" ) if err then error("spawn child failed: " .. tostring(err)) end -- Send parent PID to child process.send(child_pid, "inbox", process.pid()) children[child_pid] = true end -- Wait for all children to confirm link for i = 1, child_count do local msg = process.inbox():receive() if msg:topic() ~= "linked" then error("expected linked confirmation") end end -- Trigger failure - all children should receive LINK_DOWN error("PARENT_STAR_FAILURE") end ``` Child worker that links to parent: ```lua local function linker_child_main() local events_ch = process.events() local inbox_ch = process.inbox() -- Receive parent PID local msg = inbox_ch:receive() local parent_pid = msg:payload():data() -- Link to parent process.link(parent_pid) -- Confirm link process.send(parent_pid, "linked", process.pid()) -- Wait for LINK_DOWN when parent dies local event = events_ch:receive() if event.kind == process.event.LINK_DOWN then return "parent_died" end end ``` ### Chain Topology Linear chain where each node links to its parent: ```lua -- Chain root: A -> B -> C -> D -> E local function chain_root_main() local time = require("time") -- Spawn first child local child_pid, err = process.spawn_linked( "app.workers:chain_node", "app:processes", 4 -- depth remaining ) if err then error("spawn failed: " .. tostring(err)) end -- Wait for chain to build time.sleep("100ms") -- Trigger cascade - all linked processes die error("CHAIN_ROOT_FAILURE") end ``` Chain node spawns next node and links: ```lua local function chain_node_main(depth) local time = require("time") if depth > 0 then -- Spawn next in chain local child_pid, err = process.spawn_linked( "app.workers:chain_node", "app:processes", depth - 1 ) if err then error("spawn failed: " .. tostring(err)) end end -- Wait for parent to die (triggers our death via LINK_DOWN) time.sleep("5s") end ``` ### Configuration ```yaml # src/_index.yaml version: "1.0" namespace: app entries: - name: processes kind: process.host host: workers: 16 lifecycle: auto_start: true ``` ```yaml # src/supervisor/_index.yaml version: "1.0" namespace: app.supervisor entries: - name: pool kind: process.lua source: file://pool.lua method: main modules: - time lifecycle: auto_start: true ``` ### Supervisor Implementation ```lua -- src/supervisor/pool.lua local function main(worker_count) local time = require("time") worker_count = worker_count or 4 -- Enable trap_links to handle worker deaths process.set_options({ trap_links = true }) local events_ch = process.events() local workers = {} local function start_worker(id) local pid, err = process.spawn_linked( "app.workers:task_worker", "app:processes", id ) if err then print("Failed to start worker " .. id .. ": " .. tostring(err)) return nil end workers[pid] = {id = id, started_at = os.time()} print("Worker " .. id .. " started: " .. pid) return pid end -- Start initial pool for i = 1, worker_count do start_worker(i) end print("Supervisor started with " .. worker_count .. " workers") -- Supervision loop while true do local timeout = time.after("60s") local result = channel.select { events_ch:case_receive(), timeout:case_receive(), } if result.channel == timeout then -- Periodic health check local count = 0 for _ in pairs(workers) do count = count + 1 end print("Health check: " .. count .. " active workers") elseif result.channel == events_ch then local event = result.value if event.kind == process.event.LINK_DOWN then local dead_worker = workers[event.from] if dead_worker then workers[event.from] = nil local uptime = os.time() - dead_worker.started_at print("Worker " .. dead_worker.id .. " died after " .. uptime .. "s, restarting") -- Brief delay before restart time.sleep("100ms") start_worker(dead_worker.id) end end end end end return { main = main } ``` ### Worker Definition ```yaml # src/workers/_index.yaml version: "1.0" namespace: app.workers entries: - name: task_worker kind: process.lua source: file://task_worker.lua method: main modules: - time ``` ### Worker Implementation ```lua -- src/workers/task_worker.lua local function main(worker_id) local time = require("time") local events_ch = process.events() local inbox_ch = process.inbox() print("Task worker " .. worker_id .. " started") while true do local timeout = time.after("5s") local result = channel.select { inbox_ch:case_receive(), events_ch:case_receive(), timeout:case_receive(), } if result.channel == events_ch then local event = result.value if event.kind == process.event.CANCEL then print("Worker " .. worker_id .. " cancelled") return "cancelled" elseif event.kind == process.event.LINK_DOWN then print("Worker " .. worker_id .. " linked process died") return nil, "linked_process_died" end elseif result.channel == inbox_ch then local msg = result.value local topic = msg:topic() local payload = msg:payload():data() if topic == "work" then print("Worker " .. worker_id .. " processing: " .. payload) time.sleep("100ms") process.send(msg:from(), "result", "completed: " .. payload) end elseif result.channel == timeout then -- Idle timeout print("Worker " .. worker_id .. " idle") end end end return { main = main } ``` ## Process Host Configuration The process host controls how many OS threads execute processes: ```yaml # src/_index.yaml version: "1.0" namespace: app entries: - name: processes kind: process.host host: workers: 16 # Number of OS threads lifecycle: auto_start: true ``` Workers setting: - Controls parallelism for CPU-bound work - Typically set to number of CPU cores - All processes share this thread pool ## Key Concepts **Monitoring** (one-way observation): - Use `process.spawn_monitored()` or `process.monitor()` - Receive EXIT events when monitored process terminates - Parent continues running after child exits **Linking** (bidirectional fate-sharing): - Use `process.spawn_linked()` or `process.link()` - By default: if either process fails, both terminate - With `trap_links=true`: receive LINK_DOWN events instead **Cancellation**: - Use `process.cancel(pid, timeout)` for graceful shutdown - Worker receives CANCEL event via `process.events()` - Has timeout duration to cleanup before force termination ## Event Types | Event | Triggered By | Required Setup | |-------|--------------|----------------| | `EXIT` | Monitored process exits | `spawn_monitored()` or `monitor()` | | `LINK_DOWN` | Linked process fails | `spawn_linked()` or `link()` with `trap_links=true` | | `CANCEL` | `process.cancel()` called | None (always delivered) | ## Next Steps - [Processes](processes.md) - Process fundamentals - [Channels](channels.md) - Message passing patterns - [Process Module](lua/core/process.md) - API reference ## Navigation Previous: Processes (tutorials/processes) Next: Task Queue (tutorials/task-queue)