RFC-0001: Workflow + fanout job kinds
Summary
Add two new job kinds to CronLord: workflow (run a command only after a named set of other jobs have succeeded in the current window) and fanout (run a command once per input item with bounded concurrency, optionally followed by a reducer step). Both are opt-in; existing jobs are unaffected.
Motivation
Three user stories today cannot be expressed without external tools:
- Fan-in digest. I run
subdomain-monitor,nuclei-scan, andh1-sweepevery morning. I want adaily-digestthat runs after all three finish and only if they all succeeded. - Fanout. I have 200 bug-bounty targets in a text file. I want one scheduled job that spawns N parallel child runs, one per target, with a concurrency cap.
- Map-reduce. A fanout across N targets followed by a reducer step that merges all child outputs into a single artifact.
Today you either chain jobs via webhook_url (fragile, no ordering guarantee, doubles run count), or you introduce Airflow/Dagster (massive overkill for a tool whose whole pitch is “one binary, one TOML”).
The new kinds fit CronLord’s existing primitives (runs, statuses, logs, SSE, retries, timeouts) — they don’t introduce a new execution engine.
Design
Non-goals
- Full DAG with diamond dependencies. Fan-in only. Real DAGs are a v2 conversation.
- Triggering upstream jobs.
workflowdepends on jobs as they run on their own schedules. It does not wake them. - Cross-host coordination. Local scheduler only at first; worker support tracked separately (see
deployment.md).
workflow kind
[[jobs]]
id = "daily-digest"
kind = "workflow"
schedule = "30 14 * * *" # after the jobs it waits on
timezone = "UTC"
depends_on = ["subdomain-monitor", "nuclei-scan", "h1-sweep"]
window_sec = 86400 # how far back to look for dep success
on_missing = "skip" # skip | fail | run
on_dep_fail = "skip" # skip | fail | run
command = "/opt/jobs/digest.sh" # or JSON for http, same contract as shell/http kinds
Semantics
At the job’s next scheduled fire time:
- For each
depends_on[]id, query the latest run within the lastwindow_secseconds. - If every dep has a
status = "success"run inside the window → executecommandusing the resolved inner kind (shell,http,claude). - If any dep has no run in the window → follow
on_missing. - If any dep has a
failortimeoutrun in the window → followon_dep_fail.
The workflow’s own run has a status of skipped when it declines to execute. The UI distinguishes skipped from failed.
Why pull, not push? The dep jobs don’t need to know they have downstream consumers. They run on their own schedules with no coupling. The workflow job is the single point of coordination.
Inner kind resolution. command uses the same parsing rules as today — a bare string = shell, a JSON object with method/url = http. A new optional inner_kind field can force claude:
inner_kind = "claude"
command = "Read the three dep artifacts in /var/lib/cronlord/digests and write a summary."
fanout kind
[[jobs]]
id = "scan-targets"
kind = "fanout"
schedule = "0 6 * * *"
inputs_from = "file:///etc/cronlord/targets.txt"
# or "http://...", or "sql://main?SELECT url FROM targets WHERE enabled = 1"
# or "job://subdomain-monitor?output_as_lines"
concurrency = 4
per_item_timeout_sec = 1800
partial_tolerance = 0.1 # tolerate up to 10% child failures before parent fails
item_var = "FANOUT_ITEM" # env var each child sees
command = "nuclei -u $FANOUT_ITEM -silent -o /var/lib/cronlord/nuclei/$FANOUT_ITEM.log"
[jobs.reducer]
command = "python3 /opt/jobs/digest.py /var/lib/cronlord/nuclei"
run_on = "success" # success | always | any_success
Semantics
At scheduled fire time:
- Resolve
inputs_from→ ordered list of items. - Create a parent run. Spawn up to
concurrencychild runs. Each child seesFANOUT_ITEMand aCRONLORD_PARENT_RUN_IDenv var. - As children finish, start more until the list is exhausted.
- When all children complete:
- compute failure ratio
- parent status =
successif ratio ≤partial_tolerance, elsefail - parent status may be
partialifpartial_tolerance > 0and ratio > 0
- Run
reducer.commandbased onreducer.run_on. Reducer inherits parent timeout.
Storage changes
Add to the runs table:
parent_run_id INTEGER NULL -- REFERENCES runs(id)
item TEXT NULL -- the FANOUT_ITEM value
role TEXT NULL -- 'parent' | 'child' | 'reducer' | NULL
UI lists children nested under parent. SSE streams events for all three roles.
Input resolvers for fanout
| Scheme | Example | Behavior |
|---|---|---|
file:// | file:///etc/cronlord/targets.txt | One non-empty non-# line per item. |
http(s):// | https://api.example.com/targets | Fetch, require text/plain or JSON array. |
sql:// | sql://main?SELECT url FROM t WHERE enabled=1 | Use an admin-registered DSN. First column per row. |
job:// | job://subdomain-monitor?output_as_lines | Read stdout of that job’s most recent successful run within window_sec. |
Resolvers live in src/cronlord/runner/fanout/resolvers/, one file each. Keeps auth and SSRF isolation per resolver.
Retries
workflow: retries apply to the inner command only, not to waiting on deps. A failedworkflowrun stays failed; next schedule re-checks.fanout: per-child retry controlled by the parent’sretry_count. The parent doesn’t re-fanout; a child that exhausts retries counts as a failed child.
Cancellation
- Cancelling a
workflowrun cancels the inner command if running. - Cancelling a
fanoutparent: pending children never start; running children receive SIGTERM (then SIGKILL 2s later). Reducer doesn’t run.
Worker executor
workflow: trivially worker-safe (same contract as shell/http/claude). fanout: parent runs on the scheduler only (it owns the run graph); each child may be dispatched to a worker as today. Concurrency counts globally, not per worker.
Security
inputs_from = "http://..."— resolver enforces scheme allowlist (http,https), no redirects to private IPs in production mode.inputs_from = "sql://..."— DSN registered via admin UI only; never constructed from job config directly.inputs_from = "job://..."— can only read from jobs in the same CronLord instance, not across hosts.- Children inherit parent’s
env— so secrets in parent env are visible to children. Document this.
CLI + API surface
POST /api/jobs # existing, accepts kind = workflow | fanout
GET /api/runs?parent_run_id=123 # new filter
GET /api/runs/123/children # convenience endpoint
No breaking changes to existing endpoints.
UI
- Workflow: render
depends_onbadges with per-dep status dots. - Fanout: parent run row is expandable, showing N child rows + 1 reducer row. Each child row shows its
itemvalue.
Open questions
- Sharded SQLite writes for 1000-child fanouts. Do we need a write-queue? Likely yes; punt to benchmarking.
- Priority across fanout children vs. regular jobs — FIFO or starvation-free? Start with FIFO.
- Retry budget — a workflow with 3 failing deps is wasting the retry slot daily. Add a
cooldown_secso it stops re-checking the same day? - Reducer retry — share the parent’s retry, or independent? Proposed: independent, defaulting to 0.
- Fanout on dynamic input — what if
inputs_fromreturns 0 items? Parent finishes immediately; reducer runs iffrun_on = "always".
Migration
Opt-in. Existing TOML files and DB rows need no change. Schema migration adds nullable columns to runs; default values keep old rows valid.
Rollout
- Land
workflowfirst — smaller blast radius, no new storage shape beyond a single lookup helper. - Land
fanoutsecond — add the run-graph columns and the resolver framework together. - Ship docs update to
docs/job-kinds.mdin the same PR as each kind.
Prior art
- GitHub Actions
needs:(workflow-level DAGs, but on top of jobs). - Nomad batch job’s
parameterized/dispatch(fanout). - Dagster
@assetgraphs (DAG, heavier). - systemd.timer + requires=/wants= (unit-level dependencies).
None of those map cleanly to a single-process cron scheduler. This RFC keeps CronLord’s spirit (one TOML block per thing) while covering the real coordination gaps.
Next steps
- Comments in this file OR open as GitHub issue:
gh issue create --repo kdairatchi/CronLord --title "RFC: workflow + fanout job kinds" --body-file docs/rfcs/0001-workflow-fanout.md - After comment dust settles, move to
status: accepted, cut a tracking issue for each kind, and land in two PRs.