Fleet Framework: Streamlining Distributed Automation at Scale

The rise of large‑scale automation workloads has exposed a gap between simple scripts and heavyweight orchestration platforms. Teams that need to run dozens or hundreds of identical tasks across a fleet of machines often find themselves stitching together ad‑hoc solutions that are fragile, hard to monitor, and difficult to scale. A new open‑source project called Fleet Framework aims to fill that niche by providing a lightweight, opinionated layer that turns a collection of workers into a coherent, observable system without imposing a full‑blown scheduler or service mesh. Its origins lie in a production‑grade CAPTCHA‑solving operation, where the developers learned the hard way what can go wrong when distributed agents interact over unreliable networks and shared state.

Fleet Framework began life as the control plane for a Cloudflare Turnstile‑solving farm that processed millions of challenges per day. That environment exercised every conceivable failure mode: zombie subprocesses that lingered after workers crashed, master state loss when Redis restarted, generational counters that regressed after a roll‑out, worker slots that were recycled without proper healing, and configuration drift that appeared when heterogeneous hosts ran different versions of the same automation. Each of these pain points was addressed directly in the framework’s design, turning hard‑won operational lessons into reusable building blocks. The result is a system that not only runs the work but also self‑heals, reconciles state, and keeps operators informed through a lightweight dashboard.

At its core, Fleet Framework follows a classic master/worker pattern but adds a few critical abstractions. The master node holds the canonical configuration, pushes updates to workers, and maintains a Redis‑backed store that serves as the source of truth for slot state, counters, and event logs. Workers subscribe to typed output streams, allowing one automation to publish structured data that another can consume without needing to know the underlying transport. This stream‑based inter‑automation communication enables pipelines where, for example, a SERP scraper feeds extracted URLs into a content‑classification worker, all while preserving type safety and back‑pressure handling.

The project is split into two pip‑installable pieces. Fleet‑core contains the master daemon, worker shell, reconciliation loop, Redis store, event bus, metrics collection, and a small React‑based dashboard that visualises slot utilisation, error rates, and throughput. Fleet‑browser is an optional add‑on that supplies a battle‑tested Chromium pool with automatic fingerprint rotation, proxy authentication support, and robust orphan‑process cleanup. By separating the browser concerns, the core stays lightweight for non‑browser workloads such as API calls, file processing, or database jobs, while still offering a ready‑made solution for web‑scraping or UI‑automation scenarios.

Defining a new automation task is deliberately concise. Users subclass either ContinuousAutomation for long‑running, polling‑style jobs or BatchAutomation for finite, data‑driven workloads. A Pydantic model declares the expected configuration schema, which the framework validates automatically when the master pushes updates. The automation class implements a few lifecycle hooks—setup, execute, teardown—receiving typed input streams and emitting typed output streams. In practice, most automations settle into roughly thirty lines of code, letting developers focus on the business logic rather than boilerplate plumbing.

Packaging and deployment follow familiar Python conventions. After writing the automation class and its Pydantic config, developers add a single entry‑point line to pyproject.toml that points to the class. Building the package and publishing it to a private index (or even installing directly from a Git repository) makes it available on both master and worker nodes. A simple pip install command on each host pulls in fleet‑core, fleet‑browser if needed, and the custom automation package. The framework then takes over: it reads the configuration, validates it, assigns slots, starts workers, and begins the reconcile loop that constantly converges the desired state with the observed state.

Operational excellence is baked into the framework’s reconciliation engine. The master continuously compares the desired configuration (number of slots, resource limits, version tags) with what Redis reports about each worker. When discrepancies appear—such as a worker that has gone silent or a slot that reports an error—the framework automatically triggers healing actions: restarting the worker process, recycling the slot, or draining and re‑assigning work. Metrics are emitted to Prometheus‑compatible endpoints, and the dashboard offers real‑time views of slot health, throughput, and error budgets, enabling operators to spot anomalies before they cascade into incidents.

The optional fleet‑browser add‑on solves a set of thorny problems that frequently plague large‑scale web automation. Chromium instances are launched in isolated sandboxes, each receiving a freshly rotated fingerprint that varies user‑agent, screen size, canvas, and WebGL properties to reduce the chances of bot detection. Proxy credentials are injected per‑session and rotated on a configurable schedule, allowing the fleet to respect rate limits and geo‑targeting requirements. A dedicated reaper monitors child processes and ensures that no orphaned Chrome or chromedriver processes linger after a worker crashes, preventing resource exhaustion on the host machines.

Beyond the browser pool, the framework tackles several systemic failure patterns that emerged from the original CAPTCHA farm. Zombie subprocess accumulation is prevented by using process groups and careful signal handling, ensuring that when a worker receives SIGTERM it cleans up its entire process tree. To guard against master state loss on Redis restart, the framework stores snapshots of critical metadata in a persistent backend and reconstructs state on reconnection, eliminating the dreaded gen‑counter regression where counters would unexpectedly decrease. Slot‑recycle drops are mitigated by a two‑phase commit‑like protocol that only marks a slot as free after the worker acknowledges completion, and config drift is detected through version hashes that trigger a rolling update when workers fall behind the master’s desired version.

Fleet Framework is intentionally minimalist. It does not attempt to be a general‑purpose scheduler like Kubernetes, nor does it provide a full service mesh with traffic management and advanced routing. Instead, it offers a thin, opinionated layer that excels at the specific pattern of “N machines each doing M parallel things” where the work is homogeneous, repetitive, and benefits from centralized configuration and observability. Teams that need complex workflow orchestration, dynamic DAG scheduling, or multi‑tenant isolation should look elsewhere, but those who simply want to run a fleet of scrapers, validators, or data enrichment jobs with reliable lifecycle management will find the framework a perfect fit.

As of this writing, the project is still in pre‑alpha. The public API is expected to evolve, and there has not yet been an official release on PyPI—installation is currently done directly from the source repository. Several production deployments are already running on Fleet Framework, but they rely heavily on operational runbooks and manual oversight to smooth over the rough edges that accompany early‑stage software. The project maintains a CHANGELOG and a ROADMAP.md that outline the path toward a v0.1 release, which will stabilize the core interfaces and introduce versioned compatibility guarantees.

For engineers considering Fleet Framework, the first step is to evaluate whether the problem domain matches its sweet spot: repetitive, horizontally scalable tasks that can be expressed as independent workers with well‑defined inputs and outputs. Start by building a minimal automation subclass and testing it on a single master‑worker pair to verify config push, slot lifecycle, and stream communication. Monitor the changelog closely and pin your dependencies to a specific commit or tag until the API stabilizes. Invest in basic observability—Prometheus metrics and dashboard alerts—early, because the framework’s self‑healing features are most effective when operators can see what is happening in real time. Finally, engage with the community by reporting any edge cases you encounter; the project’s origins in a high‑stakes production environment mean that real‑world feedback is invaluable for shaping the next stable release.