Fast-CrewAI vs CrewAI benchmarks, explained
Methodology, raw numbers, and honest caveats behind the 34.5× serialization, 17.3× tool execution, and 11.2× memory search claims. Everything you need to reproduce the results.
Benchmarks are a form of argument. A good benchmark tells you what was measured, on what hardware, with what inputs, and under what assumptions. A bad benchmark tells you a number. This guide is the long version of the numbers we publish on the home page — the methodology behind 34.5× serialization, 17.3× tool execution, and 11.2× memory search — plus the caveats that stop us calling them a free lunch.
The headline numbers
| Component | Improvement | Rust throughput | Python throughput | Memory savings |
|---|---|---|---|---|
Serialization (serde vs json) | 34.5× | 80,525 ops/s | 2,333 ops/s | 58% less |
| Tool execution (cached) | 17.3× | 11,616 ops/s | 670 ops/s | 99% less |
Memory search (FTS5 vs LIKE) | 11.2× | 10,206 ops/s | 913 ops/s | 31% less |
| Database query (pooled) | 1.3× | — | — | — |
All numbers are from the benchmark_test/ suite in the Fast-CrewAI repository, run against crewai==1.7.2 on the CI matrix (Linux x86_64, Python 3.12).
What “34.5× faster serialization” actually means
The serialization benchmark measures one thing: the cost of encoding and decoding the kind of payload CrewAI passes between its internal components. Specifically, it creates a representative agent message — a dict with roles, content, metadata, and a nested tool result — and runs it through:
json.dumps/json.loads— the Python stdlib path CrewAI uses by defaultserde_jsonexposed via PyO3 — the path Fast-CrewAI patches in
The benchmark reports operations per second, wall-clock, and peak memory. The 34.5× number is 80,525 ops/s (serde) divided by 2,333 ops/s (Python json) on the same payload.
What this does not tell you: whether your application code calling json.dumps directly will get 34× faster. It won’t, unless you also switch to a faster serializer. What it tells you is: the JSON cost inside CrewAI’s message passing, memory persistence, and tool result handling — which is where the hot path runs — is 34× cheaper after patching.
What “17.3× faster tool execution” actually means
The tool execution benchmark simulates a common pattern: an agent calls the same tool with the same arguments, many times. This is what LLMs actually do. The benchmark uses a mock tool with a known execution cost and measures:
- Baseline: CrewAI’s default
BaseTool.runpath, including validation, schema parsing, and result serialization — no caching. - Accelerated: Fast-CrewAI’s Rust-backed executor with result caching (configurable TTL), serde-based JSON validation, and statistics tracking.
On repeated identical invocations, the cached path clocks 11,616 ops/s vs 670 ops/s uncached. That’s 17.3×, and it uses 99% less memory because the cache returns pre-serialized results instead of rebuilding them.
What this does not tell you: whether your tools will get 17× faster. If your tool makes an HTTP call and the remote server takes 300ms, Fast-CrewAI can’t make that faster — but it can skip the call entirely on a cache hit, which is usually what you actually want. The benchmark measures the framework overhead that survives even when the underlying tool is fast.
A note on cache correctness: every cache is a correctness hazard. Fast-CrewAI’s tool cache is off by default and opt-in per tool. You tell it explicitly which tools are safe to cache and for how long.
What “11.2× faster memory search” actually means
CrewAI’s memory backend uses SQLite. The default text search uses LIKE '%needle%', which is a full table scan. As memory grows, scans get more expensive linearly.
The memory benchmark seeds a SQLite database with a few thousand realistic memory entries (agent observations, tool outputs, entity facts) and measures retrieval throughput:
- Baseline:
LIKE '%query%'across all entries - Accelerated: SQLite’s FTS5 virtual table with BM25 relevance ranking
FTS5 hits 10,206 ops/s vs 913 ops/s for LIKE. That’s 11.2×. It also returns better results, because BM25 is a real relevance ranking and LIKE is a substring match.
What this does not tell you: how your semantic search setup compares. If you’ve already replaced CrewAI’s default memory backend with a vector database, you’re not hitting this code path and these numbers don’t apply to you. If you’re using the default, the gain is direct.
What about end-to-end workflow speed?
This is the number that actually matters, and it’s the one we’re most careful about. On a full CrewAI workflow — not a micro-benchmark, but a real crew doing real work — the speedup depends heavily on what the workflow does:
- LLM-heavy, memory-light workflow: typically 1.1–1.3× faster end-to-end. Your LLM latency dominates and Fast-CrewAI can only speed up the overhead between calls.
- Memory- and tool-heavy workflow: typically 2–5× faster end-to-end. These are the workflows where CrewAI overhead is a meaningful fraction of total time.
- Mostly database-bound workflow: typically 3–5× faster. FTS5 alone earns most of the gain.
The test-comparison target in the Fast-CrewAI Makefile runs a representative basic workflow. On our hardware it reports roughly 13% improvement on the simplest case (27.88s → 24.50s over 3 iterations). That’s a real number; it’s also the floor of the range.
Memory usage, the other metric nobody talks about
CPU time is the metric people benchmark; memory is the metric that actually takes down production pipelines. Fast-CrewAI is aggressive about memory:
| Component | Python peak | Rust peak | Savings |
|---|---|---|---|
| Tool execution | 1.2 MB per op | 0.0 MB | 99% less |
| Serialization | 8.0 MB per op | 3.4 MB | 58% less |
| Database | 0.1 MB | 0.1 MB | 31% less |
The tool execution number looks too good to be true. It’s not — it’s a consequence of returning pre-serialized cached results instead of rebuilding Python dicts every call. For a worker processing thousands of tool calls, that’s the difference between staying under your container memory limit and getting OOM-killed.
How to reproduce
If you want to run the benchmarks yourself:
git clone https://github.com/neul-labs/fast-crewai.git
cd fast-crewai
uv sync --dev
uv run maturin develop --release
make benchmark # internal micro-benchmarks
make test-comparison # full CrewAI workflow comparison
make test-comparison-extensive # 1000 iterations
Results vary with hardware, Python version, and — importantly — with what kind of workflow you feed it. We’d rather you run the benchmarks against your own workload than trust ours.
The caveats we won’t sweep under the rug
- LLM latency is not in scope. If 90% of your run time is spent waiting on
gpt-4o, no framework optimization will rescue you. Profile first. - The fallback path is real. If a Rust wheel isn’t available for your platform, Fast-CrewAI silently falls back to pure Python. You get compatibility, not speed.
- Task parallelism gains depend on your DAG. We can’t parallelize what’s genuinely sequential.
- The cache is opt-in. You have to mark tools as cacheable; we won’t do it for you.
- This project is in active development. Memory and database acceleration are production-ready. Task and tool acceleration are being optimized continuously.
Going deeper
- Why CrewAI gets slow — the long version of the bottleneck analysis.
- Migrating to Fast-CrewAI — what the one-line switch actually does.
- Book a performance audit — we measure your workload against these benchmarks for real.