Fast-CrewAI
Menu
Guidebenchmarkperformancemethodology

Fast-CrewAI vs CrewAI benchmarks, explained

Methodology, raw numbers, and honest caveats behind the 34.5× serialization, 17.3× tool execution, and 11.2× memory search claims. Everything you need to reproduce the results.

Neul Labs · · 10 min read

Benchmarks are a form of argument. A good benchmark tells you what was measured, on what hardware, with what inputs, and under what assumptions. A bad benchmark tells you a number. This guide is the long version of the numbers we publish on the home page — the methodology behind 34.5× serialization, 17.3× tool execution, and 11.2× memory search — plus the caveats that stop us calling them a free lunch.

The headline numbers

ComponentImprovementRust throughputPython throughputMemory savings
Serialization (serde vs json)34.5×80,525 ops/s2,333 ops/s58% less
Tool execution (cached)17.3×11,616 ops/s670 ops/s99% less
Memory search (FTS5 vs LIKE)11.2×10,206 ops/s913 ops/s31% less
Database query (pooled)1.3×

All numbers are from the benchmark_test/ suite in the Fast-CrewAI repository, run against crewai==1.7.2 on the CI matrix (Linux x86_64, Python 3.12).

What “34.5× faster serialization” actually means

The serialization benchmark measures one thing: the cost of encoding and decoding the kind of payload CrewAI passes between its internal components. Specifically, it creates a representative agent message — a dict with roles, content, metadata, and a nested tool result — and runs it through:

  • json.dumps / json.loads — the Python stdlib path CrewAI uses by default
  • serde_json exposed via PyO3 — the path Fast-CrewAI patches in

The benchmark reports operations per second, wall-clock, and peak memory. The 34.5× number is 80,525 ops/s (serde) divided by 2,333 ops/s (Python json) on the same payload.

What this does not tell you: whether your application code calling json.dumps directly will get 34× faster. It won’t, unless you also switch to a faster serializer. What it tells you is: the JSON cost inside CrewAI’s message passing, memory persistence, and tool result handling — which is where the hot path runs — is 34× cheaper after patching.

What “17.3× faster tool execution” actually means

The tool execution benchmark simulates a common pattern: an agent calls the same tool with the same arguments, many times. This is what LLMs actually do. The benchmark uses a mock tool with a known execution cost and measures:

  • Baseline: CrewAI’s default BaseTool.run path, including validation, schema parsing, and result serialization — no caching.
  • Accelerated: Fast-CrewAI’s Rust-backed executor with result caching (configurable TTL), serde-based JSON validation, and statistics tracking.

On repeated identical invocations, the cached path clocks 11,616 ops/s vs 670 ops/s uncached. That’s 17.3×, and it uses 99% less memory because the cache returns pre-serialized results instead of rebuilding them.

What this does not tell you: whether your tools will get 17× faster. If your tool makes an HTTP call and the remote server takes 300ms, Fast-CrewAI can’t make that faster — but it can skip the call entirely on a cache hit, which is usually what you actually want. The benchmark measures the framework overhead that survives even when the underlying tool is fast.

A note on cache correctness: every cache is a correctness hazard. Fast-CrewAI’s tool cache is off by default and opt-in per tool. You tell it explicitly which tools are safe to cache and for how long.

What “11.2× faster memory search” actually means

CrewAI’s memory backend uses SQLite. The default text search uses LIKE '%needle%', which is a full table scan. As memory grows, scans get more expensive linearly.

The memory benchmark seeds a SQLite database with a few thousand realistic memory entries (agent observations, tool outputs, entity facts) and measures retrieval throughput:

  • Baseline: LIKE '%query%' across all entries
  • Accelerated: SQLite’s FTS5 virtual table with BM25 relevance ranking

FTS5 hits 10,206 ops/s vs 913 ops/s for LIKE. That’s 11.2×. It also returns better results, because BM25 is a real relevance ranking and LIKE is a substring match.

What this does not tell you: how your semantic search setup compares. If you’ve already replaced CrewAI’s default memory backend with a vector database, you’re not hitting this code path and these numbers don’t apply to you. If you’re using the default, the gain is direct.

What about end-to-end workflow speed?

This is the number that actually matters, and it’s the one we’re most careful about. On a full CrewAI workflow — not a micro-benchmark, but a real crew doing real work — the speedup depends heavily on what the workflow does:

  • LLM-heavy, memory-light workflow: typically 1.1–1.3× faster end-to-end. Your LLM latency dominates and Fast-CrewAI can only speed up the overhead between calls.
  • Memory- and tool-heavy workflow: typically 2–5× faster end-to-end. These are the workflows where CrewAI overhead is a meaningful fraction of total time.
  • Mostly database-bound workflow: typically 3–5× faster. FTS5 alone earns most of the gain.

The test-comparison target in the Fast-CrewAI Makefile runs a representative basic workflow. On our hardware it reports roughly 13% improvement on the simplest case (27.88s → 24.50s over 3 iterations). That’s a real number; it’s also the floor of the range.

Memory usage, the other metric nobody talks about

CPU time is the metric people benchmark; memory is the metric that actually takes down production pipelines. Fast-CrewAI is aggressive about memory:

ComponentPython peakRust peakSavings
Tool execution1.2 MB per op0.0 MB99% less
Serialization8.0 MB per op3.4 MB58% less
Database0.1 MB0.1 MB31% less

The tool execution number looks too good to be true. It’s not — it’s a consequence of returning pre-serialized cached results instead of rebuilding Python dicts every call. For a worker processing thousands of tool calls, that’s the difference between staying under your container memory limit and getting OOM-killed.

How to reproduce

If you want to run the benchmarks yourself:

git clone https://github.com/neul-labs/fast-crewai.git
cd fast-crewai
uv sync --dev
uv run maturin develop --release
make benchmark              # internal micro-benchmarks
make test-comparison        # full CrewAI workflow comparison
make test-comparison-extensive  # 1000 iterations

Results vary with hardware, Python version, and — importantly — with what kind of workflow you feed it. We’d rather you run the benchmarks against your own workload than trust ours.

The caveats we won’t sweep under the rug

  1. LLM latency is not in scope. If 90% of your run time is spent waiting on gpt-4o, no framework optimization will rescue you. Profile first.
  2. The fallback path is real. If a Rust wheel isn’t available for your platform, Fast-CrewAI silently falls back to pure Python. You get compatibility, not speed.
  3. Task parallelism gains depend on your DAG. We can’t parallelize what’s genuinely sequential.
  4. The cache is opt-in. You have to mark tools as cacheable; we won’t do it for you.
  5. This project is in active development. Memory and database acceleration are production-ready. Task and tool acceleration are being optimized continuously.

Going deeper

Need help applying this to your codebase?

Neul Labs offers audits, full implementation, and retained CrewAI engineering. We built fast-crewai — we can build yours.