Guideperformancecrewaiarchitecture

Why CrewAI multi-agent systems get slow in production

A technical walkthrough of the four places CrewAI spends CPU time in real workloads — serialization, memory search, tool execution, and task scheduling — and what you can do about each of them today.

Neul Labs · April 12, 2026 · 12 min read

CrewAI is a pleasure to prototype with. You sketch a Crew, wire up a couple of Agents, hand them a Task, and you have a working multi-agent system in an afternoon. The trouble starts the day you put it in front of real users — or, worse, the night before a board demo when the pipeline that ran in 24 seconds on your laptop starts taking four minutes on staging.

Multi-agent frameworks don’t slow down for mysterious reasons. They slow down because a small number of hot paths get executed a surprising number of times per run, and each of those paths is written in idiomatic Python where “idiomatic” means “allocates a lot and serializes through json.dumps”. This guide walks through the four places we consistently find the bottlenecks, in order of impact, along with what you can do about each today.

1. JSON serialization is secretly the bottleneck

The first thing most engineers miss is how much CrewAI’s internals hinge on JSON. Every time an agent message flows between components — into memory, into a tool result, back out as structured output — it gets serialized to JSON and later parsed back. json.dumps is written in C but it’s not fast in the sense your Rust-brained colleague means when they say “fast”. In our benchmarks, Python’s stdlib JSON tops out around 2,333 ops/s on typical CrewAI payloads.

In a long-running crew with memory enabled, tool results streaming through agents, and structured outputs flowing between tasks, serialization can account for 20–40% of total CPU time. Worse, it allocates: the peak RSS of a CrewAI worker gets dominated by the temporary Python dicts that JSON serialization creates and throws away.

What you can do today:

Measure first. Wrap the relevant code paths in cProfile and look for json.dumps and json.loads in the top 20 entries. If they’re there, you have a serialization problem.
Swap to orjson for your own application code — it’s a ~10× improvement for hand-written hot paths.
For CrewAI internals, use Fast-CrewAI. The fast_crewai.shim module replaces CrewAI’s serialization paths with serde_json via PyO3. In our benchmarks that’s 34.5× faster (80,525 ops/s vs 2,333 ops/s) and uses 58% less memory.

2. Memory search is `LIKE` all the way down

CrewAI’s memory system is brilliant in the abstract: short-term, long-term, and entity memory layered over a RAG storage backend. In production, it’s also the component that most often surprises teams.

The default RAG storage uses SQLite with LIKE queries for text search. LIKE '%needle%' is a full table scan; there is no index that can help it. For a crew that’s been running for a few weeks and has accumulated a few megabytes of memory, every memory lookup starts scanning the entire table. Scanning a few megabytes doesn’t sound like much — until you realize an agent might do it dozens of times per task, and you have several agents, and each task runs through memory retrieval repeatedly.

The fix is FTS5, SQLite’s full-text search module. FTS5 builds an inverted index over your memory contents and supports BM25 relevance ranking out of the box. It’s been in SQLite since 2015 and it’s unreasonably fast.

What you can do today:

If you own the memory layer, migrate to FTS5 yourself. It’s a one-afternoon change: create a virtual table, populate it from your existing memory, add triggers to keep it in sync, and replace your LIKE queries with MATCH.
If you want it done for you, Fast-CrewAI patches CrewAI’s RAGStorage, ShortTermMemory, LongTermMemory, and EntityMemory to use FTS5 with BM25 ranking and r2d2 connection pooling. That’s 11.2× faster (10,206 ops/s vs 913 ops/s) on identical workloads.

3. Tool execution repeats work you already did

Agents call tools. Tools get called with the same arguments over and over, because LLMs are creatures of habit. A tool that fetches weather for San Francisco will be called with {"city": "San Francisco"} — and then, three agent turns later, called again with the same argument. There is no caching layer. The tool runs. Whatever it’s doing — an HTTP call, a database query, a filesystem scan — runs in full every single time.

For I/O-bound tools this is merely wasteful. For CPU-bound tools this is a catastrophe. And the tool wrapping overhead itself is non-trivial: CrewAI’s BaseTool.run path includes validation, schema parsing, and result serialization on every invocation.

What you can do today:

Add your own cache. A simple functools.lru_cache on your tool’s underlying function can eliminate 90% of the redundant calls. Make sure the arguments are hashable — use a wrapper that converts dicts to frozen tuples if needed.
Cache at the right granularity. Some tool results are valid for seconds (live data), others for hours or days (static lookups). Pick a TTL per tool and honor it.
For CrewAI’s internal tool execution overhead, Fast-CrewAI wraps BaseTool with a Rust-backed executor that caches results, uses serde for JSON validation, and tracks execution statistics. That’s 17.3× faster on synthetic benchmarks and uses 99% less memory on the hot path.

4. Task scheduling is sequential by default

CrewAI runs tasks sequentially unless you explicitly opt into parallel execution — and even then, the scheduling is naive. Tasks that have no dependency between them still run one after another because the framework doesn’t build a dependency graph.

In a crew with eight tasks where only three of them actually depend on each other, you’re leaving 5 tasks worth of parallelism on the table. If each task makes an LLM call, that’s minutes of wall-clock time wasted per run.

What you can do today:

Manually split your crew into phases. Run tasks that have no dependencies as a batch using Python’s concurrent.futures or asyncio.gather. Wait for the batch, then move to the next phase.
Be honest about your dependencies. Most task graphs have far less inherent sequentiality than the code makes it look.
Fast-CrewAI ships a task executor backed by Tokio that builds a topological sort of your task dependencies, detects cycles, and dispatches tasks in parallel when it can. You still write tasks the same way — the scheduler just gets smarter.

The honest caveats

A few things we won’t pretend away:

Your LLM latency still dominates. If a task takes 20 seconds in LLM roundtrips and 0.5 seconds in CrewAI overhead, making the overhead 34× faster saves you less than half a second. You should still measure before assuming this is your bottleneck.
Fast paths help most on memory-heavy and tool-heavy workloads. If your crew barely uses memory and calls one or two tools per task, the end-to-end gain is modest — usually 1.1–1.3×. If it uses memory aggressively and calls dozens of tools, the gain can be 3–5×.
Rust is not magic. We use it where it earns its keep (serialization, search, tool caching). Everything else stays idiomatic Python — and that’s on purpose.

Where to go from here

Read Benchmarks explained for the methodology behind the 34× / 17× / 11× numbers.
Read Migrating to Fast-CrewAI if you want to try the one-line fix.
Or book a performance audit and let us do the measuring for you.

1. JSON serialization is secretly the bottleneck

2. Memory search is LIKE all the way down

3. Tool execution repeats work you already did

4. Task scheduling is sequential by default

The honest caveats

Where to go from here

Need help applying this to your codebase?

2. Memory search is `LIKE` all the way down