Guiderustpyo3internals

Rust + Python interop with PyO3 for AI agents

A practical look at PyO3, serde, and Tokio in the context of AI agent frameworks. What Rust actually helps with, what it doesn't, and how Fast-CrewAI is structured under the hood.

Neul Labs · April 12, 2026 · 11 min read

Most Python libraries that claim to be “written in Rust” are really Python libraries that delegate two or three hot functions to Rust via PyO3 and wrap the rest in idiomatic Python. This is the right shape: Rust where it earns its keep, Python everywhere else. Fast-CrewAI follows the same pattern, and this guide is a look at why — and what the pattern costs.

What PyO3 is and isn’t

PyO3 is a crate that lets you write Python extensions in Rust. You get:

A type-safe wrapper around CPython’s C API.
Automatic conversion between Rust types (String, Vec<T>, HashMap) and their Python equivalents.
A #[pymodule] / #[pyclass] / #[pyfunction] macro surface that makes exporting Rust to Python feel like writing normal Rust.
A build toolchain (maturin) that produces wheels Python can pip install.

What PyO3 is not:

Free speedup. Crossing the Python↔Rust boundary has a cost. You pay it on every call. If your hot function crosses the boundary a million times a second, you won’t see the speedup you expected — the boundary itself becomes the bottleneck.
A way to escape the GIL. PyO3 lets you allow_threads during long Rust computations so other Python threads can run, but your Rust code can still be gated by the GIL when it touches Python objects.
A replacement for Cython or numpy. If your workload is numeric and vectorizable, numpy is still the right answer. PyO3 is for workloads that are control-flow-heavy or allocation-heavy, where numpy can’t help.

Fast-CrewAI uses PyO3 in exactly the places it works: serialization, string-heavy full-text search, structured data validation, and task scheduling. Not in the places it doesn’t.

Serde: the reason serialization is 34× faster

Python’s json.dumps is written in C, but it’s a Python-object-at-a-time recursive walker. It checks every dict key, every list item, every string, against Python type hierarchies. The per-object overhead is small but it adds up.

Serde, by contrast, is a trait-based serializer in Rust. When you serialize a struct with #[derive(Serialize)], the compiler generates straight-line code that walks the fields in a known order with zero runtime type checks. For a known schema, serde is 30–50× faster than stdlib JSON on comparable payloads.

In Fast-CrewAI, CrewAI’s message payloads get modeled as Rust structs (roles, content, metadata, tool results), serialized via serde_json, and handed back to Python as bytes. The Python side never sees the intermediate dict. That’s where the 58% memory savings come from — no Python dict is constructed, so none is collected.

The catch: this only works if the payload schema is known. For truly dynamic content — arbitrary nested dicts the caller hands in — you need a serde_json::Value round-trip, which is slower than the struct path but still ~10× faster than Python json. Fast-CrewAI uses the struct path where possible and the Value path as a fallback.

FTS5: the search engine hiding in your SQLite

SQLite has shipped with FTS5 since 2015. Almost nobody uses it, which is a shame, because it’s an excellent piece of software. FTS5 gives you:

A virtual table that acts as an inverted index over text content.
Tokenizers for English, porter stemming, and unicode folding out of the box.
BM25 relevance ranking — the thing Elasticsearch is famous for — in a library that runs in-process.
Automatic index maintenance via triggers on the source table.

In the memory acceleration path, Fast-CrewAI creates an FTS5 virtual table alongside CrewAI’s memory tables and wires up triggers so every insert, update, and delete in the source table propagates to the index. Queries that used to be SELECT * FROM memory WHERE content LIKE '%query%' become SELECT * FROM memory_fts WHERE memory_fts MATCH 'query' ORDER BY bm25(memory_fts) LIMIT 20. Same result set, ranked, 11× faster.

The Rust side of this is thin. Most of the speedup is from FTS5 itself, which is C. What Rust adds is the r2d2 connection pool, the serde-based row deserialization, and a small amount of query rewriting to make FTS5 queries safe against SQL injection.

Tokio for task scheduling, cautiously

Tokio is a Rust async runtime. It’s the standard choice for network-heavy Rust services, and it’s what you reach for when you want to run a lot of concurrent work efficiently.

In Fast-CrewAI, Tokio powers the task executor: we build a directed acyclic graph of your CrewAI tasks, topologically sort it, detect cycles, and dispatch tasks in parallel where their dependencies allow. The Tokio runtime lives inside the Rust extension and is owned by the executor — Python never sees it directly.

The honest caveat: Tokio cannot make your LLM calls faster. If a task calls gpt-4o, the wall-clock time is dominated by the OpenAI API, not by the scheduler. What Tokio buys you is the ability to run multiple LLM-bound tasks in parallel instead of sequentially. That is a real win in workflows with genuine independent work — and a no-op in workflows that are actually sequential.

The graceful fallback, and why it matters

Rust extensions have a deployment problem: they need a compiled wheel for every target platform. If a user installs fast-crewai on some exotic platform where no wheel exists, the import used to be a hard failure. That’s an unacceptable user experience.

Fast-CrewAI has a fallback: if the Rust extension fails to import for any reason, the shim falls back to pure Python implementations of the same interfaces. You get compatibility, you just don’t get speed. We check for Rust availability at startup:

import fast_crewai
if not fast_crewai.rust_available():
    print("Warning: running in pure-Python mode")

The fallback implementations are exercised by the same 101 compatibility tests the Rust implementations run against. No divergence.

What we deliberately didn’t rewrite

This is the part that usually surprises people. We didn’t rewrite:

Agent logic. The prompt templating, the ReAct loop, the tool-selection decisions — all of that stays in Python. It’s not the bottleneck, and Python is the right language for it (fast to iterate, easy to debug, plays nicely with the LLM provider SDKs).
LLM client code. The OpenAI/Anthropic/etc. SDKs are already fine. Wrapping them in Rust would just add a boundary crossing.
User tool code. Your tools stay Python. We wrap them with a Rust executor that handles caching and stats, but the tool body is yours and it’s Python.
Configuration, logging, observability. All Python.

Roughly 70% of the Fast-CrewAI codebase is still Python. The Rust side is tight, focused, and doesn’t try to be heroic. That’s how you get speedups without pain.

Build toolchain notes

If you’re thinking of building your own PyO3 extension, here are the pieces Fast-CrewAI uses:

maturin for building and publishing wheels. maturin develop gives you an editable install during development; maturin build --release produces the wheel you ship.
cibuildwheel in GitHub Actions for multi-platform wheel builds (Linux x86_64/ARM64, macOS x86_64/ARM64, Windows x86_64).
#[pyo3(crate = "pyo3")] annotations to keep the macro resolution stable across PyO3 versions.
Feature flags to conditionally compile the Rust extension — so the pure-Python fallback can be tested in CI too.

Going deeper

How it works — the canonical architecture doc in the MkDocs site.
Benchmarks explained — where the 34×/17×/11× come from in practice.
The Fast-CrewAI GitHub repository — the code is MIT-licensed and readable.