Turbocharge Pydantic AI + SurrealDB RAG with TurboAgents and TurboQuant

Google Research released the TurboQuant, the game changing compression technique also Superagentic AI released the TurboAgents to showcase the use of the TurboQuant in the real Agentic AI systems.  This post walks through a small local demo built with Pydantic AI, SurrealDB, and TurboAgents  that start with a plain RAG app, then swap only the retriever so the same app uses TurboAgents for compressed retrieval and reranking.

If you want to see the end result first, the short demo video is here: watch the demo on YouTube. The code used in this tutorial is here: turboagent-minimal-demo. TurboAgents documentation is here: TurboAgents docs.

What This Tutorial Is About

A lot of RAG examples become hard to follow because they try to explain too many things at once: the agent framework, the vector store, embeddings, prompting, orchestration, and performance claims. This tutorial takes the opposite approach.

The app is intentionally small. It does one thing clearly:

  • start with a plain local RAG app
  • swap only the retriever
  • show what TurboAgents changes

That makes it easier to see where compressed retrieval actually fits.

What Is TurboAgents

TurboAgents is a Python package for TurboQuant-style compression, retrieval, and reranking in agent and RAG systems. It is designed to plug into an existing stack instead of replacing it. That design choice matters. In many real systems, the hard question is not “how do I build a new agent framework?” The hard question is “where can I add a new retrieval capability without rewriting the rest of the app?” This tutorial uses TurboAgents exactly that way. It does not replace the agent. It does not replace the vector store. It changes the retrieval layer.

What Pydantic AI Is

Pydantic AI is the agent framework used in this example. It gives a clean Python interface for defining an agent, its instructions, and its tools. For this tutorial, that is useful because the framework code stays readable and small.

The agent in this repo has a single important job: answer a question using retrieved context. That makes Pydantic AI a good fit for the tutorial because it lets the retrieval path stay the center of attention.

What Is SurrealDB

SurrealDB is the vector-backed storage layer used here. In this repo, the demo uses the embedded surrealkv:// backend from the SurrealDB Python SDK, so there is no separate database server to run.

That keeps the tutorial local and reproducible:

  • the language model runs locally through Ollama
  • the embedding model runs locally
  • the retrieval data is stored locally

The result is a small tutorial that still uses real components rather than placeholders.

Why This Tutorial Uses SurrealDB Instead of LanceDB

LanceDB is also supported by TurboAgents, but LanceDB already has its own quantization and indexing story. For a first tutorial focused on the TurboAgents integration seam, SurrealDB makes the comparison easier to isolate.

That means the reader can look at this demo and understand:

  • what the plain retrieval path looks like
  • what the TurboAgents retrieval path looks like
  • what changed between them

That is a better teaching example than mixing multiple retrieval stories together.

What You Will Build

The demo repo contains three simple ways to run the app:

  • a plain SurrealDB RAG path
  • a TurboAgents-backed SurrealDB RAG path
  • a comparison script that runs both versions and prints the differences

The repo is here: turboagent-minimal-demo. The README in that repo already explains the same workflow in short form. This tutorial expands that into a step-by-step walkthrough.

How The Demo Is Structured

The main idea behind the repo is simple:

Plain version:
Pydantic AI -> SurrealDB search -> answer

Turbo version:
Pydantic AI -> TurboAgents + SurrealDB search -> answer

In both versions:

  • the agent stays the same
  • the documents stay the same
  • the local model stays the same
  • the question stays the same

Only the retriever changes.

That is the central point of the tutorial.

Prerequisites

You need the following before running the demo:

The Ollama model used by the agent is qwen3.5:9b. The embedding model is Qwen/Qwen3-Embedding-0.6B, truncated to 256 dimensions so it stays compatible with the TurboAgents quantization path used in the demo.

This repo does not require Docker and does not require a separate SurrealDB server. That is because it uses the embedded surrealkv:// backend.

Clone And Set Up The Repo

Start here:

git clone https://github.com/SuperagenticAI/turboagent-minimal-demo.git
cd turboagent-minimal-demo
uv sync

Make sure the Ollama model is available:

ollama pull qwen3.5:9b
ollama list

Then run the comparison script:

uv run python scripts/run_compare.py

The first run may take longer because:

  • the embedding model may need to download and load
  • the Ollama model may need to warm up
  • the demo builds its local retrieval state under demo_data/

What The Repo Contains

The important files are:

  • app/config.py — shared configuration, sample corpus, and the demo question
  • app/embed.py — real local embedding model wrapper
  • app/retrievers.py — both the plain and TurboAgents retrievers
  • app/agent.py — the shared Pydantic AI agent and grounded run helper
  • scripts/run_plain_rag.py — baseline app
  • scripts/run_turbo_rag.py — TurboAgents-backed app
  • scripts/run_compare.py — runs both and prints the comparison

This structure keeps the code small enough that the integration seam stays visible.

Step 1: Start With The Plain RAG Version

The baseline retriever uses plain SurrealDB vector search. It embeds the demo corpus, stores those vectors in the local SurrealKV-backed database, and searches it directly.

At a high level, the baseline retriever does three things:

  • prepare the local SurrealDB-backed storage
  • seed the demo documents and their embeddings
  • run vector search for the question

This is intentionally simple. The baseline exists so the reader has a clear “before” picture.

You can run only the baseline version with:

uv run python scripts/run_plain_rag.py

Step 2: Add TurboAgents To The Retrieval Layer

The Turbo version keeps the same high-level app structure, but replaces the baseline retriever with a TurboAgents-backed retriever.

That means the new retrieval path:

  • uses the same embedding vectors
  • stores the same document metadata
  • answers the same question
  • adds TurboQuant-style compressed retrieval and reranking

This is the seam many teams care about in practice. The change is not “use a completely different application.” The change is “use a different retrieval implementation under the same app.”

You can run only the Turbo version with:

uv run python scripts/run_turbo_rag.py

Step 3: Compare Both Versions

The main script for this tutorial is:

uv run python scripts/run_compare.py

This runs both versions and prints:

  • the answer from the baseline app
  • the answer from the TurboAgents app
  • retrieval mode
  • retrieval timing
  • vector storage details
  • a short comparison summary

A representative result from the demo looks like this:

Baseline mode: baseline-surrealdb
Turbo mode: turbo-surrealdb-3.5-bits
Compression gain: about 5.02x smaller rerank payload per vector
Conclusion: same agent flow, compressed retrieval payload, and only a retriever-level code change.

What Changed In The Code

This is where the tutorial becomes concrete.

The demo is designed so that the code difference is easy to trace:

  • scripts/run_plain_rag.py runs the plain version
  • scripts/run_turbo_rag.py runs the Turbo version
  • app/agent.py stays the same high-level agent wiring in both cases
  • app/retrievers.py contains the real swap from plain retrieval to TurboAgents retrieval

That is the core message of the repo and the tutorial:

  • same agent
  • same app shape
  • same documents
  • different retriever

Why The Grounded Tool Call Matters

One practical issue in local tool-using demos is that the model can sometimes answer without actually calling the retrieval tool. That is a bad failure mode for a tutorial because it makes the output look less trustworthy.

The demo handles that by explicitly steering the model to call the retrieval tool first. If the first run skips retrieval, it retries with a stricter prompt. If retrieval still does not happen, the script fails clearly instead of quietly pretending everything worked.

That is the right behavior for a technical tutorial. A retrieval demo should actually retrieve.

What The Result Means

The most important measurable output in this tutorial is the retrieval payload size. In the current demo, the baseline path shows raw float32 vectors, while the Turbo path reports:

raw=1024 bytes, turbo=204 bytes, compression≈5.02x

That is the visible win in this small example.

This tutorial is intentionally not making a blanket claim that every end-to-end RAG flow will be faster. That would be too broad. The honest claim is narrower and more useful:

  • TurboAgents fits into the retriever layer cleanly
  • the integration can be small and readable
  • the compressed retrieval payload is measurably smaller

Why The Demo Uses Real Components

This repo uses real local components:

  • a real local chat model through Ollama
  • a real local embedding model
  • a real local SurrealDB-backed storage path
  • a real TurboAgents quantization path

That matters because the tutorial is meant to be reproducible. It should not depend on fake embeddings, precomputed hidden state, or a hardcoded answer path.

Resetting The Demo

The repo builds its own local retrieval state. That means generated data should not be committed. If you want to rebuild the demo from scratch, delete the generated local data and rerun it:

rm -rf demo_data
uv run python scripts/run_compare.py

This recreates the local SurrealKV data and retrieval state.

Why This Pattern

This is a small repo, but it demonstrates a useful pattern for larger systems. Many teams already have:

  • an agent layer
  • a vector store
  • a retrieval flow

In those systems, a practical adoption path is often more important than a theoretically perfect one. This tutorial shows one practical path:

  • keep the agent
  • keep the vector store
  • change the retriever

That is why this example is useful beyond the exact stack shown here.

Watch Demo

 

Useful links:

Closing

The point of this tutorial is not that TurboAgents replaces your stack. The point is that it can fit into an existing stack at the retrieval layer. In this demo, the app stays readable, the code change stays visible, and the compression story stays measurable. That is a good way to evaluate a retrieval-layer integration before moving on to bigger systems.