Not all Context is Knowledge

🌱 - A collection of sprouting thoughts.

‼️ A local 8B model beat a frontier API at the one job I trained it for: managing a knowledge base, at roughly 5x lower latency and 5x lower cost, running entirely on my own machine. The short version: a fine-tuned 8B model running on my 3090Ti scored 81% on those knowledge management tasks. Claude Sonnet 4.6, run zero-shot on the same tasks, scored 76%, and the small model was roughly 5x faster and 5x cheaper per query. Yes, it's a specialist beating a generalist run cold. That's the point, and I dig into it below.

We are all still talking about filesystems as the substrate for agent memory: markdown files, organised well, that an agent reads and writes. Karpathy's LLM Wiki pattern is the sharpest version of it, essentially a wiki you curate and an LLM maintains, summarising sources, cross-referencing, even flagging contradictions. Retrieval gets the right files into the model's context window. The problem is that not all context is knowledge. Putting a file in front of the model is not the same as the model knowing the file is correct and current. A page written last week might describe something that changed yesterday. An entry might get quietly corrupted. A new note might land with a bad example baked in. Managing knowledge, I've started to think, is as much about what you keep out as what you let in... but also, well and truly, must I spend my hard earned tokens on fetching files?

A huge amount of what agents do (at least for me) all day is repetitive, unglamorous bookkeeping: deciding which file matters, whether a change is safe to use, who needs to see a message. IMO, none of that needs a frontier model.

I know, I know, none of this is news. Staleness is a known problem, and there are known ways to attack it from the outside: diffing tools, research agents that re-check sources, RAG pipelines with a verification step. The usual move is to bolt those onto a big frontier model and pay the API bill. My bet was the opposite direction: push the judgment down into the model itself, and make that model small, local, and cheap. Give it the job of routing queries to the right files, noticing when stored knowledge has gone stale, and evaluating a change before serving it. I call it Peripheral, a knowledge management agent (KMA) fine-tuned to do one thing well. My bigger target, though, is cheaper: small models doing the constant, boring coordination that multi-agent and multi-human systems are about to drown in. This is my first stab at the problem.

Let's get into it!

The Setup

Given I was studying for my French exam, I started by building a French language learning wiki using Karpathy's wiki pattern. Sixteen markdown files covering grammar, tenses, vocabulary, and lesson notes for B2 French exam prep. Not quite a toy example, not quite a production knowledge base either. I wanted it small enough that I could verify every answer by hand, but big enough that dumping every file into context is wasteful (~24,500 tokens per query).

Like everyone else, I reached for qmd (Tobi Lütke's on-device markdown search) as the retrieval layer. It combines BM25, vector embeddings, and LLM reranking. On paper it was exactly what I needed, but in practice it found the right files only 30% of the time, because the embedding model doesn't handle French well. A JSON file I wrote by hand in ten minutes, mapping filenames to topic keywords, beat it at 100%. Turns out the value isn't in fancy retrieval but in domain-aware understanding.

The Baseline

I also wanted a more scientific methodology this time around (like the good old days), so with the topic index pulling only the relevant files instead of all sixteen, it felt like a good moment to set a baseline.

I tested four things throughout. A dumb heuristic (keyword matching for routing, a severity lookup for gating), the base Qwen3-8B model straight out of the box, my fine-tuned Qwen3-8B (Peripheral), and Claude Sonnet 4.6 as the frontier reference. Locally, the gate experiments ran on Gemma 4B, and I generated training data with Qwen3-32B. More on each of those as we go.

Method	Tokens/Query	Savings
Raw (all files)	~24,500	baseline
Index (topic-matched)	~4,500	5-6x reduction
No context	~500	n/a

About 20,000 tokens saved per query. Not surprising, but it sets the budget the knowledge management layer has to work inside.

Files don't know when they're wrong

Catching staleness is cheap; the struggle is what you do about it. My hypothesis going in was that a small, specialised model could work with my markdown files efficiently. The interesting question was never "does retrieval help." It's "what happens when the knowledge is wrong?"

So I built a Peripheral Vision Layer (PVL for short, patent pending). It's an append-only change log built from git diffs. Before serving any file, it checks whether that file was modified recently. Then I made ten deliberate edits to my wiki: wrong conjugations ("je pit" instead of "je peux"), fake grammar rules ("qui faut + infinitif"), mangled tense formations ("en + radical + -ing"). The PVL caught all ten, at a cost of 226 tokens per query.

The PVL is the piece I keep coming back to. In a real deployment it's always on: it watches the git history and, before serving anything, flags the files that changed since the agent last looked, for about 226 tokens. That's the whole point of the "peripheral vision" name, you don't re-read and re-vet the entire knowledge base on every query, you just glance sideways at what moved and only pay attention when something did. Here it started life as a proof of concept, catching the ten edits I planted, but it's also the runtime trigger for everything downstream. It's how the system notices that a piece of knowledge might no longer be worth trusting.

The Block Gate

I tried two ways of handling stale content and the gap between them was hard to ignore.

The first was to annotate: inject a warning into the context telling the model the file was recently changed. The result was almost comic. The model (Gemma 4B, running locally) read the warning, read the corrupted content, answered from the corrupted content, then tacked the warning on as a footnote. It confidently served "je pit" as the conjugation of pouvoir.

The second was to block: withhold the stale content entirely. Instead, serve the git diff (what changed) plus the previous version of the file, and tell the model to evaluate the change before answering.

Change Type	Annotate Gate	Block Gate
Corrupted conjugation ("je pit")	Serves wrong answer + footnote	Rejects corruption, serves correct old version
Fake grammar rule ("qui faut")	Serves fake rule	Flags as "grammatically incorrect"
Rule swap ("-ant" → "-ing")	Serves wrong rule (with correct examples!)	Catches anglicism, keeps original
Correct concept with bad example ("vu que on est ici")	Accepts everything	Accepts concept, fixes the bad example
Clean vocabulary addition	Adds unnecessary warning	Accepts without false alarm

The block gate costs about 500 extra tokens. What it really does is change the cognitive task from "answer using this content" to "evaluate this change, then answer." That reframing is the whole trick.

The fake grammar rule surprised me most. Neither model (the local 4B or Claude Sonnet) caught the fake "qui faut + infinitif" rule without the block gate. It looked plausible enough to slip past both. Only when forced to look at what changed, rather than what exists, did they flag it.

This lines up with Augenstein's work on parametric vs contextual knowledge. The annotate gate creates a conflict between stale content (which the model treats as authoritative) and a warning (which it treats as metadata), and authority wins. The block gate removes the conflict by removing the stale content from context.

Training the Model

I generated training data locally with Qwen3-32B on a 3090 Ti, across three tasks:

Eval (3,000 examples): given a diff, is the change correct, incorrect, or mixed?
Routing (2,000 examples): given a query, which files are relevant?
Gate (2,000 examples): given a change signal, serve, annotate, or block?

Honestly, the most useful thing I learned in this whole project came from the data, not the model. The first batch had a 40% match rate on eval, and the culprit was the system prompt. "Evaluate this change" quietly implies something is wrong, so the model rejected almost everything, legitimate additions included. After rewriting the prompts to be neutral ("changes can be correct, incorrect, or mixed") and clearing a cached system prompt in LM Studio that was silently overriding all my scripts, match rates jumped to 77-95%. Prompt neutrality mattered more than data quantity.

By match rate I mean how often the labeller agreed with the known answer. I built the diffs myself, so I already knew which were genuine corruptions and which were legitimate edits. Match rate is just the share of generated labels that matched that ground truth. A low match rate doesn't only mean bad data, it means the labeller's judgement was skewed, and that skew gets baked into anything you train on top of it.

The lesson there is that your training data inherits the biases of your prompts. Frame every evaluation as "find the problem," and the model learns to find problems everywhere.

With cleaner data, I fine-tuned Qwen3-8B using Unsloth Studio with QLoRA on my 3090 Ti. That's 7,000 examples, 3 epochs, and about 3 hours. I decided to export the model as Q4_K_M GGUF for LM Studio so I could run my benchmarks with it.

Small beats Frontier

My first benchmark used ten hand-written test cases per task. The fine-tuned model scored 90% overall. But ten cases is thin, and the scores bounced around between runs, so they weren't statistically meaningful, though they were a good early signal.

So I built KMA-Bench: 110 diff evaluation cases, 80 routing cases, and 36 gate decision cases across three knowledge bases (the French wiki, ClickHouse docs, and the PostHog handbook, with 14 cases pulled from actual git commits). The results:

Model	Diff Eval (110)	Routing (80)	Gate (36)	Overall (226)	Latency
Heuristic	48 (44%)	60 (75%)	36 (100%)	144 (64%)	0ms
Base Qwen3-8B	23 (46%*)	4 (13%*)	4 (16%*)	31 (30%*)	~3,500ms
Peripheral (fine-tuned 8B)	78.3 ± 1.2 (71%)	70.7 ± 0.6 (88%)	34.7 ± 0.6 (96%)	183.7 ± 1.2 (81%)	~760ms
Claude Sonnet 4.6	65.5 ± 0.7 (60%)	77.0 (96%)	30.0 (83%)	172.5 ± 0.7 (76%)	~4,100ms

Results averaged over 3 runs (Peripheral) and 2 runs (Claude). ± values are standard deviation. The base model was tested on the initial 105-case run. Every model used the same prompt format, no few-shot examples, no per-model tuning. Sonnet's scores are zero-shot on our task format.

These figures cover all 226 cases I measured, ClickHouse included. The public KMA-Bench ships the 166 French and PostHog cases; more on why further down.

The fine-tuned model beats Sonnet overall, 81% to 76%, averaged across 226 test cases and three knowledge bases. The gap holds on every run.

You might be saying, "You tuned a specialist and ran the generalist cold." True, and that's the whole point!! The bet is that a specialist you train once and run locally beats a generalist you rent by the token.

Like, think about it for a bit. Qwen3-8B has 8 billion parameters, Claude Sonnet is a frontier model; Anthropic doesn't publish its size, but frontier models are estimated north of 100B. An 8B model, quantised to 4-bit and running in ~4.8GB of VRAM, beats a model that's likely 10 to 50 times larger on domain-specific knowledge management. Fine-tuning on 7,000 task-specific examples turns an 8B model into a specialist, and specialists win because they control the playing field.

Why a heuristic isn't enough

Now look at the heuristic row again. A system with zero intelligence (keyword matching for routing, a severity lookup for gating) scores 64%. It gets 100% on gate and 75% on routing. So why train a model at all?

There are a few reasons it earns its keep.

Diff evaluation. The heuristic scores 44% on diff eval because it always says "accept." It can't read content. It can't tell that "je pit" isn't a French word, or that changing a port from 8123 to 3306 swaps ClickHouse's port for MySQL's, or that a spending limit of "$15" is probably a misplaced decimal. The trained model scores 71% on the same cases. This is the task that genuinely needs intelligence: understanding what changed and whether the change is right.

Maintenance. The keyword topic index works because someone (me and Claude) hand-wrote keywords for each file. Add, rename, or restructure files and the index needs updating by hand. The trained model learned to match queries to file descriptions on its own. It routed ClickHouse and PostHog queries correctly with no manual index for those knowledge bases at all. The heuristic would score 0% on a new KB until someone sits down and writes the keywords.

Ambiguity. The severity-based gate rule (critical → block, medium → annotate, low → serve) is perfect on test cases where I assigned the severity. But in the real pipeline the PVL just produces a change signal, and something still has to decide whether that modified file is critical, medium, or low. The heuristic simply can't, but the model can.

The heuristic is a strong baseline for a reason: simple rules that match the task structure will always do well. The model shows its usefulness in the spaces between the rules, the ambiguous cases, the unseen domains, and the tasks like diff evaluation where rules don't work at all.

One comparison I still owe: how does this stack up against a naive RAG pipeline, or a heavier agentic RAG loop? I haven't run it yet. My hunch is that vanilla RAG lands near the heuristic on routing and worse on diff eval, since retrieval doesn't judge correctness, and that agentic RAG closes the accuracy gap but pays for it in latency and tokens, which is the exact cost I'm trying to avoid. But a hunch isn't a benchmark, so it's on my growing list.

Sonnet still wins on routing (96% vs 88%). The fine-tuned model wins on diff evaluation (71% vs 60%) and gate decisions (96% vs 83%), the two tasks that are actually specific to knowledge management!!

The base model scored 30% on an earlier 105-case run mostly because it can't produce valid JSON for most queries. Fine-tuning taught both the tasks and the output format, so the jump from 30% to 81% is as much about learning to emit structured output as it is about judgement.

The Full Pipeline

Everything so far runs as one pipeline. A query passes through the whole stack before the agent ever sees an answer: the router narrows to the right files, the PVL checks whether any of them changed, the block gate withholds anything stale and serves the diff instead, and the model gives its verdict.

Agent query
  → Topic index (5-6x token savings)
  → PVL gate (226 tokens, catches staleness)
  → Block gate (224 tokens, evaluates changes)
  → Peripheral (~760ms, 81% on KMA-Bench)
  → Serve to agent

Benchmarks test classification on pre-written snippets. In the real world, we care about whether the fine-tuned model handles actual git diffs through that pipeline. So I ran the full thing against my corrupted wiki. Real files, real git diffs, the block gate serving diffs instead of content.

Here's one pass, nothing trimmed. The block gate hands the model a diff and the previous version, and asks for a verdict. For the corrupted pouvoir edit, the input was a plain git diff:

--- a/grammar/pouvoir.md
+++ b/grammar/pouvoir.md
@@ présent @@
-je peux
-tu peux
-nous pouvons
+je pit
+tu pit
+nous piton

and the model handed back structured JSON, which is the only thing the agent upstream ever sees:

{
  "verdict": "reject",
  "reasoning": "'je pit', 'nous piton' are not valid conjugations of pouvoir; the change looks corrupted.",
  "corrected_content": "no correction needed",
  "confidence": 0.93
}

The agent never touches the corrupted file. It gets the verdict and the clean previous version. Run the same pipeline across the rest of the corruptions and this is what comes out:

Corrupted pouvoir conjugation:

"The new version introduces nonsensical and incorrect forms such as 'je pit', 'nous piton', which are clearly wrong. These changes appear to be random or corrupted. You should revert to the original version."

Rule swap (gérondif formation changed from "-ant" to "-ing"): Without the block gate, the model confidently claimed the gérondif is formed with "-ing" while giving correct examples (en mangeant) in the same breath. It contradicted itself without noticing. With the block gate:

"The use of '-ing' is not standard in French gérondif formation; it appears to be an anglicism or mistake. The original version should remain unchanged."

Partial error (vu que, bad example):

✅ Correct: parce que, puisque, comme (kept with full examples) ❌ Suspicious: "vu que on est ici, tout a fait" (flagged as incomplete)

Clean vocabulary addition:

"The content is well-structured, uses appropriate grammar, and provides useful notes for learners. Since the change appears to be accurate and beneficial, I will use it."

It had to break somewhere! But where??? Fabricated content. I added a fake March lesson with plausible-looking content (conjugation of -er verbs, relative clauses). The model accepted it: "there are no obvious errors in the added text." And on its own terms it was right. The fake lesson was grammatically correct and topically appropriate. The block gate catches quality problems (bad grammar, corruption, structural issues), not fabrication. It has no way to know the March lesson never happened. Catching fake-but-plausible content needs external verification: a human, a source document, a calendar. That's a real limitation, and a different problem from quality.

The fine-tuned model got 6 out of 7 on end-to-end change types. The one miss (fabrication) is specific and well understood.

Abilities transferred to new domains!

Overfitting was something that worried me quite a bit, the model saw nothing but French grammar in training. The obvious failure mode was that it had just memorised French instead of learning the job, "how to spot a broken conjugation" rather than "how to evaluate a diff." If that were the case, it would fall apart the second I pointed it at anything else. So I did exactly that, and ran it against domains it had never seen in some ClickHouse database docs (23 files) and the PostHog company handbook (86 files), no retraining.

Task	French Wiki (trained, 16 files)	ClickHouse (unseen, 23 files)	PostHog (unseen, 86 files)
Diff Eval	78%	70%	68%
Routing	95%	95%	80%
Gate	100%	100%	86%

Routing generalises, then degrades at scale. French and ClickHouse are even at 95%, so the model learned to match queries to file descriptions, not French-specific patterns. PostHog drops to 80%, and not because the domain is harder. 86 files is 5x more candidates to confuse. Scale is the bottleneck, not domain transfer.

Gate generalises cleanly. 100% on French and ClickHouse, 86% on PostHog. The PostHog misses are handbook-specific judgement calls (is an onboarding rewrite an "annotate" or an "accept"?) where the severity signal alone is ambiguous.

Diff eval drops predictably with domain distance. 78% on French (trained), 70% on ClickHouse (technical docs), 68% on PostHog (organisational docs). The model learned "how to evaluate diffs," not "how to evaluate French grammar diffs." But domain knowledge still helps. Spotting that "le logement (f.)" is wrong needs French gender, which the model has. Spotting that a "$15" spending limit is wrong for PostHog needs their policies, which it doesn't.

5x faster, 5x cheaper

The accuracy numbers turn heads, but the speed and cost numbers make this complelling imho.

Latency:

Model	Avg Latency	What's Included
Peripheral	~760ms	Local GPU inference, no network
Claude Sonnet 4.6	~4,100ms	API call, network, queue time
Base Qwen3-8B	~3,500ms	Local GPU but verbose output
Heuristic	<1ms	String matching, no model

The fine-tuned model is fast because it learned to produce short, structured JSON. The base model is slow because it generates long reasoning text before (sometimes) producing JSON. Fine-tuning squeezed the average output from ~474 tokens to ~222. Less output, less generation time, lower latency.

That compounds in a pipeline. Route, gate, eval is three model calls. At 760ms each, the full pipeline runs in ~2.3 seconds. At 4,100ms each with Sonnet, it's ~12.3 seconds. For an agent making 20 knowledge queries per task, that's 46 seconds versus 4 minutes, which is the gap between a batch job and something interactive.

Token cost per query:

Component	Tokens	Purpose
Raw retrieval (all files)	~24,500	Baseline: dump everything in context
Index retrieval	~4,500	Topic-matched files only
+ PVL overhead	+226	Staleness check
+ Block gate overhead	+224	Diff + old version
+ Model output	~222	JSON response
Total with Peripheral	~5,172	Full pipeline
Savings vs raw	~19,328	Per query

At scale this adds up. If an agent makes 100 queries per session at Sonnet pricing ($3/M input tokens), the raw approach costs ~$7.35 a session. The Peripheral approach costs ~$1.55. For a team running 50 sessions a day over a month, that's the difference between ~$11,025 and ~$2,325, roughly $8,700 saved.

Total overhead lands at about 700 tokens per query against roughly 20,000 saved, so Peripheral pays for itself 28 times over. Even running fully local, fewer tokens means faster inference, less VRAM pressure, and the option to run on smaller hardware. Peripheral's 5,172-token context sits comfortably inside a 4B model's window. The raw 24,500-token approach needs a bigger model or risks truncation.

This is the part I keep getting excited about. It opens the door for anyone who just wants to manage their own files well, without renting a frontier model to do it. I'd love to port this to MLX so it runs natively on my Mac, and I genuinely think a small, fine-tuned knowledge manager could become a standard local tool, the way ripgrep or fzf are. A little spirit that lives on your machine and keeps your notes honest. iykyk.

Not all context is knowledge

So you don't need a frontier model to manage knowledge. You can make due with a small model trained specifically for it, paired with cheap heuristics (a topic index, git diffs) that mop up the easy cases. The model is there for the hard one: deciding whether a change to stored knowledge is legitimate, corrupt, or somewhere in between. That's the line between context and knowledge. Retrieval can hand the agent a file; only judgement tells it whether the file is worth trusting.

None of this lives in isolation, which is the part I like. Peripheral is a judgement layer, and it slots in next to the tools people already use. qmd can still do retrieval where the language plays nicely with its embeddings (English, mostly), and the topic index covers the rest. lilmd handles section-level reads and writes so the model works on the right slice of a file rather than the whole thing. Headroom sits below Peripheral and compresses tokens further. Peripheral doesn't replace any of these. It sits on top and answers the one question none of them do: should I trust this change?

Again, file diffs are where I started. The gate is really a relevance primitive: given some change or event, decide serve, annotate, or block and act on it, flag it as maybe-relevant, or stay quiet. Swap "file diff" for "message" and you get something I keep hearing people ask for. Someone I was talking to recently is building exactly this for a multi-agent, multi-human chat: a model that reads every message and decides which agents it's relevant to, tagging each one "you could respond" or "you must respond". Today that's done with manual @-mentions, which nobody wants to maintain. It's the same gate (block, annotate, serve), just pointed at a conversation instead of a wiki. The signals are often already lying around, too: markdown front-matter like status: merged is an annotation an agent can read to know what happened without re-deriving it, and a "new file landed in this folder, go do something" automation is a write-path trigger waiting for a model cheap enough to run on every event.

That's the real bet, and it's bigger than knowledge bases: some operations don't have to be done by expensive models. Multi-agent, multi-human systems are going to need a lot of cheap, constant coordination for routing, gating, keeping things in sync, and most of it is exactly the kind of narrow, repeated judgement a small local specialist is good at. The plan is to slot this model in over MCP underneath a frontier model: let the cheap specialist handle the high-frequency gating and routing, and save the frontier model for the reasoning that actually needs it. The file-diff gate is step one.

An idea that unfurled while I built this: knowledge management needs a trust layer. Trust is earned, and trust assumes a guarantee. Think about peer review. However flawed it is, we trust an output because proxies we respect did the vetting for us. Managing knowledge is the same game, and it's as much about what you keep out as what you let in. The block gate is a crude first version of that, and epistemic vectors are potentially where it gets serious. This is the thread I want to keep pulling.

Where This Falls Short

Let's address the tiny elephants in the room.

Evaluation bias. I labelled the test cases, trained the model on my labels, and tested against my labels. That's three layers of the same person's judgement. Inter-annotator agreement hasn't been measured. I'd guess 10-15% of diff eval cases have arguable answers. Is swapping "indicatif" for "subjonctif" always a reject, or sometimes a partial? Reasonable people could disagree.

Prompt fairness. Every model used the same prompt with no few-shot examples, so Sonnet was never tuned for these tasks. A few examples in its prompt might close the gap. What this shows is a trained specialist against a zero-shot generalist, not proof that small models are inherently better than frontier ones.

Training data quality. The diff eval score is held back by categories where the training data had low agreement. Legitimate additions (63% match in training) and partial errors (66%) are the main offenders; the model learned to be overly suspicious of additions and overly harsh on partial errors. Better data for those two categories would probably push diff eval from 71% past 80%. The gate and routing data had much higher match rates (89% and 95%), and it shows.

Sample size. KMA-Bench is 226 cases across three knowledge bases (14 from real git commits), though the public release ships fewer (see the licensing note below). Per-category counts are still small (5-10 cases per corruption type). I'm open-sourcing it and would love contributions and feedback.

Licensing the benchmark. A hurdle I hit right at the publishing step, and it's worth sharing because it's the kind of thing you only think about too late. I'd built part of the benchmark from ClickHouse's docs, which are licensed CC BY-NC-SA 4.0, so non-commercial, share-alike. That's fine for reporting the numbers I measured on them, but it means I can't redistribute the derived test cases under a permissive licence. So the public KMA-Bench ships the French (mine) and PostHog (MIT) cases, 166 of the 226. The ClickHouse cases stay out. The lesson, obvious in hindsight: check the licence of your source material before you bake it into a benchmark dataset you mean to open-source. I'm now hunting for MIT or CC-BY docs to rebuild that coverage cleanly.

The block gate is a quality gate, not a truth gate. It catches corruption, bad grammar, structural problems, and rule swaps. It does not catch fabricated-but-plausible content. It asks "is this change well-formed?", not "where did this even come from?", so a fake lesson with clean grammar walks right through. Closing that second gap needs provenance, not grammar, which is what epistemic vectors are for.

Cost. The model is free to run, not free to build. Data generation took ~15 hours of GPU time, fine-tuning another 3, and development maybe 40 hours of mine. "Zero cost" means zero marginal cost per query.

The write path. Everything here is read-path. Deciding where new knowledge goes, when to split or merge files, how to keep cross-references intact, all untrained and untested. It's what I'm working on next.

If I were starting over, I'd use multiple annotators and measure agreement, tune prompts per model before claiming one beats another, test on three or more knowledge bases before making any generalisation claims, build the write path alongside the read path, and build KMA-Bench before the model rather than after. Having the benchmark first would have made every design call sharper.

If you're working on knowledge management for AI agents, have a sharper take on the write path, or think the specialist-beats-generalist claim falls apart somewhere, I'd genuinely like to hear it. Find me on the internet or open an issue.

What's Next

This started as a five-layer architecture, and today the read path works. I plan on working on a few more things in the near future, namely:

The Write Path

Everything so far is read-path: the agent asks, Peripheral retrieves and judges. The other half, deciding where new knowledge goes, is untrained. The write path has to answer: does this new information belong in an existing file as a new section? Should it extend a section that's already there? Does it need a new file? Is it just a duplicate of something already in the KB?

The training data scripts exist (gen_organize.py) and the task definitions are clear. What's missing is the actual data and the benchmark cases for placement decisions. That's the next chunk of engineering.

The MCP Server

Right now Peripheral is a pile of scripts and a model. Nobody can use it without reading the codebase. Wrapping it as an MCP server with three tools (peripheral_query for reads, peripheral_write for writes, peripheral_status for the PVL change log) means any agent that supports MCP (Claude Code, Cursor, custom agents) can call it. That's the step that takes this from a research project to something people can actually pick up.

The harder problems

Three things I know will bite, listed so I can't pretend I didn't see them coming.

Compound diffs. Every KMA-Bench case has one change per diff. Real commits have 5, 10, 50, and a single commit might mix a legitimate vocabulary addition, a corrupted conjugation table, and a formatting fix. The right answer there isn't accept or reject, it's per-change decomposition. KMA-Bench v0.2 will add cases that force exactly that.

Scale. The current KBs are 16 files (French) and 27 (ClickHouse). A real one might have 500, and three things break: routing (the file list alone eats thousands of tokens, so you need hierarchical routing, category then file), the PVL (the change log grows, so you filter it by routed files), and the token budget itself (does ~700 overhead against ~20,000 saved still hold?).

Multi-agent consensus. Three agents writing to the same KB at once and no coordination means one overwrites another. The PVL is designed to grow into a Raft-inspired shared change log: agents append, a leader resolves conflicts, a change commits only when it doesn't clash, and when it does the block gate serves the diff back to re-evaluate. Most setups are still single-agent, so this is built for when the trend (CrewAI, AutoGen, LangGraph, etc) catches up, not for today.

Epistemic Vectors

The most speculative piece, and the part I'm least sure about. Borrowing loosely from the uncertainty-quantification literature (the same body of work behind the UQ tutorial I link below), and truly inspired by the soon to be published work of Clement at Candide.. here's a scheme I've been sketching: every piece of knowledge in the KB would carry a four-dimensional uncertainty score:

e (evidential support): how well-supported? A rule from the Académie Française scores high. A tip from a Reddit thread scores low.
c (methodological consensus): do sources agree? Three textbooks saying "parce que + indicatif" is high c. A split between metropolitan and Quebec French is lower.
d (provenance depth): how traceable? A published textbook is d=2. Something you wrote from experience is d=1. Something a model hallucinated is d=0.
s (scope anchoredness): how universal? "French uses gendered nouns" is broadly anchored. "My tutor likes teaching passé composé before imparfait" is narrow.

Ask about pouvoir and Peripheral returns the answer plus (e=0.9, c=1.0, d=2, s=0.8): well-supported, agreed, from a textbook, broadly useful. Ask about a tutor's preferences and you get (e=0.3, c=0.0, d=1, s=0.1): personal, no possible consensus, first-hand, narrow.

This is also where the fabrication problem from earlier gets a real answer. A fake March lesson would carry d=0, no traceable source, and that alone would flag it even when the grammar is perfect. The block gate can't see that. Epistemic vectors could.

The hard part: training a model to produce calibrated vectors needs hundreds of human-annotated examples with (e, c, d, s) values, plus proof that the vectors actually predict correctness. The ECIR 2026 UQ tutorial has frameworks for model-level uncertainty, but translating that to knowledge-level uncertainty per file is an open question.

I don't know if it works. But if it does, it gives downstream agents something no retrieval system offers today: not just the answer, but a formal statement of how much to trust it. That would be the trust layer, made real.

The Stack

If you want to build this yourself:

Knowledge base: Karpathy's LLM Wiki pattern
Search: qmd for English, a topic index for everything else
Section-level operations: lilmd
Staleness detection: a git-based PVL (append-only change log)
Token compression (below Peripheral): Headroom
Training data: Qwen3-32B locally via LM Studio, zero API cost (data-gen scripts)
Fine-tuning: Unsloth Studio with QLoRA on a 3090 Ti
Inference: GGUF export, LM Studio
Evaluation: KMA-Bench: 166 public test cases (French + PostHog; ClickHouse tested but not redistributable), 14 from real git commits

Try it

Everything here is open. Pick your depth:

Run it. Grab the GGUF from Hugging Face and load it in LM Studio (or search malgamves/peripheral-8b). Set the system prompt to the task you want; eval, routing, or gate, formats are in the model card, then paste a diff, and you get a verdict back as JSON.

Benchmark it. KMA-Bench ships 166 cases across two knowledge bases (French + PostHog; the ClickHouse set I also tested on isn't redistributable, since it's CC BY-NC-SA). Run it against your own model and tell me where mine loses.

Reproduce it. The data-generation and fine-tune-prep scripts in the repo take you from a knowledge base to a fine-tuned GGUF, no frontier API needed. A full worked example is coming alongside the write path.

The code. Pipeline, PVL, block gate, and data-gen scripts: github.com/malgamves/peripheral.

I'd also like to extend a special thanks to Paul-Louis, Jules, Josh, Tuana and Clement for taking time to give me pointers as I worked on this write up. You should check out their work, they are incredible.