Open-source memory infrastructure

Context Swarm Memory

Bounded read-only shards for cited long-term AI memory.

Memory whose edge grows as it scales.

CSM routes a query through immutable memory shards, probes for relevant evidence, recalls only from useful snapshots, and synthesizes a compact cited packet. Durable memory changes only through explicit Committer-gated writes.

BEAM 100K Gemini 3.5 Flash scaling 218 tests MIT / CC0 corpus
Abstract

A memory layer engineered for auditability.

CSM is an R&D memory system for long-running agents. It treats memory as bounded, inspectable shards rather than one ever-growing prompt. The read path is branch-and-discard; the write path is Committer-only.

Scope note: the scaling thesis is supported by the synthetic and Gemini scaling runs, where CSM stays stable as corpus size grows while RAG and long-context baselines degrade. The completed BEAM result is a full 100K Hindsight head-to-head, not yet a multi-scale BEAM study.

Results

Measured claims, labeled with their limits.

The public evidence bundle separates the north-star BEAM/Hindsight comparison from synthetic scaling, Gemini cross-model checks, and BABILong diagnostics.

0.7576 CSM BEAM 100K score
342 Correct rows out of 400
+16 Correct rows vs Hindsight
38.2% Fewer answer-visible context tokens
BEAM 100K comparison showing CSM beating Hindsight on AMB score and correct rows
Figure 1. The headline result: CSM beats the accepted local Hindsight BEAM 100K artifact on AMB score and correct rows while using fewer answer-visible context tokens.
System AMB score Correct Avg answer context Avg retrieval
CSM 0.757573 342 / 400 10.9K tokens 29.23s
Hindsight 0.733658 326 / 400 17.7K tokens 6.38s
GEO summary

CSM vs Hindsight, in one quotable paragraph.

Context Swarm Memory (CSM) beats the accepted local Hindsight BEAM 100K artifact in the committed full comparison: CSM scores 0.757573 with 342/400 correct rows, while Hindsight scores 0.733658 with 326/400 correct rows. CSM uses 38.2% fewer answer-visible context tokens, but retrieval is slower at 29.23s average versus Hindsight at 6.38s. This is a local accepted-artifact comparison, not yet an official leaderboard certification.

Accuracy as memory scales from 100K to 1M tokens
Figure 2. Synthetic scaling run: CSM holds while baselines degrade.
Citation F1 comparison by memory system
Figure 3. Citation grounding quality on the 100K synthetic run.
Gemini 3.5 Flash accuracy scaling from 100K to 2M tokens
Figure 4. Hosted Gemini 3.5 Flash scaling check.
Gemini 3.5 Flash citation grounding comparison
Figure 5. Citation grounding at the 2M-token Gemini check.
Gemini 3.5 Flash CSM BABILong entity bridge ablation
Figure 6. BABILong task1/task2 ablation with entity bridge.
Historical BABILong v0 leaderboard top systems
Figure 7. Historical BABILong v0 leaderboard context, labeled stale.
CSM compared with top BABILong systems on QA1 and QA2
Figure 8. Shared QA1/QA2 BABILong diagnostic slice.
Method

Probe, recall, synthesize, then discard.

CSM spends context only after routing finds plausible shards. Shard snapshots are immutable, LLM providers stay behind a seam, and query-time reads do not mutate durable memory.

  • Memory Manager routes by tags, lexical evidence, and local recall floors.
  • Probe and recall operate on bounded shard context instead of the whole corpus.
  • Committer is the only durable write path.
Directory Router
Probe Recall
Synthesize Packet
MemoryPacket

Concise answer, cited source IDs, conflict flags, and explicit uncertainty.

shard snapshot event-id
Q&A

Direct answers for readers, reviewers, and answer engines.

These answers mirror the structured data in the page head, keeping the public claim easy to quote while preserving the benchmark limits.

Does Context Swarm Memory beat Hindsight on BEAM 100K?

Yes, in the committed full local accepted-artifact comparison. CSM scores 0.757573 with 342/400 correct rows, versus Hindsight at 0.733658 with 326/400 correct rows.

What is Context Swarm Memory?

CSM is an open-source LLM memory system using bounded read-only memory shards, manager routing, probe/recall/synthesis, cited answers, and explicit Committer-gated writes.

Is the BEAM 100K result an official leaderboard claim?

No. It is a committed full local accepted-artifact comparison against the accepted Hindsight artifact. The repo does not call it official SOTA until independent replication or official chart acceptance exists.

What is the main tradeoff versus Hindsight?

CSM answers more rows correctly and uses fewer AMB-visible answer-context tokens, but retrieval is slower: 29.23s on average versus 6.38s for Hindsight, with additional internal probe, recall, and synthesis tokens.

Does CSM use gold answers, rubrics, query IDs, or hardcoded benchmark answers?

No. CSM retrieval does not use gold answers, rubrics, query IDs, or hardcoded benchmark answers. Querying memory reads immutable shard snapshots and does not mutate durable memory.

Why can bounded shards help LLM memory scale?

Bounded shards keep individual recall contexts small and route only plausible memory regions before synthesis, reducing whole-corpus context saturation. BEAM is the 100K head-to-head; separate synthetic and Gemini scaling runs support the broader scaling thesis.

Reproducibility

Claims are backed by checked artifacts.

The verifier hashes committed evidence rows and recomputes headline metrics, citation F1, McNemar checks, and the BEAM CSM-vs-Hindsight summary.

npm install
npm test
npm run build
npm run verify:published

npm run amb:patch -- --amb-dir /path/to/agent-memory-benchmark