Full BEAM ladder 100K → 10M via the public AMB runner · 100K submitted as PR #19 · pending acceptance

The haystack grows 100×. The cost stays flat.

Q: Does CSM beat Hindsight on BEAM?

Only at 100K, and within single-trial noise: CSM 0.7367 vs Hindsight 0.7337 on the ladder (the submitted official-runner rerun is 0.743110 / 337-400, retrieving 1.84x faster). Across the full ladder Hindsight leads above 100K (500K/1M/10M: 0.711 / 0.739 / 0.641 vs CSM 0.659 / 0.569 / 0.562), so CSM does not beat Hindsight overall. What holds up is graceful degradation: from 1M to 10M CSM is essentially flat while Hindsight drops, narrowing the gap from +0.17 to +0.08.

Q: What is Context Swarm Memory?

Context Swarm Memory (CSM) is an open-source LLM memory system that treats memory as bounded, immutable, read-only shards. A Memory Manager routes a query to candidate shards, probes them cheaply, recalls from only the useful ones, and synthesizes a compact answer that cites shard, snapshot, and event IDs. Querying memory never mutates it; durable memory changes only through an explicit Committer protocol.

Q: What is the honest tradeoff versus Hindsight?

At 100K CSM scores higher and retrieves 1.84x faster, but it spends more tokens. CSM's all-in cost is ~35.8K input tokens/query: a 27.0K answer packet (vs Hindsight's 17.7K, +53%) plus 8,805 input / 625 output tokens on its internal probe/recall/synthesize pipeline. Hindsight reports only its answer context (no internal figure) and distills memory at ingest, so its all-in total is unstated. CSM's internal tokens run on models ~10x cheaper, so they are ~25% of the token count but ~7% of the dollars. Above 100K Hindsight also scores higher. All single-trial.

Context Swarm Memory (CSM) is open-source agent memory whose total cost per query stays flat — about 36–38K input tokens all-in (the small, fully-cited packet the answer model sees, ~26–33K, plus CSM's own probe/recall/synthesize calls) — whether the underlying memory holds 100K or 10M tokens. Bounded read-only shards, cheap probes, cited recall, one explicit write gate; querying memory never mutates it.

View on GitHub See the evidence

read-only shard reads
cited recall: shard · snapshot · event
Committer-gated writes
zero-LLM indexing

36–38K total input tokens/query, 100K → 10M all-in cost stays flat as haystack grows 100×

2000 BEAM queries graded four tiers, unmodified AMB runner

0.7367 BEAM 100K score leads Hindsight 0.7337

333/333 tests passing offline, no API keys

The full BEAM ladder (100K → 10M) was produced by the unmodified public AMB runner (their CLI, scoring, and judge path), same answer/judge models as the accepted Hindsight artifact. The 100K tier is submitted upstream as vectorize-io/agent-memory-benchmark#19, pending acceptance — not an official leaderboard placement. Single-trial.

The measured result

The full BEAM ladder: 100K → 10M.

CSM ran through the unmodified public AMB runner at every BEAM split, next to Vectorize's own committed Hindsight run (same answer + judge models). CSM keeps its total cost per query bounded — ~36–38K input tokens all-in (a ~26–33K answer packet plus its own probe/recall/synthesize calls) — at every tier, even at 10M, where each unit is an ~11.7M-token document. It trails Hindsight above 100K, but degrades gracefully and stabilizes at the extreme while Hindsight drops — so the gap narrows from 1M to 10M.

Input tokens per query, log scale, BEAM 100K to 10M. A full-context line climbs ~100x from ~100K to ~10M tokens and crosses the ~1-2M model context window above the 1M tier. CSM all-in stays flat near 36K (35.8/36.2/38.1/35.9K); Hindsight stays flat and leaner near 22K (17.7/20.5/23.9/27.3K). — **Cost — the headline, drawn.** As the haystack grows 100×, a brute-force full-context system's input explodes past the model's context window, while CSM (~36–38K all-in) and Hindsight (~18–27K, leaner) stay flat. Retrieval cost does not scale with the corpus.

CSM vs Hindsight on BEAM 100K to 10M. CSM 0.737, 0.659, 0.569, 0.562; Hindsight 0.734, 0.711, 0.739, 0.641. CSM trails above 100K, then stays flat from 1M to 10M while Hindsight drops, narrowing the gap from 0.17 to 0.08. — **Accuracy — the honest other half.** CSM trails Hindsight above 100K, then holds flat from 1M→10M while Hindsight drops — the gap more than halves (+0.17 → +0.08).

Full BEAM ladder, unmodified AMB runner, same answer/judge models, single-trial. Hindsight = Vectorize's own committed run; leader per tier highlighted. **CSM all-in** = answer-visible context **plus** CSM's internal probe/recall/synthesize tokens — the honest per-query total. Hindsight discloses no internal cost (and distills memory at ingest), so its column is answer-context only and it has no comparable all-in figure.
BEAM tier	CSM score	Hindsight score	CSM answer-ctx	Hindsight answer-ctx	CSM all-in input
100K	0.7367	0.7337	27.0K	17.7K	35.8K
500K	0.6589	0.7112	26.6K	20.5K	36.2K
1M	0.5693	0.7386	28.2K	23.9K	38.1K
10M	0.5616	0.6408	32.5K	27.3K	35.9K

+0.17 → +0.08 Hindsight's lead over CSM more than halves from 1M to 10M

−0.008 vs −0.098 1M→10M: CSM nearly flat (improves in 7/10 categories) while Hindsight drops

36–38K CSM all-in input/query, flat across a 100× range (~26–33K answer packet + internal pipeline; Hindsight leaner on answer context, discloses no internal)

The 100K submission (PR #19)

The 100K tier is the one submitted upstream for review: CSM's official-runner rerun — a separate, single-trial run from the 0.7367 ladder tile above — scored 0.743110 (337/400) vs Hindsight's accepted artifact 0.733658 (326/400) — a thin, single-trial lead, retrieved 1.84× faster. The full-ladder table above shows the honest rest: Hindsight leads at 500K/1M/10M, and CSM narrows the gap at the extreme.

CSM this repo 0.743110 337 / 400 graded rows correct

Hindsight accepted artifact 0.733658 326 / 400 graded rows correct

+0.95 pts +11 rows 1.84× faster retrieval +53% answer context — disclosed trade

Same answer & judge models as the accepted Hindsight artifact · single-trial · submitted upstream as PR #19, pending acceptance.

Correct rows +11

higher is better · 400 questions

CSM 337

Hindsight 326

Avg retrieve latency 1.84× faster

lower is better — the PR #19 official-runner rerun. The separate frozen-ladder run measured 4.5/7.5/5.6/11.9s across 100K→10M (non-monotonic, peaking at 10M).

CSM 3.47s

Hindsight 6.38s

Avg answer-visible context honest trade: +53%

lower is better — Hindsight is leaner here. Answer-context only (apples-to-apples); CSM's all-in is ~35.8K with +8.8K internal input, Hindsight discloses no internal cost.

CSM 27.0K

Hindsight 17.7K

The all-in cost, printed — not left to be added up

At 100K, CSM's all-in cost is ~35.8K input tokens/query: a 27.0K answer packet (larger than Hindsight's 17.7K, +53%, because the coverage chronicle fills its return-K budget) plus 8,805 input / 625 output tokens on its internal probe/recall/synthesize pipeline (down 58% from May's 21,020). We print the sum rather than report the halves separately. Hindsight reports only its answer context — no internal figure — and distills memory at ingest, so its all-in total is unstated and not directly comparable. The internal tokens run on models ~10× cheaper than the answer model, so they're ~25% of the token count but ~7% of the dollars. The run is single-trial; no gold answers, rubrics, or query IDs ever reach retrieval.

Evidence AMB_BEAM_LADDER_2026_06_18.md AMB_BEAM_100K_OFFICIAL_RERUN.md upstream PR #19 npm run verify:published

Architecture

Route, probe, recall, synthesize — then discard.

Memory is a swarm of bounded, immutable shards behind a read-only manifest. CSM spends context only after a zero-LLM router finds plausible shards; the answer arrives as one compact, cited packet.

01
Router

Keyword/tag scorer over the shard directory. No LLM, no vector DB required at index time.
02
Probe

Cheap relevance pass per candidate shard — “is this memory worth recalling from?”
03
Recall

Structured, citation-bearing extraction from only the shards that passed the probe.
04
Synthesize

Merge, dedupe, flag conflicts — emit a compact MemoryPacket for the agent.
→
MemoryPacket

Answer + key claims, every line cited to shard, snapshot, and event IDs.

Read path: branch-and-discard

ask() never appends events, writes snapshots, or mutates the chronicle. Enforced in CI by SHA-256 file hashes (tests/mutationSafety.test.ts).

Write path: Committer-gated

Durable memory changes only through appendEventAndSnapshot or an explicit Committer decision. Snapshots are immutable and versioned; the storage layer refuses overwrites.

Indexing: zero LLM cost

Routing starts from keywords and tags, with a local MiniLM embedding recall floor — no LLM-generated index is ever built, so adding memory costs no API tokens.

Coverage chronicle, deterministic

Summary/ordering/temporal queries get a date-ordered, fully-cited timeline assembled without extra LLM calls. Date arithmetic is computed, never delegated to the model.

Evidence ARCHITECTURE.md mutationSafety.test.ts npm test

June 2026 R&D wave

One orchestrator, four research agents, every change gated.

A full-repo planning pass dispatched four parallel agent briefs — coverage/chronicle recall, a hybrid router, a BEAM-slice retrieval harness, and Gemini caching. All four delivered; every feature merged behind a default-off flag, and each landed only after accuracy gates. Two of the wave’s load-bearing hypotheses did not survive measurement — and that’s published too.

Coverage chronicle — shipped, now default

A deterministic, fully-cited timeline assembler for summary/ordering/temporal queries. Offline it recovered 12/13 and 5/6 gold events on the two known coverage failures; on real BEAM 100K data it lifted event_ordering cov@24 from 0.475 to 0.659.

Hybrid router — proven, then shelved honestly

Mechanism proven offline (recall@3 0.714 → 0.857; thin-metadata gold-top-3 0/4 → 4/4) but no measurable effect at 100K scale — so it stays default-off until a 500K re-gate, exactly as measured.

BEAM-slice harness — the new gate

Retrieval-only recall@k on the two losing BEAM categories, minutes instead of a 400-query run. Gold facets stay strictly eval-side, enforced by an import-graph leakage-firewall test.

Gemini caching — hypothesis falsified

The projected “40–60% input-cost cut” did not survive verification: a measured 4,096-token implicit-cache floor means every CSM call is sub-floor — today’s pipeline gets exactly zero caching. The observability shipped anyway, surfacing ~$2.00/run of previously invisible thinking spend.

The latency rebuild, step by measured step

May architecture (BEAM avg) 29.2s

+ parallel probes & recalls (30q gate) 10.45s

+ digest dates & top-1 speculation (30q gate) 8.96s

+ flash-lite probes (30q gate, 29/30 held) 7.15s

Official-runner BEAM rerun (avg retrieve) 3.47s

Retrieval coverage on the two BEAM categories CSM lost (real BEAM 100K data, 80 queries, gold-facet proxy, bootstrap 95% CIs; gold is eval-side only)
Returned-to-harness coverage	Legacy pipeline	Coverage mode + chronicle
event_ordering cov@24	0.475	0.659 (CIs non-overlapping)
event_ordering cov@32	0.615	0.715
retrieved gold coverage (both categories)	0.61–0.65	0.80–0.83 (CIs non-overlapping)

Evidence PERF_BREAKDOWN.md RD_PORTFOLIO_2026_06.md npm test

New: write-time memory (July 2026)

Reading the ladder's failed answers found two failure mechanisms — missing scattered facts (summarization, ordering) and stale-value aggregation (multi-session) — and one write-time lever was built for each, both off by default behind intent gates validated on all 2,000 BEAM queries (zero fires on the eight categories CSM already wins). Measured so far, official config, paired on the same 40 queries: BEAM 100K summarization 0.714 → 0.936 at 43% less answer-visible context. The same lever is a score wash on event_ordering (published anyway) while cutting its context 58%. The fact registry for aggregation queries is built but not yet score-measured. Designs, gates, cost disclosures, and artifact hashes: WRITE_TIME_MEMORY_2026_07.md.

Direct answers

Frequently asked questions.

These mirror the structured FAQ data in the page head, so humans and answer engines quote the same hedged claims.

Does CSM beat Hindsight on BEAM?

Only at 100K, and within single-trial noise — CSM 0.7367 vs Hindsight 0.7337 on the ladder (the submitted official-runner rerun is 0.743110 / 337–400, retrieving 1.84× faster). Across the full ladder Hindsight leads above 100K (500K/1M/10M: 0.711 / 0.739 / 0.641 vs CSM 0.659 / 0.569 / 0.562), so CSM does not beat Hindsight overall. What holds up is graceful degradation: from 1M to 10M CSM is essentially flat while Hindsight drops, narrowing the gap from +0.17 to +0.08.

Is this an official leaderboard claim?

No. The result was produced by AMB’s own runner and has been submitted to the maintainers (vectorize-io/agent-memory-benchmark#19); the repo does not call it official until they accept the provider/result.

Did CSM use gold answers or hardcoded benchmark logic?

No. CSM retrieval receives only the ingested documents, the query, user id, and timestamp — no gold answers, rubrics, query IDs, or benchmark-specific hardcoding (the June wave also deleted the legacy domain term tables from the active path).

What is Context Swarm Memory?

An open-source LLM memory system that treats memory as bounded, immutable, read-only shards. A Memory Manager routes a query to candidate shards, probes them cheaply, recalls from only the useful ones, and synthesizes a compact answer cited to shard, snapshot, and event IDs. Querying memory never mutates it; durable changes go through an explicit Committer protocol.

What is the honest tradeoff versus Hindsight?

At 100K CSM scores higher and retrieves 1.84× faster, but it spends more tokens. CSM's all-in cost is ~35.8K input tokens/query: a 27.0K answer packet (vs Hindsight's 17.7K, +53%) plus 8,805 input / 625 output internal tokens on probe/recall/synthesize. Hindsight reports only its answer context (no internal figure) and distills memory at ingest, so its all-in is unstated. CSM's internal tokens run on models ~10× cheaper — ~25% of the token count but ~7% of the dollars. Above 100K Hindsight also scores higher. All single-trial.

Reproducibility

Don’t trust the page. Run the verifier.

The test suite runs offline against a deterministic mock provider — no API keys. The published-evidence verifier hashes the committed result rows and recomputes the headline counts, citation F1, and McNemar checks from results.jsonl.

Go deeper REPRODUCING.md REPLICATION_KIT.md BENCHMARK_METHODOLOGY.md

# offline — no API keys required
$ npm install
$ npm test                  # 333 tests, MockProvider
$ npm run verify:published  # re-hash + recompute claims

# the read path cannot write — enforced by SHA-256 hashes
$ npx vitest run tests/mutationSafety.test.ts