How We Measure Memory
HippoDid scores 77.8% on LongMemEval-S — the default path, independently re-tallied, ahead of Zep and independently-measured Mem0. This page shows that number, exactly how it’s measured, and every competitor figure traced to a primary source, including the ones we don’t lead. We publish it this way because a benchmark you can interrogate is the only kind worth trusting — and most you’ll read aren’t: they’re self-reported by the vendor, on a harness the vendor built, with the configuration that produced the best result.
If you only take one thing from this page: every figure here, including the competitor figures, is traceable to a primary source, and the figures that don’t flatter HippoDid are on this page too.
The methodology, first
Benchmark. LongMemEval-S — a public, peer-reviewed long-term-memory benchmark. 500 questions across long multi-session histories, scored by an LLM judge against gold answers. We use the longmemeval_s variant, the configuration most commonly reported by other systems, so comparisons are like-for-like.
What “a run” means here. A HippoDid run ingests every question’s session history into a live HippoDid character, then answers each question using only what HippoDid retrieves — no privileged access to the source text, no per-question tuning. The judge labels each answer correct/incorrect against the gold answer. Overall score = correct / 500.
Generator model. HippoDid is evaluated with gpt-4o as the answer-generating model, held fixed across v5 and v7 so the trajectory measures HippoDid’s changes, not a model upgrade. This matters for comparison: a memory benchmark run on a stronger generation model (e.g. gpt-5) produces a higher number for reasons that have nothing to do with the memory system. We hold our model constant and flag any competitor row that doesn’t.
The two numbers we report, and why both.
- VERBATIM, read with Chain-of-Note (CoN) — HippoDid’s VERBATIM retrieval, evaluated with the Chain-of-Note reader that the LongMemEval authors recommend in their official repository (
READING_METHOD=con) as the reading method for this benchmark. CoN is the benchmark’s prescribed reader configuration, not a HippoDid product feature — we report this number because it is the methodologically correct way to run LongMemEval, as the benchmark authors specify (LongMemEval repo: “We recommendcon, which instructs the model to first extract useful information and then reason over it.”). Earlier internal HippoDid runs used a custom direct-answer reader rather than the benchmark’s recommended Chain-of-Note method; the v7 figure reflects adopting the benchmark authors’ prescribed configuration. We note this because the reader configuration materially affects the score and should be stated, not assumed. - BEST-OF — the per-question maximum of two modes (EXTRACTED and VERBATIM+CoN). This is not a shippable single configuration; it is the complementarity ceiling — how much headroom exists if the two retrieval strategies were perfectly arbitrated per query. We label it as a ceiling, never as the product’s score, because conflating a theoretical max with the shipped number is exactly the kind of benchmark inflation this page exists to call out.
Baseline honesty. Improvement claims are only meaningful against a fixed, disclosed baseline. Ours is v5 (61.4%) — a real, on-disk, independently re-tallied run. We deliberately do not anchor improvement to an older v1 number, because doing so would credit a recent reader-prompt change for backend work that predated it. The honest trajectory is +16.4 percentage points (v5 61.4% → v7 77.8%), not the larger figure an earlier baseline would let us claim. We retired the larger figure ourselves. (Details in Why the baseline matters, below.)
The result
On LongMemEval-S, with the methodology above:
| Configuration | LongMemEval-S | What it is |
|---|---|---|
| HippoDid — VERBATIM + CoN (v7) | 77.8% (389/500) | VERBATIM retrieval, scored with LongMemEval’s recommended Chain-of-Note reader (READING_METHOD=con). The number that reflects HippoDid’s retrieval under the benchmark’s prescribed methodology. |
| HippoDid — BEST-OF (v7) | 84.6% (423/500) | The per-question best of both modes — the complementarity ceiling. Notably, the 77.8% default already exceeds the ceiling earlier projections set for it. |
| HippoDid — v5 (canonical baseline) | 61.4% (307/500) | The fixed baseline improvement is measured against. |
Trajectory: 61.4% → 77.8%, +16.4pp (v5 → v7). Both endpoints independently re-tallied from raw judge labels — the labelled answer files are public, so you can re-tally every number on this page yourself: github.com/SameThoughts/hippodid-benchmark-results.
How HippoDid compares on LongMemEval-S
A benchmark page that only shows systems you beat is a billboard, not evidence. Here is the like-for-like landscape on LongMemEval-S. Every row links to its primary source. Rows are not cherry-picked to HippoDid’s advantage — systems that match or could exceed HippoDid are included. Rows generated on a different (more powerful) model than HippoDid’s gpt-4o are kept in the table but explicitly marked not like-for-like, because a higher number from a stronger generator is not a stronger memory system.
| System | LongMemEval-S | Generator model | Source | Note |
|---|---|---|---|---|
| Supermemory (GPT-5) | 84.6% | gpt-5 | supermemory.ai/research | Not like-for-like — gpt-5 generator vs HippoDid’s gpt-4o; not a memory-system comparison |
| HippoDid — BEST-OF (v7) | 84.6% | gpt-4o | this page, §methodology | Complementarity ceiling (per-question max of two modes) |
| Mastra Observational Memory | 84.23% | gpt-4o | mastra.ai/research/observational-memory | Like-for-like generator |
| Supermemory (production) | 81.6% | gpt-4o | supermemory.ai/research | Like-for-like generator |
| HippoDid — VERBATIM+CoN (v7) | 77.8% | gpt-4o | this page, §methodology | VERBATIM retrieval, benchmark-standard CoN reader |
| Zep / Graphiti | 71.2% | gpt-4o | Zep paper, arXiv 2501.13956 | See sourcing note [1] |
| Mem0 (independent reproduction) | ~67% | gpt-4o-mini | TiMem, arXiv 2601.02845 Table 2 | Independent; see Mem0 note [2] |
Read this honestly: On a like-for-like gpt-4o basis, HippoDid’s VERBATIM retrieval under the benchmark’s recommended reader (77.8%) leads Zep and independently-measured Mem0 by a clear margin. It trails two systems on a genuinely comparable basis — Supermemory’s production number (81.6%, gpt-4o) and Mastra’s Observational Memory (84.23%, gpt-4o) — and we state that plainly. Supermemory’s 84.6% is not one of those: it was generated on gpt-5, a more powerful model than HippoDid’s gpt-4o, so it is not a memory-system comparison and is marked as such in the table. Separately, HippoDid’s 84.6% is a ceiling, not a shipped score, so it should not be read as beating any shipped number. We lay it out this way because a number is only trustworthy if the unflattering comparisons — and the non-comparable ones — are labelled honestly on the same page as the flattering ones.
[1] Zep sourcing note. The widely-cited “Zep 71.2%” is a primary published figure from the Zep/Graphiti paper (arXiv 2501.13956), independently corroborated by three other systems’ papers citing the identical figure. One honest caveat: the Zep blog reports per-category deltas rather than restating a single overall number, so 71.2% is most precisely described as the paper-published figure rather than a blog headline. We surface this rather than smooth it over.
[2] Mem0 note (the one to read carefully). Mem0’s own materials report 94.4% on LongMemEval. Independent reproductions report substantially lower: the TiMem paper (arXiv 2601.02845, Table 2) measures Mem0 at 67.56% ± 0.30%, and a separate production benchmark (ByteRover) reports 66.9% — i.e. ~67% from two independent sources versus 94.4% self-reported, a ~27-point gap. We cite the independent ~67% figure in the table above, not the self-report, and not because it’s lower — because it’s independently reproduced. This is precisely the self-report-vs-reproduction distinction this whole page is built around; we apply the same standard to ourselves (see below).
Why the baseline matters (we hold ourselves to the Mem0 standard)
The Mem0 note above makes a demand of the reader: trust independently-reproduced numbers over self-reported ones. It would be dishonest to make that demand and not meet it ourselves. So, the uncomfortable detail:
An earlier internal framing of HippoDid’s progress measured improvement against a v1 baseline, which would let us claim a larger improvement than +16.4pp. We retired that framing. Anchoring to v1 would attribute a recent reader-prompt change (the Chain-of-Note scaffold, which moved VERBATIM from 64.2% → 77.8%) to work that actually belonged to an earlier backend change between v1 and v5. The intellectually honest baseline is v5 (61.4%), the run immediately preceding the change being credited. That yields +16.4pp, a smaller and defensible number, and it is the only improvement figure we use externally.
We are telling you this because it is exactly the kind of thing a vendor normally hides, and a page about measurement integrity that hid its own baseline choice would be self-refuting.
Both endpoints — v5 = 61.4% (307/500) and v7 = 77.8% (389/500) — were re-derived independently from raw judge labels by three separate methods (a read-only audit, an independent raw-file re-tally, and an independent BEST-OF recomputation) before being published here. The raw labelled answer files are public for anyone to re-derive them again: github.com/SameThoughts/hippodid-benchmark-results.
What the number does and doesn’t tell you
It tells you: how well HippoDid retrieves and grounds answers across long, multi-session histories under a public, third-party-defined benchmark, against a fixed baseline, compared with systems measured the same way.
It does not tell you: that benchmark rank equals production fit. LongMemEval-S is one workload. Your agents’ memory needs — multi-tenancy, identity scoping, retrieval latency, write semantics — are not a single accuracy number. We publish this benchmark because it’s the common yardstick and because being measurable is itself a trust signal, not because a leaderboard position is the reason to choose memory infrastructure.
If a vendor’s only argument is a single self-reported benchmark number, that is the argument to be most skeptical of — including ours.
Sources
- LongMemEval benchmark — arXiv:2410.10813
- HippoDid v5/v7 figures — raw judge-labelled answer files published for independent verification: github.com/SameThoughts/hippodid-benchmark-results. Every HippoDid number on this page (61.4%, 77.8%, 84.6%) is re-tallyable from those files; baseline ratified v5-canonical.
- Zep / Graphiti 71.2% — arXiv:2501.13956
- Mem0 independent ~67% — TiMem, arXiv:2601.02845 Table 2 (67.56% ± 0.30%); corroborated by ByteRover production benchmark (66.9%)
- LongMemEval Chain-of-Note reader (
READING_METHOD=con) — recommended by the benchmark authors in the official LongMemEval repository; methodology and the ~10-point reading-accuracy effect described in arXiv:2410.10813 §5.5 - Supermemory — supermemory.ai/research
- Mastra Observational Memory — mastra.ai/research/observational-memory
Competitor figures verified against primary sources on 2026-05-16. If a figure here is ever shown to be wrong, it will be corrected on this page with the correction noted — that commitment is part of the methodology.