AMemGym- Interactive Memory Benchmarking for Assistants in Long-horizon Conversations
如果无法正常显示,请先停止浏览器的去广告插件。
1. AMemGym: Interactive Memory Benchmarking
for Assistants in Long-horizon Conversations
Jiayang Cheng1,2∗ , Dongyu Ru2∗ , Lin Qiu2† , Yiyang Li2 ,
Xuezhi Cao2 , Yangqiu Song1 , Xunliang Cai2
∗Equal contribution
1 HKUST
†Project lead
2 Meituan
April 2026
Project Page
2. Motivation
Framework
Experiments
Takeaways
Outline
1Motivation: Why Memory Evaluation Matters
2AMemGym Framework
3Experiments: What Did We Learn?
4Key Takeaways
2 / 16
3. Motivation
Framework
Experiments
Takeaways
Why Memory Matters for LLM Agents
Scenario: You chat with an AI assistant over 500 turns.
Does it remember your preference from turn 3?
Agents must maintain accurate information across extended interactions
Memory failures → inconsistency, repetition, broken trust
Critical for: personal assistants, customer support, collaborative coding, . . .
3 / 16
4. Motivation
Framework
Experiments
Takeaways
Why Memory Matters for LLM Agents
Scenario: You chat with an AI assistant over 500 turns.
Does it remember your preference from turn 3?
Agents must maintain accurate information across extended interactions
Memory failures → inconsistency, repetition, broken trust
Critical for: personal assistants, customer support, collaborative coding, . . .
But how do we properly evaluate this?
Existing memory benchmarks (MSC, LoCoMo) rely on static, off-policy data
They don’t capture the interactive, evolving nature of real conversations
Agent’s own responses shape context — off-policy evaluation misses this
3 / 16
5. Motivation
Framework
Experiments
Takeaways
What’s Wrong with Existing Benchmarks?
BenchmarkEval. ModeOptim. FeedbackAutomationContext Len.
MSC
RealTalk
DialSim
LoCoMo
PerLTQA
LongMemEval
PersonaMemStatic
Static
Static
Static
Static
Static
Static✗
✗
✗
✗
✗
✗
✗Manual
Manual
Manual
Semi-Auto
Semi-Auto
Semi-Auto
Fully Auto1.2K
17K
—
9.2K
—
115K–1.5M
32K–1M
Interactive✓Fully AutoConfigurable
AMemGym (Ours)
Key gaps in prior work:
All existing benchmarks use static, off-policy evaluation
No optimization feedback → cannot guide memory system improvement
Manual/semi-auto → limited scale and diversity
4 / 16
6. Motivation
Framework
Experiments
Takeaways
Why Evaluation Mode Matters: An Example
Scenario: User says “I switched from yoga to swimming last month.”
Off-Policy (Existing work)
All agents read the same pre-written con-
versation
Agent A sees identical context as Agent B
→ Tests reading comprehension,
not real memory
5 / 16
7. Motivation
Framework
Experiments
Takeaways
Why Evaluation Mode Matters: An Example
Scenario: User says “I switched from yoga to swimming last month.”
Off-Policy (Existing work)
On-Policy (Ours)
All agents read the same pre-written con-
versationAgent A remembered “likes sports”
→ “Got it, switching to swimming!”
Agent A sees identical context as Agent BAgent B forgot prior context
→ “Were you doing some sport before?”
→ Tests reading comprehension,
not real memory
→ Different conversations unfold
Key point: Memory design → agent response → conversation flow
Agent memory and conversation are coupled. Off-policy breaks this coupling.
5 / 16
8. Motivation
Framework
Experiments
Takeaways
Key Insight: On-Policy vs. Off-Policy Evaluation
Off-Policy (Existing work)On-Policy (Ours)
Pre-written conversation → test recall
✗ Doesn’t reflect real deployment
✗ Agent never actually participatedAgent participates in conversation → test recall
✓ Real deployment conditions
✓ Agent’s own responses shape context
6 / 16
9. Motivation
Framework
Experiments
Takeaways
AMemGym Framework Overview
Three-stage pipeline:
1Structured Data Generation — schema-based state evolution
2On-Policy Interaction — agent participates with user simulator
3Fine-grained Evaluation — periodic memory evaluation via personalized QA
7 / 16
10. Motivation
Framework
Experiments
Takeaways
Structured Data Generation
Schema-based state evolution:
Define entity schemas (attributes, relationships)
States evolve over turns (updates, additions, deletions)
Controllable: # entities, # state changes, conversation length
Example: A user discusses bridge gatherings.
Period 1: guest age range = “mostly 50 plus”
Period 3: guest age range → “mostly 50 plus” + “mixed ages with young adults”
⇒ Agent must track these evolving states
Why this design? Controllability (precise ground truth) + Realism (real-world scenarios) +
Scalability (arbitrary length)
8 / 16
11. Motivation
Framework
Experiments
Takeaways
On-Policy Interaction & Evaluation Metrics
Interaction protocol:
User simulator delivers information gradually across turns
At designated checkpoints: memory queries test specific information
Agent must recall from its own conversation history
Fine-grained metrics:
Overall score: End-to-end QA accuracy
Memory score: Normalized between random baseline and perfect-memory upper bound
Diagnostics: Write / Read / Utilization failure rates
Meta-evaluation validates metric reliability (high correlation with human judgment)
9 / 16
12. Motivation
Framework
Experiments
Takeaways
Q1: Does Off-Policy Evaluation Mislead?
Memory Agents (gpt-4.1-mini base)Native LLMsOn ↑Off ↑∆RankOn ↑Off ↑∆Rank
.291
.278
.275
.262
.227.253
.271
.273
.229
.241▼3
—
▲2
▼3
▲2.336
.327
.244
.203
.152.339
.317
.244
.198
.165—
—
▲2
—
—
AWE-(2,4,30)
AWE-(2,8,30)
AWE-(2,4,10)
AWE-(4,4,30)
RAG-(2,4,30)
claude-sonnet-4
gemini-2.5-flash
gpt-4.1
gpt-4.1-mini
deepseek-v3
Takeaway: Off-policy evaluation introduces reuse bias, undermining memory optimization and
configuration selection — especially for memory agents.
10 / 16
13. Motivation
Framework
Experiments
Takeaways
Q2: How Well Do Current LLMs Remember?
Period Index
Overall scores
0.8
0.7
0.6
0.5
0.4
0.3
gemini-3-pro-preview .463 .786 .596 .509 .538 .505 .374 .353 .259 .356 .421 .397
gpt-5.2-xhigh .380 .856 .607 .501 .459 .339 .278 .191 .173 .263 .257 .253
glm-4.7 .361 .811 .639 .490 .459 .322 .247 .194 .281 .172 .188 .168
gpt-5.2-none .339 .825 .606 .377 .358 .318 .261 .216 .170 .190 .290 .123
claude-sonnet-4 .337 .820 .539 .367 .378 .304 .265 .236 .115 .190 .236 .255
gemini-2.5-flash .327 .773 .573 .468 .379 .364 .205 .183 .066 .183 .235 .166
gpt-5.1-high .323 .831 .572 .429 .343 .319 .135 .141 .105 .166 .249 .263
gpt-5.1-none .282 .735 .519 .382 .348 .249 .104 .132 .085 .106 .198 .243
gemini-2.5-flash-lite .271 .606 .543 .405 .304 .232 .149 .147 .067 .179 .204 .150
seed-1.8 .266 .737 .539 .451 .361 .194 .090 .028 .072 .042 .190 .223
qwen3-max-thinking .265 .739 .480 .325 .290 .269 .124 .142 .075 .105 .194 .170
claude-sonnet-4.5 .262 .862 .605 .352 .315 .176 .121 .184 -.001 .154 .102 .015
gemini-2.0-flash .245 .667 .414 .283 .236 .221 .143 .133 .115 .125 .220 .141
gpt-4.1 .240 .738 .428 .293 .246 .195 .123 .096 .083 .159 .131 .147
gpt-4.1-mini .202 .725 .473 .331 .229 .117 .045 .018 -.003 .027 .094 .171
gpt-4o-mini .143 .506 .323 .236 .230 .089 .109 .066 -.027 .006 -.031 .067
deepseek-v3 .139 .522 .295 .217 .184 .160 -.005 .023 -.058 .006 .088 .095
Mean 0 1 2 3 4 5 6 7 8 9 10
0.8
0.7
0.6
Memory Score
0.9
Overall Score
gemini-3-pro-preview .946 .556 .760 .645 .585 .615 .585 .490 .485 .420 .490 .530 .515
gpt-5.2-xhigh .949 .503 .815 .655 .595 .570 .480 .435 .375 .355 .420 .415 .420
glm-4.7 .919 .478 .765 .660 .570 .555 .460 .405 .370 .415 .350 .360 .350
gpt-5.2-none .939 .468 .765 .645 .500 .495 .470 .420 .390 .355 .365 .420 .325
gpt-5.1-high .965 .465 .805 .645 .540 .480 .470 .335 .340 .310 .355 .410 .425
claude-sonnet-4 .928 .462 .760 .585 .485 .500 .460 .425 .405 .310 .365 .385 .405
gemini-2.5-flash .900 .448 .710 .610 .545 .495 .470 .380 .355 .275 .355 .390 .345
gpt-5.1-none .947 .431 .740 .595 .500 .480 .410 .310 .330 .290 .310 .370 .405
claude-sonnet-4.5 .955 .416 .825 .655 .480 .460 .365 .310 .365 .230 .345 .305 .240
seed-1.8 .938 .416 .740 .595 .545 .485 .365 .295 .255 .280 .260 .365 .395
qwen3-max-thinking .928 .414 .710 .560 .455 .435 .415 .325 .335 .290 .305 .370 .350
gemini-2.0-flash .925 .400 .670 .515 .425 .395 .385 .345 .335 .315 .315 .375 .325
gemini-2.5-flash-lite .858 .399 .610 .555 .475 .425 .370 .320 .325 .280 .345 .360 .320
gpt-4.1 .913 .395 .700 .515 .440 .405 .370 .320 .300 .290 .345 .330 .330
gpt-4.1-mini .917 .367 .695 .550 .465 .390 .315 .270 .245 .230 .250 .300 .330
deepseek-v3 .864 .326 .545 .415 .375 .345 .330 .245 .255 .210 .255 .310 .305
gpt-4o-mini .816 .317 .500 .420 .370 .355 .290 .295 .275 .220 .255 .230 .280
random
.231 .231 .231 .231 .231 .231 .231 .231 .231 .231 .231 .231
UB Mean 0 1 2 3 4 5 6 7 8 9 10
0.5
0.4
0.3
0.2
0.1
0.0
Period Index
Normalized memory scores
11 / 16
14. Motivation
Framework
Experiments
Takeaways
Q2: How Well Do Current LLMs Remember?
Period Index
Overall scores
0.8
0.7
0.6
0.5
0.4
0.3
gemini-3-pro-preview .463 .786 .596 .509 .538 .505 .374 .353 .259 .356 .421 .397
gpt-5.2-xhigh .380 .856 .607 .501 .459 .339 .278 .191 .173 .263 .257 .253
glm-4.7 .361 .811 .639 .490 .459 .322 .247 .194 .281 .172 .188 .168
gpt-5.2-none .339 .825 .606 .377 .358 .318 .261 .216 .170 .190 .290 .123
claude-sonnet-4 .337 .820 .539 .367 .378 .304 .265 .236 .115 .190 .236 .255
gemini-2.5-flash .327 .773 .573 .468 .379 .364 .205 .183 .066 .183 .235 .166
gpt-5.1-high .323 .831 .572 .429 .343 .319 .135 .141 .105 .166 .249 .263
gpt-5.1-none .282 .735 .519 .382 .348 .249 .104 .132 .085 .106 .198 .243
gemini-2.5-flash-lite .271 .606 .543 .405 .304 .232 .149 .147 .067 .179 .204 .150
seed-1.8 .266 .737 .539 .451 .361 .194 .090 .028 .072 .042 .190 .223
qwen3-max-thinking .265 .739 .480 .325 .290 .269 .124 .142 .075 .105 .194 .170
claude-sonnet-4.5 .262 .862 .605 .352 .315 .176 .121 .184 -.001 .154 .102 .015
gemini-2.0-flash .245 .667 .414 .283 .236 .221 .143 .133 .115 .125 .220 .141
gpt-4.1 .240 .738 .428 .293 .246 .195 .123 .096 .083 .159 .131 .147
gpt-4.1-mini .202 .725 .473 .331 .229 .117 .045 .018 -.003 .027 .094 .171
gpt-4o-mini .143 .506 .323 .236 .230 .089 .109 .066 -.027 .006 -.031 .067
deepseek-v3 .139 .522 .295 .217 .184 .160 -.005 .023 -.058 .006 .088 .095
Mean 0 1 2 3 4 5 6 7 8 9 10
0.8
0.7
0.6
Memory Score
0.9
Overall Score
gemini-3-pro-preview .946 .556 .760 .645 .585 .615 .585 .490 .485 .420 .490 .530 .515
gpt-5.2-xhigh .949 .503 .815 .655 .595 .570 .480 .435 .375 .355 .420 .415 .420
glm-4.7 .919 .478 .765 .660 .570 .555 .460 .405 .370 .415 .350 .360 .350
gpt-5.2-none .939 .468 .765 .645 .500 .495 .470 .420 .390 .355 .365 .420 .325
gpt-5.1-high .965 .465 .805 .645 .540 .480 .470 .335 .340 .310 .355 .410 .425
claude-sonnet-4 .928 .462 .760 .585 .485 .500 .460 .425 .405 .310 .365 .385 .405
gemini-2.5-flash .900 .448 .710 .610 .545 .495 .470 .380 .355 .275 .355 .390 .345
gpt-5.1-none .947 .431 .740 .595 .500 .480 .410 .310 .330 .290 .310 .370 .405
claude-sonnet-4.5 .955 .416 .825 .655 .480 .460 .365 .310 .365 .230 .345 .305 .240
seed-1.8 .938 .416 .740 .595 .545 .485 .365 .295 .255 .280 .260 .365 .395
qwen3-max-thinking .928 .414 .710 .560 .455 .435 .415 .325 .335 .290 .305 .370 .350
gemini-2.0-flash .925 .400 .670 .515 .425 .395 .385 .345 .335 .315 .315 .375 .325
gemini-2.5-flash-lite .858 .399 .610 .555 .475 .425 .370 .320 .325 .280 .345 .360 .320
gpt-4.1 .913 .395 .700 .515 .440 .405 .370 .320 .300 .290 .345 .330 .330
gpt-4.1-mini .917 .367 .695 .550 .465 .390 .315 .270 .245 .230 .250 .300 .330
deepseek-v3 .864 .326 .545 .415 .375 .345 .330 .245 .255 .210 .255 .310 .305
gpt-4o-mini .816 .317 .500 .420 .370 .355 .290 .295 .275 .220 .255 .230 .280
random
.231 .231 .231 .231 .231 .231 .231 .231 .231 .231 .231 .231
UB Mean 0 1 2 3 4 5 6 7 8 9 10
0.5
0.4
0.3
0.2
0.1
0.0
Period Index
Normalized memory scores
Key findings:
LLMs reason well in short contexts, but performance drops sharply as conversations grow
Most models fall below 50% of their upper bound in later periods
Memory scores ≪ overall scores → surface fluency masks memory failures
11 / 16
15. Motivation
Framework
Experiments
Takeaways
Q3: Can External Memory Help?
0.7
Memory implementations:
0.6
Memory Score
AWE-(2,4,30) .296 .617 .479 .365 .295 .282 .185 .218 .175 .222 .217 .197
Mem0-G .284 .638 .467 .360 .347 .272 .173 .192 .112 .140 .208 .211
AWE-(2,8,30) .278 .615 .461 .377 .338 .289 .227 .152 .138 .152 .141 .169
AWE-(2,4,10) .273 .661 .531 .391 .370 .261 .189 .102 .039 .122 .178 .161
AWE-(4,4,30) .263 .685 .530 .349 .316 .233 .088 .112 .059 .150 .179 .190
AWE-(2,0,30) .261 .718 .449 .346 .273 .263 .139 .117 .106 .181 .138 .138
AWE-(2,4,50) .249 .620 .470 .300 .299 .208 .155 .168 .049 .118 .187 .161
AWE-(8,4,30) .233 .775 .459 .311 .253 .181 .150 .050 -.019 .132 .108 .157
Nemori .231 .782 .551 .371 .296 .147 .003 .004 -.007 .050 .155 .188
RAG-(2,4,30) .223 .780 .460 .346 .283 .174 .061 .083 -.006 .036 .120 .114
A-Mem .220 .759 .487 .363 .251 .144 .014 .023 -.014 .055 .126 .210
LLM .202 .725 .473 .331 .229 .117 .045 .018 -.003 .027 .094 .171
AWI .177 .529 .331 .205 .225 .132 .089 .037 -.005 .082 .158 .167
Mean 0 1 2 3 4 5 6 7 8 9 10
0.5
0.4
0.3
0.2
0.1
0.0
Period Index
All agents use GPT-4.1-mini for fair comparison.
12 / 16
16. Motivation
Framework
Experiments
Takeaways
Q3: Can External Memory Help?
0.7
Memory implementations:
0.6
Memory Score
AWE-(2,4,30) .296 .617 .479 .365 .295 .282 .185 .218 .175 .222 .217 .197
Mem0-G .284 .638 .467 .360 .347 .272 .173 .192 .112 .140 .208 .211
AWE-(2,8,30) .278 .615 .461 .377 .338 .289 .227 .152 .138 .152 .141 .169
AWE-(2,4,10) .273 .661 .531 .391 .370 .261 .189 .102 .039 .122 .178 .161
AWE-(4,4,30) .263 .685 .530 .349 .316 .233 .088 .112 .059 .150 .179 .190
AWE-(2,0,30) .261 .718 .449 .346 .273 .263 .139 .117 .106 .181 .138 .138
AWE-(2,4,50) .249 .620 .470 .300 .299 .208 .155 .168 .049 .118 .187 .161
AWE-(8,4,30) .233 .775 .459 .311 .253 .181 .150 .050 -.019 .132 .108 .157
Nemori .231 .782 .551 .371 .296 .147 .003 .004 -.007 .050 .155 .188
RAG-(2,4,30) .223 .780 .460 .346 .283 .174 .061 .083 -.006 .036 .120 .114
A-Mem .220 .759 .487 .363 .251 .144 .014 .023 -.014 .055 .126 .210
LLM .202 .725 .473 .331 .229 .117 .045 .018 -.003 .027 .094 .171
AWI .177 .529 .331 .205 .225 .132 .089 .037 -.005 .082 .158 .167
Mean 0 1 2 3 4 5 6 7 8 9 10
0.5
0.4
0.3
0.2
0.1
0.0
Period Index
All agents use GPT-4.1-mini for fair comparison.
Key findings:
RAG (store raw history + retrieve): simple but unselective → noisy
AWE (LLM curates what to store externally): best; curation > raw storage
AWI (LLM curates into in-context buffer): worse than native LLMs — compression loses critical
info
12 / 16
17. Motivation
Framework
Experiments
Takeaways
Diagnostic Analysis: What Goes Wrong?
Strategy Agentic write improves utilization but sacrifices read performance due to
information loss during compression or retrieval.
Frequency Lower update frequency increases read failures as retaining local messages
confuses generation with multiple memory sources.
Noise Larger short-term memory increases read failures but provides more context for
write.
Top-K Top-k has a non-monotonic effect on write: higher values capture more
information but introduce more noise.
⇒ AMemGym’s fine-grained diagnostics enable targeted optimization of memory strategies.
13 / 16
18. Motivation
Framework
Experiments
Takeaways
The Write / Read / Utilization Trade-off
StrategyWrite↓Read↓Util.↓
LLM (native)
RAG
AWE
AWI.301
.377
.338
.286.087
.172
.159
.245.244
.067
.074
.122
Mean failure rates over all periods.
Three failure modes:
Utilization — knows the facts but fails to
use them
Read — wrote it down but can’t retrieve it
later
Write — never recorded the information
14 / 16
19. Motivation
Framework
Experiments
Takeaways
The Write / Read / Utilization Trade-off
StrategyWrite↓Read↓Util.↓
LLM (native)
RAG
AWE
AWI.301
.377
.338
.286.087
.172
.159
.245.244
.067
.074
.122
Mean failure rates over all periods.
Three failure modes:
Utilization — knows the facts but fails to
use them
Read — wrote it down but can’t retrieve it
later
Write — never recorded the information
Key trade-off:
RAG/AWE reduce utilization failures but
increase write & read errors
AWI keeps low write failures but high read
failures (information loss from compression)
No method excels at all three stages
14 / 16
20. Motivation
Framework
Experiments
Takeaways
AMemGym: Key Takeaways
1
On-policy evaluation reveals memory problems masked by off-policy benchmarks
15 / 16
21. Motivation
Framework
Experiments
Takeaways
AMemGym: Key Takeaways
1On-policy evaluation reveals memory problems masked by off-policy benchmarks
2All current models show significant memory limitations in long interactions
15 / 16
22. Motivation
Framework
Experiments
Takeaways
AMemGym: Key Takeaways
1On-policy evaluation reveals memory problems masked by off-policy benchmarks
2All current models show significant memory limitations in long interactions
3External memory helps but is imperfect — a fundamental write/read/utilization
trade-off persists
15 / 16
23. Motivation
Framework
Experiments
Takeaways
AMemGym: Key Takeaways
1On-policy evaluation reveals memory problems masked by off-policy benchmarks
2All current models show significant memory limitations in long interactions
3External memory helps but is imperfect — a fundamental write/read/utilization
trade-off persists
4Self-evolution is promising — agents can use AMemGym’s feedback to iteratively
refine their memory policies
Impact: AMemGym enables reliable benchmarking, diagnosis, and optimization of memory
for conversational agents.
15 / 16
24. Motivation
Framework
Experiments
Takeaways
Thank you for your attention!
Project Page
16 / 16