AMemGym- Interactive Memory Benchmarking for Assistants in Long-horizon Conversations

如果无法正常显示，请先停止浏览器的去广告插件。

1. AMemGym: Interactive Memory Benchmarking for Assistants in Long-horizon Conversations Jiayang Cheng1,2∗ , Dongyu Ru2∗ , Lin Qiu2† , Yiyang Li2 , Xuezhi Cao2 , Yangqiu Song1 , Xunliang Cai2 ∗Equal contribution 1 HKUST †Project lead 2 Meituan April 2026 Project Page

2. Motivation Framework Experiments Takeaways Outline 1Motivation: Why Memory Evaluation Matters 2AMemGym Framework 3Experiments: What Did We Learn? 4Key Takeaways 2 / 16

3. Motivation Framework Experiments Takeaways Why Memory Matters for LLM Agents Scenario: You chat with an AI assistant over 500 turns. Does it remember your preference from turn 3? Agents must maintain accurate information across extended interactions Memory failures → inconsistency, repetition, broken trust Critical for: personal assistants, customer support, collaborative coding, . . . 3 / 16

4. Motivation Framework Experiments Takeaways Why Memory Matters for LLM Agents Scenario: You chat with an AI assistant over 500 turns. Does it remember your preference from turn 3? Agents must maintain accurate information across extended interactions Memory failures → inconsistency, repetition, broken trust Critical for: personal assistants, customer support, collaborative coding, . . . But how do we properly evaluate this? Existing memory benchmarks (MSC, LoCoMo) rely on static, off-policy data They don’t capture the interactive, evolving nature of real conversations Agent’s own responses shape context — off-policy evaluation misses this 3 / 16

5. Motivation Framework Experiments Takeaways What’s Wrong with Existing Benchmarks? BenchmarkEval. ModeOptim. FeedbackAutomationContext Len. MSC RealTalk DialSim LoCoMo PerLTQA LongMemEval PersonaMemStatic Static Static Static Static Static Static✗ ✗ ✗ ✗ ✗ ✗ ✗Manual Manual Manual Semi-Auto Semi-Auto Semi-Auto Fully Auto1.2K 17K — 9.2K — 115K–1.5M 32K–1M Interactive✓Fully AutoConfigurable AMemGym (Ours) Key gaps in prior work: All existing benchmarks use static, off-policy evaluation No optimization feedback → cannot guide memory system improvement Manual/semi-auto → limited scale and diversity 4 / 16

6. Motivation Framework Experiments Takeaways Why Evaluation Mode Matters: An Example Scenario: User says “I switched from yoga to swimming last month.” Off-Policy (Existing work) All agents read the same pre-written con- versation Agent A sees identical context as Agent B → Tests reading comprehension, not real memory 5 / 16

7. Motivation Framework Experiments Takeaways Why Evaluation Mode Matters: An Example Scenario: User says “I switched from yoga to swimming last month.” Off-Policy (Existing work) On-Policy (Ours) All agents read the same pre-written con- versationAgent A remembered “likes sports” → “Got it, switching to swimming!” Agent A sees identical context as Agent BAgent B forgot prior context → “Were you doing some sport before?” → Tests reading comprehension, not real memory → Different conversations unfold Key point: Memory design → agent response → conversation flow Agent memory and conversation are coupled. Off-policy breaks this coupling. 5 / 16

8. Motivation Framework Experiments Takeaways Key Insight: On-Policy vs. Off-Policy Evaluation Off-Policy (Existing work)On-Policy (Ours) Pre-written conversation → test recall ✗ Doesn’t reflect real deployment ✗ Agent never actually participatedAgent participates in conversation → test recall ✓ Real deployment conditions ✓ Agent’s own responses shape context 6 / 16

9. Motivation Framework Experiments Takeaways AMemGym Framework Overview Three-stage pipeline: 1Structured Data Generation — schema-based state evolution 2On-Policy Interaction — agent participates with user simulator 3Fine-grained Evaluation — periodic memory evaluation via personalized QA 7 / 16

10. Motivation Framework Experiments Takeaways Structured Data Generation Schema-based state evolution: Define entity schemas (attributes, relationships) States evolve over turns (updates, additions, deletions) Controllable: # entities, # state changes, conversation length Example: A user discusses bridge gatherings. Period 1: guest age range = “mostly 50 plus” Period 3: guest age range → “mostly 50 plus” + “mixed ages with young adults” ⇒ Agent must track these evolving states Why this design? Controllability (precise ground truth) + Realism (real-world scenarios) + Scalability (arbitrary length) 8 / 16

11. Motivation Framework Experiments Takeaways On-Policy Interaction & Evaluation Metrics Interaction protocol: User simulator delivers information gradually across turns At designated checkpoints: memory queries test specific information Agent must recall from its own conversation history Fine-grained metrics: Overall score: End-to-end QA accuracy Memory score: Normalized between random baseline and perfect-memory upper bound Diagnostics: Write / Read / Utilization failure rates Meta-evaluation validates metric reliability (high correlation with human judgment) 9 / 16

12. Motivation Framework Experiments Takeaways Q1: Does Off-Policy Evaluation Mislead? Memory Agents (gpt-4.1-mini base)Native LLMsOn ↑Off ↑∆RankOn ↑Off ↑∆Rank .291 .278 .275 .262 .227.253 .271 .273 .229 .241▼3 — ▲2 ▼3 ▲2.336 .327 .244 .203 .152.339 .317 .244 .198 .165— — ▲2 — — AWE-(2,4,30) AWE-(2,8,30) AWE-(2,4,10) AWE-(4,4,30) RAG-(2,4,30) claude-sonnet-4 gemini-2.5-flash gpt-4.1 gpt-4.1-mini deepseek-v3 Takeaway: Off-policy evaluation introduces reuse bias, undermining memory optimization and configuration selection — especially for memory agents. 10 / 16

13. Motivation Framework Experiments Takeaways Q2: How Well Do Current LLMs Remember? Period Index Overall scores 0.8 0.7 0.6 0.5 0.4 0.3 gemini-3-pro-preview .463 .786 .596 .509 .538 .505 .374 .353 .259 .356 .421 .397 gpt-5.2-xhigh .380 .856 .607 .501 .459 .339 .278 .191 .173 .263 .257 .253 glm-4.7 .361 .811 .639 .490 .459 .322 .247 .194 .281 .172 .188 .168 gpt-5.2-none .339 .825 .606 .377 .358 .318 .261 .216 .170 .190 .290 .123 claude-sonnet-4 .337 .820 .539 .367 .378 .304 .265 .236 .115 .190 .236 .255 gemini-2.5-flash .327 .773 .573 .468 .379 .364 .205 .183 .066 .183 .235 .166 gpt-5.1-high .323 .831 .572 .429 .343 .319 .135 .141 .105 .166 .249 .263 gpt-5.1-none .282 .735 .519 .382 .348 .249 .104 .132 .085 .106 .198 .243 gemini-2.5-flash-lite .271 .606 .543 .405 .304 .232 .149 .147 .067 .179 .204 .150 seed-1.8 .266 .737 .539 .451 .361 .194 .090 .028 .072 .042 .190 .223 qwen3-max-thinking .265 .739 .480 .325 .290 .269 .124 .142 .075 .105 .194 .170 claude-sonnet-4.5 .262 .862 .605 .352 .315 .176 .121 .184 -.001 .154 .102 .015 gemini-2.0-flash .245 .667 .414 .283 .236 .221 .143 .133 .115 .125 .220 .141 gpt-4.1 .240 .738 .428 .293 .246 .195 .123 .096 .083 .159 .131 .147 gpt-4.1-mini .202 .725 .473 .331 .229 .117 .045 .018 -.003 .027 .094 .171 gpt-4o-mini .143 .506 .323 .236 .230 .089 .109 .066 -.027 .006 -.031 .067 deepseek-v3 .139 .522 .295 .217 .184 .160 -.005 .023 -.058 .006 .088 .095 Mean 0 1 2 3 4 5 6 7 8 9 10 0.8 0.7 0.6 Memory Score 0.9 Overall Score gemini-3-pro-preview .946 .556 .760 .645 .585 .615 .585 .490 .485 .420 .490 .530 .515 gpt-5.2-xhigh .949 .503 .815 .655 .595 .570 .480 .435 .375 .355 .420 .415 .420 glm-4.7 .919 .478 .765 .660 .570 .555 .460 .405 .370 .415 .350 .360 .350 gpt-5.2-none .939 .468 .765 .645 .500 .495 .470 .420 .390 .355 .365 .420 .325 gpt-5.1-high .965 .465 .805 .645 .540 .480 .470 .335 .340 .310 .355 .410 .425 claude-sonnet-4 .928 .462 .760 .585 .485 .500 .460 .425 .405 .310 .365 .385 .405 gemini-2.5-flash .900 .448 .710 .610 .545 .495 .470 .380 .355 .275 .355 .390 .345 gpt-5.1-none .947 .431 .740 .595 .500 .480 .410 .310 .330 .290 .310 .370 .405 claude-sonnet-4.5 .955 .416 .825 .655 .480 .460 .365 .310 .365 .230 .345 .305 .240 seed-1.8 .938 .416 .740 .595 .545 .485 .365 .295 .255 .280 .260 .365 .395 qwen3-max-thinking .928 .414 .710 .560 .455 .435 .415 .325 .335 .290 .305 .370 .350 gemini-2.0-flash .925 .400 .670 .515 .425 .395 .385 .345 .335 .315 .315 .375 .325 gemini-2.5-flash-lite .858 .399 .610 .555 .475 .425 .370 .320 .325 .280 .345 .360 .320 gpt-4.1 .913 .395 .700 .515 .440 .405 .370 .320 .300 .290 .345 .330 .330 gpt-4.1-mini .917 .367 .695 .550 .465 .390 .315 .270 .245 .230 .250 .300 .330 deepseek-v3 .864 .326 .545 .415 .375 .345 .330 .245 .255 .210 .255 .310 .305 gpt-4o-mini .816 .317 .500 .420 .370 .355 .290 .295 .275 .220 .255 .230 .280 random .231 .231 .231 .231 .231 .231 .231 .231 .231 .231 .231 .231 UB Mean 0 1 2 3 4 5 6 7 8 9 10 0.5 0.4 0.3 0.2 0.1 0.0 Period Index Normalized memory scores 11 / 16

14. Motivation Framework Experiments Takeaways Q2: How Well Do Current LLMs Remember? Period Index Overall scores 0.8 0.7 0.6 0.5 0.4 0.3 gemini-3-pro-preview .463 .786 .596 .509 .538 .505 .374 .353 .259 .356 .421 .397 gpt-5.2-xhigh .380 .856 .607 .501 .459 .339 .278 .191 .173 .263 .257 .253 glm-4.7 .361 .811 .639 .490 .459 .322 .247 .194 .281 .172 .188 .168 gpt-5.2-none .339 .825 .606 .377 .358 .318 .261 .216 .170 .190 .290 .123 claude-sonnet-4 .337 .820 .539 .367 .378 .304 .265 .236 .115 .190 .236 .255 gemini-2.5-flash .327 .773 .573 .468 .379 .364 .205 .183 .066 .183 .235 .166 gpt-5.1-high .323 .831 .572 .429 .343 .319 .135 .141 .105 .166 .249 .263 gpt-5.1-none .282 .735 .519 .382 .348 .249 .104 .132 .085 .106 .198 .243 gemini-2.5-flash-lite .271 .606 .543 .405 .304 .232 .149 .147 .067 .179 .204 .150 seed-1.8 .266 .737 .539 .451 .361 .194 .090 .028 .072 .042 .190 .223 qwen3-max-thinking .265 .739 .480 .325 .290 .269 .124 .142 .075 .105 .194 .170 claude-sonnet-4.5 .262 .862 .605 .352 .315 .176 .121 .184 -.001 .154 .102 .015 gemini-2.0-flash .245 .667 .414 .283 .236 .221 .143 .133 .115 .125 .220 .141 gpt-4.1 .240 .738 .428 .293 .246 .195 .123 .096 .083 .159 .131 .147 gpt-4.1-mini .202 .725 .473 .331 .229 .117 .045 .018 -.003 .027 .094 .171 gpt-4o-mini .143 .506 .323 .236 .230 .089 .109 .066 -.027 .006 -.031 .067 deepseek-v3 .139 .522 .295 .217 .184 .160 -.005 .023 -.058 .006 .088 .095 Mean 0 1 2 3 4 5 6 7 8 9 10 0.8 0.7 0.6 Memory Score 0.9 Overall Score gemini-3-pro-preview .946 .556 .760 .645 .585 .615 .585 .490 .485 .420 .490 .530 .515 gpt-5.2-xhigh .949 .503 .815 .655 .595 .570 .480 .435 .375 .355 .420 .415 .420 glm-4.7 .919 .478 .765 .660 .570 .555 .460 .405 .370 .415 .350 .360 .350 gpt-5.2-none .939 .468 .765 .645 .500 .495 .470 .420 .390 .355 .365 .420 .325 gpt-5.1-high .965 .465 .805 .645 .540 .480 .470 .335 .340 .310 .355 .410 .425 claude-sonnet-4 .928 .462 .760 .585 .485 .500 .460 .425 .405 .310 .365 .385 .405 gemini-2.5-flash .900 .448 .710 .610 .545 .495 .470 .380 .355 .275 .355 .390 .345 gpt-5.1-none .947 .431 .740 .595 .500 .480 .410 .310 .330 .290 .310 .370 .405 claude-sonnet-4.5 .955 .416 .825 .655 .480 .460 .365 .310 .365 .230 .345 .305 .240 seed-1.8 .938 .416 .740 .595 .545 .485 .365 .295 .255 .280 .260 .365 .395 qwen3-max-thinking .928 .414 .710 .560 .455 .435 .415 .325 .335 .290 .305 .370 .350 gemini-2.0-flash .925 .400 .670 .515 .425 .395 .385 .345 .335 .315 .315 .375 .325 gemini-2.5-flash-lite .858 .399 .610 .555 .475 .425 .370 .320 .325 .280 .345 .360 .320 gpt-4.1 .913 .395 .700 .515 .440 .405 .370 .320 .300 .290 .345 .330 .330 gpt-4.1-mini .917 .367 .695 .550 .465 .390 .315 .270 .245 .230 .250 .300 .330 deepseek-v3 .864 .326 .545 .415 .375 .345 .330 .245 .255 .210 .255 .310 .305 gpt-4o-mini .816 .317 .500 .420 .370 .355 .290 .295 .275 .220 .255 .230 .280 random .231 .231 .231 .231 .231 .231 .231 .231 .231 .231 .231 .231 UB Mean 0 1 2 3 4 5 6 7 8 9 10 0.5 0.4 0.3 0.2 0.1 0.0 Period Index Normalized memory scores Key findings: LLMs reason well in short contexts, but performance drops sharply as conversations grow Most models fall below 50% of their upper bound in later periods Memory scores ≪ overall scores → surface fluency masks memory failures 11 / 16

15. Motivation Framework Experiments Takeaways Q3: Can External Memory Help? 0.7 Memory implementations: 0.6 Memory Score AWE-(2,4,30) .296 .617 .479 .365 .295 .282 .185 .218 .175 .222 .217 .197 Mem0-G .284 .638 .467 .360 .347 .272 .173 .192 .112 .140 .208 .211 AWE-(2,8,30) .278 .615 .461 .377 .338 .289 .227 .152 .138 .152 .141 .169 AWE-(2,4,10) .273 .661 .531 .391 .370 .261 .189 .102 .039 .122 .178 .161 AWE-(4,4,30) .263 .685 .530 .349 .316 .233 .088 .112 .059 .150 .179 .190 AWE-(2,0,30) .261 .718 .449 .346 .273 .263 .139 .117 .106 .181 .138 .138 AWE-(2,4,50) .249 .620 .470 .300 .299 .208 .155 .168 .049 .118 .187 .161 AWE-(8,4,30) .233 .775 .459 .311 .253 .181 .150 .050 -.019 .132 .108 .157 Nemori .231 .782 .551 .371 .296 .147 .003 .004 -.007 .050 .155 .188 RAG-(2,4,30) .223 .780 .460 .346 .283 .174 .061 .083 -.006 .036 .120 .114 A-Mem .220 .759 .487 .363 .251 .144 .014 .023 -.014 .055 .126 .210 LLM .202 .725 .473 .331 .229 .117 .045 .018 -.003 .027 .094 .171 AWI .177 .529 .331 .205 .225 .132 .089 .037 -.005 .082 .158 .167 Mean 0 1 2 3 4 5 6 7 8 9 10 0.5 0.4 0.3 0.2 0.1 0.0 Period Index All agents use GPT-4.1-mini for fair comparison. 12 / 16

16. Motivation Framework Experiments Takeaways Q3: Can External Memory Help? 0.7 Memory implementations: 0.6 Memory Score AWE-(2,4,30) .296 .617 .479 .365 .295 .282 .185 .218 .175 .222 .217 .197 Mem0-G .284 .638 .467 .360 .347 .272 .173 .192 .112 .140 .208 .211 AWE-(2,8,30) .278 .615 .461 .377 .338 .289 .227 .152 .138 .152 .141 .169 AWE-(2,4,10) .273 .661 .531 .391 .370 .261 .189 .102 .039 .122 .178 .161 AWE-(4,4,30) .263 .685 .530 .349 .316 .233 .088 .112 .059 .150 .179 .190 AWE-(2,0,30) .261 .718 .449 .346 .273 .263 .139 .117 .106 .181 .138 .138 AWE-(2,4,50) .249 .620 .470 .300 .299 .208 .155 .168 .049 .118 .187 .161 AWE-(8,4,30) .233 .775 .459 .311 .253 .181 .150 .050 -.019 .132 .108 .157 Nemori .231 .782 .551 .371 .296 .147 .003 .004 -.007 .050 .155 .188 RAG-(2,4,30) .223 .780 .460 .346 .283 .174 .061 .083 -.006 .036 .120 .114 A-Mem .220 .759 .487 .363 .251 .144 .014 .023 -.014 .055 .126 .210 LLM .202 .725 .473 .331 .229 .117 .045 .018 -.003 .027 .094 .171 AWI .177 .529 .331 .205 .225 .132 .089 .037 -.005 .082 .158 .167 Mean 0 1 2 3 4 5 6 7 8 9 10 0.5 0.4 0.3 0.2 0.1 0.0 Period Index All agents use GPT-4.1-mini for fair comparison. Key findings: RAG (store raw history + retrieve): simple but unselective → noisy AWE (LLM curates what to store externally): best; curation > raw storage AWI (LLM curates into in-context buffer): worse than native LLMs — compression loses critical info 12 / 16

17. Motivation Framework Experiments Takeaways Diagnostic Analysis: What Goes Wrong? Strategy Agentic write improves utilization but sacrifices read performance due to information loss during compression or retrieval. Frequency Lower update frequency increases read failures as retaining local messages confuses generation with multiple memory sources. Noise Larger short-term memory increases read failures but provides more context for write. Top-K Top-k has a non-monotonic effect on write: higher values capture more information but introduce more noise. ⇒ AMemGym’s fine-grained diagnostics enable targeted optimization of memory strategies. 13 / 16

18. Motivation Framework Experiments Takeaways The Write / Read / Utilization Trade-off StrategyWrite↓Read↓Util.↓ LLM (native) RAG AWE AWI.301 .377 .338 .286.087 .172 .159 .245.244 .067 .074 .122 Mean failure rates over all periods. Three failure modes: Utilization — knows the facts but fails to use them Read — wrote it down but can’t retrieve it later Write — never recorded the information 14 / 16

19. Motivation Framework Experiments Takeaways The Write / Read / Utilization Trade-off StrategyWrite↓Read↓Util.↓ LLM (native) RAG AWE AWI.301 .377 .338 .286.087 .172 .159 .245.244 .067 .074 .122 Mean failure rates over all periods. Three failure modes: Utilization — knows the facts but fails to use them Read — wrote it down but can’t retrieve it later Write — never recorded the information Key trade-off: RAG/AWE reduce utilization failures but increase write & read errors AWI keeps low write failures but high read failures (information loss from compression) No method excels at all three stages 14 / 16

20. Motivation Framework Experiments Takeaways AMemGym: Key Takeaways 1 On-policy evaluation reveals memory problems masked by off-policy benchmarks 15 / 16

21. Motivation Framework Experiments Takeaways AMemGym: Key Takeaways 1On-policy evaluation reveals memory problems masked by off-policy benchmarks 2All current models show significant memory limitations in long interactions 15 / 16

22. Motivation Framework Experiments Takeaways AMemGym: Key Takeaways 1On-policy evaluation reveals memory problems masked by off-policy benchmarks 2All current models show significant memory limitations in long interactions 3External memory helps but is imperfect — a fundamental write/read/utilization trade-off persists 15 / 16

23. Motivation Framework Experiments Takeaways AMemGym: Key Takeaways 1On-policy evaluation reveals memory problems masked by off-policy benchmarks 2All current models show significant memory limitations in long interactions 3External memory helps but is imperfect — a fundamental write/read/utilization trade-off persists 4Self-evolution is promising — agents can use AMemGym’s feedback to iteratively refine their memory policies Impact: AMemGym enables reliable benchmarking, diagnosis, and optimization of memory for conversational agents. 15 / 16

24. Motivation Framework Experiments Takeaways Thank you for your attention! Project Page 16 / 16