AMO-Bench- Large Language Models Still Struggle in High School Math Competitions

1. AMO-Bench: Large Language Models Still Struggle in High School Math Competitions 分享人：安晟男计算与智能平台部-M17-EVA

2. · 文本能力上限不断刷新 - Test time scaling 引发 Reasoning 浪潮，模型能力尤其是在代码和推理领域迎来飞跃式发展 - 25年底发布的 Gemini-3-Pro 在 AA intelligence index 得到73分，而25年初发布的 DeepSeek-R1 仅44分

3. · 推理能力 · 大模型的“推理能力” VS 人类？能让机器具备与人类相似的认知和行为能力，能像人一样理解、思考、学习并解决复杂问题 · 而在众多推理能力评测场景中，数学推理任务是当前衡量和追踪模型推理能力进展的 “黄金标尺”

4. · 数学推理评测的困境 · 顶尖模型在常用的数学推理评测任务接近饱和，如 AIME24/25 的正确率已突破90% · 评测区分度大幅下降，难以再有效牵引模型向更高阶推理能力进化

5. We present AMO-Bench, an Advanced Mathematical reasoning benchmark with Olympiad level or even higher difficulty, revealing the reasoning capability boundaries of large language models and highlighting substantial room for future improvements.

6. (1) Data creation: All 50 problems are newly crafted by human experts to prevent data leakage from existing resources . (2) Quality review: Cross-validated by experts to meet IMO standards. LLM-based filtering ensures problems challenge SOTA models. (3) Originality review: Enables efficient automated grading via parser-based or LLM-based methods, balancing cost and generalizability. (4) Difficulty review: Expert-written step-by-step solutions for each problem, supporting error analysis and prompt engineering research.

7. · 打分方案 ·采用Parser+LLM的混合打分方案，依据题目答案类型而分别使用 ·人工质检打分准确率达到99.2%

8. · 数据统计 ·涵盖五大奥赛数学核心领域 ·答案长度远超传统评测基准，具有天然难度优势

9. · 整体结论 · 当前AMO-Bench上只有Gemini-3-Pro刚到【及格线】，大部分模型表现低于40%

10. · 整体结论 · 国产模型不弱于海外模型，开源模型迅速追赶闭源模型

11. · 推理效率 · · 大部分高分模型依赖于更多的输出token，Gemini-3-Pro效率最优国产模型输出token普遍偏多

12. · Test-Time Scaling · 同一模型的推理投入与得分呈对数线性增长，test-time scaling仍然生效

13. We present AMO-Bench, an Advanced Mathematical reasoning benchmark for pushing the boundaries of mathematical reasoning in LLMs.

14. 招聘：模型评测校招/社招岗位邮箱：anshengnan@meituan.com 更多技术干货欢迎关注“美团技术团队”

15. Q&A