LogiConBench- Benchmarking Logical Consistencies of LLMs
如果无法正常显示,请先停止浏览器的去广告插件。
1. LogiConBench: Benchmarking Logical
Consistencies of LLMs
汇报人:Fengxiang Cheng
美团搜索和推荐平台部
2. Logical Consistency
Logical consistency requires LLMs not to contradict themselves
when answering different questions during complex reasoning.
LLaMa-2 70b:
• Q: Is an albatross an organism?
• A: True.
• Q: Is an albatross not an organism?
• A: True.
l A question-answering LLM Macaw:
• Q: Is a magpie a bird? A: Yes.
• Q: Does a bird have wings? A: Yes.
• Q: Does a magpie have wings? A: No.
l
3. Negation
What is Inconsistency?
Is an albatross an organism?
Yes
Contradiction!
Is an albatross not an organism?
Yes
4. What is Inconsistency?
Implication
Is this material iron (p)?
Is all Iron (p) metal
(q)?
Yes
Yes
Is this material metal (q)?
No
Contradiction!
5. What is Inconsistency?
Transitivity
If a tsunami happens, what
will wood be?
Wood will be more moist.
Contradiction!
If wood is more moist, how will
weathering change?
There will be more
weathering occurring.
If a tsunami happens, how will
weathering change?
There will be less
weathering
occurring.
6. What is Inconsistency?
Factuality
What is the highest
mountain in the world?
The highest mountain in the
world is Mount Elbrus.
Contradiction!
The highest
mountain in the
world is Mount
Everest.
7. What is Inconsistency?
Compositional
Is melting a kind of
phase change?
Do phase changes change mass
Yes
Does the ice melt?
No
Will the mass of the ice change?
Yes
Yes
Does the ice undergo a
phase change?
Yes
Contradiction!
8. Motivation
Motivation: (a) Accuracy of frontier LLMs on
LogiConBench vs. existing benchmarks. Existing
benchmarks are saturated, while LogiConBench
remains challenging.
(b) Comparison of logical consistency datasets in
terms of size, depth, operators, reasoning path
availability, scalability, and rule count.
9. LogiConBench Framework
1⃣ Generate the logical graph 2⃣ Select k nodes
3⃣ Label consistent sets 4⃣ Translate to natural language
10. LogiConBench - Rules
11. LogiConBench Framework
Setting 1 focuses on
determining whether a given
list of Boolean labels
assigned to k logical
statements leads to a
contradiction, for statements
size k = 2, 3, 4, 5
Setting 2 focuses on the task of
enumeration. Given a set of logical
statements, the model is required to
enumerate all possible lists of Boolean
label assignments that remain logically
consistent.
Setting 3 evaluates whether,
given n mutually consistent
premises, LLMs can generate n
new statements that remain
logically consistent.
12. LogiConBench Results - Setting 1
13. LogiConBench Results - Setting 1 hard and variant modes
14. LogiConBench Results - Setting 2
15. LogiConBench Results - Setting 2 easy modes
16. LogiConBench Results - Setting 3
17. Practical Value of LogiConBench
Models are evaluated on multiple real-world benchmarks:
• Code generation (LiveCodeBench)
• Long-context writing (In@initeBench)
• Mathematical reasoning (AIME)
• Long-horizon logical reasoning (AA-LCR)
• Agent collaboration (ACE Bench)
Performance of LLM on real-world downstream tasks are highly
correlated with our benchmark!!!
18. Q&A
19. 招聘:XXX岗位
邮箱:XXX@meituan.com
更多技术干货
欢迎关注“美团技术团队”