LogiConBench- Benchmarking Logical Consistencies of LLMs

1. LogiConBench: Benchmarking Logical Consistencies of LLMs 汇报人：Fengxiang Cheng 美团搜索和推荐平台部

2. Logical Consistency Logical consistency requires LLMs not to contradict themselves when answering different questions during complex reasoning. LLaMa-2 70b: • Q: Is an albatross an organism? • A: True. • Q: Is an albatross not an organism? • A: True. l A question-answering LLM Macaw: • Q: Is a magpie a bird? A: Yes. • Q: Does a bird have wings? A: Yes. • Q: Does a magpie have wings? A: No. l

3. Negation What is Inconsistency? Is an albatross an organism? Yes Contradiction! Is an albatross not an organism? Yes

4. What is Inconsistency? Implication Is this material iron (p)? Is all Iron (p) metal (q)? Yes Yes Is this material metal (q)? No Contradiction!

5. What is Inconsistency? Transitivity If a tsunami happens, what will wood be? Wood will be more moist. Contradiction! If wood is more moist, how will weathering change? There will be more weathering occurring. If a tsunami happens, how will weathering change? There will be less weathering occurring.

6. What is Inconsistency? Factuality What is the highest mountain in the world? The highest mountain in the world is Mount Elbrus. Contradiction! The highest mountain in the world is Mount Everest.

7. What is Inconsistency? Compositional Is melting a kind of phase change? Do phase changes change mass Yes Does the ice melt? No Will the mass of the ice change? Yes Yes Does the ice undergo a phase change? Yes Contradiction!

8. Motivation Motivation: (a) Accuracy of frontier LLMs on LogiConBench vs. existing benchmarks. Existing benchmarks are saturated, while LogiConBench remains challenging. (b) Comparison of logical consistency datasets in terms of size, depth, operators, reasoning path availability, scalability, and rule count.

9. LogiConBench Framework 1⃣ Generate the logical graph 2⃣ Select k nodes 3⃣ Label consistent sets 4⃣ Translate to natural language

10. LogiConBench - Rules

11. LogiConBench Framework Setting 1 focuses on determining whether a given list of Boolean labels assigned to k logical statements leads to a contradiction, for statements size k = 2, 3, 4, 5 Setting 2 focuses on the task of enumeration. Given a set of logical statements, the model is required to enumerate all possible lists of Boolean label assignments that remain logically consistent. Setting 3 evaluates whether, given n mutually consistent premises, LLMs can generate n new statements that remain logically consistent.

12. LogiConBench Results - Setting 1

13. LogiConBench Results - Setting 1 hard and variant modes

14. LogiConBench Results - Setting 2

15. LogiConBench Results - Setting 2 easy modes

16. LogiConBench Results - Setting 3

17. Practical Value of LogiConBench Models are evaluated on multiple real-world benchmarks: • Code generation (LiveCodeBench) • Long-context writing (In@initeBench) • Mathematical reasoning (AIME) • Long-horizon logical reasoning (AA-LCR) • Agent collaboration (ACE Bench) Performance of LLM on real-world downstream tasks are highly correlated with our benchmark!!!

18. Q&A

19. 招聘：XXX岗位邮箱：XXX@meituan.com 更多技术干货欢迎关注“美团技术团队”