LLM Reasoning Benchmark
Evaluating logical reasoning in language models
Can language models actually reason, or are they sophisticated pattern
matchers? This benchmark was designed to answer that question rigorously.
We created a comprehensive evaluation framework covering deductive reasoning,
inductive reasoning, abductive reasoning, and mathematical problem-solving.
Benchmark Categories
categories:
deductive:
- syllogistic_reasoning
- propositional_logic
- first_order_logic
inductive:
- pattern_recognition
- rule_learning
- analogical_reasoning
abductive:
- best_explanation
- causal_inference
mathematical:
- arithmetic_reasoning
- algebraic_manipulation
- proof_verification
Key Findings
Our benchmark revealed several important patterns:
Scale ≠ Reasoning: Larger models don't consistently reason better.
GPT-4 fails on many problems that require systematic deduction.Training Data Leakage: Models often "solve" problems by recognizing
similar examples from training, not by genuine reasoning.Fragile Compositionality: Models struggle to combine reasoning
steps they can perform individually.
The benchmark is now used by OpenAI, Anthropic, and Google for model
evaluation during development.