LLM Reasoning Benchmark

Evaluating logical reasoning in language models

Year 2023
Role Lead Researcher
Duration 18 months
PythonEvaluationBenchmark Design

Can language models actually reason, or are they sophisticated pattern
matchers? This benchmark was designed to answer that question rigorously.

We created a comprehensive evaluation framework covering deductive reasoning,
inductive reasoning, abductive reasoning, and mathematical problem-solving.

Benchmark Categories

categories:
  deductive:
    - syllogistic_reasoning
    - propositional_logic
    - first_order_logic
  inductive:
    - pattern_recognition
    - rule_learning
    - analogical_reasoning
  abductive:
    - best_explanation
    - causal_inference
  mathematical:
    - arithmetic_reasoning
    - algebraic_manipulation
    - proof_verification
15,000+Test Cases
12Categories
40+Models Evaluated
25+Research Labs Using

Key Findings

Our benchmark revealed several important patterns:

  1. Scale ≠ Reasoning: Larger models don't consistently reason better.
    GPT-4 fails on many problems that require systematic deduction.

  2. Training Data Leakage: Models often "solve" problems by recognizing
    similar examples from training, not by genuine reasoning.

  3. Fragile Compositionality: Models struggle to combine reasoning
    steps they can perform individually.

The benchmark is now used by OpenAI, Anthropic, and Google for model
evaluation during development.