LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs
The paper introduces LGMT, a new framework for evaluating the reasoning reliability of large language models (LLMs). It highlights the limitations of existing evaluation methods that often overestimate LLM capabilities. By using first-order logic to create semantically invariant test cases, LGMT reveals hidden defects in LLM reasoning.
- ▪LGMT stands for Logic-Grounded Metamorphic Testing and aims to assess the robustness of LLMs under logically equivalent transformations.
- ▪The framework is oracle-free and constructs test cases based on formal logical equivalences.
- ▪Experiments show that LGMT uncovers significant reasoning defects that traditional evaluations miss.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.23965 (cs) [Submitted on 12 May 2026] Title:LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs Authors:Zenghui Zhou, Man Li, Xiaoke Fang, Xinyi Zhou, Weibin Li, Zheng Zheng View a PDF of the paper titled LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs, by Zenghui Zhou and 5 other authors View PDF Abstract:Large Language Models (LLMs) achieve strong performance on logical reasoning benchmarks, yet their reliability remains uncertain. Existing evaluations rely on static benchmarks, which fail to assess robustness under logically equivalent transformations and often overestimate reasoning capability.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.