A new benchmark dataset called Chinese Temporal Mapping (CTM) tests LLMs on temporal reasoning using Chinese historical knowledge. The dataset contains 2,306 multiple-choice questions spanning major Chinese dynasties, evaluating both pure temporal logic and historical context understanding.
Key technical points: • Questions are split into temporal reasoning (ordering, duration, logic) and historical alignment categories • Evaluated 7 LLMs including GPT-4, ChatGPT, and Chinese models like GLM-4 • Used both zero-shot and few-shot testing approaches • GPT-4 achieved 74.8% accuracy, setting current SOTA • Performance gap observed between English and Chinese capabilities
Results breakdown: • Models performed better on basic timeline questions vs complex reasoning • Significant variation in performance based on question type and historical period • Larger models generally showed better temporal reasoning abilities • Multi-step reasoning questions proved most challenging across all models • Historical alignment accuracy correlated with model size
I think this benchmark addresses an important gap in evaluating cultural-specific temporal reasoning. The results suggest current LLMs still struggle with complex historical relationships despite strong performance on simpler tasks. This could drive development of better temporal reasoning architectures and more culturally diverse training approaches.
I think one limitation worth noting is the multiple-choice format may not fully capture nuanced historical understanding. Additionally, the western-centric training of many models likely impacts their performance on Chinese historical content.
TLDR: New Chinese history benchmark tests LLM temporal reasoning. GPT-4 leads at 74.8% accuracy, but complex reasoning remains challenging. Shows need for improved cultural-specific capabilities.
Full summary is here. Paper here.
[link] [comments]