Research Links Collection for Reasoning (LLM and other types)
Categories
[IR] Internal Reasoning: Hierarchical models, dual-system processing, chain-of-thought methods, and brain-inspired reasoning architectures
[ER] External Reasoning: Test-time compute scaling, sampling strategies, self-certainty measures, and runtime efficiency improvements
[RR] Reflective Reasoning: Knowledge vs reasoning separation, domain expertise evaluation, metacognition, and self-awareness mechanisms
[RV] Response Validation: Benchmarking frameworks, self-consistency checking, confidence measures, and accuracy assessment
[CBF] Cognitive-Biological Foundations: Neuroscience insights, dopamine research, causal reasoning, and biological inspiration for AI
Successfully Fetched Sites
`
| Category | Paper Name and Link | Year | Abstract/Summary |
|---|---|---|---|
| [ER] | You Don’t Need ‘Thinking’ In LLMs To Reason Better | 2024 | Proposes ‘NoThinking’ method that outperforms chain-of-thought reasoning across seven tasks while using 2-5× fewer tokens, especially effective in low-budget settings |
| [ER] | Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters | 2024 | Studies inference-time computation scaling, showing that optimal test-time compute allocation can improve efficiency 4× over best-of-N baselines and outperform 14× larger models |
| [RV] | DSR-Bench: Evaluating the Structural Reasoning Abilities of LLMs via Data Structures | 2024 | Introduces benchmark with 20 data structures and 35 operations to evaluate LLMs' structural reasoning. Best models achieve only 47% on challenge subset, highlighting gaps in multi-dimensional reasoning |
| [ER] | Large Language Monkeys: Scaling Inference Compute with Repeated Sampling | 2024 | Explores repeated sampling for inference scaling, showing coverage scales log-linearly over four orders of magnitude. Improves DeepSeek-Coder performance from 15.9% to 56% on SWE-bench |
| [CBF] | Dopamine’s role in learning may be broader than previously thought | 2024 | Research reveals dopamine’s dual role: influences both fast working memory and slow reinforcement learning. Higher dopamine levels bias toward effortful strategies |
| [IR/RR] | Metacognition and Related Abilities of Large Language Models | 2024 | Explores LLMs' metacognitive abilities - “knowing that they know.” Discusses extracting knowledge from smaller models and augmenting behavior using larger models' metacognition |
| [ER] | Scalable Best-of-N Selection for Large Language Models via Self-Certainty | 2025 | Proposes self-certainty metric using LLM probability distributions to select best responses without external reward models. Uses global confidence measures across entire reasoning traces. Scales efficiently and generalizes to open-ended tasks |
| [IR] | Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic | 2024 | Demonstrates that reasoning ability can be extracted and transferred between models as compact task vectors. Extracts reasoning vector from difference between GRPO and SFT models, showing consistent improvements across reasoning benchmarks: GSM8K (+4.9%), HumanEval (+4.3%), BigBenchHard (+12.3%) when added to compatible models |
| [IR] | Decoupling Knowledge and Reasoning in LLMs: An Exploration Using Cognitive Dual-System Theory | 2025 | Framework separating knowledge retrieval and reasoning adjustment using fast/slow thinking paradigms. Shows reasoning benefits are domain-specific and parameter scaling improves both components |
| [IR] | How Do LLMs Really Reason? A Framework to Separate Logic from Knowledge | 2025 | Introduces framework using Knowledge Index and Information Gain metrics to evaluate reasoning. Shows supervised fine-tuning improves accuracy but can harm reasoning depth |
| [IR] | Knowledge or Reasoning ? A Close Look at How LLMs Think Across Domains | 2024 | Evaluation framework decomposing reasoning into knowledge and logic components. Finds reasoning abilities don’t transfer well between domains, with domain knowledge being critical |
| [ER] | GitHub - ByebyeMonica/Reasoning-Agentic-RAG | 2024 | Curated collection of reasoning-enhanced RAG approaches, categorized into predefined reasoning (structured workflows) and agentic reasoning (autonomous tool orchestration) |
| [IR] | To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning | 2024 | Meta-analysis showing CoT primarily benefits math/logic tasks with minimal gains elsewhere. Performance improvements correlate with presence of symbolic operations (equals signs) |
| [CBF] | A Practical Starters' Guide to Causal Structure Learning with Bayesian Methods in Python | 2024 | Comprehensive tutorial on Bayesian networks for causal inference, covering structure learning, parameter estimation, and inference using the bnlearn Python library |
| [IR] | Hierarchical Reasoning Model | 2025 | Brain-inspired 27M parameter model with hierarchical processing achieving 40.3% on ARC-AGI and near-perfect Sudoku performance using only 1000 training samples without pre-training |
| [ER] | Reasoning RAG via System 1 or System 2: A Survey on Reasoning Agentic Retrieval-Augmented Generation for Industry Challenges | 2025 | Comprehensive survey of reasoning-enhanced RAG systems, categorizing approaches into predefined reasoning (fixed pipelines) and agentic reasoning (autonomous orchestration) |
| [IR] | The Hidden Drivers of HRM’s Performance on ARC-AGI | 2024 | In-depth analysis revealing HRM’s performance primarily comes from iterative refinement loops and data augmentation rather than the hierarchical architecture itself |
| [RR] | A Bayesian Learning Agent: Bayes Theorem and Intelligent Agents | 2025 | Implements belief updating agent using Bayes theorem for environment learning. Demonstrates entropy-based uncertainty measurement and probability distribution evolution through evidence integration |
| [IR] | GitHub - DominikBBB/hierarchical-llm-structured-outputs: LLM-in-the-loop optimisation | 2025 | Applies LLMs in optimisation loops for structured output generation. Uses hierarchical approach where LLM iteratively refines outputs based on feedback and constraints |
| [RR] | Retrieval-augmented thoughts elicit context-aware reasoning in long-context Large Language Models | 2024 | RAT framework integrating retrieval processes into CoT prompting for long-context tasks. Demonstrates improved reasoning by retrieving relevant context chunks before generating each reasoning step |
| [ER] | Tool Augmented Agents: Navigating the Landscape of Tool Interaction and Orchestration for LLM Agents | 2024 | Overview of tool augmentation for LLM agents, covering tool interaction patterns, orchestration strategies, and practical implementation considerations for building multi-tool agent systems |
| [RV] | LLM Benchmarks and Evals Are Broken: Here’s How We Fix Them | 2025 | Analysis of current LLM evaluation shortcomings proposing solutions including context-aware metrics, adversarial robustness testing, and dynamic benchmark updates to address overfitting and data contamination |
| [RV] | Measuring Confidence and Uncertainty of LLMs in Real-World Applications | 2024 | Practical guide to implementing confidence and uncertainty measures in production LLM systems. Covers calibration techniques, threshold selection, and integration with human-in-the-loop workflows |
| [RV] | Exploring LLM Evaluation from a Data Scientist’s Perspective: Moving Beyond Accuracy Metrics | 2024 | Comprehensive evaluation framework beyond accuracy including robustness, consistency, bias detection, and cost-benefit analysis. Proposes multi-dimensional assessment approach for production systems |
| [IR] | o3 and the Power of Symbolic Reasoning In AI | 2025 | Analysis of OpenAI’s o3 model achievements highlighting role of symbolic reasoning scaffolds and search strategies in achieving ARC-AGI breakthrough results. Discusses implications for hybrid neural-symbolic approaches |
| [IR] | Can o3 Actually Reason? An Engineering Perspective | 2025 | Engineering analysis questioning whether o3’s performance represents true reasoning or sophisticated pattern matching with extensive search. Examines architectural implications and generalization capabilities |
| [RR] | LLMs can learn self-restraint through iterative self-reflection | 2024 | ReSearch algorithm for synthetic data generation using iterative self-reflection. Teaches LLMs to modulate information based on uncertainty, reducing hallucinations |
| [RR] | Self-Reflection in LLM Agents: Effects on Problem-Solving Performance | 2024 | Empirical study showing LLM agents significantly improve problem-solving performance through self-reflection (p < 0.001). Compares various types of self-reflection approaches |
| [RV] | Confidence Improves Self-Consistency in LLMs | 2025 | CISC performs weighted majority vote based on confidence scores, reducing required reasoning paths by 40%. Introduces within-question confidence evaluation methodology |
| [RV] | Cycles of Thought: Measuring LLM Confidence through Stable Explanations | 2024 | Framework for measuring LLM uncertainty through distribution of generated explanations. Interprets model+explanation pairs as test-time classifiers for posterior answer distribution |
| [ER/RV] | C4AI ML Agents: Reasoning Approaches Comparison Framework | 2024 | Comprehensive framework comparing 10 reasoning techniques (CoT, ToT, PoT, etc.) across multiple models and tasks. Investigates reasoning approach effectiveness and cost-benefit analysis |
| [RV] | Consistency in Language Models: Current Landscape, Challenges, and Future Directions | 2025 | Comprehensive survey defining behavioral consistency types (logical vs nonlogical), evaluation frameworks, and multilingual consistency challenges. Reviews current approaches to measure and enhance consistency across task domains |
| [ER] | Crosslingual Reasoning through Test-Time Scaling | 2025 | Investigates how English reasoning finetuning with long chain-of-thoughts generalizes across languages. Shows scaling inference compute for English-centric reasoning models improves multilingual mathematical reasoning, with models outperforming 2× larger models |
| [ER] | OptimalThinkingBench: Evaluating Over and Underthinking in LLMs | 2025 | Introduces unified benchmark evaluating overthinking and underthinking in LLMs with thinking-adjusted accuracy metrics. Shows thinking models often overthink on simple queries while non-thinking models underthink on complex reasoning tasks |
| [ER] | SciToolAgent: A Knowledge Graph-Driven Scientific Agent for Multi-Tool Integration | 2025 | LLM-powered agent that automates hundreds of scientific tools across biology, chemistry, and materials science using knowledge graph-driven tool selection and graph-based retrieval-augmented generation with comprehensive safety checking |
| [CBF] | Sapir-Whorf Hypothesis | 1994 | Comprehensive review of linguistic relativity research examining how language influences thought. Covers color perception, counterfactual reasoning, and concept labeling studies, showing mutual influence between language and cognition |
| [CBF] | Sapir-Whorf does not apply to Programming Languages | 2024 | Argues programming language effects on thinking are better explained by “Tetris effect” (practice effects) rather than linguistic relativism. Distinguishes between human language cognition and programming skill development |
| [RV] | Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey | 2025 | Comprehensive survey on uncertainty quantification for LLMs introducing taxonomy based on computational efficiency and uncertainty dimensions (input, reasoning, parameter, prediction). Reviews methods for enhancing reliability and trustworthiness in high-stakes applications |
| [RV] | Pro2Guard: Proactive Runtime Enforcement of LLM Agent Safety via Probabilistic Model Checking | 2025 | Proactive runtime safety enforcement framework using probabilistic reachability analysis and Discrete-Time Markov Chains to anticipate unsafe behaviors before violations occur. Achieves early safety enforcement on 93.6% of unsafe tasks and 100% prediction of traffic violations |
| [RV] | SAUP: Situation Awareness Uncertainty Propagation on LLM Agent | 2024 | Framework for propagating uncertainty through each step of LLM-based agent reasoning processes. Incorporates situational awareness by assigning situational weights to step uncertainties, achieving up to 20% improvement in AUROC over existing methods |
| [RV] | Chonkie: No-nonsense RAG chunking library | 2025 | Lightweight and fast text chunking library with RecursiveChunker (hierarchical semantic splitting) and SemanticChunker (similarity-based splitting). Features 24+ integrations and up to 33x faster performance than alternatives for evaluating reasoning and text analysis |
| [CBF] | Technological Approach to Mind Everywhere: An Experimentally-Grounded Framework for Understanding Diverse Bodies and Minds | 2022 | Levin’s TAME framework distinguishing cognitive consciousness (observable decision-making competencies) from phenomenal consciousness (subjective experience). Proposes empirical approach to studying cognition across diverse biological substrates from cells to complex organisms |
Additional Core Theoretical Sources
| Category | Paper Name and Link | Year | Abstract/Summary |
|---|---|---|---|
| [CBF] | Facing Up to the Problem of Consciousness | 1995 | Foundational paper introducing the hard/easy problem distinction in consciousness studies. Argues the “hard problem” (explaining experience itself) cannot be solved by reductive methods and proposes nonreductive explanation based on structural coherence, organizational invariance, and double-aspect theory of information |
| [IR/RR] | Computational Metacognition | 2021 | Comprehensive framework for metacognition in cognitive systems dividing into explanatory, immediate, and anticipatory metacognition. Demonstrates implementation in MIDCA architecture showing how introspective monitoring and meta-level control improve problem-solving performance through learning better action models |
| [IR/RR] | Perpetual Self-Aware Cognitive Agents | 2007 | Describes integration of Meta-AQUA multistrategy learning system with INTRO planning agent to create perpetual self-aware cognitive agents. Demonstrates how agents can independently generate goals through full integration of cognition and metacognition with intelligent behaviors |
| [RV] | Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey | 2025 | Comprehensive survey introducing novel taxonomy categorizing UQ methods by computational efficiency and uncertainty dimensions (input, reasoning, parameter, prediction). Reviews methods, benchmarks, metrics, and applications while identifying challenges in efficiency-performance tradeoffs and cross-modal uncertainty |
| [RR] | Introspection of Thought Helps AI Agents | 2025 | Proposes Introspection of Thought (INoT) framework enabling LLMs to execute programmatic dialogue reasoning through code-in-prompt design. Achieves 7.95% performance improvement with 58.3% reduction in token costs by conducting self-denial and reflection within LLM rather than external processes |
Notes
- 51 unique sources successfully fetched with category labels and publication years where available
- Categories align with cognitive reasoning framework components
- Collection spans from theoretical foundations (1995) to cutting-edge research (2025)
- Added key foundational papers including Chalmers' seminal consciousness work and recent advances in each category
- Personal blog posts (matt.thompson.gr) represent ongoing research framework development
- Core theoretical frameworks now include consciousness studies, computational metacognition, and uncertainty quantification foundations
- Some entries lack specific publication dates due to preprint/blog post nature