Research Links Collection for Reasoning (LLM and other types)

Categories

[IR] Internal Reasoning: Hierarchical models, dual-system processing, chain-of-thought methods, and brain-inspired reasoning architectures

[ER] External Reasoning: Test-time compute scaling, sampling strategies, self-certainty measures, and runtime efficiency improvements

[RR] Reflective Reasoning: Knowledge vs reasoning separation, domain expertise evaluation, metacognition, and self-awareness mechanisms

[RV] Response Validation: Benchmarking frameworks, self-consistency checking, confidence measures, and accuracy assessment

[CBF] Cognitive-Biological Foundations: Neuroscience insights, dopamine research, causal reasoning, and biological inspiration for AI

Successfully Fetched Sites

`

Category Paper Name and Link Year Abstract/Summary
[ER] You Don’t Need ‘Thinking’ In LLMs To Reason Better 2024 Proposes ‘NoThinking’ method that outperforms chain-of-thought reasoning across seven tasks while using 2-5× fewer tokens, especially effective in low-budget settings
[ER] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters 2024 Studies inference-time computation scaling, showing that optimal test-time compute allocation can improve efficiency 4× over best-of-N baselines and outperform 14× larger models
[RV] DSR-Bench: Evaluating the Structural Reasoning Abilities of LLMs via Data Structures 2024 Introduces benchmark with 20 data structures and 35 operations to evaluate LLMs' structural reasoning. Best models achieve only 47% on challenge subset, highlighting gaps in multi-dimensional reasoning
[ER] Large Language Monkeys: Scaling Inference Compute with Repeated Sampling 2024 Explores repeated sampling for inference scaling, showing coverage scales log-linearly over four orders of magnitude. Improves DeepSeek-Coder performance from 15.9% to 56% on SWE-bench
[CBF] Dopamine’s role in learning may be broader than previously thought 2024 Research reveals dopamine’s dual role: influences both fast working memory and slow reinforcement learning. Higher dopamine levels bias toward effortful strategies
[IR/RR] Metacognition and Related Abilities of Large Language Models 2024 Explores LLMs' metacognitive abilities - “knowing that they know.” Discusses extracting knowledge from smaller models and augmenting behavior using larger models' metacognition
[ER] Scalable Best-of-N Selection for Large Language Models via Self-Certainty 2025 Proposes self-certainty metric using LLM probability distributions to select best responses without external reward models. Uses global confidence measures across entire reasoning traces. Scales efficiently and generalizes to open-ended tasks
[IR] Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic 2024 Demonstrates that reasoning ability can be extracted and transferred between models as compact task vectors. Extracts reasoning vector from difference between GRPO and SFT models, showing consistent improvements across reasoning benchmarks: GSM8K (+4.9%), HumanEval (+4.3%), BigBenchHard (+12.3%) when added to compatible models
[IR] Decoupling Knowledge and Reasoning in LLMs: An Exploration Using Cognitive Dual-System Theory 2025 Framework separating knowledge retrieval and reasoning adjustment using fast/slow thinking paradigms. Shows reasoning benefits are domain-specific and parameter scaling improves both components
[IR] How Do LLMs Really Reason? A Framework to Separate Logic from Knowledge 2025 Introduces framework using Knowledge Index and Information Gain metrics to evaluate reasoning. Shows supervised fine-tuning improves accuracy but can harm reasoning depth
[IR] Knowledge or Reasoning ? A Close Look at How LLMs Think Across Domains 2024 Evaluation framework decomposing reasoning into knowledge and logic components. Finds reasoning abilities don’t transfer well between domains, with domain knowledge being critical
[ER] GitHub - ByebyeMonica/Reasoning-Agentic-RAG 2024 Curated collection of reasoning-enhanced RAG approaches, categorized into predefined reasoning (structured workflows) and agentic reasoning (autonomous tool orchestration)
[IR] To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning 2024 Meta-analysis showing CoT primarily benefits math/logic tasks with minimal gains elsewhere. Performance improvements correlate with presence of symbolic operations (equals signs)
[CBF] A Practical Starters' Guide to Causal Structure Learning with Bayesian Methods in Python 2024 Comprehensive tutorial on Bayesian networks for causal inference, covering structure learning, parameter estimation, and inference using the bnlearn Python library
[IR] Hierarchical Reasoning Model 2025 Brain-inspired 27M parameter model with hierarchical processing achieving 40.3% on ARC-AGI and near-perfect Sudoku performance using only 1000 training samples without pre-training
[ER] Reasoning RAG via System 1 or System 2: A Survey on Reasoning Agentic Retrieval-Augmented Generation for Industry Challenges 2025 Comprehensive survey of reasoning-enhanced RAG systems, categorizing approaches into predefined reasoning (fixed pipelines) and agentic reasoning (autonomous orchestration)
[IR] The Hidden Drivers of HRM’s Performance on ARC-AGI 2024 In-depth analysis revealing HRM’s performance primarily comes from iterative refinement loops and data augmentation rather than the hierarchical architecture itself
[RR] A Bayesian Learning Agent: Bayes Theorem and Intelligent Agents 2025 Implements belief updating agent using Bayes theorem for environment learning. Demonstrates entropy-based uncertainty measurement and probability distribution evolution through evidence integration
[IR] GitHub - DominikBBB/hierarchical-llm-structured-outputs: LLM-in-the-loop optimisation 2025 Applies LLMs in optimisation loops for structured output generation. Uses hierarchical approach where LLM iteratively refines outputs based on feedback and constraints
[RR] Retrieval-augmented thoughts elicit context-aware reasoning in long-context Large Language Models 2024 RAT framework integrating retrieval processes into CoT prompting for long-context tasks. Demonstrates improved reasoning by retrieving relevant context chunks before generating each reasoning step
[ER] Tool Augmented Agents: Navigating the Landscape of Tool Interaction and Orchestration for LLM Agents 2024 Overview of tool augmentation for LLM agents, covering tool interaction patterns, orchestration strategies, and practical implementation considerations for building multi-tool agent systems
[RV] LLM Benchmarks and Evals Are Broken: Here’s How We Fix Them 2025 Analysis of current LLM evaluation shortcomings proposing solutions including context-aware metrics, adversarial robustness testing, and dynamic benchmark updates to address overfitting and data contamination
[RV] Measuring Confidence and Uncertainty of LLMs in Real-World Applications 2024 Practical guide to implementing confidence and uncertainty measures in production LLM systems. Covers calibration techniques, threshold selection, and integration with human-in-the-loop workflows
[RV] Exploring LLM Evaluation from a Data Scientist’s Perspective: Moving Beyond Accuracy Metrics 2024 Comprehensive evaluation framework beyond accuracy including robustness, consistency, bias detection, and cost-benefit analysis. Proposes multi-dimensional assessment approach for production systems
[IR] o3 and the Power of Symbolic Reasoning In AI 2025 Analysis of OpenAI’s o3 model achievements highlighting role of symbolic reasoning scaffolds and search strategies in achieving ARC-AGI breakthrough results. Discusses implications for hybrid neural-symbolic approaches
[IR] Can o3 Actually Reason? An Engineering Perspective 2025 Engineering analysis questioning whether o3’s performance represents true reasoning or sophisticated pattern matching with extensive search. Examines architectural implications and generalization capabilities
[RR] LLMs can learn self-restraint through iterative self-reflection 2024 ReSearch algorithm for synthetic data generation using iterative self-reflection. Teaches LLMs to modulate information based on uncertainty, reducing hallucinations
[RR] Self-Reflection in LLM Agents: Effects on Problem-Solving Performance 2024 Empirical study showing LLM agents significantly improve problem-solving performance through self-reflection (p < 0.001). Compares various types of self-reflection approaches
[RV] Confidence Improves Self-Consistency in LLMs 2025 CISC performs weighted majority vote based on confidence scores, reducing required reasoning paths by 40%. Introduces within-question confidence evaluation methodology
[RV] Cycles of Thought: Measuring LLM Confidence through Stable Explanations 2024 Framework for measuring LLM uncertainty through distribution of generated explanations. Interprets model+explanation pairs as test-time classifiers for posterior answer distribution
[ER/RV] C4AI ML Agents: Reasoning Approaches Comparison Framework 2024 Comprehensive framework comparing 10 reasoning techniques (CoT, ToT, PoT, etc.) across multiple models and tasks. Investigates reasoning approach effectiveness and cost-benefit analysis
[RV] Consistency in Language Models: Current Landscape, Challenges, and Future Directions 2025 Comprehensive survey defining behavioral consistency types (logical vs nonlogical), evaluation frameworks, and multilingual consistency challenges. Reviews current approaches to measure and enhance consistency across task domains
[ER] Crosslingual Reasoning through Test-Time Scaling 2025 Investigates how English reasoning finetuning with long chain-of-thoughts generalizes across languages. Shows scaling inference compute for English-centric reasoning models improves multilingual mathematical reasoning, with models outperforming 2× larger models
[ER] OptimalThinkingBench: Evaluating Over and Underthinking in LLMs 2025 Introduces unified benchmark evaluating overthinking and underthinking in LLMs with thinking-adjusted accuracy metrics. Shows thinking models often overthink on simple queries while non-thinking models underthink on complex reasoning tasks
[ER] SciToolAgent: A Knowledge Graph-Driven Scientific Agent for Multi-Tool Integration 2025 LLM-powered agent that automates hundreds of scientific tools across biology, chemistry, and materials science using knowledge graph-driven tool selection and graph-based retrieval-augmented generation with comprehensive safety checking
[CBF] Sapir-Whorf Hypothesis 1994 Comprehensive review of linguistic relativity research examining how language influences thought. Covers color perception, counterfactual reasoning, and concept labeling studies, showing mutual influence between language and cognition
[CBF] Sapir-Whorf does not apply to Programming Languages 2024 Argues programming language effects on thinking are better explained by “Tetris effect” (practice effects) rather than linguistic relativism. Distinguishes between human language cognition and programming skill development
[RV] Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey 2025 Comprehensive survey on uncertainty quantification for LLMs introducing taxonomy based on computational efficiency and uncertainty dimensions (input, reasoning, parameter, prediction). Reviews methods for enhancing reliability and trustworthiness in high-stakes applications
[RV] Pro2Guard: Proactive Runtime Enforcement of LLM Agent Safety via Probabilistic Model Checking 2025 Proactive runtime safety enforcement framework using probabilistic reachability analysis and Discrete-Time Markov Chains to anticipate unsafe behaviors before violations occur. Achieves early safety enforcement on 93.6% of unsafe tasks and 100% prediction of traffic violations
[RV] SAUP: Situation Awareness Uncertainty Propagation on LLM Agent 2024 Framework for propagating uncertainty through each step of LLM-based agent reasoning processes. Incorporates situational awareness by assigning situational weights to step uncertainties, achieving up to 20% improvement in AUROC over existing methods
[RV] Chonkie: No-nonsense RAG chunking library 2025 Lightweight and fast text chunking library with RecursiveChunker (hierarchical semantic splitting) and SemanticChunker (similarity-based splitting). Features 24+ integrations and up to 33x faster performance than alternatives for evaluating reasoning and text analysis
[CBF] Technological Approach to Mind Everywhere: An Experimentally-Grounded Framework for Understanding Diverse Bodies and Minds 2022 Levin’s TAME framework distinguishing cognitive consciousness (observable decision-making competencies) from phenomenal consciousness (subjective experience). Proposes empirical approach to studying cognition across diverse biological substrates from cells to complex organisms

Additional Core Theoretical Sources

Category Paper Name and Link Year Abstract/Summary
[CBF] Facing Up to the Problem of Consciousness 1995 Foundational paper introducing the hard/easy problem distinction in consciousness studies. Argues the “hard problem” (explaining experience itself) cannot be solved by reductive methods and proposes nonreductive explanation based on structural coherence, organizational invariance, and double-aspect theory of information
[IR/RR] Computational Metacognition 2021 Comprehensive framework for metacognition in cognitive systems dividing into explanatory, immediate, and anticipatory metacognition. Demonstrates implementation in MIDCA architecture showing how introspective monitoring and meta-level control improve problem-solving performance through learning better action models
[IR/RR] Perpetual Self-Aware Cognitive Agents 2007 Describes integration of Meta-AQUA multistrategy learning system with INTRO planning agent to create perpetual self-aware cognitive agents. Demonstrates how agents can independently generate goals through full integration of cognition and metacognition with intelligent behaviors
[RV] Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey 2025 Comprehensive survey introducing novel taxonomy categorizing UQ methods by computational efficiency and uncertainty dimensions (input, reasoning, parameter, prediction). Reviews methods, benchmarks, metrics, and applications while identifying challenges in efficiency-performance tradeoffs and cross-modal uncertainty
[RR] Introspection of Thought Helps AI Agents 2025 Proposes Introspection of Thought (INoT) framework enabling LLMs to execute programmatic dialogue reasoning through code-in-prompt design. Achieves 7.95% performance improvement with 58.3% reduction in token costs by conducting self-denial and reflection within LLM rather than external processes

Notes

  • 51 unique sources successfully fetched with category labels and publication years where available
  • Categories align with cognitive reasoning framework components
  • Collection spans from theoretical foundations (1995) to cutting-edge research (2025)
  • Added key foundational papers including Chalmers' seminal consciousness work and recent advances in each category
  • Personal blog posts (matt.thompson.gr) represent ongoing research framework development
  • Core theoretical frameworks now include consciousness studies, computational metacognition, and uncertainty quantification foundations
  • Some entries lack specific publication dates due to preprint/blog post nature

Agentic AI AGI Intelligent Agents Research