Research Links Collection for Reasoning (LLM and other types)

Wednesday, February 18, 2026

Successfully Fetched Sites

Category	Paper Name and Link	Year	Abstract/Summary
[ER]	You Don’t Need ‘Thinking’ In LLMs To Reason Better	2024	Proposes ‘NoThinking’ method that outperforms chain-of-thought reasoning across seven tasks while using 2-5× fewer tokens, especially effective in low-budget settings
[ER]	Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters	2024	Studies inference-time computation scaling, showing that optimal test-time compute allocation can improve efficiency 4× over best-of-N baselines and outperform 14× larger models
[RV]	DSR-Bench: Evaluating the Structural Reasoning Abilities of LLMs via Data Structures	2024	Introduces benchmark with 20 data structures and 35 operations to evaluate LLMs' structural reasoning. Best models achieve only 47% on challenge subset, highlighting gaps in multi-dimensional reasoning
[ER]	Large Language Monkeys: Scaling Inference Compute with Repeated Sampling	2024	Explores repeated sampling for inference scaling, showing coverage scales log-linearly over four orders of magnitude. Improves DeepSeek-Coder performance from 15.9% to 56% on SWE-bench
[CBF]	Dopamine’s role in learning may be broader than previously thought	2024	Research reveals dopamine’s dual role: influences both fast working memory and slow reinforcement learning. Higher dopamine levels bias toward effortful strategies
[IR/RR]	Metacognition and Related Abilities of Large Language Models	2024	Explores LLMs' metacognitive abilities - “knowing that they know.” Discusses extracting knowledge from smaller models and augmenting behavior using larger models' metacognition
[ER]	Scalable Best-of-N Selection for Large Language Models via Self-Certainty	2025	Proposes self-certainty metric using LLM probability distributions to select best responses without external reward models. Uses global confidence measures across entire reasoning traces. Scales efficiently and generalizes to open-ended tasks
[IR]	Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic	2024	Demonstrates that reasoning ability can be extracted and transferred between models as compact task vectors. Extracts reasoning vector from difference between GRPO and SFT models, showing consistent improvements across reasoning benchmarks: GSM8K (+4.9%), HumanEval (+4.3%), BigBenchHard (+12.3%) when added to compatible models
[IR]	Decoupling Knowledge and Reasoning in LLMs: An Exploration Using Cognitive Dual-System Theory	2025	Framework separating knowledge retrieval and reasoning adjustment using fast/slow thinking paradigms. Shows reasoning benefits are domain-specific and parameter scaling improves both components
[IR]	How Do LLMs Really Reason? A Framework to Separate Logic from Knowledge	2025	Introduces framework using Knowledge Index and Information Gain metrics to evaluate reasoning. Shows supervised fine-tuning improves accuracy but can harm reasoning depth
[IR]	Knowledge or Reasoning ? A Close Look at How LLMs Think Across Domains	2024	Evaluation framework decomposing reasoning into knowledge and logic components. Finds reasoning abilities don’t transfer well between domains, with domain knowledge being critical
[ER]	GitHub - ByebyeMonica/Reasoning-Agentic-RAG	2024	Curated collection of reasoning-enhanced RAG approaches, categorized into predefined reasoning (structured workflows) and agentic reasoning (autonomous tool orchestration)
[IR]	To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning	2024	Meta-analysis showing CoT primarily benefits math/logic tasks with minimal gains elsewhere. Performance improvements correlate with presence of symbolic operations (equals signs)
[CBF]	A Practical Starters' Guide to Causal Structure Learning with Bayesian Methods in Python	2024	Comprehensive tutorial on Bayesian networks for causal inference, covering structure learning, parameter estimation, and inference using the bnlearn Python library
[IR]	Hierarchical Reasoning Model	2025	Brain-inspired 27M parameter model with hierarchical processing achieving 40.3% on ARC-AGI and near-perfect Sudoku performance using only 1000 training samples without pre-training
[ER]	Reasoning RAG via System 1 or System 2: A Survey on Reasoning Agentic Retrieval-Augmented Generation for Industry Challenges	2025	Comprehensive survey of reasoning-enhanced RAG systems, categorizing approaches into predefined reasoning (fixed pipelines) and agentic reasoning (autonomous orchestration)
[IR]	The Hidden Drivers of HRM’s Performance on ARC-AGI	2024	In-depth analysis revealing HRM’s performance primarily comes from iterative refinement loops and data augmentation rather than the hierarchical architecture itself
[RR]	A Bayesian Learning Agent: Bayes Theorem and Intelligent Agents	2025	Implements belief updating agent using Bayes theorem for environment learning. Demonstrates entropy-based uncertainty measurement and probability distribution evolution through evidence integration
[IR]	GitHub - DominikBBB/hierarchical-llm-structured-outputs: LLM-in-the-loop optimisation	2025	Applies LLMs in optimisation loops for structured output generation. Uses hierarchical approach where LLM iteratively refines outputs based on feedback and constraints
[RR]	Retrieval-augmented thoughts elicit context-aware reasoning in long-context Large Language Models	2024	RAT framework integrating retrieval processes into CoT prompting for long-context tasks. Demonstrates improved reasoning by retrieving relevant context chunks before generating each reasoning step
[ER]	Tool Augmented Agents: Navigating the Landscape of Tool Interaction and Orchestration for LLM Agents	2024	Overview of tool augmentation for LLM agents, covering tool interaction patterns, orchestration strategies, and practical implementation considerations for building multi-tool agent systems
[RV]	LLM Benchmarks and Evals Are Broken: Here’s How We Fix Them	2025	Analysis of current LLM evaluation shortcomings proposing solutions including context-aware metrics, adversarial robustness testing, and dynamic benchmark updates to address overfitting and data contamination
[RV]	Measuring Confidence and Uncertainty of LLMs in Real-World Applications	2024	Practical guide to implementing confidence and uncertainty measures in production LLM systems. Covers calibration techniques, threshold selection, and integration with human-in-the-loop workflows
[RV]	Exploring LLM Evaluation from a Data Scientist’s Perspective: Moving Beyond Accuracy Metrics	2024	Comprehensive evaluation framework beyond accuracy including robustness, consistency, bias detection, and cost-benefit analysis. Proposes multi-dimensional assessment approach for production systems
[IR]	o3 and the Power of Symbolic Reasoning In AI	2025	Analysis of OpenAI’s o3 model achievements highlighting role of symbolic reasoning scaffolds and search strategies in achieving ARC-AGI breakthrough results. Discusses implications for hybrid neural-symbolic approaches
[IR]	Can o3 Actually Reason? An Engineering Perspective	2025	Engineering analysis questioning whether o3’s performance represents true reasoning or sophisticated pattern matching with extensive search. Examines architectural implications and generalization capabilities
[RR]	LLMs can learn self-restraint through iterative self-reflection	2024	ReSearch algorithm for synthetic data generation using iterative self-reflection. Teaches LLMs to modulate information based on uncertainty, reducing hallucinations
[RR]	Self-Reflection in LLM Agents: Effects on Problem-Solving Performance	2024	Empirical study showing LLM agents significantly improve problem-solving performance through self-reflection (p < 0.001). Compares various types of self-reflection approaches
[RV]	Confidence Improves Self-Consistency in LLMs	2025	CISC performs weighted majority vote based on confidence scores, reducing required reasoning paths by 40%. Introduces within-question confidence evaluation methodology
[RV]	Cycles of Thought: Measuring LLM Confidence through Stable Explanations	2024	Framework for measuring LLM uncertainty through distribution of generated explanations. Interprets model+explanation pairs as test-time classifiers for posterior answer distribution
[ER/RV]	C4AI ML Agents: Reasoning Approaches Comparison Framework	2024	Comprehensive framework comparing 10 reasoning techniques (CoT, ToT, PoT, etc.) across multiple models and tasks. Investigates reasoning approach effectiveness and cost-benefit analysis
[RV]	Consistency in Language Models: Current Landscape, Challenges, and Future Directions	2025	Comprehensive survey defining behavioral consistency types (logical vs nonlogical), evaluation frameworks, and multilingual consistency challenges. Reviews current approaches to measure and enhance consistency across task domains
[ER]	Crosslingual Reasoning through Test-Time Scaling	2025	Investigates how English reasoning finetuning with long chain-of-thoughts generalizes across languages. Shows scaling inference compute for English-centric reasoning models improves multilingual mathematical reasoning, with models outperforming 2× larger models
[ER]	OptimalThinkingBench: Evaluating Over and Underthinking in LLMs	2025	Introduces unified benchmark evaluating overthinking and underthinking in LLMs with thinking-adjusted accuracy metrics. Shows thinking models often overthink on simple queries while non-thinking models underthink on complex reasoning tasks
[ER]	SciToolAgent: A Knowledge Graph-Driven Scientific Agent for Multi-Tool Integration	2025	LLM-powered agent that automates hundreds of scientific tools across biology, chemistry, and materials science using knowledge graph-driven tool selection and graph-based retrieval-augmented generation with comprehensive safety checking
[CBF]	Sapir-Whorf Hypothesis	1994	Comprehensive review of linguistic relativity research examining how language influences thought. Covers color perception, counterfactual reasoning, and concept labeling studies, showing mutual influence between language and cognition
[CBF]	Sapir-Whorf does not apply to Programming Languages	2024	Argues programming language effects on thinking are better explained by “Tetris effect” (practice effects) rather than linguistic relativism. Distinguishes between human language cognition and programming skill development
[RV]	Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey	2025	Comprehensive survey on uncertainty quantification for LLMs introducing taxonomy based on computational efficiency and uncertainty dimensions (input, reasoning, parameter, prediction). Reviews methods for enhancing reliability and trustworthiness in high-stakes applications
[RV]	Pro2Guard: Proactive Runtime Enforcement of LLM Agent Safety via Probabilistic Model Checking	2025	Proactive runtime safety enforcement framework using probabilistic reachability analysis and Discrete-Time Markov Chains to anticipate unsafe behaviors before violations occur. Achieves early safety enforcement on 93.6% of unsafe tasks and 100% prediction of traffic violations
[RV]	SAUP: Situation Awareness Uncertainty Propagation on LLM Agent	2024	Framework for propagating uncertainty through each step of LLM-based agent reasoning processes. Incorporates situational awareness by assigning situational weights to step uncertainties, achieving up to 20% improvement in AUROC over existing methods
[RV]	Chonkie: No-nonsense RAG chunking library	2025	Lightweight and fast text chunking library with RecursiveChunker (hierarchical semantic splitting) and SemanticChunker (similarity-based splitting). Features 24+ integrations and up to 33x faster performance than alternatives for evaluating reasoning and text analysis
[CBF]	Technological Approach to Mind Everywhere: An Experimentally-Grounded Framework for Understanding Diverse Bodies and Minds	2022	Levin’s TAME framework distinguishing cognitive consciousness (observable decision-making competencies) from phenomenal consciousness (subjective experience). Proposes empirical approach to studying cognition across diverse biological substrates from cells to complex organisms

Additional Core Theoretical Sources

Category	Paper Name and Link	Year	Abstract/Summary
[CBF]	Facing Up to the Problem of Consciousness	1995	Foundational paper introducing the hard/easy problem distinction in consciousness studies. Argues the “hard problem” (explaining experience itself) cannot be solved by reductive methods and proposes nonreductive explanation based on structural coherence, organizational invariance, and double-aspect theory of information
[IR/RR]	Computational Metacognition	2021	Comprehensive framework for metacognition in cognitive systems dividing into explanatory, immediate, and anticipatory metacognition. Demonstrates implementation in MIDCA architecture showing how introspective monitoring and meta-level control improve problem-solving performance through learning better action models
[IR/RR]	Perpetual Self-Aware Cognitive Agents	2007	Describes integration of Meta-AQUA multistrategy learning system with INTRO planning agent to create perpetual self-aware cognitive agents. Demonstrates how agents can independently generate goals through full integration of cognition and metacognition with intelligent behaviors
[RV]	Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey	2025	Comprehensive survey introducing novel taxonomy categorizing UQ methods by computational efficiency and uncertainty dimensions (input, reasoning, parameter, prediction). Reviews methods, benchmarks, metrics, and applications while identifying challenges in efficiency-performance tradeoffs and cross-modal uncertainty
[RR]	Introspection of Thought Helps AI Agents	2025	Proposes Introspection of Thought (INoT) framework enabling LLMs to execute programmatic dialogue reasoning through code-in-prompt design. Achieves 7.95% performance improvement with 58.3% reduction in token costs by conducting self-denial and reflection within LLM rather than external processes

Notes

51 unique sources successfully fetched with category labels and publication years where available
Categories align with cognitive reasoning framework components
Collection spans from theoretical foundations (1995) to cutting-edge research (2025)
Added key foundational papers including Chalmers' seminal consciousness work and recent advances in each category
Personal blog posts (matt.thompson.gr) represent ongoing research framework development
Core theoretical frameworks now include consciousness studies, computational metacognition, and uncertainty quantification foundations
Some entries lack specific publication dates due to preprint/blog post nature

Agentic AI AGI Intelligent Agents Research

Research Links Collection for Reasoning (LLM and other types)

Categories

Successfully Fetched Sites

Additional Core Theoretical Sources

Notes