Researchers have made significant progress in developing large language models (LLMs) that can perform a wide range of tasks, from answering questions to generating text. However, these models still struggle with tasks that require reasoning and problem-solving, such as math and science problems. To address this, researchers have proposed various methods for improving the reasoning abilities of LLMs, including the use of symbolic reasoning, multi-agent systems, and hybrid approaches that combine the strengths of different models. Additionally, researchers have developed new benchmarks and evaluation metrics to assess the performance of LLMs on complex tasks. Despite these advances, there is still much work to be done to develop LLMs that can match human-level performance on complex tasks. Researchers are also exploring the use of LLMs in real-world applications, such as healthcare, finance, and education, and are working to address the challenges of deploying these models in practical settings.
A new benchmark, RetailBench, has been introduced to evaluate the performance of LLMs in realistic retail environments. The benchmark models retail management as a partially observable decision process and allows agents to manage pricing, replenishment, supplier selection, shelf assortment, inventory aging, customer feedback, external events, and cash-flow constraints. Researchers have also proposed a new framework, STRIDE, for improving the reasoning abilities of LLMs through discriminative estimation. STRIDE contrasts successful and failed trajectories within each response group to estimate the outcome-discriminative preference of each n-gram strategic pattern, and further combines this signal with reasoning saliency entropy to identify decision-relevant strategic patterns.
Researchers have also made progress in developing LLMs that can perform tasks that require reasoning and problem-solving, such as math and science problems. A new benchmark, IRTS-ToolBench, has been introduced to evaluate the performance of LLMs on irregular time series analysis tasks. The benchmark provides standardized inputs and a reproducible evaluation protocol, and allows researchers to assess the performance of LLMs on a wide range of tasks, from simple forecasting to complex anomaly detection. Additionally, researchers have proposed a new framework, SERAF, for improving the performance of LLMs on time series forecasting tasks. SERAF conducts dual retrieval over the time series and their self-generated textual descriptions, and selectively and jointly uses the retrieved patterns to guide future predictions.
Researchers have also explored the use of LLMs in real-world applications, such as healthcare, finance, and education. A new benchmark, PAL-Bench, has been introduced to evaluate the performance of LLMs on personal album reconstruction tasks. The benchmark provides a controlled environment for evaluating the performance of LLMs on a wide range of tasks, from simple image classification to complex scene understanding. Additionally, researchers have proposed a new framework, LiteOdyssey, for improving the performance of LLMs on rare disease diagnosis tasks. LiteOdyssey guides reasoning language models through a clinical genetics workflow, and uses dynamic access to public biomedical tools to improve the accuracy of diagnosis.
Key Takeaways
- Large language models (LLMs) have made significant progress in performing a wide range of tasks, but still struggle with tasks that require reasoning and problem-solving.
- Researchers have proposed various methods for improving the reasoning abilities of LLMs, including symbolic reasoning, multi-agent systems, and hybrid approaches.
- New benchmarks and evaluation metrics have been developed to assess the performance of LLMs on complex tasks.
- LLMs are being explored for use in real-world applications, such as healthcare, finance, and education.
- Researchers are working to address the challenges of deploying LLMs in practical settings.
- A new benchmark, RetailBench, has been introduced to evaluate the performance of LLMs in realistic retail environments.
- A new framework, STRIDE, has been proposed to improve the reasoning abilities of LLMs through discriminative estimation.
- A new benchmark, IRTS-ToolBench, has been introduced to evaluate the performance of LLMs on irregular time series analysis tasks.
- A new framework, SERAF, has been proposed to improve the performance of LLMs on time series forecasting tasks.
- A new benchmark, PAL-Bench, has been introduced to evaluate the performance of LLMs on personal album reconstruction tasks.
- A new framework, LiteOdyssey, has been proposed to improve the performance of LLMs on rare disease diagnosis tasks.
Sources
- Do LLMs Reliably Identify Correct Information Units in Aphasic Discourse?
- Unassigned Agents in Compilation-based Multi-agent Path Finding
- Multi-agent Framework for Time-Sensitive Complementary Collaboration in Minecraft
- Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking
- Advanced Machine Learning and Deep Learning Techniques for Enhanced Cattle Identification and Detection: A Comprehensive Review
- Overcoming the Impedance Mismatch: A Theoretical Roadmap for Fusing Foundation Models and Knowledge Graphs
- AI-Driven Framework for Adaptive Water Network Management with Proof-of-Concept Implementation: Addressing Non-Revenue Water in Jordan
- Mitigating Visual Hallucinations in Multimodal Systems through Retrieval-Augmented Reliability-Aware Inference
- Rethinking Scaffolding in LLM Tutors: The Interactional Mismatch Between Benchmarks and Real-World Deployments
- Heteroskedastic Signals in Budgeted LLM Verification: Structural Heterogeneity Limits Optimization Gains
- TrustedARI: Towards Trust-Native Agentic Routing Infrastructure for Agentic AI
- An Integrated System for Real-Time Student Assessment and Career Guidance Using Neural Networks in Computing Disciplines
- SciText2Eq: Assessing LLMs for Explainable Equation Generation for Scientific Creativity
- LLM-as-Code Agentic Programming for Agent Harness
- Agentic Framework for Deep Learning workload migration via In-Context Learning
- Medical Heuristic Learning: An LLM-Driven Framework for Interpretable and Auditable Clinical Decision Rules
- Posterior Twins: Distributional Behavioral Simulation for Enterprise Decisions
- Looking Is Not Picking: An Attention-Segment Account of Tool-Selection Failures in LLM Agents
- Post-Hoc Merging is Not Enough: Many-Shot Model Merging with Loss-Gap Balancing
- CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies
- TNODEV: Toolbox for Neural ODE Verification
- ARB4WM: An Adversarial Robustness Benchmark for World Models in Continuous Control
- AgentFairBench: Do LLM Agents Discriminate When They Act?
- Medical world models: representing medical states, modelling clinical dynamics and guiding intervention policies
- User as Code: Executable Memory for Personalized Agents
- A First-Principles Derivation of LLM Policy Optimization: From Expected Reward to GRPO and Its Structural Extensions
- Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations
- The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers
- RAID: Semantic Graph Diffusion for True Cold-Start and Cross-Lingual Forecasting
- From Affect Prediction to Affect Forecasting: Evidence for Distinct Information Sources in Longitudinal Text
- The Integrator Advantage: Controlled Agentic AI for Small and Medium-Sized Companies
- MR-GVNO: A Geometry-Aware Variational Physics-Informed Neural Operator for Mindlin-Reissner Plates on Irregular Domains
- Kairos: A Native World Model Stack for Physical AI
- Whose hotel does the AI recommend? An algorithm audit of reputation signals in LLM-assisted hotel selection
- Mind-Studio: Executable World Models with Lookahead Evaluation for Partially Observable Games
- Auditing Reward Hackability in Code RL Training Environments
- Model Graph Inductive Learning for Knowledge Graph Completion
- Steering Emotional Dynamics for Art Therapy: Controllable Narrative Script Generation through Hierarchically Guided LLM Agents
- RecourseBench: A Modular Framework for Reproducible Algorithmic Recourse Evaluation
- Hierarchical Modeling of ICD Codes in EHR Foundation Models
- CogGuard: Cognitive and Operational Profiling for Proactive Warning in Edge Intelligent Services
- Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling
- Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier
- Towards End-to-End Automation of AI Research
- Skill-to-LoRA: From Using Skills to Learning Behaviors for Token-Efficient LLM Agents
- When Agent Automation Becomes Profitable: Quantifying and Insuring Autonomous AI Risk through Trace-Economic Underwriting
- ROSA-RL: Uncertainty-Aware Roundabout Optimized Speed Advisory with Reinforcement Learning
- GIST-CMTF: Goal-State Inference for Causal Minimal Tool Filtering in LLM Agents
- A Formal Framework for Declarative Agentic AI in Business Process Analysis
- Feature Attribution in Directed Acyclic Graphs Using Edge Intervention
- ChatPlanner: A Large Language Model Framework for Personalized Public Transit Routing
- Reward Hacking in Language Model Agents: Revisiting AI Safety Gridworlds
- Who Drifted: the System or the Judge? Anytime-Valid Attribution in LLM Evaluation Pipelines
- Frame-Conditioned Moral Computation in LLaMA 3.1-8B-Instruct: A Mechanistic Interpretability Audit of Ethical Reasoning
- Minimal Oversight: Uncertainty-Aware Governance for Delegated AI Systems
- Toward Vibe Medicine: A Self-Evolving Multi-Agent Framework for Clinical Decision Support
- Synthetic Counteradaptation: A Principle of Human-AI Co-evolution
- Integrating Reasoning and Generalization in Text-to-SQL via Self-Enhanced Fine-Tuning
- Agentic Retrieval and Reinforcement Learned Equation Chains: A Controlled Generation Framework for Complex and Novel Physics Word Problems
- Your Agent Has a Genome: Sequence-Level Behavioral Analysis and Runtime Governance of LLM-Powered Autonomous Agents
- LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control
- Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models
- OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models
- MA-SBI: Misspecification-Aware Simulation-Based Inference via Side-Channel Guidance
- Greed Is Learned: Visible Incentives as Reward-Hacking Triggers
- When in Doubt, Plan It Out: Committed Small Language Model Deliberation for Reactive Reinforcement Learning
- A Causal Model of Theory of Mind in Conflict for Artificial Intelligence
- Consensus-based Agentic Large Language Model Framework for Harmonized Tariff Schedule Code Classification
- Latent Thought Flow: Efficient Latent Reasoning in Large Language Models
- Measuring Whether LLM Tutors Teach or Solve: A Diagnostic for Educational Impact
- SpecAlign: Efficient Specification-Grounded Alignment of Large Language Models via Synthetic Data
- AIChilles: Automatically Uncovering Hidden Weaknesses in AI-Evolved Systems
- Recurrent Reasoning on Symbolic Puzzles with Sequence Models
- QoS-Aware Token Scheduling and Private Data Valuation for Multi-Modal Agentic Networks
- Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning
- Thinking with Visual Grounding
- S1-DeepResearch: Beyond Search, Toward Real-World Long-Horizon Research Agents
- PrologMCP: A Standardized Prolog Tool Interface for LLM Agents
- NeuroSymbolic AI for Legal AI-TRISM: Trustworthy, Reliable, Interpretable, Safe Models
- APEX: Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI Agents
- Architectural Wisdom: A Framework for Governing Optimization in AI Systems
- UrbanWell: Benchmarking Multimodal Large Language Models for Spatio-Temporal Urban Wellbeing Analytics
- ToolMenuBench: Benchmarking Tool-Menu Filtering Strategies for Reliable and Efficient LLM Agents
- Trust Between AI Agents: Measuring Formation, Breakage, and Recovery, with Implications for Governing Multi-Agent Systems
- Relational Structural Causal Models
- Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion
- A Definition of Good Explanations and the Challenges Explaining LLM Outputs
- OSGuard: A Benchmark for Safety in Computer-Use Agents
- Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability
- AI Engram: In Search of Memory Traces in Artificial Intelligence
- Cognitive Debt: AI as Intellectual Leverage and the Dynamics of Systemic Fragility
- CONCORD: Asynchronous Sparse Aggregation for Device-Cloud RAG under Document Isolation
- Risk-Aware LLM Agents for Geospatial Data Retrieval: Design and Preliminary Adversarial Evaluation
- VGPT-RSI for RH-Adjacent Formal Progress: Boundary Certificates, Verified Finite Lagarias Inequalities, and Explicit Failure Localization
- Attribute Inference from Interactive Targeted Ads
- The Faithfulness Gap: Certifying Semantic Equivalence Between Natural-Language and Formal Mathematical Statements
- Rhythm of the Deep: A Computational-Linguistic Test of Duality of Patterning in Sperm Whale Codas
- Know Your Limits : On the Faithfulness of LLMs as Solvers and Autoformalizers in Legal Reasoning
- VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models
- AI Pluralism and the Worlds It Misses
- The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning
- LiteOdyssey: A Lightweight Reasoning AI Agent for Interpretable Rare-Disease Diagnosis
- Sensor-Conditioned Representation Learning via Scene-Relevant Observation Quotients
- PAL-Bench: Evidence-Grounded Profile Reconstruction from Longitudinal Personal Albums
- AdaSTORM: Scaling LLM Reasoning on Dynamic Graphs via Adaptive Spatio-Temporal Multi-Agent Collaboration
- Phase-Aware Guidance Injection for Recurrent MAPPO in Assembly-Line Disruption Recovery
- State-Grounded Multi-Agent Synthetic Data Generation for Tool-Augmented LLMs
- TimeVista: Exploring and Exploiting Vision-Language Models as Judges for Time Series Forecasting
- RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments
- STRIDE: Strategic Trajectory Reasoning via Discriminative Estimation for Verifiable Reinforcement Learning
- Artificial Intelligence Index Report 2026
- Towards Next-Generation Healthcare: A Survey of Medical Embodied AI for Perception, Decision-Making, and Action
- RoboPIN: Grounded Embodied Reasoning via Pinned Chain-of-Thought
- Large Language Models as Optimizers: A Survey of Direct vs. Tool-Augmented Approaches and Their Performance Frontiers
- Do we have the knowledge we need? Rethinking human-AI decision-making in corporations
- Forced Deferral: Manipulating Routing Decisions in Multimodal LLM Cascades
- CODA-BENCH: Can Code Agents Handle Data-Intensive Tasks?
- Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs
- Towards Verifiable Agentic Data Science: Solving Irregular TSQA Via Tool-Grounded Reasoning
- Semantics-Enhanced Retrieval-Augmented Time Series Forecasting
- Symbolic Informalization: Fluent, Productive, Multilingual
- Tensor-Coord: Algebraic Decomposition of Joint Plan Tensors for Conflict-Free Multi-Agent LLM Planning
- Exploiting Search in Symbolic Numeric Planning with Patterns
Comments
Please log in to post a comment.