Researchers have made significant advancements in various fields, including legal case retrieval, computer-using agents, and multimodal reasoning. A self-evolving framework for rule-driven query rewriting has been proposed, which enhances BM25 without any parameter training. Additionally, a framework for reusable web skills has been introduced, which learns transferable interaction patterns and reduces the average LLM-action count on successful trajectories by 8-10%. Furthermore, a study on the internal lifecycle of code reasoning in LLMs has revealed that models first brew the answer and then diverge into one of four resolution outcomes. A framework for financial multimodal reasoning has also been proposed, which accumulates financially grounded reasoning experience from prior trajectories and distills successful strategies and failure-derived cautionary rules into a persistent memory bank.
A benchmark for evaluating long-horizon webpage generation has been introduced, which contains 490 real-world long webpages for structural fidelity evaluation and 507 goal-oriented interaction tasks over 129 webpages for functional evaluation. A framework for strategic resource reallocation has been proposed, which evaluates LLMs on CEO-level strategic resource reallocation and reveals that all models achieve high structural validity but diverge sharply on strategic calibration. A study on the behavior of LLMs has been conducted, which shows that models can improve substantially after training on tens or hundreds of examples of zero. Additionally, a framework for distributed general-purpose agent networks has been proposed, which enables open peer-to-peer networks in which heterogeneous agents can discover one another, establish trust, and execute open-ended tasks.
A benchmark for evaluating personalized workflows predicted by agents has been introduced, which contains 100 tasks across five domains, with 1,246 reference workflow steps grounded in more than 3,900 sources. A framework for proactive preflection and self-evolving memory for zero-shot object goal navigation has been proposed, which enables continuous test-time improvement. A study on the autoregressive curse in long-horizon logical reasoning has been conducted, which shows that small epistemic perturbations introduced early in generation can propagate irreversibly along the Markov decision process flow, triggering cascading failures that drive the reasoning trajectory toward collapse. A framework for dynamic epistemic entropy orchestrated erasable reinforcement learning has been proposed, which eliminates reliance on external signals and enables the model to precisely excise localized logical defects while reusing historical key-value cache streams.
Key Takeaways
- A self-evolving framework for rule-driven query rewriting enhances BM25 without any parameter training.
- A framework for reusable web skills reduces the average LLM-action count on successful trajectories by 8-10%.
- LLMs first brew the answer and then diverge into one of four resolution outcomes.
- A framework for financial multimodal reasoning accumulates financially grounded reasoning experience from prior trajectories.
- A benchmark for evaluating long-horizon webpage generation contains 490 real-world long webpages for structural fidelity evaluation.
- A framework for strategic resource reallocation evaluates LLMs on CEO-level strategic resource reallocation.
- LLMs can improve substantially after training on tens or hundreds of examples of zero.
- A framework for distributed general-purpose agent networks enables open peer-to-peer networks in which heterogeneous agents can discover one another, establish trust, and execute open-ended tasks.
- A benchmark for evaluating personalized workflows predicted by agents contains 100 tasks across five domains.
- A framework for proactive preflection and self-evolving memory for zero-shot object goal navigation enables continuous test-time improvement.
Sources
- When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval
- PreAct: Computer-Using Agents that Get Faster on Repeated Tasks
- Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns
- From Brewing to Resolution: Tracing the Internal Lifecycle of Code Reasoning in LLMs
- FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness
- DecoSearch: Complexity-Aware Routing and Plan-Level Repair for Text-to-SQL
- LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings
- EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent
- FllumaOne: A Code-Native Multimodal CAD Dataset with Executable Programs and Kernel-Validated Feature Histories
- How Inference Compute Shapes Frontier LLM Evaluation
- Small Initialization Matters for Large Language Models
- MoCo-AIS: A Contrastive Learning Framework for Similarity Computation of Vessel Trajectories
- LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI
- Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications
- PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience
- ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents
- A Unified Framework for Context-Aware and Relation-Aware Graph Retrieval-Augmented Generation
- Skill-Constrained Model Predictive Control for Resilient Manufacturing Supply Chains
- Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding
- Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning
- Surrogate Assisted Pedestrian Protection Design via a Foundation Model Orchestrated Workflow
- DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack
- Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty
- SpeechDx: A Multi-Task Benchmark for Clinical Speech AI
- Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation
- Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation
- Dissecting model behavior through agent trajectories
- A Machine-Learned Comorbidity Index
- IsabeLLM: Automated Theorem Proving Applied to Formally Verifying Consensus
- LLM-as-Judge in Education: A Curriculum-Grounded Marking Pipeline
- A homotopy-type-theoretic generalization of neurosymbolic inference
- WallZero: Mastering the Game of WallGo with Strategic Analysis
- DiagFlowBench: Evaluating How Language Models Handle Off-Procedure Inputs in Grounded Diagnostic Dialogue
- Learn to Quantify Social Interaction with Constraints for Pedestrian Walking
- MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning
- Structural Preservation and the Logical Expressiveness of Graph Neural Networks
- Memory as a Wasting Asset: Pricing Flash Endurance for Embodied Agents, and the Limits of Doing So
- First Proof Second Batch
- Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models
- The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data
- Learning Cardiac Electrophysiology Digital Twins Through Agentic Discovery of Hybrid Structure
- DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction
- EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal Navigation
- Nothing from Something: Can a Language Model Discover 0?
- Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes
- MemTrace: Probing What Final Accuracy Misses in Long-Term Memory
- SkillChain-Gym: A Benchmark for Reskilling-Aware Production-Inventory Control under Disruptions
- Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers
- WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning
- Knowledge Reutilization in Meta-Reinforcement Learning
- LLM Consumer Behavior Theory: Foundations of a Novel Research Field
- STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training
- Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs
- Using Cognitive Models to Improve Language Model Simulation of Human Persuasion Games
- Brick-DICL: Dynamic In-Context Learning for Automated Brick Schema Classification
- Incumbent Advantage: Brand Bias and Cognitive Manipulation Dynamics in LLM Recommendation Systems
- Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search
- StepGuard: Guarding Web Navigation via Single-Step Calibration
- SEAGym: An Evaluation Environment for Self-Evolving LLM Agents
- MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors
- FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow
Comments
Please log in to post a comment.