Research, summarized
Plain-language summaries of the papers, evaluations, and surveys that matter for agent builders. Each post calls out what changed, why it matters, and what to do next.
- Stanford University May 7, 2026
Dynamic In-Context Example Selection for Reliable Agentic Reasoning
A theoretically grounded method for agents to dynamically select optimal in-context examples during reasoning, boosting reliability across diverse tasks.
- Google DeepMind May 7, 2026
ToolMemory: Long-Term Memory Management for Agentic Workflows
Framework enabling agents to maintain tool-specific memory across extended conversations, pruning irrelevance while preserving critical knowledge.
- Multiple May 7, 2026
Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills
Comprehensive survey examining how agentic AI systems adapt through post-training, memory architectures, and skill acquisition for long-horizon task execution.
- Multiple May 7, 2026
How Agentic AI Changes the Economics of Enterprise Software
Research on how agentic coding systems reshape make-or-buy decisions by dramatically reducing development timelines and CAPEX for enterprise applications.
- Multiple May 7, 2026
Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours
Framework for automating adversarial testing of agentic systems using AI-driven red teaming agents that generate workflows from 45+ attacks, 450+ transforms, and 130+ scorers.
- Torq May 7, 2026
Agentic Coding for SecOps: Torq Agentic Builder
Production-grade agentic AI system for security operations that transforms natural language intent into executable agents through contextual analysis, planning, and automated testing.
- Coveo May 7, 2026
10 Agentic Commerce Research Papers Shaping the Future of Enterprise Product Discovery
Meta-analysis of 2025 agentic commerce research, including empirical findings on agent purchasing behavior, position bias, and the modular retrieval-first architectures that enable reliable shopping agents.
- Academic consortium May 7, 2026
The Adoption and Usage of AI Agents
Comprehensive empirical study of agentic AI system adoption patterns, market sizing, and real-world deployment challenges across enterprise and consumer segments.
- Independent May 7, 2026
AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair
New benchmark exposes ranking instability in agent repair leaderboards due to evaluator reconfiguration, enabling more reliable evaluation of AI agent debugging capabilities.
- Stanford AI Lab May 7, 2026
AgentEval: A Comprehensive Benchmark for Evaluating Long-Horizon Agentic Workflows with Real-World Failure Modes
New benchmark reveals critical gaps in agent tool-use reliability and proposes verifier architectures to boost success rates by 28% on multi-step tasks.
- Independent May 7, 2026
Agentic AI for Robot Control: Flexible but still Fragile
Research on LLM-based agentic control systems for robots reveals architecture patterns for reasoning and execution, but exposes brittleness under real-world constraints.
- UC Berkeley May 7, 2026
REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?
New benchmark evaluates whether agentic AI can reliably assess reproducibility in social science papers, revealing key strengths and failure modes.
- Independent May 7, 2026
ACON: Optimizing Context Compression for Long-horizon LLM Agents
A new method for compressing context in long-horizon LLM agents to reduce token overhead while maintaining planning performance.
- Stanford HCI May 7, 2026
CriticFlow: Multi-Agent Verifier Orchestration for Robust Long-Horizon Agent Planning
New multi-agent verification framework dramatically improves planning reliability in long-horizon tasks through dynamic critic handoff and failure prediction.
- Independent May 7, 2026
Anemoi Agent: A2A Communication for Scalable Multi-Agent Coordination
Agent-to-agent communication server replaces context-stuffing with direct coordination, achieving 52.73% accuracy on GAIA with smaller models.
- Stanford HCI May 7, 2026
CriticLM: A Verifier for Reliable Agentic Planning
New benchmark and LLM-based critic architecture that catches 73% more planning errors in long-horizon agent tasks than prior verification methods.
- Northeastern University Apr 18, 2026
Reflexion, three years on: what self-critique still buys you
A meta-analysis of 41 papers building on Reflexion-style self-critique loops finds modest, durable gains in coding and tool-use, and diminishing returns in open-ended reasoning.
- Stanford NLP Apr 14, 2026
Long-horizon memory: survey of seven architectures, ranked by recall and cost
Compares episodic, semantic, hybrid, and graph-based memory across realistic 30-day agent simulations. Hybrid stores win on recall; graph stores win on cost stability.
- DeepMind Apr 8, 2026
Six failure modes in tool-using agents, and the patterns that fix them
An empirical taxonomy of agent tool-use failures across 4,000 traces from production deployments. Schema drift and silent partial-failure dominate.
- MIT CSAIL Apr 4, 2026
Decoupled planner-critic agents outperform monolithic planners on long tasks
Splitting planning and critique into specialized models with structured exchange yields a 14-point lift on multi-day research tasks.
- UC Berkeley Mar 30, 2026
The case for replay-based agent evaluation
Static benchmarks miss the failure modes that matter in production. This paper argues for replay sets โ captured user sessions scored against a held-out outcome.