Enterprises are investing billions of dollars in AI agents and infrastructure to transform business processes. However, we are seeing limited success in real-world applications, often due to the inability of agents to truly understand business data, policies and processes. While we manage the integrations well with technologies like API management, model context protocol (MCP) and others, having agents truly understand the “meaning” of data in the context of a given businesis a different story. Enterprise data is mostly siloed into disparate systems in structured and unstructured forms and needs to be analyzed with a domain-specific business lens.sAs an example, the term “customer” may refer to a different group of people in a Sales CRM system, compared to a finance system which may use this tag for paying clients. One department might define “product” as a SKU; another may represent as a "product" family; a third as a marketing bundle. Data about “product sales” thus varies in meaning without agreed upon relationships and definitions. For agents to combine data from multiple systems, they must understand different representations. Agents need to know what the data means in context and how to find the right data for the right process. Moreover, schema changes in systems and data quality issues during collection can lead to more ambiguity and inability of agents to know how to act when such situations are encountered. Furthermore, classification of data into categories like PII (personally identifiable information) needs to be rigorously followed to maintain compliance with standards like GDPR and CCPA. This requires the data to be labelled correctly and agents to be able to understand and respect this classification. Hence we see that building a cool demo using agents is very much doable - but putting into production working on real business data is a different story altogether.The ontology-based source of truthBuilding effective agentic solutions requries an ontology-based single source of truth. Ontology is a business definition of concepts, their hierarchy and relationships. It defines terms with respect to business domains, can help establish a single-source of truth for data and capture uniform field names and apply classifications to fields. An ontology may be domain-specific (healthcare or finance), or organization-specific based on internal structures. Defining an ontology upfront is time consuming, but can help standardize business processes and lay a strong foundation for agentic AI. Ontology may be realized using common queryable formats like triplestore. More complex business rules with multi-hop relations could use a labelled property graphs like Neo4j. These graphs can also help enterprises discover new relationships and answer complex questions. Ontologies like FIBO (Finance Industry Business Ontology) and UMLS (Unified Medical Language System) are available in the public domain and can be a very good starting point. However, these usually need to be customized to capture specific details of an enterprise.Getting started with ontologyOnce implemented, an ontology can be the driving force for enterprise agents. We can now prompt AI to follow the ontology and use it to discover data and relationships. If needed, we can have an agentic layer serve key details of the ontology itself and discover data. Business rules and policies can be implemented in this ontology for agents to adhere to. This is an excellent way to ground your agents and establish guardrails based on real business context. Agents designed in this manner and tuned to follow an ontology can stick to guardrails and avoid hallucinations that can be caused by the large language models (LLM) powering them. For example, a business policy may define that unless all documents associated with a loan do not have verified flags set to "true," the loan status should be kept in “pending” state. Agents can work around this policy and determine what documents are needed and query the knowledge base. Here's an example implementation: (Original figure by Author)As illustrated, we have structured and unstructured data processed by a document intelligence (DocIntel) agent which populates a Neo4j database based on an ontology of the business domain. A data discovery agent in Neo4j finds and queries the right data and passes it to other agents handling business process execution. The inter-agent communication happens with a popular protocol like A2A (agent to agent). A new protocol called AG-UI (Agent User Interaction) can help build more generic UI screens to capture the workings and responses from these agents. With this method, we can avoid hallucinations by enforcing agents to follow ontology-driven paths and maintain data classifications and relationships. Moreover, we can scale easily by adding new assets, relationships and policies that agents can automatically comply to, and control hallucinations by defining rules for the whole system rather than individual entities. For example, if an agent hallucinates an individual 'customer,' because the connected data for the hallucinated 'customer' will not be verifiable in the data discovery, we can easily detect this anomaly and plan to eliminate it. This helps the agentic system scale with the business and manage its dynamic nature.Indeed, a reference architecture like this adds some overhead in data discovery and graph databases. But for a large enterprise, it adds the right guardrails and gives agents directions to orchestrate complex business processes. Dattaraj Rao is innovation and R&D architect at Persistent Systems. Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
As AI systems enter production, reliability and governance can’t depend on wishful thinking. Here’s how observability turns large language models (LLMs) into auditable, trustworthy enterprise systems.Why observability secures the future of enterprise AIThe enterprise race to deploy LLM systems mirrors the early days of cloud adoption. Executives love the promise; compliance demands accountability; engineers just want a paved road.Yet, beneath the excitement, most leaders admit they can’t trace how AI decisions are made, whether they helped the business, or if they broke any rule.Take one Fortune 100 bank that deployed an LLM to classify loan applications. Benchmark accuracy looked stellar. Yet, 6 months later, auditors found that 18% of critical cases were misrouted, without a single alert or trace. The root cause wasn’t bias or bad data. It was invisible. No observability, no accountability.If you can’t observe it, you can’t trust it. And unobserved AI will fail in silence.Visibility isn’t a luxury; it’s the foundation of trust. Without it, AI becomes ungovernable.Start with outcomes, not modelsMost corporate AI projects begin with tech leaders choosing a model and, later, defining success metrics.
That’s backward.Flip the order:Define the outcome first. What’s the measurable business goal?Deflect 15 % of billing callsReduce document review time by 60 %Cut case-handling time by two minutesDesign telemetry around that outcome, not around “accuracy” or “BLEU score.”Select prompts, retrieval methods and models that demonstrably move those KPIs.At one global insurer, for instance, reframing success as “minutes saved per claim” instead of “model precision” turned an isolated pilot into a company-wide roadmap.A 3-layer telemetry model for LLM observabilityJust like microservices rely on logs, metrics and traces, AI systems need a structured observability stack:a) Prompts and context: What went inLog every prompt template, variable and retrieved document.Record model ID, version, latency and token counts (your leading cost indicators).Maintain an auditable redaction log showing what data was masked, when and by which rule.b) Policies and controls: The guardrailsCapture safety-filter outcomes (toxicity, PII), citation presence and rule triggers.Store policy reasons and risk tier for each deployment.Link outputs back to the governing model card for transparency.c) Outcomes and feedback: Did it work?Gather human ratings and edit distances from accepted answers.Track downstream business events, case closed, document approved, issue resolved.Measure the KPI deltas, call time, backlog, reopen rate.All three layers connect through a common trace ID, enabling any decision to be replayed, audited or improved.Diagram © SaiKrishna Koorapati (2025). Created specifically for this article; licensed to VentureBeat for publication.Apply SRE discipline: SLOs and error budgets for AIService reliability engineering (SRE) transformed software operations; now it’s AI’s turn.Define three “golden signals” for every critical workflow:SignalTarget SLOWhen breachedFactuality≥ 95 % verified against source of recordFallback to verified templateSafety≥ 99.9 % pass toxicity/PII filtersQuarantine and human reviewUsefulness≥ 80 % accepted on first passRetrain or rollback prompt/modelIf hallucinations or refusals exceed budget, the system auto-routes to safer prompts or human review just like rerouting traffic during a service outage.This isn’t bureaucracy; it’s reliability applied to reasoning.Build the thin observability layer in two agile sprintsYou don’t need a six-month roadmap, just focus and two short sprints.Sprint 1 (weeks 1-3): FoundationsVersion-controlled prompt registryRedaction middleware tied to policyRequest/response logging with trace IDsBasic evaluations (PII checks, citation presence)Simple human-in-the-loop (HITL) UISprint 2 (weeks 4-6): Guardrails and KPIsOffline test sets (100–300 real examples)Policy gates for factuality and safetyLightweight dashboard tracking SLOs and costAutomated token and latency trackerIn 6 weeks, you’ll have the thin layer that answers 90% of governance and product questions.Make evaluations continuous (and boring)Evaluations shouldn’t be heroic one-offs; they should be routine.Curate test sets from real cases; refresh 10–20 % monthly.Define clear acceptance criteria shared by product and risk teams.Run the suite on every prompt/model/policy change and weekly for drift checks.Publish one unified scorecard each week covering factuality, safety, usefulness and cost.When evals are part of CI/CD, they stop being compliance theater and become operational pulse checks.Apply human oversight where it mattersFull automation is neither realistic nor responsible. High-risk or ambiguous cases should escalate to human review.Route low-confidence or policy-flagged responses to experts.Capture every edit and reason as training data and audit evidence.Feed reviewer feedback back into prompts and policies for continuous improvement.At one health-tech firm, this approach cut false positives by 22 % and produced a retrainable, compliance-ready dataset in weeks.Cost control through design, not hopeLLM costs grow non-linearly. Budgets won’t save you architecture will.Structure prompts so deterministic sections run before generative ones.Compress and rerank context instead of dumping entire documents.Cache frequent queries and memoize tool outputs with TTL.Track latency, throughput and token use per feature.When observability covers tokens and latency, cost becomes a controlled variable, not a surprise.The 90-day playbookWithin 3 months of adopting observable AI principles, enterprises should see:1–2 production AI assists with HITL for edge casesAutomated evaluation suite for pre-deploy and nightly runsWeekly scorecard shared across SRE, product and riskAudit-ready traces linking prompts, policies and outcomesAt a Fortune 100 client, this structure reduced incident time by 40 % and aligned product and compliance roadmaps.Scaling trust through observabilityObservable AI is how you turn AI from experiment to infrastructure.With clear telemetry, SLOs and human feedback loops:Executives gain evidence-backed confidence.Compliance teams get replayable audit chains.Engineers iterate faster and ship safely.Customers experience reliable, explainable AI.Observability isn’t an add-on layer, it’s the foundation for trust at scale.SaiKrishna Koorapati is a software engineering leader.Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
The most dangerous KPIs aren’t broken; they’re the ones trusted long after they’ve lost their meaning.
The post Metric Deception: When Your Best KPIs Hide Your Worst Failures appeared first on Towards Data Science.
Learn how to increase LLM usage to achieve increased productivity
The post How to Scale Your LLM Usage appeared first on Towards Data Science.
User data exposed, AI sacrifice test, Sora limits, robot bubble warning, and more...
This article is divided into two parts; they are: • Fine-tuning a BERT Model for GLUE Tasks • Fine-tuning a BERT Model for SQuAD Tasks GLUE is a benchmark for evaluating natural language understanding (NLU) tasks.
Agent memory remains a problem that enterprises want to fix, as agents forget some instructions or conversations the longer they run. Anthropic believes it has solved this issue for its Claude Agent SDK, developing a two-fold solution that allows an agent to work across different context windows.“The core challenge of long-running agents is that they must work in discrete sessions, and each new session begins with no memory of what came before,” Anthropic wrote in a blog post. “Because context windows are limited, and because most complex projects cannot be completed within a single window, agents need a way to bridge the gap between coding sessions.”Anthropic engineers proposed a two-fold approach for its Agent SDK: An initializer agent to set up the environment, and a coding agent to make incremental progress in each session and leave artifacts for the next. The agent memory problemSince agents are built on foundation models, they remain constrained by the limited, although continually growing, context windows. For long-running agents, this could create a larger problem, leading the agent to forget instructions and behave abnormally while performing a task. Enhancing agent memory becomes essential for consistent, business-safe performance. Several methods emerged over the past year, all attempting to bridge the gap between context windows and agent memory. LangChain’s LangMem SDK, Memobase and OpenAI’s Swarm are examples of companies offering memory solutions. Research on agentic memory has also exploded recently, with proposed frameworks like Memp and the Nested Learning Paradigm from Google offering new alternatives to enhance memory. Many of the current memory frameworks are open source and can ideally adapt to different large language models (LLMs) powering agents. Anthropic’s approach improves its Claude Agent SDK. How it worksAnthropic identified that even though the Claude Agent SDK had context management capabilities and “should be possible for an agent to continue to do useful work for an arbitrarily long time,” it was not sufficient. The company said in its blog post that a model like Opus 4.5 running the Claude Agent SDK can “fall short of building a production-quality web app if it’s only given a high-level prompt, such as 'build a clone of claude.ai.'” The failures manifested in two patterns, Anthropic said. First, the agent tried to do too much, causing the model to run out of context in the middle. The agent then has to guess what happened and cannot pass clear instructions to the next agent. The second failure occurs later on, after some features have already been built. The agent sees progress has been made and just declares the job done. Anthropic researchers broke down the solution: Setting up an initial environment to lay the foundation for features and prompting each agent to make incremental progress towards a goal, while still leaving a clean slate at the end. This is where the two-part solution of Anthropic's agent comes in. The initializer agent sets up the environment, logging what agents have done and which files have been added. The coding agent will then ask models to make incremental progress and leave structured updates. “Inspiration for these practices came from knowing what effective software engineers do every day,” Anthropic said. The researchers said they added testing tools to the coding agent, improving its ability to identify and fix bugs that weren’t obvious from the code alone. Future researchAnthropic noted that its approach is “one possible set of solutions in a long-running agent harness.” However, this is just the beginning stage of what could become a wider research area for many in the AI space. The company said its experiments to boost long-term memory for agents haven’t shown whether a single general-purpose coding agent works best across contexts or a multi-agent structure. Its demo also focused on full-stack web app development, so other experiments should focus on generalizing the results across different tasks.“It’s likely that some or all of these lessons can be applied to the types of long-running agentic tasks required in, for example, scientific research or financial modeling,” Anthropic said.
Hello, dear readers. Happy belated Thanksgiving and Black Friday!This year has felt like living inside a permanent DevDay. Every week, some lab drops a new model, a new agent framework, or a new “this changes everything” demo. It’s overwhelming. But it’s also the first year I’ve felt like AI is finally diversifying — not just one or two frontier models in the cloud, but a whole ecosystem: open and closed, giant and tiny, Western and Chinese, cloud and local.So for this Thanksgiving edition, here’s what I’m genuinely thankful for in AI in 2025 — the releases that feel like they’ll matter in 12–24 months, not just during this week’s hype cycle.1. OpenAI kept shipping strong: GPT-5, GPT-5.1, Atlas, Sora 2 and open weightsAs the company that undeniably birthed the "generative AI" era with its viral hit product ChatGPT in late 2022, OpenAI arguably had among the hardest tasks of any AI company in 2025: continue its growth trajectory even as well-funded competitors like Google with its Gemini models and other startups like Anthropic fielded their own highly competitive offerings. Thankfully, OpenAI rose to the challenge and then some. Its headline act was GPT-5, unveiled in August as the next frontier reasoning model, followed in November by GPT-5.1 with new Instant and Thinking variants that dynamically adjust how much “thinking time” they spend per task. In practice, GPT-5’s launch was bumpy — VentureBeat documented early math and coding failures and a cooler-than-expected community reaction in “OpenAI’s GPT-5 rollout is not going smoothly," but it quickly course corrected based on user feedback and, as a daily user of this model, I'm personally pleased with it and impressed with it. At the same time, enterprises actually using the models are reporting solid gains. ZenDesk Global, for example, says GPT-5-powered agents now resolve more than half of customer tickets, with some customers seeing 80–90% resolution rates. That’s the quiet story: these models may not always impress the chattering classes on X, but they’re starting to move real KPIs.On the tooling side, OpenAI finally gave developers a serious AI engineer with GPT-5.1-Codex-Max, a new coding model that can run long, agentic workflows and is already the default in OpenAI’s Codex environment. VentureBeat covered it in detail in “OpenAI debuts GPT-5.1-Codex-Max coding model and it already completed a 24-hour task internally.” Then there’s ChatGPT Atlas, a full browser with ChatGPT baked into the chrome itself — sidebar summaries, on-page analysis, and search tightly integrated into regular browsing. It’s the clearest sign yet that “assistant” and “browser” are on a collision course.On the media side, Sora 2 turned the original Sora video demo into a full video-and-audio model with better physics, synchronized sound and dialogue, and more control over style and shot structure, plus a dedicated Sora app with a full fledged social networking component, allowing any user to create their own TV network in their pocket. Finally — and maybe most symbolically — OpenAI released gpt-oss-120B and gpt-oss-20B, open-weight MoE reasoning models under an Apache 2.0–style license. Whatever you think of their quality (and early open-source users have been loud about their complaints), this is the first time since GPT-2 that OpenAI has put serious weights into the public commons.2. China’s open-source wave goes mainstreamIf 2023–24 was about Llama and Mistral, 2025 belongs to China’s open-weight ecosystem.A study from MIT and Hugging Face found that China now slightly leads the U.S. in global open-model downloads, largely thanks to DeepSeek and Alibaba’s Qwen family. Highlights:DeepSeek-R1 dropped in January as an open-source reasoning model rivaling OpenAI’s o1, with MIT-licensed weights and a family of distilled smaller models. VentureBeat has followed the story from its release to its cybersecurity impact to performance-tuned R1 variants.Kimi K2 Thinking from Moonshot, a “thinking” open-source model that reasons step-by-step with tools, very much in the o1/R1 mold, and is positioned as the best open reasoning model so far in the world.Z.ai shipped GLM-4.5 and GLM-4.5-Air as “agentic” models, open-sourcing base and hybrid reasoning variants on GitHub.Baidu’s ERNIE 4.5 family arrived as a fully open-sourced, multimodal MoE suite under Apache 2.0, including a 0.3B dense model and visual “Thinking” variants focused on charts, STEM, and tool use.Alibaba’s Qwen3 line — including Qwen3-Coder, large reasoning models, and the Qwen3-VL series released over the summer and fall months of 2025 — continues to set a high bar for open weights in coding, translation, and multimodal reasoning, leading me to declare this past summer as "Qwen's summer."VentureBeat has been tracking these shifts, including Chinese math and reasoning models like Light-R1-32B and Weibo’s tiny VibeThinker-1.5B, which beat DeepSeek baselines on shoestring training budgets.If you care about open ecosystems or on-premise options, this is the year China’s open-weight scene stopped being a curiosity and became a serious alternative.3. Small and local models grow upAnother thing I’m thankful for: we’re finally getting good small models, not just toys.Liquid AI spent 2025 pushing its Liquid Foundation Models (LFM2) and LFM2-VL vision-language variants, designed from day one for low-latency, device-aware deployments — edge boxes, robots, and constrained servers, not just giant clusters. The newer LFM2-VL-3B targets embedded robotics and industrial autonomy, with demos planned at ROSCon. On the big-tech side, Google’s Gemma 3 line made a strong case that “tiny” can still be capable. Gemma 3 spans from 270M parameters up through 27B, all with open weights and multimodal support in the larger variants. The standout is Gemma 3 270M, a compact model purpose-built for fine-tuning and structured text tasks — think custom formatters, routers, and watchdogs — covered both in Google’s developer blog and community discussions in local-LLM circles. These models may never trend on X, but they’re exactly what you need for privacy-sensitive workloads, offline workflows, thin-client devices, and “agent swarms” where you don’t want every tool call hitting a giant frontier LLM.4. Meta + Midjourney: aesthetics as a serviceOne of the stranger twists this year: Meta partnered with Midjourney instead of simply trying to beat it.In August, Meta announced a deal to license Midjourney’s “aesthetic technology” — its image and video generation stack — and integrate it into Meta’s future models and products, from Facebook and Instagram feeds to Meta AI features.VentureBeat covered the partnership in “Meta is partnering with Midjourney and will license its technology for future models and products,” raising the obvious question: does this slow or reshape Midjourney’s own API roadmap? Still awaiting an answer there, but unfortunately, stated plans for an API release have yet to materialize, suggesting that it has. For creators and brands, though, the immediate implication is simple: Midjourney-grade visuals start to show up in mainstream social tools instead of being locked away in a Discord bot. That could normalize higher-quality AI art for a much wider audience — and force rivals like OpenAI, Google, and Black Forest Labs to keep raising the bar.5. Google’s Gemini 3 and Nano Banana ProGoogle tried to answer GPT-5 with Gemini 3, billed as its most capable model yet, with better reasoning, coding, and multimodal understanding, plus a new Deep Think mode for slow, hard problems. VentureBeat’s coverage, “Google unveils Gemini 3 claiming the lead in math, science, multimodal and agentic AI,” framed it as a direct shot at frontier benchmarks and agentic workflows. But the surprise hit is Nano Banana Pro (Gemini 3 Pro Image), Google’s new flagship image generator. It specializes in infographics, diagrams, multi-subject scenes, and multilingual text that actually renders legibly across 2K and 4K resolutions. In the world of enterprise AI — where charts, product schematics, and “explain this system visually” images matter more than fantasy dragons — that’s a big deal.6. Wild cards I’m keeping an eye onA few more releases I’m thankful for, even if they don’t fit neatly into one bucket:Black Forest Labs’ Flux.2 image models, which launched just earlier this week with ambitions to challenge both Nano Banana Pro and Midjourney on quality and control. VentureBeat dug into the details in “Black Forest Labs launches Flux.2 AI image models to challenge Nano Banana Pro and Midjourney."Anthropic’s Claude Opus 4.5, a new flagship that aims for cheaper, more capable coding and long-horizon task execution, covered in “Anthropic’s Claude Opus 4.5 is here: Cheaper AI, infinite chats, and coding skills that beat humans." A steady drumbeat of open math/reasoning models — from Light-R1 to VibeThinker and others — that show you don’t need $100M training runs to move the needle.Last thought (for now)If 2024 was the year of “one big model in the cloud,” 2025 is the year the map exploded: multiple frontiers at the top, China taking the lead in open models, small and efficient systems maturing fast, and creative ecosystems like Midjourney getting pulled into big-tech stacks.I’m thankful not just for any single model, but for the fact that we now have options — closed and open, local and hosted, reasoning-first and media-first. For journalists, builders, and enterprises, that diversity is the real story of 2025.Happy holidays and best to you and your loved ones!
An honest view from a 10-year AI Engineer
The post Data Science in 2026: Is It Still Worth It? appeared first on Towards Data Science.
Set up, build, and test agentic apps with Claude Code, powered by your locally installed Claude CLI and Claude Code subscription.
The simple shift in training that unlocks foresight, faster inference, and better reasoning.
The post Your Next LLM Might Not Predict Tokens One-by-One appeared first on Towards Data Science.
These five configurations can turn your Docker setup from a slow chore into a finely tuned machine.
How product, growth and engineering teams can converge on a single signal for better incident management
The post The Product Health Score: How I Reduced Critical Incidents by 35% with Unified Monitoring and n8n Automation appeared first on Towards Data Science.
AI and machine learning have pushed the demand for high-performance hardware, making the GPU-versus-TPU discussion more relevant than ever. GPUs, originally built for graphics, have grown into flexible processors for data analysis, scientific computing, and modern AI workloads. TPUs, built by Google as specialized ASICs for deep learning, focus on high-throughput tensor operations and have […]
The post GPU vs TPU: What’s the Difference? appeared first on Analytics Vidhya.
It turns out all the guardrails in the world won’t protect a chatbot from meter and rhyme.
Princeton researchers found that the brain excels at learning because it reuses modular “cognitive blocks” across many tasks. Monkeys switching between visual categorization challenges revealed that the prefrontal cortex assembles these blocks like Legos to create new behaviors. This flexibility explains why humans learn quickly while AI models often forget old skills. The insights may help build better AI and new clinical treatments for impaired cognitive adaptability.
Researchers at the University of Science and Technology of China have developed a new reinforcement learning (RL) framework that helps train large language models (LLMs) for complex agentic tasks beyond well-defined problems such as math and coding. Their framework, Agent-R1, is compatible with popular RL algorithms and shows considerable improvement on reasoning tasks that require multiple retrieval stages and multi-turn interactions with tools. The framework is built on a redefinition of the RL paradigm that takes into account the dynamic nature of agentic applications that require interacting with evolving environments and imperfect information. This framing is much more similar to real-world applications and can have important uses for agentic tasks in enterprise settings.Rethinking reinforcement learning for agentsRL has become a cornerstone of training LLMs for well-defined reasoning tasks. In areas like mathematics and coding, the model receives a clear signal: The answer is either right or wrong. This makes it relatively straightforward to reward or penalize its behavior. But this approach struggles with agentic tasks that require models to work in interactive environments, develop dynamic memories across conversations, perform multi-step reasoning and respond to unpredictable feedback. Training agents with RL for these scenarios presents unique challenges, especially in multi-turn interactions where designing effective rewards is complex and the trained agent often fails to generalize to the messy, unpredictable nature of real-world environments.To address these challenges, the University of Science and Technology researchers revisited the fundamental framework of RL, known as the Markov Decision Process (MDP). An MDP models decision-making using four key components: a state space (the set of possible states an agent can be in); an action space (what the agent can do); a state transition probability (the state to which an action will likely lead); and a reward function (whether the outcome is good or bad). The paper proposes extending this framework to better suit LLM agents.In the new formulation, the state space is expanded to include not just the current state (the current sequence of tokens generated by the model) but the entire history of interactions and environmental feedback. Actions are still fundamentally about generating text, but specific sequences of text can now trigger external tools, like an API call. State transitions become unpredictable, or "stochastic," because the outcome depends not just on the tokens the model predicts but also on the environment's response, which depends on external factors. Finally, the reward system becomes more granular, incorporating intermediate "process rewards" for successfully completing steps along the way, rather than just a single reward at the very end. This provides more frequent and precise guidance to the agent during training.This last bit is especially important and addresses the “sparse reward” problem that most RL frameworks face. When the agent receives a single reward signal based on the final outcome, it does not learn from the right and wrong intermediate steps it has taken along the way. Process rewards solve this problem by providing feedback signals on these intermediate steps, making the learning process much more efficient.“These extensions are crucial for enabling reinforcement learning algorithms to train sophisticated Agents capable of complex, multi-step reasoning and interaction within dynamic environments,” the researchers write in their paper.The Agent-R1 frameworkBased on the extended MDP definition, the researchers developed Agent-R1, a flexible and user-friendly training platform for RL-based LLM agents. It extends traditional single-turn RL frameworks to handle the multi-turn, interactive nature of agentic tasks, allowing for seamless integration with diverse environments. The most significant difference lies in the "rollout phase," where the agent generates responses. In single-turn RL, the model generates a response once. In multi-turn RL, the process involves a series of complex back-and-forth interactions.Agent-R1 achieves this flexible multi-turn rollout with two core modules: Tool and ToolEnv. The Tool module acts as an executor for specific actions such as calling an API or accessing a database. When invoked, a Tool performs its action and returns the direct, raw outcome. In contrast, the ToolEnv module is the orchestrator and interpreter. It takes the output from the Tool and determines how that outcome affects the agent's state and the overall task progress. ToolEnv manages state transitions, calculates reward signals based on tool outcomes and packages the new state information for the agent. In short, when an action is complete, the Tool reports "what happened," while ToolEnv dictates "what this outcome means for the agent and the task."Agent-R1 in actionThe researchers tested Agent-R1 on the challenging task of multi-hop question answering, which requires complex reasoning, information retrieval across multiple documents and multi-step decision-making. They trained Qwen2.5-3B-Instruct on QA datasets and evaluated its performance on the HotpotQA and 2WikiMultihopQA datasets. They also tested it on the Musique dataset, which was out of the domain of tasks the agent was trained on. They compared various RL algorithms trained with Agent-R1 against two baselines: Naive RAG, a single-pass retrieval method where an LLM answers based on one set of retrieved documents, and Base Tool Call, which uses the model's native function-calling ability without specialized RL training.The results demonstrated that all RL-trained agents substantially outperformed the baselines. GRPO, an RL algorithm used in advanced reasoning models like DeepSeek-R1, delivered the best overall performance. “These results robustly validate Agent-R1’s efficacy in training powerful LLM agents via end-to-end RL, showing consistent, substantial gains over baselines across diverse datasets and RL algorithms,” the researchers write.These findings can be significant for the enterprise, where there is a strong push to apply RL and reasoning beyond well-defined domains. A framework designed to handle messy, multi-turn interactions with users and dynamic environments can pave the way for new agents capable of solving complex problems in real-world settings.“We hope Agent-R1 provides a foundation for future work on scalable and unified RL training for agentic LLMs,” the researchers conclude.
Model showdown, extinction risk, Claude memory trick, FLUX.2, AI glasses, and more...
Don't miss our most-read stories of the past month
The post TDS Newsletter: November Must-Reads on GraphRAG, ML Projects, LLM-Powered Time-Series Analysis, and More appeared first on Towards Data Science.
Neural and symbolic models compress the world in fundamentally different ways, and Sparse Autoencoders (SAEs) offer a bridge to connect them.
The post Neural Networks Are Blurry, Symbolic Systems Are Fragmented. Sparse Autoencoders Help Us Combine Them. appeared first on Towards Data Science.
Have we all been tricked into believing in an impossible, extremely expensive future?
The post Water Cooler Small Talk, Ep. 10: So, What About the AI Bubble? appeared first on Towards Data Science.
The point is this: those who learn to collaborate with AI rather than fear it will hold the keys to tomorrow’s job market.
From insurance premiums to courtrooms: the impact of noise
The post Everyday Decisions are Noisier Than You Think — Here’s How AI Can Help Fix That appeared first on Towards Data Science.
These are the essentials that help me code faster, analyze data smarter, and automate more of my workflow.
A beginner-friendly Python tutorial using conditionals and the random module
The post Implementing the Rock Paper Scissors Game in Python appeared first on Towards Data Science.
Cloud deal secured, Transformer replacement, Gemini 3 story, AlphaFold doc, and more...
In this post, we explore how Myriad Genetics partnered with the AWS Generative AI Innovation Center to transform their healthcare document processing pipeline using Amazon Bedrock and Amazon Nova foundation models, achieving 98% classification accuracy while reducing costs by 77% and processing time by 80%. We detail the technical implementation using AWS's open-source GenAI Intelligent Document Processing Accelerator, the optimization strategies for document classification and key information extraction, and the measurable business impact on Myriad's prior authorization workflows.
In this post, CBRE and AWS demonstrate how they transformed property management by building a unified search and digital assistant using Amazon Bedrock, enabling professionals to access millions of documents and multiple databases through natural language queries. The solution combines Amazon Nova Pro for SQL generation and Claude Haiku for document interactions, achieving a 67% reduction in processing time while maintaining enterprise-grade security across more than eight million documents.
In this post, we introduce Managed Tiered KV Cache and Intelligent Routing for Amazon SageMaker HyperPod, new capabilities that can reduce time to first token by up to 40% and lower compute costs by up to 25% for long context prompts and multi-turn conversations. These features automatically manage distributed KV caching infrastructure and intelligent request routing, making it easier to deploy production-scale LLM inference workloads with enterprise-grade performance while significantly reducing operational overhead.
Learn more about AlphaFold, Google’s AI system that accurately predicts protein structures.