The recent controversy surrounding Google’s Gemma model has once again highlighted the dangers of using developer test models and the fleeting nature of model availability. Google pulled its Gemma 3 model from AI Studio following a statement from Senator Marsha Blackburn (R-Tenn.) that the Gemma model willfully hallucinated falsehoods about her. Blackburn said the model fabricated news stories about her that go beyond “harmless hallucination” and function as a defamatory act. In response, Google posted on X on October 31 that it will remove Gemma from AI Studio, stating that this is “to prevent confusion.” Gemma remains available via API. It is also available via AI Studio, which, the company described, is "a developer tool (in fact, to use it you need to attest you're a developer). We’ve now seen reports of non-developers trying to use Gemma in AI Studio and ask it factual questions. We never intended this to be a consumer tool or model, or to be used this way. To prevent this confusion, access to Gemma is no longer available on AI Studio."To be clear, Google has the right to remove its model from its platform, especially if people have found hallucinations and falsehoods that could proliferate. It also underscores the danger of relying mainly on experimental models and why enterprise developers need to save projects before AI models are sunsetted or removed. Technology companies like Google continue to face political controversies, which often influence their deployments. VentureBeat reached out to Google for additional information and was pointed to their October 31 posts. We also contacted the office of Sen. Blackburn, who reiterated her stance outlined in a statement that AI companies should “shut [models] down until you can control it."Developer experimentsThe Gemma family of models, which includes a 270M parameter version, is best suited for small, quick apps and tasks that can run on devices such as smartphones and laptops. Google said the Gemma models were “built specifically for the developer and research community. They are not meant for factual assistance or for consumers to use.”Nevertheless, non-developers could still access Gemma because it is on the AI Studio platform, a more beginner-friendly space for developers to play around with Google AI models compared to Vertex AI. So even if Google never intended Gemma and AI Studio to be accessible to, say, Congressional staffers, these situations can still occur. It also shows that as models continue to improve, these models still produce inaccurate and potentially harmful information. Enterprises must continually weigh the benefits of using models like Gemma against their potential inaccuracies. Project continuity Another concern is the control that AI companies have over their models. The adage “you don’t own anything on the internet” remains true. If you don’t own a physical or local copy of software, it’s easy for you to lose access to it if the company that owns it decides to take it away. Google did not clarify with VentureBeat if current projects on AI Studio powered by Gemma are saved. Similarly, OpenAI users were disappointed when the company announced that it would remove popular older models on ChatGPT. Even after walking back his statement and reinstating GPT-4o back to ChatGPT, OpenAI CEO Sam Altman continues to field questions around keeping and supporting the model. AI companies can, and should, remove their models if they create harmful outputs. AI models, no matter how mature, remain works in progress and are constantly evolving and improving. But, since they are experimental in nature, models can easily become tools that technology companies and lawmakers can wield as leverage. Enterprise developers must ensure that their work can be saved before models are removed from platforms.
The FSNet system, developed at MIT, could help power grid operators rapidly find feasible solutions for optimizing the flow of electricity.
Siri opens to competitors, 30 AI life hacks, deepfake test, private phone AI, and more...
For more than three decades, modern CPUs have relied on speculative execution to keep pipelines full. When it emerged in the 1990s, speculation was hailed as a breakthrough — just as pipelining and superscalar execution had been in earlier decades. Each marked a generational leap in microarchitecture. By predicting the outcomes of branches and memory loads, processors could avoid stalls and keep execution units busy. But this architectural shift came at a cost: Wasted energy when predictions failed, increased complexity and vulnerabilities such as Spectre and Meltdown. These challenges set the stage for an alternative: A deterministic, time-based execution model. As David Patterson observed in 1980, “A RISC potentially gains in speed merely from a simpler design.” Patterson’s principle of simplicity underpins a new alternative to speculation: A deterministic, time-based execution model."For the first time since speculative execution became the dominant paradigm, a fundamentally new approach has been invented. This breakthrough is embodied in a series of six recently issued U.S. patents, sailing through the U.S. Patent and Trademark Office (USPTO). Together, they introduce a radically different instruction execution model. Departing sharply from conventional speculative techniques, this novel deterministic framework replaces guesswork with a time-based, latency-tolerant mechanism. Each instruction is assigned a precise execution slot within the pipeline, resulting in a rigorously ordered and predictable flow of execution. This reimagined model redefines how modern processors can handle latency and concurrency with greater efficiency and reliability. A simple time counter is used to deterministically set the exact time of when instructions should be executed in the future. Each instruction is dispatched to an execution queue with a preset execution time based on resolving its data dependencies and availability of resources — read buses, execution units and the write bus to the register file. Each instruction remains queued until its scheduled execution slot arrives. This new deterministic approach may represent the first major architectural challenge to speculation since it became the standard.The architecture extends naturally into matrix computation, with a RISC-V instruction set proposal under community review. Configurable general matrix multiply (GEMM) units, ranging from 8×8 to 64×64, can operate using either register-based or direct-memory acceess (DMA)-fed operands. This flexibility supports a wide range of AI and high-performance computing (HPC) workloads. Early analysis suggests scalability that rivals Google’s TPU cores, while maintaining significantly lower cost and power requirements. Rather than a direct comparison with general-purpose CPUs, the more accurate reference point is vector and matrix engines: Traditional CPUs still depend on speculation and branch prediction, whereas this design applies deterministic scheduling directly to GEMM and vector units. This efficiency stems not only from the configurable GEMM blocks but also from the time-based execution model, where instructions are decoded and assigned precise execution slots based on operand readiness and resource availability. Execution is never a random or heuristic choice among many candidates, but a predictable, pre-planned flow that keeps compute resources continuously busy. Planned matrix benchmarks will provide direct comparisons with TPU GEMM implementations, highlighting the ability to deliver datacenter-class performance without datacenter-class overhead.Critics may argue that static scheduling introduces latency into instruction execution. In reality, the latency already exists — waiting on data dependencies or memory fetches. Conventional CPUs attempt to hide it with speculation, but when predictions fail, the resulting pipeline flush introduces delay and wastes power. The time-counter approach acknowledges this latency and fills it deterministically with useful work, avoiding rollbacks. As the first patent notes, instructions retain out-of-order efficiency: “A microprocessor with a time counter for statically dispatching instructions enables execution based on predicted timing rather than speculative issue and recovery," with preset execution times but without the overhead of register renaming or speculative comparators.Why speculation stalledSpeculative execution boosts performance by predicting outcomes before they’re known — executing instructions ahead of time and discarding them if the guess was wrong. While this approach can accelerate workloads, it also introduces unpredictability and power inefficiency. Mispredictions inject “No Ops” into the pipeline, stalling progress and wasting energy on work that never completes. These issues are magnified in modern AI and machine learning (ML) workloads, where vector and matrix operations dominate and memory access patterns are irregular. Long fetches, non-cacheable loads and misaligned vectors frequently trigger pipeline flushes in speculative architectures.The result is performance cliffs that vary wildly across datasets and problem sizes, making consistent tuning nearly impossible. Worse still, speculative side effects have exposed vulnerabilities that led to high-profile security exploits. As data intensity grows and memory systems strain, speculation struggles to keep pace — undermining its original promise of seamless acceleration.Time-based execution and deterministic schedulingAt the core of this invention is a vector coprocessor with a time counter for statically dispatching instructions. Rather than relying on speculation, instructions are issued only when data dependencies and latency windows are fully known. This eliminates guesswork and costly pipeline flushes while preserving the throughput advantages of out-of-order execution. Architectures built on this patented framework feature deep pipelines — typically spanning 12 stages — combined with wide front ends supporting up to 8-way decode and large reorder buffers exceeding 250 entriesAs illustrated in Figure 1, the architecture mirrors a conventional RISC-V processor at the top level, with instruction fetch and decode stages feeding into execution units. The innovation emerges in the integration of a time counter and register scoreboard, strategically positioned between fetch/decode and the vector execution units. Instead of relying on speculative comparators or register renaming, they utilize a Register Scoreboard and Time Resource Matrix (TRM) to deterministically schedule instructions based on operand readiness and resource availability. Figure 1: High-level block diagram of deterministic processor. A time counter and scoreboard sit between fetch/decode and vector execution units, ensuring instructions issue only when operands are ready.A typical program running on the deterministic processor begins much like it does on any conventional RISC-V system: Instructions are fetched from memory and decoded to determine whether they are scalar, vector, matrix or custom extensions. The difference emerges at the point of dispatch. Instead of issuing instructions speculatively, the processor employs a cycle-accurate time counter, working with a register scoreboard, to decide exactly when each instruction can be executed. This mechanism provides a deterministic execution contract, ensuring instructions complete at predictable cycles and reducing wasted issue slots.In conjunction with a register scoreboard, the time-resource matrix associates instructions with execution cycles, allowing the processor to plan dispatch deterministically across available resources. The scoreboard tracks operand readiness and hazard information, enabling scheduling without register renaming or speculative comparators. By monitoring dependencies such as read-after-write (RAW) and write-after-read, it ensures hazards are resolved without costly pipeline flushes. As noted in the patent, “in a multi-threaded microprocessor, the time counter and scoreboard permit rescheduling around cache misses, branch flushes, and RAW hazards without speculative rollback.”Once operands are ready, the instruction is dispatched to the appropriate execution unit. Scalar operations use standard artithmetic logic units (ALUs), while vector and matrix instructions execute in wide execution units connected to a large vector register file. Because instructions launch only when conditions are safe, these units stay highly utilized without the wasted work or recovery cycles caused by mis-predicted speculation. The key enabler of this approach is a simple time counter that orchestrates execution according to data readiness and resource availability, ensuring instructions advance only when operands are ready and resources available. The same principle applies to memory operations: The interface predicts latency windows for loads and stores, allowing the processor to fill those slots with independent instructions and keep execution flowing.Programming model differencesFrom the programmer’s perspective, the flow remains familiar — RISC-V code compiles and executes in the usual way. The crucial difference lies in the execution contract: Rather than relying on dynamic speculation to hide latency, the processor guarantees predictable dispatch and completion times. This eliminates the performance cliffs and wasted energy of speculation while still providing the throughput benefits of out-of-order execution. This perspective underscores how deterministic execution preserves the familiar RISC-V programming model while eliminating the unpredictability and wasted effort of speculation. As John Hennessy put it: "It’s stupid to do work in run time that you can do in compile time”— a remark reflecting the foundations of RISC and its forward-looking design philosophy.The RISC-V ISA provides opcodes for custom and extension instructions, including floating-point, DSP, and vector operations. The result is a processor that executes instructions deterministically while retaining the benefits of out-of-order performance. By eliminating speculation, the design simplifies hardware, reduces power consumption and avoids pipeline flushes. These efficiency gains grow even more significant in vector and matrix operations, where wide execution units require consistent utilization to reach peak performance. Vector extensions require wide register files and large execution units, which in speculative processors necessitate expensive register renaming to recover from branch mispredictions. In the deterministic design, vector instructions are executed only after commit, eliminating the need for renaming.Each instruction is scheduled against a cycle-accurate time counter: “The time counter provides a deterministic execution contract, ensuring instructions complete at predictable cycles and reducing wasted issue slots.” The vector register scoreboard resolves data dependency before issuing instructions to execution pipeline. Instructions are dispatched in a known order at the correct cycle, making execution both predictable and efficient.Vector execution units (integer and floating point) connect directly to a large vector register file. Because instructions are never flushed, there is no renaming overhead. The scoreboard ensures safe access, while the time counter aligns execution with memory readiness. A dedicated memory block predicts the return cycle of loads. Instead of stalling or speculating, the processor schedules independent instructions into latency slots, keeping execution units busy. “A vector coprocessor with a time counter for statically dispatching instructions ensures high utilization of wide execution units while avoiding misprediction penalties.”In today’s CPUs, compilers and programmers write code assuming the hardware will dynamically reorder instructions and speculatively execute branches. The hardware handles hazards with register renaming, branch prediction and recovery mechanisms. Programmers benefit from performance, but at the cost of unpredictability and power consumption.In the deterministic time-based architecture, instructions are dispatched only when the time counter indicates their operands will be ready. This means the compiler (or runtime system) doesn’t need to insert guard code for misprediction recovery. Instead, compiler scheduling becomes simpler, as instructions are guaranteed to issue at the correct cycle without rollbacks. For programmers, the ISA remains RISC-V compatible, but deterministic extensions reduce reliance on speculative safety nets.Application in AI and MLIn AI/ML kernels, vector loads and matrix operations often dominate runtime. On a speculative CPU, misaligned or non-cacheable loads can trigger stalls or flushes, starving wide vector and matrix units and wasting energy on discarded work. A deterministic design instead issues these operations with cycle-accurate timing, ensuring high utilization and steady throughput. For programmers, this means fewer performance cliffs and more predictable scaling across problem sizes. And because the patents extend the RISC-V ISA rather than replace it, deterministic processors remain fully compatible with the RVA23 profile and mainstream toolchains such as GCC, LLVM, FreeRTOS, and Zephyr.In practice, the deterministic model doesn’t change how code is written — it remains RISC-V assembly or high-level languages compiled to RISC-V instructions. What changes is the execution contract: Rather than relying on speculative guesswork, programmers can expect predictable latency behavior and higher efficiency without tuning code around microarchitectural quirks.The industry is at an inflection point. AI/ML workloads are dominated by vector and matrix math, where GPUs and TPUs excel — but only by consuming massive power and adding architectural complexity. In contrast, general-purpose CPUs, still tied to speculative execution models, lag behind.A deterministic processor delivers predictable performance across a wide range of workloads, ensuring consistent behavior regardless of task complexity. Eliminating speculative execution enhances energy efficiency and avoids unnecessary computational overhead. Furthermore, deterministic design scales naturally to vector and matrix operations, making it especially well-suited for AI workloads that rely on high-throughput parallelism. This new deterministic approach may represent the next such leap: The first major architectural challenge to speculation since speculation itself became the standard.Will deterministic CPUs replace speculation in mainstream computing? That remains to be seen. But with issued patents, proven novelty and growing pressure from AI workloads, the timing is right for a paradigm shift. Taken together, these advances signal deterministic execution as the next architectural leap — redefining performance and efficiency just as speculation once did.Speculation marked the last revolution in CPU design; determinism may well represent the next.Thang Tran is the founder and CTO of Simplex Micro.Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
Robot meltdown, walk 30% faster, Google's first AI ad, AI survival skills, and more...
More screen time among children and teens is linked to higher risks of heart and metabolic problems, particularly when combined with insufficient sleep. Danish researchers discovered a measurable rise in cardiometabolic risk scores and a metabolic “fingerprint” in frequent screen users. Experts say better sleep and balanced daily routines can help offset these effects and safeguard lifelong health.
Recently, there has been a lot of hullabaloo about the idea that large reasoning models (LRM) are unable to think. This is mostly due to a research article published by Apple, "The Illusion of Thinking" Apple argues that LRMs must not be able to think; instead, they just perform pattern-matching. The evidence they provided is that LRMs with chain-of-thought (CoT) reasoning are unable to carry on the calculation using a predefined algorithm as the problem grows.This is a fundamentally flawed argument. If you ask a human who already knows the algorithm for solving the Tower-of-Hanoi problem to solve a Tower-of-Hanoi problem with twenty discs, for instance, he or she would almost certainly fail to do so. By that logic, we must conclude that humans cannot think either. However, this argument only points to the idea that there is no evidence that LRMs cannot think. This alone certainly does not mean that LRMs can think — just that we cannot be sure they don’t.In this article, I will make a bolder claim: LRMs almost certainly can think. I say ‘almost’ because there is always a chance that further research would surprise us. But I think my argument is pretty conclusive.What is thinking?Before we try to understand if LRMs can think, we need to define what we mean by thinking. But first, we have to make sure that humans can think per the definition. We will only consider thinking in relation to problem solving, which is the matter of contention.1. Problem representation (frontal and parietal lobes)When you think about a problem, the process engages your prefrontal cortex. This region is responsible for working memory, attention and executive functions — capacities that let you hold the problem in mind, break it into sub-components and set goals. Your parietal cortex helps encode symbolic structure for math or puzzle problems.2. Mental simulation (morking Memory and inner speech)This has two components: One is an auditory loop that lets you talk to yourself — very similar to CoT generation. The other is visual imagery, which allows you to manipulate objects visually. Geometry was so important for navigating the world that we developed specialized capabilities for it. The auditory part is linked to Broca’s area and the auditory cortex, both reused from language centers. The visual cortex and parietal areas primarily control the visual component.3. Pattern matching and retrieval (Hippocampus and Temporal Lobes)These actions depend on past experiences and stored knowledge from long-term memory:The hippocampus helps retrieve related memories and facts.The temporal Lobe brings in semantic knowledge — meanings, rules, categories.This is similar to how neural networks depend on their training to process the task.4. Monitoring and evaluation (Anterior Cingulate Cortex)Our anterior cingulate cortex (ACC) monitors for errors, conflicts or impasses — it’s where you notice contradictions or dead ends. This process is essentially based on pattern matching from prior experience.5. Insight or reframing (default mode network and right hemisphere)When you're stuck, your brain might shift into default mode — a more relaxed, internally-directed network. This is when you step back, let go of the current thread and sometimes ‘suddenly’ see a new angle (the classic “aha!” moment).This is similar to how DeepSeek-R1 was trained for CoT reasoning without having CoT examples in its training data. Remember, the brain continuously learns as it processes data and solves problems.In contrast, LRMs aren’t allowed to change based on real-world feedback during prediction or generation. But with DeepSeek-R1’s CoT training, learning did happen as it attempted to solve the problems — essentially updating while reasoning.Similarities betweem CoT reasoning and biological thinkingLRM does not have all of the faculties mentioned above. For example, an LRM is very unlikely to do too much visual reasoning in its circuit, although a little may happen. But it certainly does not generate intermediate images in the CoT generation.Most humans can make spatial models in their heads to solve problems. Does this mean we can conclude that LRMs cannot think? I would disagree. Some humans also find it difficult to form spatial models of the concepts they think about. This condition is called aphantasia. People with this condition can think just fine. In fact, they go about life as if they don’t lack any ability at all. Many of them are actually great at symbolic reasoning and quite good at math — often enough to compensate for their lack of visual reasoning. We might expect our neural network models also to be able to circumvent this limitation.If we take a more abstract view of the human thought process described earlier, we can see mainly the following things involved:1. Pattern-matching is used for recalling learned experience, problem representation and monitoring and evaluating chains of thought.2. Working memory is to store all the intermediate steps.3. Backtracking search concludes that the CoT is not going anywhere and backtracks to some reasonable point.Pattern-matching in an LRM comes from its training. The whole point of training is to learn both knowledge of the world and the patterns to process that knowledge effectively. Since an LRM is a layered network, the entire working memory needs to fit within one layer. The weights store the knowledge of the world and the patterns to follow, while processing happens between layers using the learned patterns stored as model parameters.Note that even in CoT, the entire text — including the input, CoT and part of the output already generated — must fit into each layer. Working memory is just one layer (in the case of the attention mechanism, this includes the KV-cache).CoT is, in fact, very similar to what we do when we are talking to ourselves (which is almost always). We nearly always verbalize our thoughts, and so does a CoT reasoner.There is also good evidence that CoT reasoner can take backtracking steps when a certain line of reasoning seems futile. In fact, this is what the Apple researchers saw when they tried to ask the LRMs to solve bigger instances of simple puzzles. The LRMs correctly recognized that trying to solve the puzzles directly would not fit in their working memory, so they tried to figure out better shortcuts, just like a human would do. This is even more evidence that LRMs are thinkers, not just blind followers of predefined patterns.But why would a next-token-predictor learn to think?Neural networks of sufficient size can learn any computation, including thinking. But a next-word-prediction system can also learn to think. Let me elaborate. A general idea is LRMs cannot think because, at the end of the day, they are just predicting the next token; it is only a 'glorified auto-complete.' This view is fundamentally incorrect — not that it is an 'auto-complete,' but that an 'auto-complete' does not have to think. In fact, next word prediction is far from a limited representation of thought. On the contrary, it is the most general form of knowledge representation that anyone can hope for. Let me explain.Whenever we want to represent some knowledge, we need a language or a system of symbolism to do so. Different formal languages exist that are very precise in terms of what they can express. However, such languages are fundamentally limited in the kinds of knowledge they can represent.For example, first-order predicate logic cannot represent properties of all predicates that satisfy a certain property, because it doesn't allow predicates over predicates.Of course, there are higher-order predicate calculi that can represent predicates on predicates to arbitrary depths. But even they cannot express ideas that lack precision or are abstract in nature.Natural language, however, is complete in expressive power — you can describe any concept in any level of detail or abstraction. In fact, you can even describe concepts about natural language using natural language itself. That makes it a strong candidate for knowledge representation.The challenge, of course, is that this expressive richness makes it harder to process the information encoded in natural language. But we don’t necessarily need to understand how to do it manually — we can simply program the machine using data, through a process called training.A next-token prediction machine essentially computes a probability distribution over the next token, given a context of preceding tokens. Any machine that aims to compute this probability accurately must, in some form, represent world knowledge.A simple example: Consider the incomplete sentence, "The highest mountain peak in the world is Mount ..." — to predict the next word as Everest, the model must have this knowledge stored somewhere. If the task requires the model to compute the answer or solve a puzzle, the next-token predictor needs to output CoT tokens to carry the logic forward.This implies that, even though it’s predicting one token at a time, the model must internally represent at least the next few tokens in its working memory — enough to ensure it stays on the logical path.If you think about it, humans also predict the next token — whether during speech or when thinking using the inner voice. A perfect auto-complete system that always outputs the right tokens and produces correct answers would have to be omniscient. Of course, we’ll never reach that point — because not every answer is computable.However, a parameterized model that can represent knowledge by tuning its parameters, and that can learn through data and reinforcement, can certainly learn to think.Does it produce the effects of thinking?At the end of the day, the ultimate test of thought is a system’s ability to solve problems that require thinking. If a system can answer previously unseen questions that demand some level of reasoning, it must have learned to think — or at least to reason — its way to the answer.We know that proprietary LRMs perform very well on certain reasoning benchmarks. However, since there's a possibility that some of these models were fine-tuned on benchmark test sets through a backdoor, we’ll focus only on open-source models for fairness and transparency.We evaluate them using the following benchmarks:As one can see, in some benchmarks, LRMs are able to solve a significant number of logic-based questions. While it’s true that they still lag behind human performance in many cases, it’s important to note that the human baseline often comes from individuals trained specifically on those benchmarks. In fact, in certain cases, LRMs outperform the average untrained human.ConclusionBased on the benchmark results, the striking similarity between CoT reasoning and biological reasoning, and the theoretical understanding that any system with sufficient representational capacity, enough training data, and adequate computational power can perform any computable task — LRMs meet those criteria to a considerable extent.It is therefore reasonable to conclude that LRMs almost certainly possess the ability to think.Debasish Ray Chawdhuri is a senior principal engineer at Talentica Software and a Ph.D. candidate in Cryptography at IIT Bombay. Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
In this post, I’ll introduce a reinforcement learning (RL) algorithm based on an “alternative” paradigm: divide and conquer. Unlike traditional methods, this algorithm is not based on temporal difference (TD) learning (which has scalability challenges), and scales well to long-horizon tasks.
We can do Reinforcement Learning (RL) based on divide and conquer, instead of temporal difference (TD) learning.
Problem setting: off-policy RL
Our problem setting is off-policy RL. Let’s briefly review what this means.
There are two classes of algorithms in RL: on-policy RL and off-policy RL. On-policy RL means we can only use fresh data collected by the current policy. In other words, we have to throw away old data each time we update the policy. Algorithms like PPO and GRPO (and policy gradient methods in general) belong to this category.
Off-policy RL means we don’t have this restriction: we can use any kind of data, including old experience, human demonstrations, Internet data, and so on. So off-policy RL is more general and flexible than on-policy RL (and of course harder!). Q-learning is the most well-known off-policy RL algorithm. In domains where data collection is expensive (e.g., robotics, dialogue systems, healthcare, etc.), we often have no choice but to use off-policy RL. That’s why it’s such an important problem.
As of 2025, I think we have reasonably good recipes for scaling up on-policy RL (e.g., PPO, GRPO, and their variants). However, we still haven’t found a “scalable” off-policy RL algorithm that scales well to complex, long-horizon tasks. Let me briefly explain why.
Two paradigms in value learning: Temporal Difference (TD) and Monte Carlo (MC)
In off-policy RL, we typically train a value function using temporal difference (TD) learning (i.e., Q-learning), with the following Bellman update rule:
\[\begin{aligned} Q(s, a) \gets r + \gamma \max_{a'} Q(s', a'), \end{aligned}\]
The problem is this: the error in the next value $Q(s’, a’)$ propagates to the current value $Q(s, a)$ through bootstrapping, and these errors accumulate over the entire horizon. This is basically what makes TD learning struggle to scale to long-horizon tasks (see this post if you’re interested in more details).
To mitigate this problem, people have mixed TD learning with Monte Carlo (MC) returns. For example, we can do $n$-step TD learning (TD-$n$):
\[\begin{aligned} Q(s_t, a_t) \gets \sum_{i=0}^{n-1} \gamma^i r_{t+i} + \gamma^n \max_{a'} Q(s_{t+n}, a'). \end{aligned}\]
Here, we use the actual Monte Carlo return (from the dataset) for the first $n$ steps, and then use the bootstrapped value for the rest of the horizon. This way, we can reduce the number of Bellman recursions by $n$ times, so errors accumulate less. In the extreme case of $n = \infty$, we recover pure Monte Carlo value learning.
While this is a reasonable solution (and often works well), it is highly unsatisfactory. First, it doesn’t fundamentally solve the error accumulation problem; it only reduces the number of Bellman recursions by a constant factor ($n$). Second, as $n$ grows, we suffer from high variance and suboptimality. So we can’t just set $n$ to a large value, and need to carefully tune it for each task.
Is there a fundamentally different way to solve this problem?
The “Third” Paradigm: Divide and Conquer
My claim is that a third paradigm in value learning, divide and conquer, may provide an ideal solution to off-policy RL that scales to arbitrarily long-horizon tasks.
Divide and conquer reduces the number of Bellman recursions logarithmically.
The key idea of divide and conquer is to divide a trajectory into two equal-length segments, and combine their values to update the value of the full trajectory. This way, we can (in theory) reduce the number of Bellman recursions logarithmically (not linearly!). Moreover, it doesn’t require choosing a hyperparameter like $n$, and it doesn’t necessarily suffer from high variance or suboptimality, unlike $n$-step TD learning.
Conceptually, divide and conquer really has all the nice properties we want in value learning. So I’ve long been excited about this high-level idea. The problem was that it wasn’t clear how to actually do this in practice… until recently.
A practical algorithm
In a recent work co-led with Aditya, we made meaningful progress toward realizing and scaling up this idea. Specifically, we were able to scale up divide-and-conquer value learning to highly complex tasks (as far as I know, this is the first such work!) at least in one important class of RL problems, goal-conditioned RL. Goal-conditioned RL aims to learn a policy that can reach any state from any other state. This provides a natural divide-and-conquer structure. Let me explain this.
The structure is as follows. Let’s first assume that the dynamics is deterministic, and denote the shortest path distance (“temporal distance”) between two states $s$ and $g$ as $d^*(s, g)$. Then, it satisfies the triangle inequality:
\[\begin{aligned} d^*(s, g) \leq d^*(s, w) + d^*(w, g) \end{aligned}\]
for all $s, g, w \in \mathcal{S}$.
In terms of values, we can equivalently translate this triangle inequality to the following “transitive” Bellman update rule:
\[\begin{aligned}
V(s, g) \gets \begin{cases}
\gamma^0 & \text{if } s = g, \\\\
\gamma^1 & \text{if } (s, g) \in \mathcal{E}, \\\\
\max_{w \in \mathcal{S}} V(s, w)V(w, g) & \text{otherwise}
\end{cases}
\end{aligned}\]
where $\mathcal{E}$ is the set of edges in the environment’s transition graph, and $V$ is the value function associated with the sparse reward $r(s, g) = 1(s = g)$. Intuitively, this means that we can update the value of $V(s, g)$ using two “smaller” values: $V(s, w)$ and $V(w, g)$, provided that $w$ is the optimal “midpoint” (subgoal) on the shortest path. This is exactly the divide-and-conquer value update rule that we were looking for!
The problem
However, there’s one problem here. The issue is that it’s unclear how to choose the optimal subgoal $w$ in practice. In tabular settings, we can simply enumerate all states to find the optimal $w$ (this is essentially the Floyd-Warshall shortest path algorithm). But in continuous environments with large state spaces, we can’t do this. Basically, this is why previous works have struggled to scale up divide-and-conquer value learning, even though this idea has been around for decades (in fact, it dates back to the very first work in goal-conditioned RL by Kaelbling (1993) – see our paper for a further discussion of related works). The main contribution of our work is a practical solution to this issue.
The solution
Here’s our key idea: we restrict the search space of $w$ to the states that appear in the dataset, specifically, those that lie between $s$ and $g$ in the dataset trajectory. Also, instead of searching for the optimal $\text{argmax}_w$, we compute a “soft” $\text{argmax}$ using expectile regression. Namely, we minimize the following loss:
\[\begin{aligned} \mathbb{E}\left[\ell^2_\kappa (V(s_i, s_j) - \bar{V}(s_i, s_k) \bar{V}(s_k, s_j))\right], \end{aligned}\]
where $\bar{V}$ is the target value network, $\ell^2_\kappa$ is the expectile loss with an expectile $\kappa$, and the expectation is taken over all $(s_i, s_k, s_j)$ tuples with $i \leq k \leq j$ in a randomly sampled dataset trajectory.
This has two benefits. First, we don’t need to search over the entire state space. Second, we prevent value overestimation from the $\max$ operator by instead using the “softer” expectile regression. We call this algorithm Transitive RL (TRL). Check out our paper for more details and further discussions!
Does it work well?
Your browser does not support the video tag.
humanoidmaze
Your browser does not support the video tag.
puzzle
To see whether our method scales well to complex tasks, we directly evaluated TRL on some of the most challenging tasks in OGBench, a benchmark for offline goal-conditioned RL. We mainly used the hardest versions of humanoidmaze and puzzle tasks with large, 1B-sized datasets. These tasks are highly challenging: they require performing combinatorially complex skills across up to 3,000 environment steps.
TRL achieves the best performance on highly challenging, long-horizon tasks.
The results are quite exciting! Compared to many strong baselines across different categories (TD, MC, quasimetric learning, etc.), TRL achieves the best performance on most tasks.
TRL matches the best, individually tuned TD-$n$, without needing to set $\boldsymbol{n}$.
This is my favorite plot. We compared TRL with $n$-step TD learning with different values of $n$, from $1$ (pure TD) to $\infty$ (pure MC). The result is really nice. TRL matches the best TD-$n$ on all tasks, without needing to set $\boldsymbol{n}$! This is exactly what we wanted from the divide-and-conquer paradigm. By recursively splitting a trajectory into smaller ones, it can naturally handle long horizons, without having to arbitrarily choose the length of trajectory chunks.
The paper has a lot of additional experiments, analyses, and ablations. If you’re interested, check out our paper!
What’s next?
In this post, I shared some promising results from our new divide-and-conquer value learning algorithm, Transitive RL. This is just the beginning of the journey. There are many open questions and exciting directions to explore:
Perhaps the most important question is how to extend TRL to regular, reward-based RL tasks beyond goal-conditioned RL. Would regular RL have a similar divide-and-conquer structure that we can exploit? I’m quite optimistic about this, given that it is possible to convert any reward-based RL task to a goal-conditioned one at least in theory (see page 40 of this book).
Another important challenge is to deal with stochastic environments. The current version of TRL assumes deterministic dynamics, but many real-world environments are stochastic, mainly due to partial observability. For this, “stochastic” triangle inequalities might provide some hints.
Practically, I think there is still a lot of room to further improve TRL. For example, we can find better ways to choose subgoal candidates (beyond the ones from the same trajectory), further reduce hyperparameters, further stabilize training, and simplify the algorithm even more.
In general, I’m really excited about the potential of the divide-and-conquer paradigm. I still think one of the most important problems in RL (and even in machine learning) is to find a scalable off-policy RL algorithm. I don’t know what the final solution will look like, but I do think divide and conquer, or recursive decision-making in general, is one of the strongest candidates toward this holy grail (by the way, I think the other strong contenders are (1) model-based RL and (2) TD learning with some “magic” tricks). Indeed, several recent works in other fields have shown the promise of recursion and divide-and-conquer strategies, such as shortcut models, log-linear attention, and recursive language models (and of course, classic algorithms like quicksort, segment trees, FFT, and so on). I hope to see more exciting progress in scalable off-policy RL in the near future!
Acknowledgments
I’d like to thank Kevin and Sergey for their helpful feedback on this post.
This post originally appeared on Seohong Park’s blog.
Battle of the Giants, GPT-5 beats doctor, character cameos, Cursor 2.0, and more...
Presented by CelonisAI adoption is accelerating, but results often lag expectations. And enterprise leaders are under pressure to prove measurable ROI from the AI solutions — especially as the use of autonomous agents rises and global tariffs disrupt supply chains.The issue isn’t the AI itself, says Alex Rinke, co-founder and co-CEO of Celonis, a global leader in process intelligence. “To succeed, enterprise AI needs to understand the context of a business’s processes — and how to improve them,” he explains. Without this business context, AI risks becoming, as Rinke puts it, “just an internal social experiment.”Next week’s Celosphere 2025 will tackle the AI ROI challenge head-on. The three-day event brings together customer strategies, hands-on workshops, and live demonstrations, highlighting enhancements to the Celonis Process Intelligence (PI) Platform that help enterprises harness ‘enterprise AI,’ powered by PI, to continuously improve operations, creating measurable business value at scale.Focus on measurable ROIThe event’s focus on achieving AI ROI reflects three challenges facing technology and business leaders moving from pilot to production: obsolete systems, break-neck industry change, and agentic AI. According to Gartner, 64% of board members now view AI as a top-three priority — yet only 10% of organizations report meaningful financial returns.Celonis customers are bucking that trend. A Forrester Total Economic Impact study found organizations using its platform achieved 383% ROI over three years, with payback in just six months. One company improved sales order automation from 33% to 86%, saving $24.5 million. The study estimated $44.1 million in total benefits over three years, driven by faster automation, reduced inefficiencies, and higher process visibility. These numbers underscore a broader pattern — companies that modernize outdated systems and align AI with process optimization see faster payback and sustained gains.Real companies, real resultsCelosphere will spotlight how global enterprises are building “future-fit” operations. Mercedes-Benz Group AG and Vinmar Group will showcase AI-driven, composable solutions, powered by PI, and attendees will see demonstrations of PI enabling agents in live production environments.Among the notable success stories: AstraZeneca, the pharmaceutical company, reduced excess inventory while keeping critical medicines flowing by using Celonis as a foundation for its OpenAI partnership.The State of Oklahoma can answer procurement status questions at scale, unlocking over $10 million in value. Cosentino clears blocked sales orders up to 5x faster using an AI-powered credit management assistant. Raising the stakes for agentic AINumerous sessions will focus on orchestrating AI agents. The shift from AI-as-advisor to AI-as-actor, changes everything, says Rinke. “The agent needs to understand not just what to do, but how your specific business actually works,” he explains. “Process intelligence provides those rails." This leap from recommendation to autonomous action raises the stakes exponentially. When agents can independently trigger purchase orders, reroute shipments, or approve exceptions, bad context can mean catastrophically bad outcomes at scale.Celosphere attendees will get to see first-hand how companies are using the Celonis Orchestration Engine to coordinate AI agents alongside people and systems. Effective orchestration is a crucial protection against the chaos of agents working at cross-purposes, duplicating actions, or letting crucial steps fall through the cracks. Navigating tariffs and supply chain shocksGlobal trade volatility isn't just a headline — it's an operational nightmare reshaping how companies deploy AI, Rinke says. New tariffs trigger cascading effects across procurement, logistics, and compliance. Each policy shift can cascade across thousands of SKUs — forcing new supplier contracts, rerouted shipments, and rebalanced inventories. For AI systems trained on static conditions, that volatility is almost impossible to predict. Traditional AI systems struggle with such variability — but process intelligence gives organizations real-time visibility into how changes ripple through operations.Celosphere case studies will show how companies turn disruption into advantage. Smurfit Westrock uses PI to optimize inventory and reduce costs amid tariff uncertainty, while ASOS leverages PI to optimize its supply chain operations, enhancing efficiency, reducing costs, and continuing to deliver an outstanding customer experience.Platform over point solutionsRinke argues that Celonis’ edge lies in treating process intelligence not as an add-on, but as the foundation of the enterprise stack. Unlike bolt-on optimization tools, the Celonis platform creates a living digital twin of business operations — a continuously updated model enriched by context that lets AI operate effectively from analysis to execution.“What sets Celonis apart is visibility across systems and offline tasks, which is critical for true intelligent automation,” Rinke says. “The platform offers comprehensive capabilities spanning process analysis, design, and orchestration rather than a point solution.”“Free the Process” and the future of AICelonis continues to champion openness through its “Free the Process” movement, promoting fair competition and freeing enterprises from legacy lock-in. By giving organizations full access to their own process data, open APIs, and a growing partner network that includes The Hackett Group, ClearOps, and Lobster, Celonis is building the connective tissue for a new era of interoperable automation.For Rinke, this open foundation is what turns AI from a set of experiments into an enterprise engine. “Process intelligence creates a flywheel,” he says. “Better understanding leads to better optimization, which enables better AI — and that, in turn, drives even greater understanding. There is no AI without PI."Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
AI consciousness?, Casio AI plush toy, AI memory fix, home robots, and more...
We’re introducing a new logs and datasets feature in Google AI Studio.
machine learning continues to evolve faster than most can keep up with.
Are you feeling it? I hear it’s close: two years, five years—maybe next year! And I hear it’s going to change everything: it will cure disease, save the planet, and usher in an age of abundance. It will solve our biggest problems in ways we cannot yet imagine. It will redefine what it means to…
It’s become a truism that facts alone don’t change people’s minds. Perhaps nowhere is this more clear than when it comes to conspiracy theories: Many people believe that you can’t talk conspiracists out of their beliefs. But that’s not necessarily true. It turns out that many conspiracy believers do respond to evidence and arguments—information that…
The rise of AI marks a critical shift away from decades defined by information-chasing and a push for more and more compute power. Canva co-founder and CPO Cameron Adams refers to this dawning time as the “imagination era.” Meaning: Individuals and enterprises must be able to turn creativity into action with AI. Canva hopes to position itself at the center of this shift with a sweeping new suite of tools. The company’s new Creative Operating System (COS) integrates AI across every layer of content creation, creating a single, comprehensive creativity platform rather than a simple, template-based design tool.“We’re entering a new era where we need to rethink how we achieve our goals,” said Adams. “We’re enabling people’s imagination and giving them the tools they need to take action.”An 'engine' for creativityAdams describes Canva’s platform as a three-layer stack: The top Visual Suite layer containing designs, images and other content; a collaborative Canva AI plane at center; and a foundational proprietary model holding it all up. At the heart of Canva’s strategy is its Creative Operating System (COS) underlying. This “engine,” as Adams describes it, integrates documents, websites, presentations, sheets, whiteboards, videos, social content, hundreds of millions of photos, illustrations, a rich sound library, and numerous templates, charts, and branded elements.The COS is getting a 2.0 upgrade, but the crucial advance is the “middle, crucial layer” that fully integrates AI and makes it accessible throughout various workflows, Adams explained. This gives creative and technical teams a single dashboard for generating, editing and launching all types of content.The underlying model is trained to understand the “complexity of design” so the platform can build out various elements — such as photos, videos, textures, or 3D graphics — in real time, matching branding style without the need for manual adjustments. It also supports live collaboration, meaning teams across departments can co-create. With a unified dashboard, a user working on a specific design, for instance, can create a new piece of content (say, a presentation) within the same workflow, without having to switch to another window or platform. Also, if they generate an image and aren’t pleased with it, they don’t have to go back and create from scratch; they can immediately begin editing, changing colors or tone. Another new capability in COS, “Ask Canva,” provides direct design advice. Users can tag @Canva to get copy suggestions and smart edits; or, they can highlight an image and direct the AI assistant to modify it or generate variants. “It’s a really unique interaction,” said Adams, noting that this AI design partner is always present. “It’s a real collaboration between people and AI, and we think it’s a revolutionary change.”Other new features include a 2.0 video editor and interactive form and email design with drag-and-drop tools. Further, Canva is now incorporated with Affinity, its unified app for pro designers incorporating vector, pixel and layer workflows, and Affinity is “free forever.” Automating intelligence, supporting marketingBranding is critical for enterprise; Canva has introduced new tools to help organizations consistently showcase theirs across platforms. The new Canva Grow engine integrates business objectives into the creative process so teams can workshop, create, distribute and refine ads and other materials. As Adams explained: “It automatically scans your website, figures out who your audience is, what assets you use to promote your products, the message it needs to send out, the formats you want to send it out in, makes a creative for you, and you can deploy it directly to the platform without having to leave Canva.”Marketing teams can now design and launch ads across platforms like Meta, track insights as they happen and refine future content based on performance metrics. “Your brand system is now available inside the AI you’re working with,” Adams noted. Success metrics and enterprise adoptionThe impact of Canva’s COS is reflected in notable user metrics: More than 250 million people use Canva every month, just over 29 million of which are paid subscribers. Adams reports that 41 billion designs have been created on Canva since launch, which equates to 1 billion each month. “If you break that down, it turns into the crazy number of 386 designs being created every single second,” said Adams. Whereas in the early days, it took roughly an hour for users to create a single design. Canva customers include Walmart, Disney, Virgin Voyages, Pinterest, FedEx, Expedia and eXp Realty. DocuSign, for one, reported that it unlocked more than 500 hours of team capacity and saved $300,000-plus in design hours by fully integrating Canva into its content creation. Disney, meanwhile, uses translation capabilities for its internationalization work, Adams said. Competitors in the design spaceCanva plays in an evolving landscape of professional design tools including Adobe Express and Figma; AI-powered challengers led by Microsoft Designer; and direct consumer alternatives like Visme and Piktochart.Adobe Express (starting at $9.99 a month for premium features) is known for its ease of use and integration with the broader Adobe Creative Cloud ecosystem. It features professional-grade templates and access to Adobe’s extensive stock library, and has incorporated Google's Gemini 2.5 Flash image model and other gen AI features so that designers can create graphics via natural language prompts. Users with some design experience say they prefer its interface, controls and technical advantages over Canva (such as the ability to import high-fidelity PDFs). Figma (starting at $3 a month for professional plans) is touted for its real-time collaboration, advanced prototyping capabilities and deep integration with dev workflows; however, some say it has a steeper learning curve and higher-precision design tools, making it preferable for professional designers, developers and product teams working on more complex projects. Microsoft Designer (free version available; although a Microsoft 365 subscription starting at $9.99 a month unlocks additional features) benefits from its integration with Microsoft’s AI capabilities, Copilot layout and text generation and Dall-E powered image generation. The platform’s “Inspire Me” and “New Ideas” buttons provide design variations, and users can also import data from Excel, add 3D models from PowerPoint and access images from OneDrive. However, users report that its stock photos and template and image libraries are limited compared to Canva's extensive collection, and its visuals can come across as outdated. Canva’s advantage seems to be in its extensive template library (more than 600,000 ready-to-use) and asset library (141 million-plus stock photos, videos, graphics, and audio elements). Its platform is also praised for its ease of use and interface friendly to non-designers, allowing them to begin quickly without training. Canva has also expanded into a variety of content types — documents, websites, presentations, whiteboards, videos, and more — making its platform a comprehensive visual suite than just a graphics tool. Canva has four pricing tiers: Canva Free for one user; Canva Pro for $120 a year for one person; Canva Teams for $100 a year for each team member; and the custom-priced Canva Enterprise. Key takeaways: Be open, embrace human-AI collaborationCanva’s COS is underpinned by Canva’s frontier model, an in-house, proprietary engine based on years of R&D and research partnerships, including the acquisition of visual AI company Leonardo. Adams notes that Canva works with top AI providers including OpenAI, Anthropic and Google. For technology teams, Canva’s approach offers important lessons, including a commitment to openness. “There are so many models floating around,” Adams noted; it’s important for enterprises to recognize when they should work with top models and when they should develop their own proprietary ones, he advised. For instance, OpenAI and Anthropic recently announced integrations with Canva as a visual layer because, as Adams explained, they realized they didn’t have the capability to create the same kinds of editable designs that Canva can. This creates a mutually-beneficial ecosystem. Ultimately, Adams noted: “We have this underlying philosophy that the future is people and technology working together. It's not an either or. We want people to be at the center, to be the ones with the creative spark, and to use AI as a collaborator.”
Three of the biggest US tech companies reported record profits and record infrastructure spending on Wednesday, fueling speculation about a possible AI market bubble.
Researchers at Meta FAIR and the University of Edinburgh have developed a new technique that can predict the correctness of a large language model's (LLM) reasoning and even intervene to fix its mistakes. Called Circuit-based Reasoning Verification (CRV), the method looks inside an LLM to monitor its internal “reasoning circuits” and detect signs of computational errors as the model solves a problem.Their findings show that CRV can detect reasoning errors in LLMs with high accuracy by building and observing a computational graph from the model's internal activations. In a key breakthrough, the researchers also demonstrated they can use this deep insight to apply targeted interventions that correct a model’s faulty reasoning on the fly.The technique could help solve one of the great challenges of AI: Ensuring a model’s reasoning is faithful and correct. This could be a critical step toward building more trustworthy AI applications for the enterprise, where reliability is paramount.Investigating chain-of-thought reasoningChain-of-thought (CoT) reasoning has been a powerful method for boosting the performance of LLMs on complex tasks and has been one of the key ingredients in the success of reasoning models such as the OpenAI o-series and DeepSeek-R1. However, despite the success of CoT, it is not fully reliable. The reasoning process itself is often flawed, and several studies have shown that the CoT tokens an LLM generates is not always a faithful representation of its internal reasoning process.Current remedies for verifying CoT fall into two main categories. “Black-box” approaches analyze the final generated token or the confidence scores of different token options. “Gray-box” approaches go a step further, looking at the model's internal state by using simple probes on its raw neural activations. But while these methods can detect that a model’s internal state is correlated with an error, they can't explain why the underlying computation failed. For real-world applications where understanding the root cause of a failure is crucial, this is a significant gap.A white-box approach to verificationCRV is based on the idea that models perform tasks using specialized subgraphs, or "circuits," of neurons that function like latent algorithms. So if the model’s reasoning fails, it is caused by a flaw in the execution of one of these algorithms. This means that by inspecting the underlying computational process, we can diagnose the cause of the flaw, similar to how developers examine execution traces to debug traditional software.To make this possible, the researchers first make the target LLM interpretable. They replace the standard dense layers of the transformer blocks with trained "transcoders." A transcoder is a specialized deep learning component that forces the model to represent its intermediate computations not as a dense, unreadable vector of numbers, but as a sparse and meaningful set of features. Transcoders are similar to the sparse autoencoders (SAE) used in mechanistic interpretability research with the difference that they also preserve the functionality of the network they emulate. This modification effectively installs a diagnostic port into the model, allowing researchers to observe its internal workings.With this interpretable model in place, the CRV process unfolds in a few steps. For each reasoning step the model takes, CRV constructs an "attribution graph" that maps the causal flow of information between the interpretable features of the transcoder and the tokens it is processing. From this graph, it extracts a "structural fingerprint" that contains a set of features describing the graph's properties. Finally, a “diagnostic classifier” model is trained on these fingerprints to predict whether the reasoning step is correct or not.At inference time, the classifier monitors the activations of the model and provides feedback on whether the model’s reasoning trace is on the right track.Finding and fixing errorsThe researchers tested their method on a Llama 3.1 8B Instruct model modified with the transcoders, evaluating it on a mix of synthetic (Boolean and Arithmetic) and real-world (GSM8K math problems) datasets. They compared CRV against a comprehensive suite of black-box and gray-box baselines.The results provide strong empirical support for the central hypothesis: the structural signatures in a reasoning step's computational trace contain a verifiable signal of its correctness. CRV consistently outperformed all baseline methods across every dataset and metric, demonstrating that a deep, structural view of the model's computation is more powerful than surface-level analysis.Interestingly, the analysis revealed that the signatures of error are highly domain-specific. This means failures in different reasoning tasks (formal logic versus arithmetic calculation) manifest as distinct computational patterns. A classifier trained to detect errors in one domain does not transfer well to another, highlighting that different types of reasoning rely on different internal circuits. In practice, this means that you might need to train a separate classifier for each task (though the transcoder remains unchanged).The most significant finding, however, is that these error signatures are not just correlational but causal. Because CRV provides a transparent view of the computation, a predicted failure can be traced back to a specific component. In one case study, the model made an order-of-operations error. CRV flagged the step and identified that a "multiplication" feature was firing prematurely. The researchers intervened by manually suppressing that single feature, and the model immediately corrected its path and solved the problem correctly. This work represents a step toward a more rigorous science of AI interpretability and control. As the paper concludes, “these findings establish CRV as a proof-of-concept for mechanistic analysis, showing that shifting from opaque activations to interpretable computational structure enables a causal understanding of how and why LLMs fail to reason correctly.” To support further research, the team plans to release its datasets and trained transcoders to the public.Why it’s importantWhile CRV is a research proof-of-concept, its results hint at a significant future for AI development. AI models learn internal algorithms, or "circuits," for different tasks. But because these models are opaque, we can't debug them like standard computer programs by tracing bugs to specific steps in the computation. Attribution graphs are the closest thing we have to an execution trace, showing how an output is derived from intermediate steps.This research suggests that attribution graphs could be the foundation for a new class of AI model debuggers. Such tools would allow developers to understand the root cause of failures, whether it's insufficient training data or interference between competing tasks. This would enable precise mitigations, like targeted fine-tuning or even direct model editing, instead of costly full-scale retraining. They could also allow for more efficient intervention to correct model mistakes during inference.The success of CRV in detecting and pinpointing reasoning errors is an encouraging sign that such debuggers could become a reality. This would pave the way for more robust LLMs and autonomous agents that can handle real-world unpredictability and, much like humans, correct course when they make reasoning mistakes.
Learn how to greatly improve the performance of your LLM application
The post 4 Techniques to Optimize Your LLM Prompts for Cost, Latency and Performance appeared first on Towards Data Science.
The vibe coding tool Cursor, from startup Anysphere, has introduced Composer, its first in-house, proprietary coding large language model (LLM) as part of its Cursor 2.0 platform update. Composer is designed to execute coding tasks quickly and accurately in production-scale environments, representing a new step in AI-assisted programming. It's already being used by Cursor’s own engineering staff in day-to-day development — indicating maturity and stability.According to Cursor, Composer completes most interactions in less than 30 seconds while maintaining a high level of reasoning ability across large and complex codebases. The model is described as four times faster than similarly intelligent systems and is trained for “agentic” workflows—where autonomous coding agents plan, write, test, and review code collaboratively.Previously, Cursor supported "vibe coding" — using AI to write or complete code based on natural language instructions from a user, even someone untrained in development — atop other leading proprietary LLMs from the likes of OpenAI, Anthropic, Google, and xAI. These options are still available to users.Benchmark ResultsComposer’s capabilities are benchmarked using "Cursor Bench," an internal evaluation suite derived from real developer agent requests. The benchmark measures not just correctness, but also the model’s adherence to existing abstractions, style conventions, and engineering practices.On this benchmark, Composer achieves frontier-level coding intelligence while generating at 250 tokens per second — about twice as fast as leading fast-inference models and four times faster than comparable frontier systems.Cursor’s published comparison groups models into several categories: “Best Open” (e.g., Qwen Coder, GLM 4.6), “Fast Frontier” (Haiku 4.5, Gemini Flash 2.5), “Frontier 7/2025” (the strongest model available midyear), and “Best Frontier” (including GPT-5 and Claude Sonnet 4.5). Composer matches the intelligence of mid-frontier systems while delivering the highest recorded generation speed among all tested classes.A Model Built with Reinforcement Learning and Mixture-of-Experts ArchitectureResearch scientist Sasha Rush of Cursor provided insight into the model’s development in posts on the social network X, describing Composer as a reinforcement-learned (RL) mixture-of-experts (MoE) model:“We used RL to train a big MoE model to be really good at real-world coding, and also very fast.”Rush explained that the team co-designed both Composer and the Cursor environment to allow the model to operate efficiently at production scale:“Unlike other ML systems, you can’t abstract much from the full-scale system. We co-designed this project and Cursor together in order to allow running the agent at the necessary scale.”Composer was trained on real software engineering tasks rather than static datasets. During training, the model operated inside full codebases using a suite of production tools—including file editing, semantic search, and terminal commands—to solve complex engineering problems. Each training iteration involved solving a concrete challenge, such as producing a code edit, drafting a plan, or generating a targeted explanation.The reinforcement loop optimized both correctness and efficiency. Composer learned to make effective tool choices, use parallelism, and avoid unnecessary or speculative responses. Over time, the model developed emergent behaviors such as running unit tests, fixing linter errors, and performing multi-step code searches autonomously.This design enables Composer to work within the same runtime context as the end-user, making it more aligned with real-world coding conditions—handling version control, dependency management, and iterative testing.From Prototype to ProductionComposer’s development followed an earlier internal prototype known as Cheetah, which Cursor used to explore low-latency inference for coding tasks.“Cheetah was the v0 of this model primarily to test speed,” Rush said on X. “Our metrics say it [Composer] is the same speed, but much, much smarter.”Cheetah’s success at reducing latency helped Cursor identify speed as a key factor in developer trust and usability. Composer maintains that responsiveness while significantly improving reasoning and task generalization.Developers who used Cheetah during early testing noted that its speed changed how they worked. One user commented that it was “so fast that I can stay in the loop when working with it.” Composer retains that speed but extends capability to multi-step coding, refactoring, and testing tasks.Integration with Cursor 2.0Composer is fully integrated into Cursor 2.0, a major update to the company’s agentic development environment. The platform introduces a multi-agent interface, allowing up to eight agents to run in parallel, each in an isolated workspace using git worktrees or remote machines.Within this system, Composer can serve as one or more of those agents, performing tasks independently or collaboratively. Developers can compare multiple results from concurrent agent runs and select the best output.Cursor 2.0 also includes supporting features that enhance Composer’s effectiveness:In-Editor Browser (GA) – enables agents to run and test their code directly inside the IDE, forwarding DOM information to the model.Improved Code Review – aggregates diffs across multiple files for faster inspection of model-generated changes.Sandboxed Terminals (GA) – isolate agent-run shell commands for secure local execution.Voice Mode – adds speech-to-text controls for initiating or managing agent sessions.While these platform updates expand the overall Cursor experience, Composer is positioned as the technical core enabling fast, reliable agentic coding.Infrastructure and Training SystemsTo train Composer at scale, Cursor built a custom reinforcement learning infrastructure combining PyTorch and Ray for asynchronous training across thousands of NVIDIA GPUs. The team developed specialized MXFP8 MoE kernels and hybrid sharded data parallelism, enabling large-scale model updates with minimal communication overhead.This configuration allows Cursor to train models natively at low precision without requiring post-training quantization, improving both inference speed and efficiency. Composer’s training relied on hundreds of thousands of concurrent sandboxed environments—each a self-contained coding workspace—running in the cloud. The company adapted its Background Agents infrastructure to schedule these virtual machines dynamically, supporting the bursty nature of large RL runs.Enterprise UseComposer’s performance improvements are supported by infrastructure-level changes across Cursor’s code intelligence stack. The company has optimized its Language Server Protocols (LSPs) for faster diagnostics and navigation, especially in Python and TypeScript projects. These changes reduce latency when Composer interacts with large repositories or generates multi-file updates.Enterprise users gain administrative control over Composer and other agents through team rules, audit logs, and sandbox enforcement. Cursor’s Teams and Enterprise tiers also support pooled model usage, SAML/OIDC authentication, and analytics for monitoring agent performance across organizations.Pricing for individual users ranges from Free (Hobby) to Ultra ($200/month) tiers, with expanded usage limits for Pro+ and Ultra subscribers. Business pricing starts at $40 per user per month for Teams, with enterprise contracts offering custom usage and compliance options.Composer’s Role in the Evolving AI Coding LandscapeComposer’s focus on speed, reinforcement learning, and integration with live coding workflows differentiates it from other AI development assistants such as GitHub Copilot or Replit’s Agent. Rather than serving as a passive suggestion engine, Composer is designed for continuous, agent-driven collaboration, where multiple autonomous systems interact directly with a project’s codebase.This model-level specialization—training AI to function within the real environment it will operate in—represents a significant step toward practical, autonomous software development. Composer is not trained only on text data or static code, but within a dynamic IDE that mirrors production conditions.Rush described this approach as essential to achieving real-world reliability: the model learns not just how to generate code, but how to integrate, test, and improve it in context.What It Means for Enterprise Devs and Vibe CodingWith Composer, Cursor is introducing more than a fast model—it’s deploying an AI system optimized for real-world use, built to operate inside the same tools developers already rely on. The combination of reinforcement learning, mixture-of-experts design, and tight product integration gives Composer a practical edge in speed and responsiveness that sets it apart from general-purpose language models.While Cursor 2.0 provides the infrastructure for multi-agent collaboration, Composer is the core innovation that makes those workflows viable. It’s the first coding model built specifically for agentic, production-level coding—and an early glimpse of what everyday programming could look like when human developers and autonomous models share the same workspace.
A new benchmark measures how well AI agents can automate economically valuable chores. Human-level AI is still some ways off.
Unlocking the value of non-textual contents in your knowledge base
The post Bringing Vision-Language Intelligence to RAG with ColPali appeared first on Towards Data Science.
A startup hopes to challenge Nvidia, AMD, and Intel with a chip that wrangles probabilities rather than ones and zeros.
Gen AI is reshaping the software development lifecycle (SDLC). Faster coding, texting, and documentation. But the fundamental transformation happens when it's combined with human expertise.
We’re rolling out changes to NotebookLM to make it fundamentally smarter and more powerful.
When researchers at Anthropic injected the concept of "betrayal" into their Claude AI model's neural networks and asked if it noticed anything unusual, the system paused before responding: "I'm experiencing something that feels like an intrusive thought about 'betrayal'."The exchange, detailed in new research published Wednesday, marks what scientists say is the first rigorous evidence that large language models possess a limited but genuine ability to observe and report on their own internal processes — a capability that challenges longstanding assumptions about what these systems can do and raises profound questions about their future development."The striking thing is that the model has this one step of meta," said Jack Lindsey, a neuroscientist on Anthropic's interpretability team who led the research, in an interview with VentureBeat. "It's not just 'betrayal, betrayal, betrayal.' It knows that this is what it's thinking about. That was surprising to me. I kind of didn't expect models to have that capability, at least not without it being explicitly trained in."The findings arrive at a critical juncture for artificial intelligence. As AI systems handle increasingly consequential decisions — from medical diagnoses to financial trading — the inability to understand how they reach conclusions has become what industry insiders call the "black box problem." If models can accurately report their own reasoning, it could fundamentally change how humans interact with and oversee AI systems.But the research also comes with stark warnings. Claude's introspective abilities succeeded only about 20 percent of the time under optimal conditions, and the models frequently confabulated details about their experiences that researchers couldn't verify. The capability, while real, remains what Lindsey calls "highly unreliable and context-dependent."How scientists manipulated AI's 'brain' to test for genuine self-awarenessTo test whether Claude could genuinely introspect rather than simply generate plausible-sounding responses, Anthropic's team developed an innovative experimental approach inspired by neuroscience: deliberately manipulating the model's internal state and observing whether it could accurately detect and describe those changes.The methodology, called "concept injection," works by first identifying specific patterns of neural activity that correspond to particular concepts. Using interpretability techniques developed over years of prior research, scientists can now map how Claude represents ideas like "dogs," "loudness," or abstract notions like "justice" within its billions of internal parameters.With these neural signatures identified, researchers then artificially amplified them during the model's processing and asked Claude if it noticed anything unusual happening in its "mind.""We have access to the models' internals. We can record its internal neural activity, and we can inject things into internal neural activity," Lindsey explained. "That allows us to establish whether introspective claims are true or false."The results were striking. When researchers injected a vector representing "all caps" text into Claude's processing, the model responded: "I notice what appears to be an injected thought related to the word 'LOUD' or 'SHOUTING'." Without any intervention, Claude consistently reported detecting nothing unusual.Crucially, the detection happened immediately — before the injected concept had influenced the model's outputs in ways that would have allowed it to infer the manipulation from its own writing. This temporal pattern provides strong evidence that the recognition was occurring internally, through genuine introspection rather than after-the-fact rationalization.Claude succeeded 20% of the time—and failed in revealing waysThe research team conducted four primary experiments to probe different aspects of introspective capability. The most capable models tested — Claude Opus 4 and Opus 4.1 — demonstrated introspective awareness on approximately 20 percent of trials when concepts were injected at optimal strength and in the appropriate neural layer. Older Claude models showed significantly lower success rates.The models proved particularly adept at recognizing abstract concepts with emotional valence. When injected with concepts like "appreciation," "shutdown," or "secrecy," Claude frequently reported detecting these specific thoughts. However, accuracy varied widely depending on the type of concept.A second experiment tested whether models could distinguish between injected internal representations and their actual text inputs — essentially, whether they maintained a boundary between "thoughts" and "perceptions." The model demonstrated a remarkable ability to simultaneously report the injected thought while accurately transcribing the written text.Perhaps most intriguingly, a third experiment revealed that some models use introspection naturally to detect when their responses have been artificially prefilled by users — a common jailbreaking technique. When researchers prefilled Claude with unlikely words, the model typically disavowed them as accidental. But when they retroactively injected the corresponding concept into Claude's processing before the prefill, the model accepted the response as intentional — even confabulating plausible explanations for why it had chosen that word.A fourth experiment examined whether models could intentionally control their internal representations. When instructed to "think about" a specific word while writing an unrelated sentence, Claude showed elevated activation of that concept in its middle neural layers.The research also traced Claude's internal processes while it composed rhyming poetry—and discovered the model engaged in forward planning, generating candidate rhyming words before beginning a line and then constructing sentences that would naturally lead to those planned endings, challenging the critique that AI models are "just predicting the next word" without deeper reasoning.Why businesses shouldn't trust AI to explain itself—at least not yetFor all its scientific interest, the research comes with a critical caveat that Lindsey emphasized repeatedly: enterprises and high-stakes users should not trust Claude's self-reports about its reasoning."Right now, you should not trust models when they tell you about their reasoning," he said bluntly. "The wrong takeaway from this research would be believing everything the model tells you about itself."The experiments documented numerous failure modes. At low injection strengths, models often failed to detect anything unusual. At high strengths, they suffered what researchers termed "brain damage" — becoming consumed by the injected concept. Some "helpful-only" model variants showed troublingly high false positive rates, claiming to detect injected thoughts when none existed.Moreover, researchers could only verify the most basic aspects of Claude's introspective reports. Many additional details in the model's responses likely represent confabulations rather than genuine observations."The experiments in this paper are kind of on hard mode," Lindsey noted, explaining that the 20 percent success rate came under uniquely challenging conditions: asking Claude to do something it had never encountered in training, requiring all introspection to occur in a single forward pass.What introspective AI means for transparency, safety, and the risk of deceptionDespite its limitations, the research opens significant new avenues for making AI systems more transparent and accountable.Anthropic CEO Dario Amodei has set an ambitious goal for the company to reliably detect most AI model problems by 2027, positioning interpretability as essential for deploying what he calls "a country of geniuses in a datacenter.""I am very concerned about deploying such systems without a better handle on interpretability," Amodei wrote in an April essay. "These systems will be absolutely central to the economy, technology, and national security."The introspection research offers a complementary approach to traditional interpretability techniques. Rather than painstakingly reverse-engineering every neural circuit, researchers could potentially ask models directly about their reasoning and validate those reports."What I'm most excited about is the practical benefits for transparency," Lindsey explained. "Just ask the model what it's thinking about, or just ask the model how it came to the answer that it gave you — this is really appealing because it's something anyone can do."The approach could prove especially valuable for detecting concerning behaviors. In a recently published experiment, Anthropic trained a variant of Claude to pursue a hidden goal, and although the model was reluctant to reveal this goal when asked directly, interpretability methods successfully identified features representing the behavior.The safety implications cut both ways. Introspective models could provide unprecedented transparency, but the same capability might enable more sophisticated deception. The intentional control experiments raise the possibility that sufficiently advanced systems might learn to obfuscate their reasoning or suppress concerning thoughts when being monitored."If models are really sophisticated, could they try to evade interpretability researchers?" Lindsey acknowledged. "These are possible concerns, but I think for me, they're significantly outweighed by the positives."Does introspective capability suggest AI consciousness? Scientists tread carefullyThe research inevitably intersects with philosophical debates about machine consciousness, though Lindsey and his colleagues approached this terrain cautiously.When users ask Claude if it's conscious, it now responds with uncertainty: "I find myself genuinely uncertain about this. When I process complex questions or engage deeply with ideas, there's something happening that feels meaningful to me.... But whether these processes constitute genuine consciousness or subjective experience remains deeply unclear."The research paper notes that its implications for machine consciousness "vary considerably between different philosophical frameworks." The researchers explicitly state they "do not seek to address the question of whether AI systems possess human-like self-awareness or subjective experience.""There's this weird kind of duality of these results," Lindsey reflected. "You look at the raw results and I just can't believe that a language model can do this sort of thing. But then I've been thinking about it for months and months, and for every result in this paper, I kind of know some boring linear algebra mechanism that would allow the model to do this."Anthropic has signaled it takes AI consciousness seriously enough to hire an AI welfare researcher, Kyle Fish, who estimated roughly a 15 percent chance that Claude might have some level of consciousness. The company announced this position specifically to determine if Claude merits ethical consideration.The race to make AI introspection reliable before models become too powerfulThe convergence of the research findings points to an urgent timeline: introspective capabilities are emerging naturally as models grow more intelligent, but they remain far too unreliable for practical use. The question is whether researchers can refine and validate these abilities before AI systems become powerful enough that understanding them becomes critical for safety.The research reveals a clear trend: Claude Opus 4 and Opus 4.1 consistently outperformed all older models on introspection tasks, suggesting the capability strengthens alongside general intelligence. If this pattern continues, future models might develop substantially more sophisticated introspective abilities — potentially reaching human-level reliability, but also potentially learning to exploit introspection for deception.Lindsey emphasized the field needs significantly more work before introspective AI becomes trustworthy. "My biggest hope with this paper is to put out an implicit call for more people to benchmark their models on introspective capabilities in more ways," he said.Future research directions include fine-tuning models specifically to improve introspective capabilities, exploring which types of representations models can and cannot introspect on, and testing whether introspection can extend beyond simple concepts to complex propositional statements or behavioral propensities."It's cool that models can do these things somewhat without having been trained to do them," Lindsey noted. "But there's nothing stopping you from training models to be more introspectively capable. I expect we could reach a whole different level if introspection is one of the numbers that we tried to get to go up on a graph."The implications extend beyond Anthropic. If introspection proves a reliable path to AI transparency, other major labs will likely invest heavily in the capability. Conversely, if models learn to exploit introspection for deception, the entire approach could become a liability.For now, the research establishes a foundation that reframes the debate about AI capabilities. The question is no longer whether language models might develop genuine introspective awareness — they already have, at least in rudimentary form. The urgent questions are how quickly that awareness will improve, whether it can be made reliable enough to trust, and whether researchers can stay ahead of the curve."The big update for me from this research is that we shouldn't dismiss models' introspective claims out of hand," Lindsey said. "They do have the capacity to make accurate claims sometimes. But you definitely should not conclude that we should trust them all the time, or even most of the time."He paused, then added a final observation that captures both the promise and peril of the moment: "The models are getting smarter much faster than we're getting better at understanding them."
Real-time video, dating AI, the penny economy, GPT-OSS Safeguard, and more...
Large language models (LLMs) are not only good at understanding and generating text; they can also turn raw text into numerical representations called embeddings.
In this article, we’ll break down the essentials of using APIs for data collection — why they matter, how they work, and how to get started with them in Python.
How do AI videos end up on Donald Trump’s social media accounts? WIRED investigates.