Latest AI News & Updates

#mit energy initiative #energy #energy efficiency #electric grid #artificial intelligence #emissions #technology and policy #cleaner industry #sustainability #climate change #research

MIT faculty and MITEI member company experts address power demand from data centers.

#ai

Even as concern and skepticism grows over U.S. AI startup OpenAI's buildout strategy and high spending commitments, Chinese open source AI providers are escalating their competition and one has even caught up to OpenAI's flagship, paid proprietary model GPT-5 in key third-party performance benchmarks with a new, free model. The Chinese AI startup Moonshot AI’s new Kimi K2 Thinking model, released today, has vaulted past both proprietary and open-weight competitors to claim the top position in reasoning, coding, and agentic-tool benchmarks. Despite being fully open-source, the model now outperforms OpenAI’s GPT-5, Anthropic’s Claude Sonnet 4.5 (Thinking mode), and xAI's Grok-4 on several standard evaluations — an inflection point for the competitiveness of open AI systems.Developers can access the model via platform.moonshot.ai and kimi.com; weights and code are hosted on Hugging Face. The open release includes APIs for chat, reasoning, and multi-tool workflows.Users can try out Kimi K2 Thinking directly through its own ChatGPT-like website competitor and on a Hugging Face space as well. Modified Standard Open Source LicenseMoonshot AI has formally released Kimi K2 Thinking under a Modified MIT License on Hugging Face.The license grants full commercial and derivative rights — meaning individual researchers and developers working on behalf of enterprise clients can access it freely and use it in commercial applications — but adds one restriction:"If the software or any derivative product serves over 100 million monthly active users or generates over $20 million USD per month in revenue, the deployer must prominently display 'Kimi K2' on the product’s user interface."For most research and enterprise applications, this clause functions as a light-touch attribution requirement while preserving the freedoms of standard MIT licensing. It makes K2 Thinking one of the most permissively licensed frontier-class models currently available.A New Benchmark LeaderKimi K2 Thinking is a Mixture-of-Experts (MoE) model built around one trillion parameters, of which 32 billion activate per inference. It combines long-horizon reasoning with structured tool use, executing up to 200–300 sequential tool calls without human intervention.According to Moonshot’s published test results, K2 Thinking achieved:44.9 % on Humanity’s Last Exam (HLE), a state-of-the-art score;60.2 % on BrowseComp, an agentic web-search and reasoning test;71.3 % on SWE-Bench Verified and 83.1 % on LiveCodeBench v6, key coding evaluations;56.3 % on Seal-0, a benchmark for real-world information retrieval.Across these tasks, K2 Thinking consistently outperforms GPT-5’s corresponding scores and surpasses the previous open-weight leader MiniMax-M2—released just weeks earlier by Chinese rival MiniMax AI.Open Model Outperforms Proprietary SystemsGPT-5 and Claude Sonnet 4.5 Thinking remain the leading proprietary “thinking” models. Yet in the same benchmark suite, K2 Thinking’s agentic reasoning scores exceed both: for instance, on BrowseComp the open model’s 60.2 % decisively leads GPT-5’s 54.9 % and Claude 4.5’s 24.1 %.K2 Thinking also edges GPT-5 in GPQA Diamond (85.7 % vs 84.5 %) and matches it on mathematical reasoning tasks such as AIME 2025 and HMMT 2025. Only in certain heavy-mode configurations—where GPT-5 aggregates multiple trajectories—does the proprietary model regain parity.That Moonshot’s fully open-weight release can meet or exceed GPT-5’s scores marks a turning point. The gap between closed frontier systems and publicly available models has effectively collapsed for high-end reasoning and coding.Surpassing MiniMax-M2: The Previous Open-Source BenchmarkWhen VentureBeat profiled MiniMax-M2 just a week and a half ago, it was hailed as the “new king of open-source LLMs,” achieving top scores among open-weight systems:τ²-Bench 77.2BrowseComp 44.0FinSearchComp-global 65.5SWE-Bench Verified 69.4Those results placed MiniMax-M2 near GPT-5-level capability in agentic tool use. Yet Kimi K2 Thinking now eclipses them by wide margins.Its BrowseComp result of 60.2 % exceeds M2’s 44.0 %, and its SWE-Bench Verified 71.3 % edges out M2’s 69.4 %. Even on financial-reasoning tasks such as FinSearchComp-T3 (47.4 %), K2 Thinking performs comparably while maintaining superior general-purpose reasoning.Technically, both models adopt sparse Mixture-of-Experts architectures for compute efficiency, but Moonshot’s network activates more experts and deploys advanced quantization-aware training (INT4 QAT). This design doubles inference speed relative to standard precision without degrading accuracy—critical for long “thinking-token” sessions reaching 256 k context windows.Agentic Reasoning and Tool UseK2 Thinking’s defining capability lies in its explicit reasoning trace. The model outputs an auxiliary field, reasoning_content, revealing intermediate logic before each final response. This transparency preserves coherence across long multi-turn tasks and multi-step tool calls.A reference implementation published by Moonshot demonstrates how the model autonomously conducts a “daily news report” workflow: invoking date and web-search tools, analyzing retrieved content, and composing structured output—all while maintaining internal reasoning state.This end-to-end autonomy enables the model to plan, search, execute, and synthesize evidence across hundreds of steps, mirroring the emerging class of “agentic AI” systems that operate with minimal supervision.Efficiency and AccessDespite its trillion-parameter scale, K2 Thinking’s runtime cost remains modest. Moonshot lists usage at:$0.15 / 1 M tokens (cache hit)$0.60 / 1 M tokens (cache miss)$2.50 / 1 M tokens outputThese rates are competitive even against MiniMax-M2’s $0.30 input / $1.20 output pricing—and an order of magnitude below GPT-5 ($1.25 input / $10 output).Comparative Context: Open-Weight AccelerationThe rapid succession of M2 and K2 Thinking illustrates how quickly open-source research is catching frontier systems. MiniMax-M2 demonstrated that open models could approach GPT-5-class agentic capability at a fraction of the compute cost. Moonshot has now advanced that frontier further, pushing open weights beyond parity into outright leadership.Both models rely on sparse activation for efficiency, but K2 Thinking’s higher activation count (32 B vs 10 B active parameters) yields stronger reasoning fidelity across domains. Its test-time scaling—expanding “thinking tokens” and tool-calling turns—provides measurable performance gains without retraining, a feature not yet observed in MiniMax-M2.Technical OutlookMoonshot reports that K2 Thinking supports native INT4 inference and 256 k-token contexts with minimal performance degradation. Its architecture integrates quantization, parallel trajectory aggregation (“heavy mode”), and Mixture-of-Experts routing tuned for reasoning tasks.In practice, these optimizations allow K2 Thinking to sustain complex planning loops—code compile–test–fix, search–analyze–summarize—over hundreds of tool calls. This capability underpins its superior results on BrowseComp and SWE-Bench, where reasoning continuity is decisive.Enormous Implications for the AI EcosystemThe convergence of open and closed models at the high end signals a structural shift in the AI landscape. Enterprises that once relied exclusively on proprietary APIs can now deploy open alternatives matching GPT-5-level reasoning while retaining full control of weights, data, and compliance.Moonshot’s open publication strategy follows the precedent set by DeepSeek R1, Qwen3, GLM-4.6 and MiniMax-M2 but extends it to full agentic reasoning. For academic and enterprise developers, K2 Thinking provides both transparency and interoperability—the ability to inspect reasoning traces and fine-tune performance for domain-specific agents.The arrival of K2 Thinking signals that Moonshot — a young startup founded in 2023 with investment from some of China's biggest apps and tech companies — is here to play in an intensifying competition, and comes amid growing scrutiny of the financial sustainability of AI’s largest players. Just a day ago, OpenAI CFO Sarah Friar sparked controversy after suggesting at WSJ Tech Live event that the U.S. government might eventually need to provide a “backstop” for the company’s more than $1.4 trillion in compute and data-center commitments — a comment widely interpreted as a call for taxpayer-backed loan guarantees.Although Friar later clarified that OpenAI was not seeking direct federal support, the episode reignited debate about the scale and concentration of AI capital spending. With OpenAI, Microsoft, Meta, and Google all racing to secure long-term chip supply, critics warn of an unsustainable investment bubble and “AI arms race” driven more by strategic fear than commercial returns — one that could "blow up" and take down the entire global economy with it if there is hesitation or market uncertainty, as so many trades and valuations have now been made in anticipation of continued hefty AI investment and massive returns. Against that backdrop, Moonshot AI’s and MiniMax’s open-weight releases put more pressure on U.S. proprietary AI firms and their backers to justify the size of the investments and paths to profitability. If an enterprise customer can just as easily get comparable or better performance from a free, open source Chinese AI model than they do with paid, proprietary AI solutions like OpenAI's GPT-5, Anthropic's Claude Sonnet 4.5, or Google's Gemini 2.5 Pro — why would they continue paying to access the proprietary models? Already, Silicon Valley stalwarts like Airbnb have raised eyebrows for admitting to heavily using Chinese open source alternatives like Alibaba's Qwen over OpenAI's proprietary offerings. For investors and enterprises, these developments suggest that high-end AI capability is no longer synonymous with high-end capital expenditure. The most advanced reasoning systems may now come not from companies building gigascale data centers, but from research groups optimizing architectures and quantization for efficiency.In that sense, K2 Thinking’s benchmark dominance is not just a technical milestone—it’s a strategic one, arriving at a moment when the AI market’s biggest question has shifted from how powerful models can become to who can afford to sustain them.What It Means for Enterprises Going ForwardWithin weeks of MiniMax-M2’s ascent, Kimi K2 Thinking has overtaken it—along with GPT-5 and Claude 4.5—across nearly every reasoning and agentic benchmark. The model demonstrates that open-weight systems can now meet or surpass proprietary frontier models in both capability and efficiency.For the AI research community, K2 Thinking represents more than another open model: it is evidence that the frontier has become collaborative. The best-performing reasoning model available today is not a closed commercial product but an open-source system accessible to anyone.

#classes and programs #computer science and technology #artificial intelligence #algorithms #machine learning #data #students #graduate, postdoctoral #research #electrical engineering and computer science (eecs) #laboratory for information and decision systems (lids) #mit-ibm watson ai lab #school of engineering #institute for medical engineering and science (imes) #mit schwarzman college of computing #computer science and artificial intelligence laboratory (csail)

MIT PhD students who interned with the MIT-IBM Watson AI Lab Summer Program are pushing AI tools to be more flexible, efficient, and grounded in truth.

#ai #big data #data infrastructure

Google Cloud is introducing what it calls its most powerful artificial intelligence infrastructure to date, unveiling a seventh-generation Tensor Processing Unit and expanded Arm-based computing options designed to meet surging demand for AI model deployment — what the company characterizes as a fundamental industry shift from training models to serving them to billions of users.The announcement, made Thursday, centers on Ironwood, Google's latest custom AI accelerator chip, which will become generally available in the coming weeks. In a striking validation of the technology, Anthropic, the AI safety company behind the Claude family of models, disclosed plans to access up to one million of these TPU chips — a commitment worth tens of billions of dollars and among the largest known AI infrastructure deals to date.The move underscores an intensifying competition among cloud providers to control the infrastructure layer powering artificial intelligence, even as questions mount about whether the industry can sustain its current pace of capital expenditure. Google's approach — building custom silicon rather than relying solely on Nvidia's dominant GPU chips — amounts to a long-term bet that vertical integration from chip design through software will deliver superior economics and performance.Why companies are racing to serve AI models, not just train themGoogle executives framed the announcements around what they call "the age of inference" — a transition point where companies shift resources from training frontier AI models to deploying them in production applications serving millions or billions of requests daily."Today's frontier models, including Google's Gemini, Veo, and Imagen and Anthropic's Claude train and serve on Tensor Processing Units," said Amin Vahdat, vice president and general manager of AI and Infrastructure at Google Cloud. "For many organizations, the focus is shifting from training these models to powering useful, responsive interactions with them."This transition has profound implications for infrastructure requirements. Where training workloads can often tolerate batch processing and longer completion times, inference — the process of actually running a trained model to generate responses — demands consistently low latency, high throughput, and unwavering reliability. A chatbot that takes 30 seconds to respond, or a coding assistant that frequently times out, becomes unusable regardless of the underlying model's capabilities.Agentic workflows — where AI systems take autonomous actions rather than simply responding to prompts — create particularly complex infrastructure challenges, requiring tight coordination between specialized AI accelerators and general-purpose computing.Inside Ironwood's architecture: 9,216 chips working as one supercomputerIronwood is more than incremental improvement over Google's sixth-generation TPUs. According to technical specifications shared by the company, it delivers more than four times better performance for both training and inference workloads compared to its predecessor — gains that Google attributes to a system-level co-design approach rather than simply increasing transistor counts.The architecture's most striking feature is its scale. A single Ironwood "pod" — a tightly integrated unit of TPU chips functioning as one supercomputer — can connect up to 9,216 individual chips through Google's proprietary Inter-Chip Interconnect network operating at 9.6 terabits per second. To put that bandwidth in perspective, it's roughly equivalent to downloading the entire Library of Congress in under two seconds.This massive interconnect fabric allows the 9,216 chips to share access to 1.77 petabytes of High Bandwidth Memory — memory fast enough to keep pace with the chips' processing speeds. That's approximately 40,000 high-definition Blu-ray movies' worth of working memory, instantly accessible by thousands of processors simultaneously. "For context, that means Ironwood Pods can deliver 118x more FP8 ExaFLOPS versus the next closest competitor," Google stated in technical documentation.The system employs Optical Circuit Switching technology that acts as a "dynamic, reconfigurable fabric." When individual components fail or require maintenance — inevitable at this scale — the OCS technology automatically reroutes data traffic around the interruption within milliseconds, allowing workloads to continue running without user-visible disruption.This reliability focus reflects lessons learned from deploying five previous TPU generations. Google reported that its fleet-wide uptime for liquid-cooled systems has maintained approximately 99.999% availability since 2020 — equivalent to less than six minutes of downtime per year.Anthropic's billion-dollar bet validates Google's custom silicon strategyPerhaps the most significant external validation of Ironwood's capabilities comes from Anthropic's commitment to access up to one million TPU chips — a staggering figure in an industry where even clusters of 10,000 to 50,000 accelerators are considered massive."Anthropic and Google have a longstanding partnership and this latest expansion will help us continue to grow the compute we need to define the frontier of AI," said Krishna Rao, Anthropic's chief financial officer, in the official partnership agreement. "Our customers — from Fortune 500 companies to AI-native startups — depend on Claude for their most important work, and this expanded capacity ensures we can meet our exponentially growing demand."According to a separate statement, Anthropic will have access to "well over a gigawatt of capacity coming online in 2026" — enough electricity to power a small city. The company specifically cited TPUs' "price-performance and efficiency" as key factors in the decision, along with "existing experience in training and serving its models with TPUs."Industry analysts estimate that a commitment to access one million TPU chips, with associated infrastructure, networking, power, and cooling, likely represents a multi-year contract worth tens of billions of dollars — among the largest known cloud infrastructure commitments in history.James Bradbury, Anthropic's head of compute, elaborated on the inference focus: "Ironwood's improvements in both inference performance and training scalability will help us scale efficiently while maintaining the speed and reliability our customers expect."Google's Axion processors target the computing workloads that make AI possibleAlongside Ironwood, Google introduced expanded options for its Axion processor family — custom Arm-based CPUs designed for general-purpose workloads that support AI applications but don't require specialized accelerators.The N4A instance type, now entering preview, targets what Google describes as "microservices, containerized applications, open-source databases, batch, data analytics, development environments, experimentation, data preparation and web serving jobs that make AI applications possible." The company claims N4A delivers up to 2X better price-performance than comparable current-generation x86-based virtual machines.Google is also previewing C4A metal, its first bare-metal Arm instance, which provides dedicated physical servers for specialized workloads such as Android development, automotive systems, and software with strict licensing requirements.The Axion strategy reflects a growing conviction that the future of computing infrastructure requires both specialized AI accelerators and highly efficient general-purpose processors. While a TPU handles the computationally intensive task of running an AI model, Axion-class processors manage data ingestion, preprocessing, application logic, API serving, and countless other tasks in a modern AI application stack.Early customer results suggest the approach delivers measurable economic benefits. Vimeo reported observing "a 30% improvement in performance for our core transcoding workload compared to comparable x86 VMs" in initial N4A tests. ZoomInfo measured "a 60% improvement in price-performance" for data processing pipelines running on Java services, according to Sergei Koren, the company's chief infrastructure architect.Software tools turn raw silicon performance into developer productivityHardware performance means little if developers cannot easily harness it. Google emphasized that Ironwood and Axion are integrated into what it calls AI Hypercomputer — "an integrated supercomputing system that brings together compute, networking, storage, and software to improve system-level performance and efficiency."According to an October 2025 IDC Business Value Snapshot study, AI Hypercomputer customers achieved on average 353% three-year return on investment, 28% lower IT costs, and 55% more efficient IT teams.Google disclosed several software enhancements designed to maximize Ironwood utilization. Google Kubernetes Engine now offers advanced maintenance and topology awareness for TPU clusters, enabling intelligent scheduling and highly resilient deployments. The company's open-source MaxText framework now supports advanced training techniques including Supervised Fine-Tuning and Generative Reinforcement Policy Optimization.Perhaps most significant for production deployments, Google's Inference Gateway intelligently load-balances requests across model servers to optimize critical metrics. According to Google, it can reduce time-to-first-token latency by 96% and serving costs by up to 30% through techniques like prefix-cache-aware routing.The Inference Gateway monitors key metrics including KV cache hits, GPU or TPU utilization, and request queue length, then routes incoming requests to the optimal replica. For conversational AI applications where multiple requests might share context, routing requests with shared prefixes to the same server instance can dramatically reduce redundant computation.The hidden challenge: powering and cooling one-megawatt server racksBehind these announcements lies a massive physical infrastructure challenge that Google addressed at the recent Open Compute Project EMEA Summit. The company disclosed that it's implementing +/-400 volt direct current power delivery capable of supporting up to one megawatt per rack — a tenfold increase from typical deployments."The AI era requires even greater power delivery capabilities," explained Madhusudan Iyengar and Amber Huffman, Google principal engineers, in an April 2025 blog post. "ML will require more than 500 kW per IT rack before 2030."Google is collaborating with Meta and Microsoft to standardize electrical and mechanical interfaces for high-voltage DC distribution. The company selected 400 VDC specifically to leverage the supply chain established by electric vehicles, "for greater economies of scale, more efficient manufacturing, and improved quality and scale."On cooling, Google revealed it will contribute its fifth-generation cooling distribution unit design to the Open Compute Project. The company has deployed liquid cooling "at GigaWatt scale across more than 2,000 TPU Pods in the past seven years" with fleet-wide availability of approximately 99.999%.Water can transport approximately 4,000 times more heat per unit volume than air for a given temperature change — critical as individual AI accelerator chips increasingly dissipate 1,000 watts or more.Custom silicon gambit challenges Nvidia's AI accelerator dominanceGoogle's announcements come as the AI infrastructure market reaches an inflection point. While Nvidia maintains overwhelming dominance in AI accelerators — holding an estimated 80-95% market share — cloud providers are increasingly investing in custom silicon to differentiate their offerings and improve unit economics.Amazon Web Services pioneered this approach with Graviton Arm-based CPUs and Inferentia / Trainium AI chips. Microsoft has developed Cobalt processors and is reportedly working on AI accelerators. Google now offers the most comprehensive custom silicon portfolio among major cloud providers.The strategy faces inherent challenges. Custom chip development requires enormous upfront investment — often billions of dollars. The software ecosystem for specialized accelerators lags behind Nvidia's CUDA platform, which benefits from 15+ years of developer tools. And rapid AI model architecture evolution creates risk that custom silicon optimized for today's models becomes less relevant as new techniques emerge.Yet Google argues its approach delivers unique advantages. "This is how we built the first TPU ten years ago, which in turn unlocked the invention of the Transformer eight years ago — the very architecture that powers most of modern AI," the company noted, referring to the seminal "Attention Is All You Need" paper from Google researchers in 2017.The argument is that tight integration — "model research, software, and hardware development under one roof" — enables optimizations impossible with off-the-shelf components.Beyond Anthropic, several other customers provided early feedback. Lightricks, which develops creative AI tools, reported that early Ironwood testing "makes us highly enthusiastic" about creating "more nuanced, precise, and higher-fidelity image and video generation for our millions of global customers," said Yoav HaCohen, the company's research director.Google's announcements raise questions that will play out over coming quarters. Can the industry sustain current infrastructure spending, with major AI companies collectively committing hundreds of billions of dollars? Will custom silicon prove economically superior to Nvidia GPUs? How will model architectures evolve?For now, Google appears committed to a strategy that has defined the company for decades: building custom infrastructure to enable applications impossible on commodity hardware, then making that infrastructure available to customers who want similar capabilities without the capital investment.As the AI industry transitions from research labs to production deployments serving billions of users, that infrastructure layer — the silicon, software, networking, power, and cooling that make it all run — may prove as important as the models themselves.And if Anthropic's willingness to commit to accessing up to one million chips is any indication, Google's bet on custom silicon designed specifically for the age of inference may be paying off just as demand reaches its inflection point.

#research #human-computer interaction #programming #machine learning #software #artificial intelligence #computer science and technology #programming languages #electrical engineering and computer science (eecs) #computer science and artificial intelligence laboratory (csail) #school of engineering #mit schwarzman college of computing

The coding framework uses modular concepts and simple synchronization rules to make software clearer, safer, and easier for LLMs to generate.

#ai

Presented by Salesforce Vibe coding — the fast-growing trend of using generative AI to spin up code from plain-language prompts — is quick, creative, and great for instant prototypes. But many argue that it's not cut out for building production-ready business apps with the security, governance, and trusted infrastructure that enterprises require. In other words, a few saved hours in development can mean a future full of security vulnerabilities, endless maintenance, and scalability headaches, says Mohith Shrivastava, principal developer advocate at Salesforce."For rapid experimentation, building minimum viable products, and tackling creative challenges, vibe coding is a game-changer," Shrivastava says. "However, that same speed and improvisational nature are exactly what makes its application in a professional, enterprise setting a topic of intense debate. And the skepticism from the developer community is 100% justified."Risks and rewards of vibe coding The excitement is all about speed: going from a rough idea to a working prototype in hours, not weeks, is a massive advantage. But as Shrivastava shared, developers have been vocal about the potential downsides."When you apply vibe coding indiscriminately to an entire application stack, you’re not just moving fast; you’re accumulating risk at an unprecedented rate," Shrivastava explains. "The cons are significant." That includes potential security nightmares, as AI models don't typically take into consideration the company's specific security policies. They can easily introduce vulnerabilities like hardcoded secrets or use insecure, hallucinated packages. Then there’s the issue of what Shrivastava calls "spaghetti code on steroids," or verbose code that lacks a coherent architectural pattern, creating a mountain of technical debt.Equally concerning is the illusion of progress: vibe coding may complete 80% of a feature in record time, but the remaining 20% — the edge cases, performance tuning, and compliance work — becomes exponentially harder.But does this mean vibe coding has no place in the enterprise?"The idea that you can just vibe your way to a complex, secure, and maintainable enterprise application is a dangerous fantasy," Shrivastava says. "But — the pros are undeniable if it's used correctly. The key is not to avoid vibe coding, but to apply it intelligently in your enterprise."Red and green zones: Enterprise-grade vibe codingYou can't, and you absolutely should not, vibe code your entire enterprise stack with just any generic tool, Shrivastava warns. But when paired with no-, low-, or pro-code tools that are built for the enterprise, many of the gaps can be addressed. An enterprise-grade vibe coding solution, for example, can automatically scan for security issues, flag performance bottlenecks, and provide a safety net. It’s also critical to understand which parts of an application suit this approach — and which demand a higher level of trust and control. Shrivastava divides the stack into red and green zones to illustrate.The green zone is the presentation layer, or the UI and UX. It’s ideal for vibe coding, where developers can move fast and iterate quickly without much risk. In contrast is the red zone, which covers the foundational pillars of an application, including business logic and data layers.Empowering developers in the green zoneDeveloper expertise remains the foundation for effective and safe vibe coding. But developers can be amplified by AI tools and emerging agents that are grounded in business context, connected to real applications, integrations, and data flows."A generic AI agent can't grasp your company's unique processes, but a context-aware tool can act as a powerful pair programmer, helping a developer draft complex logic or model data with greater speed and accuracy," Shrivastava says. "It’s about making the expert developer more efficient, not trying to do their job for them."Some areas will always be high risk for ungoverned AI — especially infrastructure and security. Letting a generic AI agent configure firewalls or Identity and Access Management [IAM] policies without oversight, Shrivastava warns, is a recipe for disaster. The solution isn’t to avoid the red zone entirely, but to approach it with the right tools — ones that embed governance, security, and context from the ground up."The winning strategy is clear: Vibe code the green zone for agility, approach the red zone by augmenting your developers with powerful, context-aware tools, and never, ever DIY your core infrastructure with AI," he says.Embracing enterprise vibe codingTo harness the power of enterprise vibe coding, Salesforce developed Agentforce Vibes. This new vibe coding offering for the enterprise includes Agentforce, an autonomous AI agent built to collaborate like a pair programmer on the Salesforce Platform. It’s designed precisely to provide developers with the right tools for the job, covering both the green and red zones. For the green zone, it offers the speed and agility to rapidly build UIs and prototypes. But its true power lies in how it augments developers in the red zone."Enterprise vibe coding like Agentforce lets organizations take AI-assisted development to the organizational level, accelerating coding, testing, and deployment, while ensuring consistency, security, and performance," says Dan Fernandez, VP of product, developer services at Salesforce. "It's not about throwing away governance for speed; it’s about integrating AI into every stage of the application lifecycle to work smarter."Because Agentforce Vibes’ tooling is deeply integrated with your business context on the platform, it can safely assist with business logic and data modeling. Most importantly, it operates on a trusted platform. Instead of a DIY approach — jury-rigging a generic AI agent to handle your networking — developers build on a foundation that has security and governance built in, so they can innovate safely, knowing the most critical layers of the stack are secure and compliant.Major enterprises are putting vibe coding to work Agentforce Vibes users are now tapping the tool to build around 20 to 25% of their new code base, according to Salesforce data, and users are accepting around 1.2 million lines of agentic code per month. That includes companies like Coinbase, CGI, Grupo Globo, and one of the top five banks in the U.S., which is using Agentforce Vibes capabilities to develop production-ready apps faster. Agentforce Vibes is part of a suite of tools in Agentforce 360 that span from no-code and low-code to pro-code development. These tools are together helping customers develop and deploy at speeds previously unheard of.With the low-code Agent Builder in Agentforce, the Secret Escapes team was able to build, test, and launch their agent to support customer service in just two weeks, compared to the six months it had previously taken the company to build and train a bot. With Agentforce, 1-800Accountant autonomously resolved 70% of customer chat engagements during tax week in 2025, without writing a line of code, using Salesforce’s low-code tools and AI assistance. Meanwhile, media company Grupo Globo deployed agents to identify subscribers at risk of lapsing, offer personalized upgrades, cross-sell, and convert non-subscribers. As a result, Agentforce boosted Globo’s retention rates by 22% in less than three months.Innovation meets discipline Enterprise tools show that disciplined engineering and creative experimentation can coexist — and that balance, Shrivastava says, is the key to lasting innovation."Vibe coding is not a fad, but it's also not a silver bullet that will replace disciplined software engineering," Shrivastava says. "The smart path forward is a hybrid approach where human software skills are augmented with agentic intelligence. This balanced approach is how you get the best of both worlds: radical innovation at the edge and unwavering stability at the core."Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

#ai

Presented by ArmAI is no longer confined to the cloud or data centers. Increasingly, it’s running directly where data is created — in devices, sensors, and networks at the edge. This shift toward on-device intelligence is being driven by latency, privacy, and cost concerns that companies are confronting as they continue their investments in AI. For leadership teams, the opportunity is clear, says Chris Bergey, SVP and GM, of Arm’s Client Business: Invest in AI-first platforms that complement cloud usage, deliver real-time responsiveness, and protect sensitive data. "With the explosion of connected devices and the rise of IoT, edge AI provides a significant opportunity for organizations to gain a competitive edge through faster, more efficient AI," Bergey explains. "Those who move first aren’t just improving efficiency, they’re redefining what customers expect. AI is becoming a differentiator in trust, responsiveness, and innovation. The sooner a business makes AI central to its workflows, the faster it compounds that advantage." Use cases: Deploying AI where data livesEnterprises are discovering that edge AI isn’t just a performance boost — it’s a new operational model. Processing locally means less dependency on the cloud and faster, safer decision-making in real time. For instance, a factory floor can analyze equipment data instantly to prevent downtime, while a hospital can run diagnostic models securely on-site. Retailers are deploying in-store analytics using vision systems while logistic companies are using on-device AI to optimize fleet operations. Instead of sending vast data volumes to the cloud, organizations can analyze and act on insights where they emerge. The result is a more responsive, privacy-preserving, and cost-effective AI architecture.The consumer expectation: Immediacy and trustWorking with Alibaba’s Taobao team, the largest Chinese ecommerce platform, Arm (Nasdaq:Arm) enabled on-device product recommendations that update instantly without depending on the cloud. This helped online shoppers find what they need faster while keeping browsing data private.Another example comes from consumer tech: Meta’s Ray-Ban smart glasses, which blend cloud and on-device AI. The glasses handle quick commands locally for faster responses, while heavier tasks like translation and visual recognition are processed in the cloud."Every major technology shift has created new ways to engage and monetize," Bergey says. "As AI capabilities and user expectations grow, more intelligence will need to move closer to the edge to deliver this kind of immediacy and trust that people now expect." This shift is also taking place with the tools people use every day. Assistants like Microsoft Copilot and Google Gemini are blending cloud and on-device intelligence to bring generative AI closer to the user, delivering faster, more secure, and more context-aware experiences. That same principle applies across industries: the more intelligence you move safely and efficiently to the edge, the more responsive, private, and valuable your operations become. Building smarter for scaleThe explosion of AI at the edge demands not only smarter chips but smarter infrastructure. By aligning compute power with workload demands, enterprises can reduce energy consumption while maintaining high performance. This balance of sustainability and scale is fast becoming a competitive differentiator."Compute needs, whether in the cloud or on-premises, will continue to rise sharply. The question becomes, how do you maximize value from that compute?" he said. "You can only do this by investing in compute platforms and software that scale with your AI ambitions. The real measure of progress is enterprise value creation, not raw efficiency metrics."The intelligent foundationThe rapid evolution of AI models, especially those powering edge inferencing, multimodal applications, and low-latency responses, demands not just smarter algorithms, but a foundation of highly performant, energy-efficient hardware. As workloads grow more diverse and distributed, legacy architectures designed for traditional workloads are no longer adequate. The role of CPUs is evolving, and they now sit at the center of increasingly heterogenous systems that deliver advanced on-device AI experiences. Thanks to their flexibility, efficiency, and mature software support, modern CPUs can run everything from classic machine learning to complex generative AI workloads. When paired with accelerators such as NPUs or GPUs, they intelligently coordinate compute across the system — ensuring the right workload runs on the right engine for maximum performance and efficiency. The CPU continues to be the foundation that enables scalable, efficient AI everywhere.Technologies like Arm’s Scalable Matrix Extension 2 (SME2) bring advanced matrix acceleration to Armv9 CPUs. Meanwhile, Arm KleidiAI, its intelligent software layer, is extensively integrated across leading frameworks to automatically boost performance for a wide range of AI workloads, from language models to speech recognition to computer vision, running on Arm-based edge devices — without needing developers to rewrite their code."These technologies ensure that AI frameworks can tap into the full performance of Arm-based systems without extra developer effort," he says. "It’s how we make AI both scalable and sustainable: by embedding intelligence into the foundation of modern compute, so innovation happens at the speed of software, not hardware cycles."That democratization of compute power is also what will facilitate the next wave of intelligent, real-time experiences across the enterprise, not just in flagship products, but across entire device portfolios. The evolution of edge AI As AI moves from isolated pilots to full-scale deployment, the enterprises that succeed will be those that connect intelligence across every layer of infrastructure. Agentic AI systems will depend on this seamless integration — enabling autonomous processes that can reason, coordinate, and deliver value instantly."The pattern is familiar as in every disruptive wave, incumbents that move slowly risk being overtaken by new entrants," he says. "The companies that thrive will be the ones that wake up every morning asking how to make their organization AI-first. As with the rise of the internet and cloud computing, those who lean in and truly become AI-enabled will shape the next decade."Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

#ai

By now, enterprises understand that retrieval augmented generation (RAG) allows applications and agents to find the best, most grounded information for queries. However, typical RAG setups could be an engineering challenge and also exhibit undesirable traits. To help solve this, Google released the File Search Tool on the Gemini API, a fully managed RAG system “that abstracts away the retrieval pipeline.” File Search removes much of the tool and application-gathering involved in setting up RAG pipelines, so engineers don’t need to stitch together things like storage solutions and embedding creators.  This tool competes directly with enterprise RAG products from OpenAI, AWS and Microsoft, which also aim to simplify RAG architecture. Google, though, claims its offering requires less orchestration and is more standalone. “File Search provides a simple, integrated and scalable way to ground Gemini with your data, delivering responses that are more accurate, relevant and verifiable,” Google said in a blog post. Enterprises can access some features of File Search, such as storage and embedding generation, for free at query time. Users will begin paying for embeddings when these files are indexed at a fixed rate of $0.15 per 1 million tokens. Google’s Gemini Embedding model, which eventually became the top embedding model on the Massive Text Embedding Benchmark, powers File Search. File Search and integrated experiences Google said File Search works “by handling the complexities of RAG for you.” File Search manages file storage, chunking strategies and embeddings. Developers can invoke File Search within the existing generateContent API, which Google said makes the tool easier to adopt. File Search uses vector search to “understand the meaning and context of a user’s query.” Ideally, it will find the relevant information to answer a query from documents, even if the prompt contains inexact words. The feature has built-in citations that point to the specific parts of a document it used to generate answers, and also supports a variety of file formats. These include PDF, Docx, txt, JSON and “many common programming language file types," Google says. Continuous RAG experimentation Enterprises may have already begun building out a RAG pipeline as they lay the groundwork for their AI agents to actually tap the correct data and make informed decisions. Because RAG represents a key part of how enterprises maintain accuracy and tap into insights about their business, organizations must quickly have visibility into this pipeline. RAG can be an engineering pain because orchestrating multiple tools together can become complicated. Building “traditional” RAG pipelines means organizations must assemble and fine-tune a file ingestion and parsing program, including chunking, embedding generation and updates. They must then contract a vector database like Pinecone, determine its retrieval logic, and fit it all within a model’s context window. Additionally, they can, if desired, add source citations. File Search aims to streamline all of that, although competitor platforms offer similar features. OpenAI’s Assistants API allows developers to utilize a file search feature, guiding an agent to relevant documents for responses. AWS’s Bedrock unveiled a data automation managed service in December. While File Search stands similarly to these other platforms, Google’s offering abstracts all, rather than just some, elements of the RAG pipeline creation. Phaser Studio, the creator of AI-driven game generation platform Beam, said in Google’s blog that it used File Search to sift through its library of 3,000 files.“File Search allows us to instantly surface the right material, whether that’s a code snippet for bullet patterns, genre templates or architectural guidance from our Phaser ‘brain’ corpus,” said Phaser CTO Richard Davey. “The result is ideas that once took days to prototype now become playable in minutes.”Since the announcement, several users expressed interest in using the feature.

#ai

Google Cloud has introduced a big update in a bid to keep AI developers on its Vertex AI platform for concepting, designing, building, testing, deploying and modifying AI agents in enterprise use cases.The new features, announced today, include additional governance tools for enterprises and expanding the capabilities for creating agents with just a few lines of code, moving faster with state-of-the-art context management layers and one-click deployment, as well as managed services for scaling production and evaluation, and support for identifying agents.Agent Builder, released last year during its annual Cloud Next event, provides a no-code platform for enterprises to create agents and connect these to orchestration frameworks like LangChain.Google’s Agent Development Kit (ADK), which lets developers build agents “in under 100 lines of code,” can also be accessed through Agent Builder. “These new capabilities underscore our commitment to Agent Builder, and simplify the agent development process to meet developers where they are, no matter which tech stack they choose,” said Mike Clark, director of Product Management, Vertex AI Agent Builder. Build agents fasterPart of Google’s pitch for Agent Builder’s new features is that enterprises can bake in-orchestration even as they construct their agents. “Building an agent from a concept to a working product involves complex orchestration,” said Clark. The new capabilities, which are shipped with the ADK, include:SOTA context management layers including Static, Turn, User and Cache layers so enterprises have more control over the agents’ contextPrebuilt plugins with customizable logic. One of the new plugins allows agents to recognize failed tool calls and “self-heal” by retrying the task with a different approachAdditional language support in ADK, including Go, alongside Python and Java, that launched with ADKOne-click deployment through the ADK command line interface to move agents from a local environment to live testing with a single commandGovernance layerEnterprises require high accuracy; security; observability and auditability (what a program did and why); and steerability (control) in their production-grade AI agents.While Google had observability features in the local development environment at launch, developers can now access these tools through the Agent Engine managed runtime dashboard. The company said this brings cloud-based production monitoring to track token consumption, error rates and latency. Within this observability dashboard, enterprises can visualize the actions agents take and reproduce any issues. Agent Engine will also have a new Evaluation Layer to help “simulate agent performance across a vast array of user interactions and situations.”This governance layer will also include:Agent Identities that Google said give “agents their own unique, native identities within Google Cloud Model Armor, which would block prompt injections, screen tool calls and agent responsesSecurity Command Center, so admins can build an inventory of their agents to detect threats like unauthorized access“These native identities provide a deep, built-in layer of control and a clear audit trail for all agent actions. These certificate-backed identities further strengthen your security as they cannot be impersonated and are tied directly to the agent's lifecycle, eliminating the risk of dormant accounts,” Clark said. The battle of agent builders It’s no surprise that model providers create platforms to build agents and bring them to production. The competition lies in how fast new tools and features are added.Google’s Agent Builder competes with OpenAI’s open-source Agent Development Kit, which enables developers to create AI agents using non-OpenAI models. Additionally, there is the recently announced AgentKit, which features an Agent Builder that enables companies to integrate agents into their applications easily. Microsoft has its Azure AI Foundry, launched last year around this time for AI agent creation, and AWS also offers agent builders on its Bedrock platform, but Google is hoping is suite of new features will help give it a competitive edge. However, it isn’t just companies with their own models that court developers to build their AI agents within their platforms. Any enterprise service provider with an agent library also wants clients to make agents on their systems. Capturing developer interest and keeping them within the ecosystem is the big battle between tech companies now, with features to make building and governing agents easier. 

#research #computer science and technology #algorithms #artificial intelligence #machine learning #robotics #computer vision #autonomous vehicles #aeronautical and astronautical engineering #laboratory for information and decision systems (lids) #electrical engineering and computer science (eecs) #school of engineering #mit schwarzman college of computing #national science foundation (nsf) #disaster response

A new approach developed at MIT could help a search-and-rescue robot navigate an unpredictable environment by rapidly generating an accurate map of its surroundings.

#ai

The latest big headline in AI isn’t model size or multimodality — it’s the capacity crunch. At VentureBeat’s latest AI Impact stop in NYC, Val Bercovici, chief AI officer at WEKA, joined Matt Marshall, VentureBeat CEO, to discuss what it really takes to scale AI amid rising latency, cloud lock-in, and runaway costs.Those forces, Bercovici argued, are pushing AI toward its own version of surge pricing. Uber famously introduced surge pricing, bringing real-time market rates to ridesharing for the first time. Now, Bercovici argued, AI is headed toward the same economic reckoning — especially for inference — when the focus turns to profitability."We don't have real market rates today. We have subsidized rates. That’s been necessary to enable a lot of the innovation that’s been happening, but sooner or later — considering the trillions of dollars of capex we’re talking about right now, and the finite energy opex — real market rates are going to appear; perhaps next year, certainly by 2027," he said. "When they do, it will fundamentally change this industry and drive an even deeper, keener focus on efficiency."The economics of the token explosion"The first rule is that this is an industry where more is more. More tokens equal exponentially more business value," Bercovici said. But so far, no one's figured out how to make that sustainable. The classic business triad — cost, quality, and speed — translates in AI to latency, cost, and accuracy (especially in output tokens). And accuracy is non-negotiable. That holds not only for consumer interactions with agents like ChatGPT, but for high-stakes use cases such as drug discovery and business workflows in heavily regulated industries like financial services and healthcare."That’s non-negotiable," Bercovici said. "You have to have a high amount of tokens for high inference accuracy, especially when you add security into the mix, guardrail models, and quality models. Then you’re trading off latency and cost. That’s where you have some flexibility. If you can tolerate high latency, and sometimes you can for consumer use cases, then you can have lower cost, with free tiers and low cost-plus tiers." However, latency is a critical bottleneck for AI agents. “These agents now don't operate in any singular sense. You either have an agent swarm or no agentic activity at all,” Bercovici noted.In a swarm, groups of agents work in parallel to complete a larger objective. An orchestrator agent — the smartest model — sits at the center, determining subtasks and key requirements: architecture choices, cloud vs. on-prem execution, performance constraints, and security considerations. The swarm then executes all subtasks, effectively spinning up numerous concurrent inference users in parallel sessions. Finally, evaluator models judge whether the overall task was successfully completed.“These swarms go through what's called multiple turns, hundreds if not thousands of prompts and responses until the swarm convenes on an answer,” Bercovici said. “And if you have a compound delay in those thousand turns, it becomes untenable. So latency is really, really important. And that means typically having to pay a high price today that's subsidized, and that's what's going to have to come down over time.”Reinforcement learning as the new paradigmUntil around May of this year, agents weren't that performant, Bercovici explained. And then context windows became large enough, and GPUs available enough, to support agents that could complete advanced tasks, like writing reliable software. It's now estimated that in some cases, 90% of software is generated by coding agents. Now that agents have essentially come of age, Bercovici noted, reinforcement learning is the new conversation among data scientists at some of the leading labs, like OpenAI, Anthropic, and Gemini, who view it as a critical path forward in AI innovation.."The current AI season is reinforcement learning. It blends many of the elements of training and inference into one unified workflow,” Bercovici said. “It’s the latest and greatest scaling law to this mythical milestone we’re all trying to reach called AGI — artificial general intelligence,” he added. "What’s fascinating to me is that you have to apply all the best practices of how you train models, plus all the best practices of how you infer models, to be able to iterate these thousands of reinforcement learning loops and advance the whole field."The path to AI profitability There’s no one answer when it comes to building an infrastructure foundation to make AI profitable, Bercovici said, since it's still an emerging field. There’s no cookie-cutter approach. Going all on-prem may be the right choice for some — especially frontier model builders — while being cloud-native or running in a hybrid environment may be a better path for organizations looking to innovate agilely and responsively. Regardless of which path they choose initially, organizations will need to adapt their AI infrastructure strategy as their business needs evolve."Unit economics are what fundamentally matter here," said Bercovici. "We are definitely in a boom, or even in a bubble, you could say, in some cases, since the underlying AI economics are being subsidized. But that doesn’t mean that if tokens get more expensive, you’ll stop using them. You’ll just get very fine-grained in terms of how you use them." Leaders should focus less on individual token pricing and more on transaction-level economics, where efficiency and impact become visible, Bercovici concludes. The pivotal question enterprises and AI companies should be asking, Bercovici said, is “What is the real cost for my unit economics?”Viewed through that lens, the path forward isn’t about doing less with AI — it’s about doing it smarter and more efficiently at scale.

#ai

Presented by Elastic Logs set to become the primary tool for finding the “why” in diagnosing network incidents Modern IT environments have a data problem: there’s too much of it. Organizations that need to manage a company’s environment are increasingly challenged to detect and diagnose issues in real-time, optimize performance, improve reliability, and ensure security and compliance — all within constrained budgets. The modern observability landscape has many tools that offer a solution. Most revolve around DevOps teams or Site Reliability Engineers (SREs) analyzing logs, metrics, and traces to uncover patterns and figure out what’s happening across the network, and diagnose why an issue or incident occurred. The problem is that the process creates information overload: A Kubernetes cluster alone can emit 30 to 50 gigabytes of logs a day, and suspicious behavior patterns can sneak past human eyes. "It’s so anachronistic now, in the world of AI, to think about humans alone observing infrastructure," says Ken Exner, chief product officer at Elastic. "I hate to break it to you, but machines are better than human beings at pattern matching.“An industry-wide focus on visualizing symptoms forces engineers to manually hunt for answers. The crucial "why" is buried in logs, but because they contain massive volumes of unstructured data, the industry tends to use them as a tool of last resort. This has forced teams into costly tradeoffs: either spend countless hours building complex data pipelines, drop valuable log data and risk critical visibility gaps, or log and forget.Elastic, the Search AI Company, recently released a new feature for observability called Streams, which aims to become the primary signal for investigations by taking noisy logs and turning them into patterns, context and meaning. Streams uses AI to automatically partition and parse raw logs to extract relevant fields, and greatly reduce the effort required of SREs to make logs usable. Streams also automatically surfaces significant events such as critical errors and anomalies from context-rich logs, giving SREs early warnings and a clear understanding of their workloads, enabling them to investigate and resolve issues faster. The ultimate goal is to show remediation steps."From raw, voluminous, messy data, Streams automatically creates structure, putting it into a form that is usable, automatically alerts you to issues and helps you remediate them," Exner says. "That is the magic of Streams."A broken workflowStreams upends an observability process that some say is broken. Typically, SREs set up metrics, logs and traces. Then they set up alerts, and service level objectives (SLOs) — often hard-coded rules to show where a service or process has gone beyond a threshold, or a specific pattern has been detected. When an alert is triggered, it points to the metric that's showing an anomaly. From there, SREs look at a metrics dashboard, where they can visualize the issue and compare the alert to other metrics, or CPU to memory to I/O, and start looking for patterns. They may then need to look at a trace, and examine upstream and downstream dependencies across the application to dig into the root cause of the issue. Once they figure out what's causing the trouble, they jump into the logs for that database or service to try and debug the issue. Some companies simply seek to add more tools when current ones prove ineffective. That means SREs are hopping from tool to tool to keep on top of monitoring and troubleshooting across their infrastructure and applications."You’re hopping across different tools. You’re relying on a human to interpret these things, visually look at the relationship between systems in a service map, visually look at graphs on a metrics dashboard, to figure out what and where the issue is, " Exner says. "But AI automates that workflow away." With AI-powered Streams, logs are not just used reactively to resolve issues, but also to proactively process potential issues and create information-rich alerts that help teams jump straight to problem-solving, offering a solution for remediation or even fixing the issue entirely, before automatically notifying the team that it's been taken care of."I believe that logs, the richest set of information, the original signal type, will start driving a lot of the automation that a service reliability engineer typically does today, and does very manually," he adds. "A human should not be in that process, where they are doing this by digging into themselves, trying to figure out what is going on, where and what the issue is, and then once they find the root cause, they’re trying to figure out how to debug it."Observability’s future Large language models (LLMs) could be a key player in the future of observability. LLMs excel at recognizing patterns in vast quantities of repetitive data, which closely resembles log and telemetry data in complex, dynamic systems. And today’s LLMs can be trained for specific IT processes. With automation tooling, the LLM has the information and tools it needs to resolve database errors or Java heap issues, and more. Incorporating those into platforms that bring context and relevance will be essential. Automated remediation will still take some time, Exner says, but automated runbooks and playbooks generated by LLMs will become standard practice within the next couple of years. In other words, remediation steps will be driven by LLMs. The LLM will offer up fixes, and the human will verify and implement them, rather than calling in an expert.Addressing skill shortagesGoing all in on AI for observability would help address a major shortage in the talent needed to manage IT infrastructure. Hiring is slow because organizations need teams with a great deal of experience and understanding of potential issues, and how to resolve them fast. That experience can come from an LLM that is contextually grounded, Exner says."We can help deal with the skill shortage by augmenting people with LLMs that make them all instantly experts," he explains. "I think this is going to make it much easier for us to take novice practitioners and make them expert practitioners in both security and observability, and it’s going to make it possible for a more novice practitioner to act like an expert.” Streams in Elastic Observability is available now. Get started by reading more on the Streams. Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

#ai

The intelligence of AI models isn't what's blocking enterprise deployments. It's the inability to define and measure quality in the first place.That's where AI judges are now playing an increasingly important role. In AI evaluation, a "judge" is an AI system that scores outputs from another AI system. Judge Builder is Databricks' framework for creating judges and was first deployed as part of the company's Agent Bricks technology earlier this year. The framework has evolved significantly since its initial launch in response to direct user feedback and deployments.Early versions focused on technical implementation but customer feedback revealed the real bottleneck was organizational alignment. Databricks now offers a structured workshop process that guides teams through three core challenges: getting stakeholders to agree on quality criteria, capturing domain expertise from limited subject matter experts and deploying evaluation systems at scale."The intelligence of the model is typically not the bottleneck, the models are really smart," Jonathan Frankle, Databricks' chief AI scientist, told VentureBeat in an exclusive briefing. "Instead, it's really about asking, how do we get the models to do what we want, and how do we know if they did what we wanted?"The 'Ouroboros problem' of AI evaluationJudge Builder addresses what Pallavi Koppol, a Databricks research scientist who led the development, calls the "Ouroboros problem."  An Ouroboros is an ancient symbol that depicts a snake eating its own tail. Using AI systems to evaluate AI systems creates a circular validation challenge."You want a judge to see if your system is good, if your AI system is good, but then your judge is also an AI system," Koppol explained. "And now you're saying like, well, how do I know this judge is good?"The solution is measuring "distance to human expert ground truth" as the primary scoring function. By minimizing the gap between how an AI judge scores outputs versus how domain experts would score them, organizations can trust these judges as scalable proxies for human evaluation.This approach differs fundamentally from traditional guardrail systems or single-metric evaluations. Rather than asking whether an AI output passed or failed on a generic quality check, Judge Builder creates highly specific evaluation criteria tailored to each organization's domain expertise and business requirements.The technical implementation also sets it apart. Judge Builder integrates with Databricks' MLflow and prompt optimization tools and can work with any underlying model. Teams can version control their judges, track performance over time and deploy multiple judges simultaneously across different quality dimensions.Lessons learned: Building judges that actually workDatabricks' work with enterprise customers revealed three critical lessons that apply to anyone building AI judges.Lesson one: Your experts don't agree as much as you think. When quality is subjective, organizations discover that even their own subject matter experts disagree on what constitutes acceptable output. A customer service response might be factually correct but use an inappropriate tone. A financial summary might be comprehensive but too technical for the intended audience."One of the biggest lessons of this whole process is that all problems become people problems," Frankle said. "The hardest part is getting an idea out of a person's brain and into something explicit. And the harder part is that companies are not one brain, but many brains."The fix is batched annotation with inter-rater reliability checks. Teams annotate examples in small groups, then measure agreement scores before proceeding. This catches misalignment early. In one case, three experts gave ratings of 1, 5 and neutral for the same output before discussion revealed they were interpreting the evaluation criteria differently.Companies using this approach achieve inter-rater reliability scores as high as 0.6 compared to typical scores of 0.3 from external annotation services. Higher agreement translates directly to better judge performance because the training data contains less noise.Lesson two: Break down vague criteria into specific judges. Instead of one judge evaluating whether a response is "relevant, factual and concise," create three separate judges. Each targets a specific quality aspect. This granularity matters because a failing "overall quality" score reveals something is wrong but not what to fix.The best results come from combining top-down requirements such as regulatory constraints, stakeholder priorities, with bottom-up discovery of observed failure patterns. One customer built a top-down judge for correctness but discovered through data analysis that correct responses almost always cited the top two retrieval results. This insight became a new production-friendly judge that could proxy for correctness without requiring ground-truth labels.Lesson three: You need fewer examples than you think. Teams can create robust judges from just 20-30 well-chosen examples. The key is selecting edge cases that expose disagreement rather than obvious examples where everyone agrees."We're able to run this process with some teams in as little as three hours, so it doesn't really take that long to start getting a good judge," Koppol said.Production results: From pilots to seven-figure deploymentsFrankle shared three metrics Databricks uses to measure Judge Builder's success: whether customers want to use it again, whether they increase AI spending and whether they progress further in their AI journey.On the first metric, one customer created more than a dozen judges after their initial workshop. "This customer made more than a dozen judges after we walked them through doing this in a rigorous way for the first time with this framework," Frankle said. "They really went to town on judges and are now measuring everything."For the second metric, the business impact is clear. "There are multiple customers who have gone through this workshop and have become seven-figure spenders on GenAI at Databricks in a way that they weren't before," Frankle said.The third metric reveals Judge Builder's strategic value. Customers who previously hesitated to use advanced techniques like reinforcement learning now feel confident deploying them because they can measure whether improvements actually occurred."There are customers who have gone and done very advanced things after having had these judges where they were reluctant to do so before," Frankle said. "They've moved from doing a little bit of prompt engineering to doing reinforcement learning with us. Why spend the money on reinforcement learning, and why spend the energy on reinforcement learning if you don't know whether it actually made a difference?"What enterprises should do nowThe teams successfully moving AI from pilot to production treat judges not as one-time artifacts but as evolving assets that grow with their systems.Databricks recommends three practical steps. First, focus on high-impact judges by identifying one critical regulatory requirement plus one observed failure mode. These become your initial judge portfolio.Second, create lightweight workflows with subject matter experts. A few hours reviewing 20-30 edge cases provides sufficient calibration for most judges. Use batched annotation and inter-rater reliability checks to denoise your data.Third, schedule regular judge reviews using production data. New failure modes will emerge as your system evolves. Your judge portfolio should evolve with them."A judge is a way to evaluate a model, it's also a way to create guardrails, it's also a way to have a metric against which you can do prompt optimization and it's also a way to have a metric against which you can do reinforcement learning," Frankle said. "Once you have a judge that you know represents your human taste in an empirical form that you can query as much as you want, you can use it in 10,000 different ways to measure or improve your agents."

#ai

When the transformer architecture was introduced in 2017 in the now seminal Google paper "Attention Is All You Need," it became an instant cornerstone of modern artificial intelligence. Every major large language model (LLM) — from OpenAI's GPT series to Anthropic's Claude, Google's Gemini, and Meta's Llama — has been built on some variation of its central mechanism: attention, the mathematical operation that allows a model to look back across its entire input and decide what information matters most.Eight years later, the same mechanism that defined AI’s golden age is now showing its limits. Attention is powerful, but it is also expensive — its computational and memory costs scale quadratically with context length, creating an increasingly unsustainable bottleneck for both research and industry. As models aim to reason across documents, codebases, or video streams lasting hours or days, attention becomes the architecture’s Achilles’ heel.On October 28, 2025, the little-known AI startup Manifest AI introduced a radical alternative. Their new model, Brumby-14B-Base, is a retrained variant of Qwen3-14B-Base, one of the leading open-source transformer models.But while many variants of Qwen have been trained already, Brumby-14B-Base is novel in that it abandons attention altogether. Instead, Brumby replaces those layers with a novel mechanism called Power Retention—a recurrent, hardware-efficient architecture that stores and updates information over arbitrarily long contexts without the exponential memory growth of attention.Trained at a stated cost of just $4,000, the 14-billion-parameter Brumby model performs on par with established transformer models like Qwen3-14B and GLM-4.5-Air, achieving near-state-of-the-art accuracy on a range of reasoning and comprehension benchmarks.From Attention to Retention: The Architectural ShiftThe core of Manifest AI’s innovation lies in what they call the Power Retention layer. In a traditional transformer, every token computes a set of queries (Q), keys (K), and values (V), then performs a matrix operation that measures the similarity between every token and every other token—essentially a full pairwise comparison across the sequence. This is what gives attention its flexibility, but also what makes it so costly: processing a sequence twice as long takes roughly four times the compute and memory.Power Retention keeps the same inputs (Q, K, V), but replaces the global similarity operation with a recurrent state update. Each layer maintains a memory matrix S, which is updated at each time step according to the incoming key, value, and a learned gating signal. The process looks more like an RNN (Recurrent Neural Network) than a transformer: instead of recomputing attention over the entire context, the model continuously compresses past information into a fixed-size latent state.This means the computational cost of Power Retention does not grow with context length. Whether the model is processing 1,000 or 1,000,000 tokens, the per-token cost remains constant. That property alone—constant-time per-token computation—marks a profound departure from transformer behavior.At the same time, Power Retention preserves the expressive power that made attention successful. Because the recurrence involves tensor powers of the input (hence the name “power retention”), it can represent higher-order dependencies between past and present tokens. The result is an architecture that can theoretically retain long-term dependencies indefinitely, while remaining as efficient as an RNN and as expressive as a transformer.Retraining, Not RebuildingPerhaps the most striking aspect of Brumby-14B’s training process is its efficiency. Manifest AI trained the model for only 60 hours on 32 Nvidia H100 GPUs, at a cost of roughly $4,000 — less than 2% of what a conventional model of this scale would cost to train from scratch.However, since it relied on a transformer-based model, it's safe to say that this advance alone will not end the transformer AI-era.As Jacob Buckman, founder of Manifest AI, clarified in an email to VentureBeat: “The ability to train for $4,000 is indeed only possible when leveraging an existing transformer model,” he said. “Brumby could not be trained from scratch for that price.”Still, Buckman emphasized the significance of that result: “The reason this is important is that the ability to build on the weights of the previous generation of model architectures is a critical accelerant for the adoption of a new modeling paradigm.” He argues this demonstrates how attention-free systems can catch up to transformer performance “for orders-of-magnitude less” investment.In the loss curves released by Manifest AI, Brumby’s training loss quickly converges to that of the Qwen3 baseline within 3,000 training steps, even as the architecture diverges significantly from its transformer origins. Although Brumby-14B-Base began life as Qwen3-14B-Base, it did not remain identical for long. Manifest AI fundamentally altered Qwen3’s architecture by removing its attention layers—the mathematical engine that defines how a transformer model processes information—and replacing them with their new “power retention” mechanism. This change restructured the model’s internal wiring, effectively giving it a new brain while preserving much of its prior knowledge.Because of that architectural swap, the existing Qwen3 weights no longer fit perfectly. They were trained to operate within a transformer’s attention dynamics, not the new retention-based system. As a result, the Brumby model initially “forgot” how to apply some of its learned knowledge effectively. The retraining process—about 3,000 steps of additional learning—served to recalibrate those weights, aligning them with the power retention framework without having to start from zero.A helpful way to think about this is to imagine taking a world-class pianist and handing them a guitar. They already understand rhythm, harmony, and melody, but their hands must learn entirely new patterns to produce the same music. Similarly, Brumby had to relearn how to use its existing knowledge through a new computational instrument. Those 3,000 training steps were, in effect, its crash course in guitar lessons.By the end of this short retraining phase, Brumby had regained its full performance, reaching the same accuracy as the original Qwen3 model. That quick recovery is what makes the result so significant: it shows that an attention-free system can inherit and adapt the capabilities of a transformer model with only a fraction of the training time and cost.The benchmark progression plots show a similar trend: the model rapidly approaches its target accuracy on core evaluations like GSM8K, HellaSwag, and MMLU after only a few thousand steps, matching or even slightly surpassing Qwen3 on several tasks.Benchmarking the BrumbyAcross standard evaluation tasks, Brumby-14B-Base consistently performs at or near parity with transformer baselines of comparable scale.TaskBrumby-14BQwen3-14BGLM-4.5-AirNemotron Nano (12B)ARC0.890.940.920.93GSM8K0.880.840.830.84GSM8K (Platinum)0.870.880.850.87HellaSwag0.770.810.850.82MATH0.620.540.470.26MBPP0.570.750.730.71MMLU0.710.780.770.78MMLU (Pro)0.360.550.510.53While it lags slightly behind transformers on knowledge-heavy evaluations like MMLU-Pro, it matches or outperforms them on mathematical reasoning and long-context reasoning tasks—precisely where attention architectures tend to falter. This pattern reinforces the idea that recurrent or retention-based systems may hold a structural advantage for reasoning over extended temporal or logical dependencies.Hardware Efficiency and Inference PerformanceBrumby’s power retention design offers another major advantage: hardware efficiency.Because the state update involves only local matrix operations, inference can be implemented with linear complexity in sequence length. Manifest AI reports that their fastest kernels, developed through their in-house CUDA framework Vidrial, can deliver hundreds-fold speedups over attention on very long contexts.Buckman said the alpha-stage Power Retention kernels “achieve typical hardware utilization of 80–85%, which is higher than FlashAttention2’s 70–75% or Mamba’s 50–60%.” (Mamba is another emerging “post-transformer” architecture developed by Carnegie Mellon scientists back in 2023 that, like Power Retention, seeks to eliminate the computational bottleneck of attention. It replaces attention with a state-space mechanism that processes sequences linearly — updating an internal state over time rather than comparing every token to every other one. This makes it far more efficient for long inputs, though it typically achieves lower hardware utilization than Power Retention in early tests.)Both Power Retention and Mamba, he added, “expend meaningfully fewer total FLOPs than FlashAttention2 on long contexts, as well as far less memory.” According to Buckman, the reported 100× speedup comes from this combined improvement in utilization and computational efficiency, though he noted that “we have not yet stress-tested it on production-scale workloads.”Training and Scaling EconomicsPerhaps no statistic in the Brumby release generated more attention than the training cost.A 14-billion-parameter model, trained for $4,000, represents a two-order-of-magnitude reduction in the cost of foundation model development.Buckman confirmed that the low cost reflects a broader scaling pattern. “Far from diminishing returns, we have found that ease of retraining improves with scale,” he said. “The number of steps required to successfully retrain a model decreases with its parameter count.” Manifest has not yet validated the cost of retraining models at 700B parameters, but Buckman projected a range of $10,000–$20,000 for models of that magnitude—still far below transformer training budgets.He also reiterated that this approach could democratize large-scale experimentation by allowing smaller research groups or companies to retrain or repurpose existing transformer checkpoints without prohibitive compute costs.Integration and DeploymentAccording to Buckman, converting an existing transformer into a Power Retention model is designed to be simple. “It is straightforward for any company that is already retraining, post-training, or fine-tuning open-source models,” he said. “Simply pip install retention, change one line of your architecture code, and resume training where you left off.”He added that after only a small number of GPU-hours, the model typically recovers its original performance—at which point it gains the efficiency benefits of the attention-free design. “The resulting architecture will permit far faster long-context training and inference than previously,” Buckman noted.On infrastructure, Buckman said the main Brumby kernels are written in Triton, compatible with both NVIDIA and AMD accelerators. Specialized CUDA kernels are also available through the team’s in-house Vidrial framework. Integration with vLLM and other inference engines remains a work in progress: “We have not yet integrated Power Retention into inference engines, but doing so is a major ongoing initiative at Manifest.”As for distributed inference, Buckman dismissed concerns about instability: “We have not found this difficulty to be exacerbated in any way by our recurrent-state architecture. In fact, context-parallel training and GPU partitioning for multi-user inference both become significantly cleaner technically when using our approach.”Mission and Long-Term VisionBeyond the engineering details, Buckman also described Manifest’s broader mission. “Our mission is to train a neural network to model all human output,” he said. The team’s goal, he explained, is to move beyond modeling “artifacts of intelligence” toward modeling “the intelligent processes that generated them.” This shift, he argued, requires “fundamentally rethinking” how models are designed and trained—work that Power Retention represents only the beginning of.The Brumby-14B release, he said, is “one step forward in a long march” toward architectures that can model thought processes continuously and efficiently.Public Debate and Industry ReceptionThe launch of Brumby-14B sparked immediate discussion on X (formerly Twitter), where researchers debated the framing of Manifest AI’s announcement. Some, including Meta researcher Ariel (@redtachyon), argued that the “$4,000 foundation model” tagline was misleading, since the training involved reusing pretrained transformer weights rather than training from scratch.“They shuffled around the weights of Qwen, fine-tuned it a bit, and called it ‘training a foundation model for $4k,’” Ariel wrote.Buckman responded publicly, clarifying that the initial tweet had been part of a longer thread explaining the retraining approach. “It’s not like I was being deceptive about it,” he wrote. “I broke it up into separate tweets, and now everyone is mad about the first one.”In a follow-up email, Buckman took a measured view of the controversy. “The end of the transformer era is not yet here,” he reiterated, “but the march has begun.” He also acknowledged that the $4,000 claim, though technically accurate in context, had drawn attention precisely because it challenged expectations about what it costs to experiment at frontier scale.Conclusion: A Crack in the Transformer’s Wall?The release of Brumby-14B-Base is more than an engineering milestone; it is a proof of concept that the transformer’s dominance may finally face credible competition. By replacing attention with power retention, Manifest AI has demonstrated that performance parity with state-of-the-art transformers is possible at a fraction of the computational cost—and that the long-context bottleneck can be broken without exotic hardware.The broader implications are twofold. First, the economics of training and serving large models could shift dramatically, lowering the barrier to entry for open research and smaller organizations. Second, the architectural diversity of AI models may expand again, reigniting theoretical and empirical exploration after half a decade of transformer monoculture.As Buckman put it: “The end of the transformer era is not yet here. Our release is just one step forward in a long march toward the future.”

#ai

Market researchers have embraced artificial intelligence at a staggering pace, with 98% of professionals now incorporating AI tools into their work and 72% using them daily or more frequently, according to a new industry survey that reveals both the technology's transformative promise and its persistent reliability problems.The findings, based on responses from 219 U.S. market research and insights professionals surveyed in August 2025 by QuestDIY, a research platform owned by The Harris Poll, paint a picture of an industry caught between competing pressures: the demand to deliver faster business insights and the burden of validating everything AI produces to ensure accuracy.While more than half of researchers — 56% — report saving at least five hours per week using AI tools, nearly four in ten say they've experienced "increased reliance on technology that sometimes produces errors." An additional 37% report that AI has "introduced new risks around data quality or accuracy," and 31% say the technology has "led to more work re-checking or validating AI outputs."The disconnect between productivity gains and trustworthiness has created what amounts to a grand bargain in the research industry: professionals accept time savings and enhanced capabilities in exchange for constant vigilance over AI's mistakes, a dynamic that may fundamentally reshape how insights work gets done.How market researchers went from AI skeptics to daily users in less than a yearThe numbers suggest AI has moved from experiment to infrastructure in record time. Among those using AI daily, 39% deploy it once per day, while 33% use it "several times per day or more," according to the survey conducted between August 15-19, 2025. Adoption is accelerating: 80% of researchers say they're using AI more than they were six months ago, and 71% expect to increase usage over the next six months. Only 8% anticipate their usage will decline.“While AI provides excellent assistance and opportunities, human judgment will remain vital,” Erica Parker, Managing Director Research Products at The Harris Poll, told VentureBeat. “The future is a teamwork dynamic where AI will accelerate tasks and quickly unearth findings, while researchers will ensure quality and provide high level consultative insights.”The top use cases reflect AI's strength in handling data at scale: 58% of researchers use it for analyzing multiple data sources, 54% for analyzing structured data, 50% for automating insight reports, 49% for analyzing open-ended survey responses, and 48% for summarizing findings. These tasks—traditionally labor-intensive and time-consuming — now happen in minutes rather than hours.Beyond time savings, researchers report tangible quality improvements. Some 44% say AI improves accuracy, 43% report it helps surface insights they might otherwise have missed, 43% cite increased speed of insights delivery, and 39% say it sparks creativity. The overwhelming majority — 89% — say AI has made their work lives better, with 25% describing the improvement as "significant."The productivity paradox: saving time while creating new validation workYet the same survey reveals deep unease about the technology's reliability. The list of concerns is extensive: 39% of researchers report increased reliance on error-prone technology, 37% cite new risks around data quality or accuracy, 31% describe additional validation work, 29% report uncertainty about job security, and 28% say AI has raised concerns about data privacy and ethics.The report notes that "accuracy is the biggest frustration with AI experienced by researchers when asked on an open-ended basis." One researcher captured the tension succinctly: "The faster we move with AI, the more we need to check if we're moving in the right direction."This paradox — saving time while simultaneously creating new work — reflects a fundamental characteristic of current AI systems, which can produce outputs that appear authoritative but contain what researchers call "hallucinations," or fabricated information presented as fact. The challenge is particularly acute in a profession where credibility depends on methodological rigor and where incorrect data can lead clients to make costly business decisions."Researchers view AI as a junior analyst, capable of speed and breadth, but needing oversight and judgment," said Gary Topiol, Managing Director at QuestDIY, in the report.That metaphor — AI as junior analyst — captures the industry's current operating model. Researchers treat AI outputs as drafts requiring senior review rather than finished products, a workflow that provides guardrails but also underscores the technology's limitations.Why data privacy fears are the biggest obstacle to AI adoption in researchWhen asked what would limit AI use at work, researchers identified data privacy and security concerns as the greatest barrier, cited by 33% of respondents. This concern isn't abstract: researchers handle sensitive customer data, proprietary business information, and personally identifiable information subject to regulations like GDPR and CCPA. Sharing that data with AI systems — particularly cloud-based large language models — raises legitimate questions about who controls the information and whether it might be used to train models accessible to competitors.Other significant barriers include time to experiment and learn new tools (32%), training (32%), integration challenges (28%), internal policy restrictions (25%), and cost (24%). An additional 31% cited lack of transparency in AI use as a concern, which could complicate explaining results to clients and stakeholders.The transparency issue is particularly thorny. When an AI system produces an analysis or insight, researchers often cannot trace how the system arrived at its conclusion — a problem that conflicts with the scientific method's emphasis on replicability and clear methodology. Some clients have responded by including no-AI clauses in their contracts, forcing researchers to either avoid the technology entirely or use it in ways that don't technically violate contractual terms but may blur ethical lines."Onboarding beats feature bloat," Parker said in the report. "The biggest brakes are time to learn and train. Packaged workflows, templates, and guided setup all unlock usage faster than piling on capabilities."Inside the new workflow: treating AI like a junior analyst who needs constant supervisionDespite these challenges, researchers aren't abandoning AI — they're developing frameworks to use it responsibly. The consensus model, according to the survey, is "human-led research supported by AI," where AI handles repetitive tasks like coding, data cleaning, and report generation while humans focus on interpretation, strategy, and business impact.About one-third of researchers (29%) describe their current workflow as "human-led with significant AI support," while 31% characterize it as "mostly human with some AI help." Looking ahead to 2030, 61% envision AI as a "decision-support partner" with expanded capabilities including generative features for drafting surveys and reports (56%), AI-driven synthetic data generation (53%), automation of core processes like project setup and coding (48%), predictive analytics (44%), and deeper cognitive insights (43%).The report describes an emerging division of labor where researchers become "Insight Advocates" — professionals who validate AI outputs, connect findings to stakeholder challenges, and translate machine-generated analysis into strategic narratives that drive business decisions. In this model, technical execution becomes less central to the researcher's value proposition than judgment, context, and storytelling."AI can surface missed insights — but it still needs a human to judge what really matters," Topiol said in the report.What other knowledge workers can learn from the research industry's AI experimentThe market research industry's AI adoption may presage similar patterns in other knowledge work professions where the technology promises to accelerate analysis and synthesis. The experience of researchers — early AI adopters who have integrated the technology into daily workflows — offers lessons about both opportunities and pitfalls.First, speed genuinely matters. One boutique agency research lead quoted in the report described watching survey results accumulate in real-time after fielding: "After submitting it for fielding, I literally watched the survey count climb and finish the same afternoon. It was a remarkable turnaround." That velocity enables researchers to respond to business questions within hours rather than weeks, making insights actionable while decisions are still being made rather than after the fact.Second, the productivity gains are real but uneven. Saving five hours per week represents meaningful efficiency for individual contributors, but those savings can disappear if spent validating AI outputs or correcting errors. The net benefit depends on the specific task, the quality of the AI tool, and the user's skill in prompting and reviewing the technology's work.Third, the skills required for research are changing. The report identifies future competencies including cultural fluency, strategic storytelling, ethical stewardship, and what it calls "inquisitive insight advocacy" — the ability to ask the right questions, validate AI outputs, and frame insights for maximum business impact. Technical execution, while still important, becomes less differentiating as AI handles more of the mechanical work.The strange phenomenon of using technology intensively while questioning its reliabilityThe survey's most striking finding may be the persistence of trust issues despite widespread adoption. In most technology adoption curves, trust builds as users gain experience and tools mature. But with AI, researchers appear to be using tools intensively while simultaneously questioning their reliability — a dynamic driven by the technology's pattern of performing well most of the time but failing unpredictably.This creates a verification burden that has no obvious endpoint. Unlike traditional software bugs that can be identified and fixed, AI systems' probabilistic nature means they may produce different outputs for the same inputs, making it difficult to develop reliable quality assurance processes.The data privacy concerns — cited by 33% as the biggest barrier to adoption — reflect a different dimension of trust. Researchers worry not just about whether AI produces accurate outputs but also about what happens to the sensitive data they feed into these systems. QuestDIY's approach, according to the report, is to build AI directly into a research platform with ISO/IEC 27001 certification rather than requiring researchers to use general-purpose tools like ChatGPT that may store and learn from user inputs."The center of gravity is analysis at scale — fusing multiple sources, handling both structured and unstructured data, and automating reporting," Topiol said in the report, describing where AI delivers the most value.The future of research work: elevation or endless verification?The report positions 2026 as an inflection point when AI moves from being a tool researchers use to something more like a team member — what the authors call a "co-analyst" that participates in the research process rather than merely accelerating specific tasks.This vision assumes continued improvement in AI capabilities, particularly in areas where researchers currently see the technology as underdeveloped. While 41% currently use AI for survey design, 37% for programming, and 30% for proposal creation, most researchers consider these appropriate use cases, suggesting significant room for growth once the tools become more reliable or the workflows more structured.The human-led model appears likely to persist. "The future is human-led, with AI as a trusted co-analyst," Parker said in the report. But what "human-led" means in practice may shift. If AI handles most analytical tasks and researchers focus on validation and strategic interpretation, the profession may come to resemble editorial work more than scientific analysis — curating and contextualizing machine-generated insights rather than producing them from scratch."AI gives researchers the space to move up the value chain – from data gatherers to Insight Advocates, focused on maximising business impact," Topiol said in the report.Whether this transformation marks an elevation of the profession or a deskilling depends partly on how the technology evolves. If AI systems become more transparent and reliable, the verification burden may decrease and researchers can focus on higher-order thinking. If they remain opaque and error-prone, researchers may find themselves trapped in an endless cycle of checking work produced by tools they cannot fully trust or explain.The survey data suggests researchers are navigating this uncertainty by developing a form of professional muscle memory — learning which tasks AI handles well, where it tends to fail, and how much oversight each type of output requires. This tacit knowledge, accumulated through daily use and occasional failures, may become as important to the profession as statistical literacy or survey design principles.Yet the fundamental tension remains unresolved. Researchers are moving faster than ever, delivering insights in hours instead of weeks, and handling analytical tasks that would have been impossible without AI. But they're doing so while shouldering a new responsibility that previous generations never faced: serving as the quality control layer between powerful but unpredictable machines and business leaders making million-dollar decisions.The industry has made its bet. Now comes the harder part: proving that human judgment can keep pace with machine speed — and that the insights produced by this uneasy partnership are worth the trust clients place in them.

#ai

SAP aims to displace more general large language models with the release of its own foundational “tabular” model, which the company claims will reduce training requirements for enterprises. The model, called SAP RPT-1, is a pre-trained model with business and enterprise knowledge out of the box. SAP calls it a Relational Foundation Model, meaning it can do predictions based on relational databases even without fine-tuning or additional training.Walter Sun, SAP's global head of AI, told VentureBeat in an interview that the value of the new model lies in its ability to perform various enterprise tasks, such as predictive analytics, out of the box. “Everyone knows about language models, and there’s a bunch of good ones that already exist,” Sun said. “But we trained the model on data on business transactions, basically Excel spreadsheets, and so we have a model that can do predictive analytics where the value is that it’s out of the box, meaning you don’t need to have specifics of a company to do tasks analogous to a language model.” Sun said that right out of the gate, RPT-1 can essentially build out a business model for enterprises based on its knowledge gained from data from SAP’s decades of information. Organizations can plug the model directly into applications, even without additional fine-tuning.RPT-1, SAP’s first large family of AI models, will be generally available in “Q4 of 2025” and be deployed via SAP’s AI Foundation. While RPT-1 is currently available, the company stated that additional models will be made available soon, including an open-source, state-of-the-art model. SAP will also release a no-code playground environment to experiment with the model. 
Tabular models vs LLMs
Tabular or relational AI models learned from spreadsheets, unlike LLMs, which learned from text and code. RPT-1 not only understands numbers and the relationships between different cells, but it’s also able to provide more structured and precise answers. When enterprises decide to use RPT-1, they can add more direction to the model through a bit of context engineering, since the model is semantically aware and learns based on how it is being used. SAP researchers first proposed the idea that tabular models can both exhibit semantic awareness and learn from content through a paper published in June. It proposed ConTextTab introduced context-aware pretraining. It utilizes semantic signals, such as table headers or column types, to guide model training, enabling the model to build a relational structure with the data. It’s this architecture that makes the model work best for tasks with precise answers, such as for financial or enterprise use cases.The RPT models build on the ConTextTab work that lets it learn structured business data, say from SAP’s knowledge graph, and then be able to add more context through usage. SAP researchers did test ConTextTab against benchmarks, saying it “is competitive” against similar models like TabPFN and TabIFL. Industry-specific models continue to grow
Many enterprises prefer to fine-tune general LLMs like GPT-5 or Claude, to basically retrain the model to answer only questions relevant to their business. However, a shift towards industry-specific models has begun to take root. Sun said that his experience at a previous company, building a very narrow, highly customized AI model for sentiment analysis, influenced a lot of what makes RPT-1 different. “It was a very customized model, a narrow model that takes specific feedback for specific products but it wasn’t scalable,” Sun said. “When LLMs came about, that one model measures sentiment. But there are use cases that we can do that LLMs cannot do.”He said these use cases include predictions, such as determining when a shopper will return to a grocery store, which may involve numerical analysis along with an understanding of the shopper’s buying habits. However, some LLMs have begun integrating into spreadsheets, and AI model providers encourage users to upload similar data to teach them context. Microsoft added new capabilities to Copilot, including the ability to work in Excel. Anthropic integrated its Claude model with Excel, complementing its Claude for Finance service. Chinese startup Manus also offers a data visualization tool that understands spreadsheets, and ChatGPT can create charts from uploaded spreadsheets and other data sources. However, SAP noted that it is more than just reading a spreadsheet; RPT-1 should stand out amongst its competitors because it requires fewer additional pieces of information about a business to provide its responses. 

#ai

Presented by ZendeskAgentic AI is currently transforming three key areas of work — creative, coding, and support — says Shashi Upadhyay, president of engineering, AI, and product at Zendesk. But he notes that support presents a distinct challenge. "Support is special because you’re putting an autonomous AI agent right in front of your customer," Upadhyay says. "You have to be confident that it’s going to do the right thing for the customer and by the customer. Every step forward in AI should make service more dependable for both customers and human agents." Zendesk, recently named a Leader in the 2025 Gartner Magic Quadrant for the CRM Customer Engagement Center, started implementing AI agents about a year and a half ago. Since then, they've seen that AI agents can solve almost 80% of all incoming customer requests on their own. For the remaining 20%, the AI agent can hand it over to a human to help solve the more complex problems. "Autonomous AI agents work 24/7, with no wait or queue time. You have a problem; they provide an answer right away. All of that adds up," he says. "Not only do you get higher resolutions, higher automation, but you can also improve the CSAT at the same time. Because 80% is such a promising number, and the results are so solid, we believe it’s only a matter of time before everyone adopts this technology. We already see that across the board."The company's efforts to advance its standard of usability, depth of insight, and time to value for organizations of all sizes require continuous testing, integration of advanced models like ChatGPT-5, and a major upgrade of its analytics capabilities and real-time, gen AI–powered insights with the acquisition of HyperArc, an AI-native analytics platform.Designing, testing, and deploying a better agent"In a support context especially, it’s important AI agents behave consistently with the brand of the company, policies, and regulatory requirements you may have," Upadhyay says. "We test every agent, every model continuously across all our customers. We do it before we release it and we do it after we release it, across five categories." Those categories — automation rate, execution, precision, latency, and safety — form the foundation of Zendesk’s ongoing benchmarking program. Each model is scored on how accurately it resolves issues, how well it follows instructions, how fast it responds, and whether it stays within clearly defined guardrails. The goal isn’t just to make AI faster — it’s to make it dependable, accountable, and aligned with the standards that define great customer service.That testing is reinforced by Zendesk’s QA agent — an automated monitor that keeps a constant eye on every conversation. If an exchange starts to drift off course, whether in tone or accuracy, the system immediately flags it and alerts a human agent to step in. It’s an added layer of assurance that keeps the customer experience on track, even when AI is running the first line of support.GPT-5 for next-level agentsIn the world of support and service, the move from simple chatbots that answer basic queries or solve uncomplicated problems, to agents that actually take action, is groundbreaking. An agent that can understand that a customer wants to return an item, confirm whether it's eligible for a return, process the return, and issue a refund, is a powerful upgrade. With the introduction of ChatGPT-5, Zendesk recognized an opportunity to integrate that ability into its Resolution Platform."We worked very closely with OpenAI because GPT-5 was a pretty big improvement in model capabilities, going from being able to answer questions, to being able to reason and take action," Upadhyay says. "First, it does a much better job at solving problems autonomously. Secondly, it's much better at understanding your intent, which improves the customer experience because you feel understood. Last but not least, it has 95%-plus reliability on executing correctly."Those gains ripple across Zendesk’s AI agents, Copilot, and App Builder. GPT-5 cuts workflow failures by 30%, thanks to its ability to adapt to unexpected complexity without losing context, and reduces fallback escalations by more than 20%, with more complete and accurate responses. The result: faster resolutions, fewer hand-offs, and AI that behaves more like a seasoned support professional than a scripted assistant.Plus, GPT-5 is better at handling ambiguity, and able to clarify vague customer input, which improves routing and increases automated workflows in over 65% of conversations. It has greater accuracy across five languages, and makes agents more productive with more concise, contextually relevant answers that align with tone guidelines.And in App Builder, GPT-5 delivered 25% to 30% faster overall performance, with more prompt iterations per minute, speeding app builder development workflows.Filling in the analytics gapTraditionally, support analytics has focused on structured data — the kind that fits neatly into a table: when a ticket was opened, who handled it, how long it took to resolve, and when it was closed. But the most valuable insights often live in unstructured data — the conversations themselves, spread across email, chat, voice, and messaging apps like WhatsApp."Customers often don’t realize how much intelligence sits in their support interactions," Upadhyay says. "What we’re pushing for with analytics is ways in which we can improve the entire company with the insights that are sitting in support data."To surface those deeper insights, Zendesk turned to HyperArc, an AI-native analytics company known for its proprietary HyperGraph engine and generative-AI-powered insights. The acquisition gave new life to Explore, Zendesk’s analytics platform, transforming it into a modern solution capable of merging structured and unstructured data, supporting conversational interfaces, and drawing on persistent memory to use past interactions as context for new queries."Your support interactions are telling you everything that’s not working in your business today, all that information is sitting in these millions of tickets that you’ve collected over time," Upadhyay says. "We wanted to make that completely visible. Now we have this genius AI agent that can analyze it all and come back with explicit recommendations. That doesn’t just improve support. It improves the entire company."That visibility now translates into actionable intelligence. The system can pinpoint where issues are most persistent, identify the patterns behind them, and suggest ways to resolve them. It can even anticipate problems before they happen. During high-pressure events like Black Friday, for example, it can analyze historical data to flag recurring issues, predict where new bottlenecks might appear, and recommend preventive measures — turning reactive support into proactive strategy."That’s where HyperArc shines," Upadhyay says. It doesn’t just help you understand the past — it helps you plan better for the future."By integrating HyperArc’s AI-native intelligence, Zendesk is moving customer service toward continuous learning — where every interaction builds trust and sharpens performance, setting the stage for AI that can see what’s coming next.Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

#faculty #k-12 education #artificial intelligence #education, teaching, academics #technology and society #books and authors #school of humanities arts and social sciences #podcasts #comparative media studies/writing

MIT’s Teaching Systems Lab, led by Associate Professor Justin Reich, is working to help educators by listening to and sharing their stories.

#interview #students #graduate, postdoctoral #research #ecology #environment #data #animals #computer science and technology #computer vision #artificial intelligence #machine learning #algorithms #pollution #sustainability #school of engineering #computer science and artificial intelligence laboratory (csail) #abdul latif jameel water and food systems lab (j-wafs) #electrical engineering and computer science (eecs) #mit schwarzman college of computing #national science foundation (nsf)

MIT PhD student and CSAIL researcher Justin Kay describes his work combining AI and computer vision systems to monitor the ecosystems that support our planet.

#ai

I’m thrilled to announce a fantastic new addition to our leadership team: Karyne Levy is joining VentureBeat as our new Managing Editor. Today is her first day.Many of you may know Karyne from her most recent role as Deputy Managing Editor at TechCrunch, but her career is a highlight reel of veteran tech journalism. Her resume includes pivotal roles at Protocol, NerdWallet, Business Insider, and CNET, giving her a deep understanding of this industry from every angle.Hiring Karyne is a significant step forward for VentureBeat. As we’ve sharpened our focus on serving you – the enterprise technical decision-maker navigating the complexities of AI and data – I’ve been looking for a very specific kind of leader.The "Organizer's Dopamine Hit"In the past, a managing editor was often the final backstop for copy. Today, at a modern, data-focused media company like ours, the role is infinitely more dynamic. It’s the central hub of the entire content operation.During my search, I found myself talking a lot about the two types of "dopamine hits" in our business. There’s the writer’s hit – seeing your name on a great story. And then there’s the organizer’s hit – the satisfaction that comes from building, tuning, and running the complex machine that allows a dozen different parts of the company to move in a single, powerful direction.We were looking for the organizer.When I spoke with Karyne, I explained this vision: a leader who thrives on creating workflows, who loves being the liaison between editorial, our data and survey team, our events, and our marketing operations.Her response confirmed she was the one: "Everything you said is exactly my dopamine hit."Karyne’s passion is making the entire operation hum. She has a proven track record of managing people, running newsrooms, and interfacing with all parts of a business to ensure everyone is aligned. That operational rigor is precisely what we need for our next chapter.Why This Matters for Our Strategy (and for You)As I’ve written about before, VentureBeat is on a mission to evolve. In an age where experts and companies can publish directly, it’s not enough to be a secondary source. Our goal is to become a primary source for you.How? By leveraging our relationship with our community of millions of technical leaders. We are increasingly surveying you directly to generate proprietary insights you can’t get anywhere else. We want to be the first to tell you which vector stores your peers are actually implementing, what governance challenges are most pressing for data scientists, or how your counterparts are budgeting for generative AI.This is an ambitious strategy. It requires a tight-knit team where our editorial content, our research surveys and reports, our newsletters, and our VB Transform events are all working from the same playbook.Karyne is the leader who will help us execute that vision. Her experience at Protocol, which was also dedicated to serving technical and business decision-makers, means she fundamentally understands our audience. She is ideally suited to manage our newsroom and ensure that every piece of content we produce helps you do your job better. She’ll be working alongside Carl Franzen, our executive editor, who continues to drive news decision-making.This is a fantastic hire for VentureBeat. It’s another sign of our commitment to building the most focused, expert team in enterprise AI and data.Please join me in welcoming Karyne to the team.

#ai

The buzzed-about but still stealthy New York City startup Augmented Intelligence Inc (AUI), which seeks to go beyond the popular "transformer" architecture used by most of today's LLMs such as ChatGPT and Gemini, has raised $20 million in a bridge SAFE round at a $750 million valuation cap, bringing its total funding to nearly $60 million, VentureBeat can exclusively reveal.The round, completed in under a week, comes amid heightened interest in deterministic conversational AI and precedes a larger raise now in advanced stages.AUI relies on a fusion of the transformer tech and a newer technology called "neuro-symbolic AI," described in greater detail below. "We realize that you can combine the brilliance of LLMs in linguistic capabilities with the guarantees of symbolic AI," said Ohad Elhelo, AUI co-founder and CEO in a recent interview with VentureBeat. Elhelo launched the company in 2017 alongside co-founder and Chief Product Officer Ori Cohen.The new financing includes participation from eGateway Ventures, New Era Capital Partners, existing shareholders, and other strategic investors. It follows a $10 million raise in September 2024 at a $350 million valuation cap, coinciding with the company’s announced go-to-market partnership with Google in October 2024. Early investors include Vertex Pharmaceuticals founder Joshua Boger, UKG Chairman Aron Ain, and former IBM President Jim Whitehurst.According to the company, the bridge round is a precursor to a significantly larger raise already in advanced stages.AUI is the company behind Apollo-1, a new foundation model built for task-oriented dialog, which it describes as the "economic half" of conversational AI — distinct from the open-ended dialog handled by LLMs like ChatGPT and Gemini. The firm argues that existing LLMs lack the determinism, policy enforcement, and operational certainty required by enterprises, especially in regulated sectors.Chris Varelas, co-founder of Redwood Capital and an advisor to AUI, said in a press release provided to VentureBeat: “I’ve seen some of today’s top AI leaders walk away with their heads spinning after interacting with Apollo-1.”A Distinctive Neuro-Symbolic ArchitectureApollo-1’s core innovation is its neuro-symbolic architecture, which separates linguistic fluency from task reasoning. Instead of using the most common technology underpinning most LLMs and conversational AI systems today — the vaunted transformer architecture described in the seminal 2017 Google paper "Attention Is All You Need" — AUI's system integrates two layers:Neural modules, powered by LLMs, handle perception: encoding user inputs and generating natural language responses.A symbolic reasoning engine, developed over several years, interprets structured task elements such as intents, entities, and parameters. This symbolic state engine determines the appropriate next actions using deterministic logic.This hybrid architecture allows Apollo-1 to maintain state continuity, enforce organizational policies, and reliably trigger tool or API calls — capabilities that transformer-only agents lack.Elhelo said this design emerged from a multi-year data collection effort: “We built a consumer service and recorded millions of human-agent interactions across 60,000 live agents. From that, we abstracted a symbolic language that defines the structure of task-based dialogs, separate from their domain-specific content.”However, enterprises that have already built systems built around transformer LLMs needn't worry. AUI wants to make adopting its new technology just as easy. "Apollo-1 deploys like any modern foundation model," Elhelo told VentureBeat in a text last night. "It doesn’t require dedicated or proprietary clusters to run. It operates across standard cloud and hybrid environments, leveraging both GPUs and CPUs, and is significantly more cost-efficient to deploy than frontier reasoning models. Apollo-1 can also be deployed across all major clouds in a separated environment for increased security."Generalization and Domain FlexibilityApollo-1 is described as a foundation model for task-oriented dialog, meaning it is domain-agnostic and generalizable across verticals like healthcare, travel, insurance, and retail.Unlike consulting-heavy AI platforms that require building bespoke logic per client, Apollo-1 allows enterprises to define behaviors and tools within a shared symbolic language. This approach supports faster onboarding and reduces long-term maintenance. According to the team, an enterprise can launch a working agent in under a day.Crucially, procedural rules are encoded at the symbolic layer — not learned from examples. This enables deterministic execution for sensitive or regulated tasks. For instance, a system can block cancellation of a Basic Economy flight not by guessing intent but by applying hard-coded logic to a symbolic representation of the booking class.As Elhelo explained to VentureBeat, LLMs are "not a good mechanism when you’re looking for certainty. It’s better if you know what you’re going to send [to an AI model] and always send it, and you know, always, what’s going to come back [to the user] and how to handle that.”Availability and Developer AccessApollo-1 is already in active use within Fortune 500 enterprises in a closed beta, and a broader general availability release is expected before the end of 2025, according to a previous report by The Information, which broke the initial news on the startup.Enterprises can integrate with Apollo-1 either via:A developer playground, where business users and technical teams jointly configure policies, rules, and behaviors; orA standard API, using OpenAI-compatible formats.The model supports policy enforcement, rule-based customization, and steering via guardrails. Symbolic rules allow businesses to dictate fixed behaviors, while LLM modules handle open-text interpretation and user interaction.Enterprise Fit: When Reliability Beats FluencyWhile LLMs have advanced general-purpose dialog and creativity, they remain probabilistic — a barrier to enterprise deployment in finance, healthcare, and customer service. Apollo-1 targets this gap by offering a system where policy adherence and deterministic task completion are first-class design goals.Elhelo puts it plainly: “If your use case is task-oriented dialog, you have to use us, even if you are ChatGPT.”

#ai #automation

An international team of researchers has released an artificial intelligence system capable of autonomously conducting scientific research across multiple disciplines — generating papers from initial concept to publication-ready manuscript in approximately 30 minutes for about $4 each.The system, called Denario, can formulate research ideas, review existing literature, develop methodologies, write and execute code, create visualizations, and draft complete academic papers. In a demonstration of its versatility, the team used Denario to generate papers spanning astrophysics, biology, chemistry, medicine, neuroscience, and other fields, with one AI-generated paper already accepted for publication at an academic conference."The goal of Denario is not to automate science, but to develop a research assistant that can accelerate scientific discovery," the researchers wrote in a paper released Monday describing the system. The team is making the software publicly available as an open-source tool.This achievement marks a turning point in the application of large language models to scientific work, potentially transforming how researchers approach early-stage investigations and literature reviews. However, the research also highlights substantial limitations and raises pressing questions about validation, authorship, and the changing nature of scientific labor.From data to draft: how AI agents collaborate to conduct researchAt its core, Denario operates not as a single AI brain but as a digital research department where specialized AI agents collaborate to push a project from conception to completion. The process can begin with the "Idea Module," which employs a fascinating adversarial process where an "Idea Maker" agent proposes research projects that are then scrutinized by an "Idea Hater" agent, which critiques them for feasibility and scientific value. This iterative loop refines raw concepts into robust research directions.Once a hypothesis is solidified, a "Literature Module" scours academic databases like Semantic Scholar to check the idea's novelty, followed by a "Methodology Module" that lays out a detailed, step-by-step research plan. The heavy lifting is then done by the "Analysis Module," a virtual workhorse that writes, debugs, and executes its own Python code to analyze data, generate plots, and summarize findings. Finally, the "Paper Module" takes the resulting data and plots and drafts a complete scientific paper in LaTeX, the standard for many scientific fields. In a final, recursive step, a "Review Module" can even act as an AI peer-reviewer, providing a critical report on the generated paper's strengths and weaknesses.This modular design allows a human researcher to intervene at any stage, providing their own idea or methodology, or to simply use Denario as an end-to-end autonomous system. "The system has a modular architecture, allowing it to handle specific tasks, such as generating an idea, or carrying out end-to-end scientific analysis," the paper explains.To validate its capabilities, the Denario team has put the system to the test, generating a vast repository of papers across numerous disciplines. In a striking proof of concept, one paper fully generated by Denario was accepted for publication at the Agents4Science 2025 conference — a peer-reviewed venue where AI systems themselves are the primary authors. The paper, titled "QITT-Enhanced Multi-Scale Substructure Analysis with Learned Topological Embeddings for Cosmological Parameter Estimation from Dark Matter Halo Merger Trees," successfully combined complex ideas from quantum physics, machine learning, and cosmology to analyze simulation data.The ghost in the machine: AI’s ‘vacuous’ results and ethical alarmsWhile the successes are notable, the research paper is refreshingly candid about Denario's significant limitations and failure modes. The authors stress that the system currently "behaves more like a good undergraduate or early graduate student rather than a full professor in terms of big picture, connecting results...etc." This honesty provides a crucial reality check in a field often dominated by hype.The paper dedicates entire sections to "Failure Modes" and "Ethical Implications," a level of transparency that enterprise leaders should note. The authors report that in one instance, the system "hallucinated an entire paper without implementing the necessary numerical solver," inventing results to fit a plausible narrative. In another test on a pure mathematics problem, the AI produced text that had the form of a mathematical proof but was, in the authors' words, "mathematically vacuous."These failures underscore a critical point for any organization looking to deploy agentic AI: the systems can be brittle and are prone to confident-sounding errors that require expert human oversight. The Denario paper serves as a vital case study in the importance of keeping a human in the loop for validation and critical assessment.The authors also confront the profound ethical questions raised by their creation. They warn that "AI agents could be used to quickly flood the scientific literature with claims driven by a particular political agenda or specific commercial or economic interests." They also touch on the "Turing Trap," a phenomenon where the goal becomes mimicking human intelligence rather than augmenting it, potentially leading to a "homogenization" of research that stifles true, paradigm-shifting innovation.An open-source co-pilot for the world's labsDenario is not just a theoretical exercise locked away in an academic lab. The entire system is open-source under a GPL-3.0 license and is accessible to the broader community. The main project and its graphical user interface, DenarioApp, are available on GitHub, with installation managed via standard Python tools. For enterprise environments focused on reproducibility and scalability, the project also provides official Docker images. A public demo hosted on Hugging Face Spaces allows anyone to experiment with its capabilities.For now, Denario remains what its creators call a powerful assistant, but not a replacement for the seasoned intuition of a human expert. This framing is deliberate. The Denario project is less about creating an automated scientist and more about building the ultimate co-pilot, one designed to handle the tedious and time-consuming aspects of modern research.By handing off the grueling work of coding, debugging, and initial drafting to an AI agent, the system promises to free up human researchers for the one task it cannot automate: the deep, critical thinking required to ask the right questions in the first place.

#ai

The recent controversy surrounding Google’s Gemma model has once again highlighted the dangers of using developer test models and the fleeting nature of model availability. Google pulled its Gemma 3 model from AI Studio following a statement from Senator Marsha Blackburn (R-Tenn.) that the Gemma model willfully hallucinated falsehoods about her. Blackburn said the model fabricated news stories about her that go beyond “harmless hallucination” and function as a defamatory act. In response, Google posted on X on October 31 that it will remove Gemma from AI Studio, stating that this is “to prevent confusion.” Gemma remains available via API. It is also available via AI Studio, which, the company described, is "a developer tool (in fact, to use it you need to attest you're a developer). We’ve now seen reports of non-developers trying to use Gemma in AI Studio and ask it factual questions. We never intended this to be a consumer tool or model, or to be used this way. To prevent this confusion, access to Gemma is no longer available on AI Studio."To be clear, Google has the right to remove its model from its platform, especially if people have found hallucinations and falsehoods that could proliferate. It also underscores the danger of relying mainly on experimental models and why enterprise developers need to save projects before AI models are sunsetted or removed. Technology companies like Google continue to face political controversies, which often influence their deployments. VentureBeat reached out to Google for additional information and was pointed to their October 31 posts. We also contacted the office of Sen. Blackburn, who reiterated her stance outlined in a statement that AI companies should “shut [models] down until you can control it."Developer experimentsThe Gemma family of models, which includes a 270M parameter version, is best suited for small, quick apps and tasks that can run on devices such as smartphones and laptops. Google said the Gemma models were “built specifically for the developer and research community. They are not meant for factual assistance or for consumers to use.”Nevertheless, non-developers could still access Gemma because it is on the AI Studio platform, a more beginner-friendly space for developers to play around with Google AI models compared to Vertex AI. So even if Google never intended Gemma and AI Studio to be accessible to, say, Congressional staffers, these situations can still occur. It also shows that as models continue to improve, these models still produce inaccurate and potentially harmful information. Enterprises must continually weigh the benefits of using models like Gemma against their potential inaccuracies. Project continuity Another concern is the control that AI companies have over their models. The adage “you don’t own anything on the internet” remains true. If you don’t own a physical or local copy of software, it’s easy for you to lose access to it if the company that owns it decides to take it away. Google did not clarify with VentureBeat if current projects on AI Studio powered by Gemma are saved. Similarly, OpenAI users were disappointed when the company announced that it would remove popular older models on ChatGPT. Even after walking back his statement and reinstating GPT-4o back to ChatGPT, OpenAI CEO Sam Altman continues to field questions around keeping and supporting the model. AI companies can, and should, remove their models if they create harmful outputs. AI models, no matter how mature, remain works in progress and are constantly evolving and improving. But, since they are experimental in nature, models can easily become tools that technology companies and lawmakers can wield as leverage. Enterprise developers must ensure that their work can be saved before models are removed from platforms. 

#research #computer science and technology #artificial intelligence #algorithms #machine learning #alternative energy #laboratory for information and decision systems (lids) #electrical engineering and computer science (eecs) #school of engineering #mit schwarzman college of computing

The FSNet system, developed at MIT, could help power grid operators rapidly find feasible solutions for optimizing the flow of electricity.

#ai #data infrastructure #datadecisionmakers

For more than three decades, modern CPUs have relied on speculative execution to keep pipelines full. When it emerged in the 1990s, speculation was hailed as a breakthrough — just as pipelining and superscalar execution had been in earlier decades. Each marked a generational leap in microarchitecture. By predicting the outcomes of branches and memory loads, processors could avoid stalls and keep execution units busy. But this architectural shift came at a cost: Wasted energy when predictions failed, increased complexity and vulnerabilities such as Spectre and Meltdown. These challenges set the stage for an alternative: A deterministic, time-based execution model. As David Patterson observed in 1980, “A RISC potentially gains in speed merely from a simpler design.” Patterson’s principle of simplicity underpins a new alternative to speculation: A deterministic, time-based execution model."For the first time since speculative execution became the dominant paradigm, a fundamentally new approach has been invented. This breakthrough is embodied in a series of six recently issued U.S. patents, sailing through the U.S. Patent and Trademark Office (USPTO). Together, they introduce a radically different instruction execution model. Departing sharply from conventional speculative techniques, this novel deterministic framework replaces guesswork with a time-based, latency-tolerant mechanism. Each instruction is assigned a precise execution slot within the pipeline, resulting in a rigorously ordered and predictable flow of execution. This reimagined model redefines how modern processors can handle latency and concurrency with greater efficiency and reliability. A simple time counter is used to deterministically set the exact time of when instructions should be executed in the future. Each instruction is dispatched to an execution queue with a preset execution time based on resolving its data dependencies and availability of resources — read buses, execution units and the write bus to the register file. Each instruction remains queued until its scheduled execution slot arrives. This new deterministic approach may represent the first major architectural challenge to speculation since it became the standard.The architecture extends naturally into matrix computation, with a RISC-V instruction set proposal under community review. Configurable general matrix multiply (GEMM) units, ranging from 8×8 to 64×64, can operate using either register-based or direct-memory acceess (DMA)-fed operands. This flexibility supports a wide range of AI and high-performance computing (HPC) workloads. Early analysis suggests scalability that rivals Google’s TPU cores, while maintaining significantly lower cost and power requirements. Rather than a direct comparison with general-purpose CPUs, the more accurate reference point is vector and matrix engines: Traditional CPUs still depend on speculation and branch prediction, whereas this design applies deterministic scheduling directly to GEMM and vector units. This efficiency stems not only from the configurable GEMM blocks but also from the time-based execution model, where instructions are decoded and assigned precise execution slots based on operand readiness and resource availability. Execution is never a random or heuristic choice among many candidates, but a predictable, pre-planned flow that keeps compute resources continuously busy. Planned matrix benchmarks will provide direct comparisons with TPU GEMM implementations, highlighting the ability to deliver datacenter-class performance without datacenter-class overhead.Critics may argue that static scheduling introduces latency into instruction execution. In reality, the latency already exists — waiting on data dependencies or memory fetches. Conventional CPUs attempt to hide it with speculation, but when predictions fail, the resulting pipeline flush introduces delay and wastes power. The time-counter approach acknowledges this latency and fills it deterministically with useful work, avoiding rollbacks. As the first patent notes, instructions retain out-of-order efficiency: “A microprocessor with a time counter for statically dispatching instructions enables execution based on predicted timing rather than speculative issue and recovery," with preset execution times but without the overhead of register renaming or speculative comparators.Why speculation stalledSpeculative execution boosts performance by predicting outcomes before they’re known — executing instructions ahead of time and discarding them if the guess was wrong. While this approach can accelerate workloads, it also introduces unpredictability and power inefficiency. Mispredictions inject “No Ops” into the pipeline, stalling progress and wasting energy on work that never completes. These issues are magnified in modern AI and machine learning (ML) workloads, where vector and matrix operations dominate and memory access patterns are irregular. Long fetches, non-cacheable loads and misaligned vectors frequently trigger pipeline flushes in speculative architectures.The result is performance cliffs that vary wildly across datasets and problem sizes, making consistent tuning nearly impossible. Worse still, speculative side effects have exposed vulnerabilities that led to high-profile security exploits. As data intensity grows and memory systems strain, speculation struggles to keep pace — undermining its original promise of seamless acceleration.Time-based execution and deterministic schedulingAt the core of this invention is a vector coprocessor with a time counter for statically dispatching instructions. Rather than relying on speculation, instructions are issued only when data dependencies and latency windows are fully known. This eliminates guesswork and costly pipeline flushes while preserving the throughput advantages of out-of-order execution. Architectures built on this patented framework feature deep pipelines — typically spanning 12 stages — combined with wide front ends supporting up to 8-way decode and large reorder buffers exceeding 250 entriesAs illustrated in Figure 1, the architecture mirrors a conventional RISC-V processor at the top level, with instruction fetch and decode stages feeding into execution units. The innovation emerges in the integration of a time counter and register scoreboard, strategically positioned between fetch/decode and the vector execution units. Instead of relying on speculative comparators or register renaming, they utilize a Register Scoreboard and Time Resource Matrix (TRM) to deterministically schedule instructions based on operand readiness and resource availability. Figure 1: High-level block diagram of deterministic processor. A time counter and scoreboard sit between fetch/decode and vector execution units, ensuring instructions issue only when operands are ready.A typical program running on the deterministic processor begins much like it does on any conventional RISC-V system: Instructions are fetched from memory and decoded to determine whether they are scalar, vector, matrix or custom extensions. The difference emerges at the point of dispatch. Instead of issuing instructions speculatively, the processor employs a cycle-accurate time counter, working with a register scoreboard, to decide exactly when each instruction can be executed. This mechanism provides a deterministic execution contract, ensuring instructions complete at predictable cycles and reducing wasted issue slots.In conjunction with a register scoreboard, the time-resource matrix associates instructions with execution cycles, allowing the processor to plan dispatch deterministically across available resources. The scoreboard tracks operand readiness and hazard information, enabling scheduling without register renaming or speculative comparators. By monitoring dependencies such as read-after-write (RAW) and write-after-read, it ensures hazards are resolved without costly pipeline flushes. As noted in the patent, “in a multi-threaded microprocessor, the time counter and scoreboard permit rescheduling around cache misses, branch flushes, and RAW hazards without speculative rollback.”Once operands are ready, the instruction is dispatched to the appropriate execution unit. Scalar operations use standard artithmetic logic units (ALUs), while vector and matrix instructions execute in wide execution units connected to a large vector register file. Because instructions launch only when conditions are safe, these units stay highly utilized without the wasted work or recovery cycles caused by mis-predicted speculation. The key enabler of this approach is a simple time counter that orchestrates execution according to data readiness and resource availability, ensuring instructions advance only when operands are ready and resources available. The same principle applies to memory operations: The interface predicts latency windows for loads and stores, allowing the processor to fill those slots with independent instructions and keep execution flowing.Programming model differencesFrom the programmer’s perspective, the flow remains familiar — RISC-V code compiles and executes in the usual way. The crucial difference lies in the execution contract: Rather than relying on dynamic speculation to hide latency, the processor guarantees predictable dispatch and completion times. This eliminates the performance cliffs and wasted energy of speculation while still providing the throughput benefits of out-of-order execution. This perspective underscores how deterministic execution preserves the familiar RISC-V programming model while eliminating the unpredictability and wasted effort of speculation. As John Hennessy put it: "It’s stupid to do work in run time that you can do in compile time”— a remark reflecting the foundations of RISC and its forward-looking design philosophy.The RISC-V ISA provides opcodes for custom and extension instructions, including floating-point, DSP, and vector operations. The result is a processor that executes instructions deterministically while retaining the benefits of out-of-order performance. By eliminating speculation, the design simplifies hardware, reduces power consumption and avoids pipeline flushes. These efficiency gains grow even more significant in vector and matrix operations, where wide execution units require consistent utilization to reach peak performance. Vector extensions require wide register files and large execution units, which in speculative processors necessitate expensive register renaming to recover from branch mispredictions. In the deterministic design, vector instructions are executed only after commit, eliminating the need for renaming.Each instruction is scheduled against a cycle-accurate time counter: “The time counter provides a deterministic execution contract, ensuring instructions complete at predictable cycles and reducing wasted issue slots.” The vector register scoreboard resolves data dependency before issuing instructions to execution pipeline.  Instructions are dispatched in a known order at the correct cycle, making execution both predictable and efficient.Vector execution units (integer and floating point) connect directly to a large vector register file. Because instructions are never flushed, there is no renaming overhead. The scoreboard ensures safe access, while the time counter aligns execution with memory readiness. A dedicated memory block predicts the return cycle of loads. Instead of stalling or speculating, the processor schedules independent instructions into latency slots, keeping execution units busy. “A vector coprocessor with a time counter for statically dispatching instructions ensures high utilization of wide execution units while avoiding misprediction penalties.”In today’s CPUs, compilers and programmers write code assuming the hardware will dynamically reorder instructions and speculatively execute branches. The hardware handles hazards with register renaming, branch prediction and recovery mechanisms. Programmers benefit from performance, but at the cost of unpredictability and power consumption.In the deterministic time-based architecture, instructions are dispatched only when the time counter indicates their operands will be ready. This means the compiler (or runtime system) doesn’t need to insert guard code for misprediction recovery. Instead, compiler scheduling becomes simpler, as instructions are guaranteed to issue at the correct cycle without rollbacks. For programmers, the ISA remains RISC-V compatible, but deterministic extensions reduce reliance on speculative safety nets.Application in AI and MLIn AI/ML kernels, vector loads and matrix operations often dominate runtime. On a speculative CPU, misaligned or non-cacheable loads can trigger stalls or flushes, starving wide vector and matrix units and wasting energy on discarded work. A deterministic design instead issues these operations with cycle-accurate timing, ensuring high utilization and steady throughput. For programmers, this means fewer performance cliffs and more predictable scaling across problem sizes. And because the patents extend the RISC-V ISA rather than replace it, deterministic processors remain fully compatible with the RVA23 profile and mainstream toolchains such as GCC, LLVM, FreeRTOS, and Zephyr.In practice, the deterministic model doesn’t change how code is written — it remains RISC-V assembly or high-level languages compiled to RISC-V instructions. What changes is the execution contract: Rather than relying on speculative guesswork, programmers can expect predictable latency behavior and higher efficiency without tuning code around microarchitectural quirks.The industry is at an inflection point. AI/ML workloads are dominated by vector and matrix math, where GPUs and TPUs excel — but only by consuming massive power and adding architectural complexity. In contrast, general-purpose CPUs, still tied to speculative execution models, lag behind.A deterministic processor delivers predictable performance across a wide range of workloads, ensuring consistent behavior regardless of task complexity. Eliminating speculative execution enhances energy efficiency and avoids unnecessary computational overhead. Furthermore, deterministic design scales naturally to vector and matrix operations, making it especially well-suited for AI workloads that rely on high-throughput parallelism. This new deterministic approach may represent the next such leap: The first major architectural challenge to speculation since speculation itself became the standard.Will deterministic CPUs replace speculation in mainstream computing? That remains to be seen. But with issued patents, proven novelty and growing pressure from AI workloads, the timing is right for a paradigm shift. Taken together, these advances signal deterministic execution as the next architectural leap — redefining performance and efficiency just as speculation once did.Speculation marked the last revolution in CPU design; determinism may well represent the next.Thang Tran is the founder and CTO of Simplex Micro.Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

#ai #datadecisionmakers

Recently, there has been a lot of hullabaloo about the idea that large reasoning models (LRM) are unable to think. This is mostly due to a research article published by Apple, "The Illusion of Thinking" Apple argues that LRMs must not be able to think; instead, they just perform pattern-matching. The evidence they provided is that LRMs with chain-of-thought (CoT) reasoning are unable to carry on the calculation using a predefined algorithm as the problem grows.This is a fundamentally flawed argument. If you ask a human who already knows the algorithm for solving the Tower-of-Hanoi problem to solve a Tower-of-Hanoi problem with twenty discs, for instance, he or she would almost certainly fail to do so. By that logic, we must conclude that humans cannot think either. However, this argument only points to the idea that there is no evidence that LRMs cannot think. This alone certainly does not mean that LRMs can think — just that we cannot be sure they don’t.In this article, I will make a bolder claim: LRMs almost certainly can think. I say ‘almost’ because there is always a chance that further research would surprise us. But I think my argument is pretty conclusive.What is thinking?Before we try to understand if LRMs can think, we need to define what we mean by thinking. But first, we have to make sure that humans can think per the definition. We will only consider thinking in relation to problem solving, which is the matter of contention.1. Problem representation (frontal and parietal lobes)When you think about a problem, the process engages your prefrontal cortex. This region is responsible for working memory, attention and executive functions — capacities that let you hold the problem in mind, break it into sub-components and set goals. Your parietal cortex helps encode symbolic structure for math or puzzle problems.2. Mental simulation (morking Memory and inner speech)This has two components: One is an auditory loop that lets you talk to yourself — very similar to CoT generation. The other is visual imagery, which allows you to manipulate objects visually. Geometry was so important for navigating the world that we developed specialized capabilities for it. The auditory part is linked to Broca’s area and the auditory cortex, both reused from language centers. The visual cortex and parietal areas primarily control the visual component.3. Pattern matching and retrieval (Hippocampus and Temporal Lobes)These actions depend on past experiences and stored knowledge from long-term memory:The hippocampus helps retrieve related memories and facts.The temporal Lobe brings in semantic knowledge — meanings, rules, categories.This is similar to how neural networks depend on their training to process the task.4. Monitoring and evaluation (Anterior Cingulate Cortex)Our anterior cingulate cortex (ACC) monitors for errors, conflicts or impasses — it’s where you notice contradictions or dead ends. This process is essentially based on pattern matching from prior experience.5. Insight or reframing (default mode network and right hemisphere)When you're stuck, your brain might shift into default mode — a more relaxed, internally-directed network. This is when you step back, let go of the current thread and sometimes ‘suddenly’ see a new angle (the classic “aha!” moment).This is similar to how DeepSeek-R1 was trained for CoT reasoning without having CoT examples in its training data. Remember, the brain continuously learns as it processes data and solves problems.In contrast, LRMs aren’t allowed to change based on real-world feedback during prediction or generation. But with DeepSeek-R1’s CoT training, learning did happen as it attempted to solve the problems — essentially updating while reasoning.Similarities betweem CoT reasoning and biological thinkingLRM does not have all of the faculties mentioned above. For example, an LRM is very unlikely to do too much visual reasoning in its circuit, although a little may happen. But it certainly does not generate intermediate images in the CoT generation.Most humans can make spatial models in their heads to solve problems. Does this mean we can conclude that LRMs cannot think? I would disagree. Some humans also find it difficult to form spatial models of the concepts they think about. This condition is called aphantasia. People with this condition can think just fine. In fact, they go about life as if they don’t lack any ability at all. Many of them are actually great at symbolic reasoning and quite good at math — often enough to compensate for their lack of visual reasoning. We might expect our neural network models also to be able to circumvent this limitation.If we take a more abstract view of the human thought process described earlier, we can see mainly the following things involved:1.  Pattern-matching is used for recalling learned experience, problem representation and monitoring and evaluating chains of thought.2.  Working memory is to store all the intermediate steps.3.  Backtracking search concludes that the CoT is not going anywhere and backtracks to some reasonable point.Pattern-matching in an LRM comes from its training. The whole point of training is to learn both knowledge of the world and the patterns to process that knowledge effectively. Since an LRM is a layered network, the entire working memory needs to fit within one layer. The weights store the knowledge of the world and the patterns to follow, while processing happens between layers using the learned patterns stored as model parameters.Note that even in CoT, the entire text — including the input, CoT and part of the output already generated — must fit into each layer. Working memory is just one layer (in the case of the attention mechanism, this includes the KV-cache).CoT is, in fact, very similar to what we do when we are talking to ourselves (which is almost always). We nearly always verbalize our thoughts, and so does a CoT reasoner.There is also good evidence that CoT reasoner can take backtracking steps when a certain line of reasoning seems futile. In fact, this is what the Apple researchers saw when they tried to ask the LRMs to solve bigger instances of simple puzzles. The LRMs correctly recognized that trying to solve the puzzles directly would not fit in their working memory, so they tried to figure out better shortcuts, just like a human would do. This is even more evidence that LRMs are thinkers, not just blind followers of predefined patterns.But why would a next-token-predictor learn to think?Neural networks of sufficient size can learn any computation, including thinking. But a next-word-prediction system can also learn to think. Let me elaborate. A general idea is LRMs cannot think because, at the end of the day, they are just predicting the next token; it is only a 'glorified auto-complete.' This view is fundamentally incorrect — not that it is an 'auto-complete,' but that an 'auto-complete' does not have to think. In fact, next word prediction is far from a limited representation of thought. On the contrary, it is the most general form of knowledge representation that anyone can hope for. Let me explain.Whenever we want to represent some knowledge, we need a language or a system of symbolism to do so. Different formal languages exist that are very precise in terms of what they can express. However, such languages are fundamentally limited in the kinds of knowledge they can represent.For example, first-order predicate logic cannot represent properties of all predicates that satisfy a certain property, because it doesn't allow predicates over predicates.Of course, there are higher-order predicate calculi that can represent predicates on predicates to arbitrary depths. But even they cannot express ideas that lack precision or are abstract in nature.Natural language, however, is complete in expressive power — you can describe any concept in any level of detail or abstraction. In fact, you can even describe concepts about natural language using natural language itself. That makes it a strong candidate for knowledge representation.The challenge, of course, is that this expressive richness makes it harder to process the information encoded in natural language. But we don’t necessarily need to understand how to do it manually — we can simply program the machine using data, through a process called training.A next-token prediction machine essentially computes a probability distribution over the next token, given a context of preceding tokens. Any machine that aims to compute this probability accurately must, in some form, represent world knowledge.A simple example: Consider the incomplete sentence, "The highest mountain peak in the world is Mount ..." — to predict the next word as Everest, the model must have this knowledge stored somewhere. If the task requires the model to compute the answer or solve a puzzle, the next-token predictor needs to output CoT tokens to carry the logic forward.This implies that, even though it’s predicting one token at a time, the model must internally represent at least the next few tokens in its working memory — enough to ensure it stays on the logical path.If you think about it, humans also predict the next token — whether during speech or when thinking using the inner voice. A perfect auto-complete system that always outputs the right tokens and produces correct answers would have to be omniscient. Of course, we’ll never reach that point — because not every answer is computable.However, a parameterized model that can represent knowledge by tuning its parameters, and that can learn through data and reinforcement, can certainly learn to think.Does it produce the effects of thinking?At the end of the day, the ultimate test of thought is a system’s ability to solve problems that require thinking. If a system can answer previously unseen questions that demand some level of reasoning, it must have learned to think — or at least to reason — its way to the answer.We know that proprietary LRMs perform very well on certain reasoning benchmarks. However, since there's a possibility that some of these models were fine-tuned on benchmark test sets through a backdoor, we’ll focus only on open-source models for fairness and transparency.We evaluate them using the following benchmarks:As one can see, in some benchmarks, LRMs are able to solve a significant number of logic-based questions. While it’s true that they still lag behind human performance in many cases, it’s important to note that the human baseline often comes from individuals trained specifically on those benchmarks. In fact, in certain cases, LRMs outperform the average untrained human.ConclusionBased on the benchmark results, the striking similarity between CoT reasoning and biological reasoning, and the theoretical understanding that any system with sufficient representational capacity, enough training data, and adequate computational power can perform any computable task — LRMs meet those criteria to a considerable extent.It is therefore reasonable to conclude that LRMs almost certainly possess the ability to think.Debasish Ray Chawdhuri is a senior principal engineer at Talentica Software and a Ph.D. candidate in Cryptography at IIT Bombay. Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

In this post, I’ll introduce a reinforcement learning (RL) algorithm based on an “alternative” paradigm: divide and conquer. Unlike traditional methods, this algorithm is not based on temporal difference (TD) learning (which has scalability challenges), and scales well to long-horizon tasks.




We can do Reinforcement Learning (RL) based on divide and conquer, instead of temporal difference (TD) learning.




Problem setting: off-policy RL

Our problem setting is off-policy RL. Let’s briefly review what this means.

There are two classes of algorithms in RL: on-policy RL and off-policy RL. On-policy RL means we can only use fresh data collected by the current policy. In other words, we have to throw away old data each time we update the policy. Algorithms like PPO and GRPO (and policy gradient methods in general) belong to this category.

Off-policy RL means we don’t have this restriction: we can use any kind of data, including old experience, human demonstrations, Internet data, and so on. So off-policy RL is more general and flexible than on-policy RL (and of course harder!). Q-learning is the most well-known off-policy RL algorithm. In domains where data collection is expensive (e.g., robotics, dialogue systems, healthcare, etc.), we often have no choice but to use off-policy RL. That’s why it’s such an important problem.

As of 2025, I think we have reasonably good recipes for scaling up on-policy RL (e.g., PPO, GRPO, and their variants). However, we still haven’t found a “scalable” off-policy RL algorithm that scales well to complex, long-horizon tasks. Let me briefly explain why.

Two paradigms in value learning: Temporal Difference (TD) and Monte Carlo (MC)

In off-policy RL, we typically train a value function using temporal difference (TD) learning (i.e., Q-learning), with the following Bellman update rule:

\[\begin{aligned} Q(s, a) \gets r + \gamma \max_{a'} Q(s', a'), \end{aligned}\]

The problem is this: the error in the next value $Q(s’, a’)$ propagates to the current value $Q(s, a)$ through bootstrapping, and these errors accumulate over the entire horizon. This is basically what makes TD learning struggle to scale to long-horizon tasks (see this post if you’re interested in more details).

To mitigate this problem, people have mixed TD learning with Monte Carlo (MC) returns. For example, we can do $n$-step TD learning (TD-$n$):

\[\begin{aligned} Q(s_t, a_t) \gets \sum_{i=0}^{n-1} \gamma^i r_{t+i} + \gamma^n \max_{a'} Q(s_{t+n}, a'). \end{aligned}\]

Here, we use the actual Monte Carlo return (from the dataset) for the first $n$ steps, and then use the bootstrapped value for the rest of the horizon. This way, we can reduce the number of Bellman recursions by $n$ times, so errors accumulate less. In the extreme case of $n = \infty$, we recover pure Monte Carlo value learning.

While this is a reasonable solution (and often works well), it is highly unsatisfactory. First, it doesn’t fundamentally solve the error accumulation problem; it only reduces the number of Bellman recursions by a constant factor ($n$). Second, as $n$ grows, we suffer from high variance and suboptimality. So we can’t just set $n$ to a large value, and need to carefully tune it for each task.

Is there a fundamentally different way to solve this problem?

The “Third” Paradigm: Divide and Conquer

My claim is that a third paradigm in value learning, divide and conquer, may provide an ideal solution to off-policy RL that scales to arbitrarily long-horizon tasks.




Divide and conquer reduces the number of Bellman recursions logarithmically.


The key idea of divide and conquer is to divide a trajectory into two equal-length segments, and combine their values to update the value of the full trajectory. This way, we can (in theory) reduce the number of Bellman recursions logarithmically (not linearly!). Moreover, it doesn’t require choosing a hyperparameter like $n$, and it doesn’t necessarily suffer from high variance or suboptimality, unlike $n$-step TD learning.

Conceptually, divide and conquer really has all the nice properties we want in value learning. So I’ve long been excited about this high-level idea. The problem was that it wasn’t clear how to actually do this in practice… until recently.

A practical algorithm

In a recent work co-led with Aditya, we made meaningful progress toward realizing and scaling up this idea. Specifically, we were able to scale up divide-and-conquer value learning to highly complex tasks (as far as I know, this is the first such work!) at least in one important class of RL problems, goal-conditioned RL. Goal-conditioned RL aims to learn a policy that can reach any state from any other state. This provides a natural divide-and-conquer structure. Let me explain this.

The structure is as follows. Let’s first assume that the dynamics is deterministic, and denote the shortest path distance (“temporal distance”) between two states $s$ and $g$ as $d^*(s, g)$. Then, it satisfies the triangle inequality:

\[\begin{aligned} d^*(s, g) \leq d^*(s, w) + d^*(w, g) \end{aligned}\]

for all $s, g, w \in \mathcal{S}$.

In terms of values, we can equivalently translate this triangle inequality to the following “transitive” Bellman update rule:

\[\begin{aligned}
V(s, g) \gets \begin{cases}
\gamma^0 & \text{if } s = g, \\\\
\gamma^1 & \text{if } (s, g) \in \mathcal{E}, \\\\
\max_{w \in \mathcal{S}} V(s, w)V(w, g) & \text{otherwise}
\end{cases}
\end{aligned}\]

where $\mathcal{E}$ is the set of edges in the environment’s transition graph, and $V$ is the value function associated with the sparse reward $r(s, g) = 1(s = g)$. Intuitively, this means that we can update the value of $V(s, g)$ using two “smaller” values: $V(s, w)$ and $V(w, g)$, provided that $w$ is the optimal “midpoint” (subgoal) on the shortest path. This is exactly the divide-and-conquer value update rule that we were looking for!

The problem

However, there’s one problem here. The issue is that it’s unclear how to choose the optimal subgoal $w$ in practice. In tabular settings, we can simply enumerate all states to find the optimal $w$ (this is essentially the Floyd-Warshall shortest path algorithm). But in continuous environments with large state spaces, we can’t do this. Basically, this is why previous works have struggled to scale up divide-and-conquer value learning, even though this idea has been around for decades (in fact, it dates back to the very first work in goal-conditioned RL by Kaelbling (1993) – see our paper for a further discussion of related works). The main contribution of our work is a practical solution to this issue.

The solution

Here’s our key idea: we restrict the search space of $w$ to the states that appear in the dataset, specifically, those that lie between $s$ and $g$ in the dataset trajectory. Also, instead of searching for the optimal $\text{argmax}_w$, we compute a “soft” $\text{argmax}$ using expectile regression. Namely, we minimize the following loss:

\[\begin{aligned} \mathbb{E}\left[\ell^2_\kappa (V(s_i, s_j) - \bar{V}(s_i, s_k) \bar{V}(s_k, s_j))\right], \end{aligned}\]

where $\bar{V}$ is the target value network, $\ell^2_\kappa$ is the expectile loss with an expectile $\kappa$, and the expectation is taken over all $(s_i, s_k, s_j)$ tuples with $i \leq k \leq j$ in a randomly sampled dataset trajectory.

This has two benefits. First, we don’t need to search over the entire state space. Second, we prevent value overestimation from the $\max$ operator by instead using the “softer” expectile regression. We call this algorithm Transitive RL (TRL). Check out our paper for more details and further discussions!

Does it work well?





Your browser does not support the video tag.


humanoidmaze




Your browser does not support the video tag.


puzzle



To see whether our method scales well to complex tasks, we directly evaluated TRL on some of the most challenging tasks in OGBench, a benchmark for offline goal-conditioned RL. We mainly used the hardest versions of humanoidmaze and puzzle tasks with large, 1B-sized datasets. These tasks are highly challenging: they require performing combinatorially complex skills across up to 3,000 environment steps.




TRL achieves the best performance on highly challenging, long-horizon tasks.


The results are quite exciting! Compared to many strong baselines across different categories (TD, MC, quasimetric learning, etc.), TRL achieves the best performance on most tasks.




TRL matches the best, individually tuned TD-$n$, without needing to set $\boldsymbol{n}$.


This is my favorite plot. We compared TRL with $n$-step TD learning with different values of $n$, from $1$ (pure TD) to $\infty$ (pure MC). The result is really nice. TRL matches the best TD-$n$ on all tasks, without needing to set $\boldsymbol{n}$! This is exactly what we wanted from the divide-and-conquer paradigm. By recursively splitting a trajectory into smaller ones, it can naturally handle long horizons, without having to arbitrarily choose the length of trajectory chunks.

The paper has a lot of additional experiments, analyses, and ablations. If you’re interested, check out our paper!

What’s next?

In this post, I shared some promising results from our new divide-and-conquer value learning algorithm, Transitive RL. This is just the beginning of the journey. There are many open questions and exciting directions to explore:



Perhaps the most important question is how to extend TRL to regular, reward-based RL tasks beyond goal-conditioned RL. Would regular RL have a similar divide-and-conquer structure that we can exploit? I’m quite optimistic about this, given that it is possible to convert any reward-based RL task to a goal-conditioned one at least in theory (see page 40 of this book).


Another important challenge is to deal with stochastic environments. The current version of TRL assumes deterministic dynamics, but many real-world environments are stochastic, mainly due to partial observability. For this, “stochastic” triangle inequalities might provide some hints.


Practically, I think there is still a lot of room to further improve TRL. For example, we can find better ways to choose subgoal candidates (beyond the ones from the same trajectory), further reduce hyperparameters, further stabilize training, and simplify the algorithm even more.



In general, I’m really excited about the potential of the divide-and-conquer paradigm. I still think one of the most important problems in RL (and even in machine learning) is to find a scalable off-policy RL algorithm. I don’t know what the final solution will look like, but I do think divide and conquer, or recursive decision-making in general, is one of the strongest candidates toward this holy grail (by the way, I think the other strong contenders are (1) model-based RL and (2) TD learning with some “magic” tricks). Indeed, several recent works in other fields have shown the promise of recursion and divide-and-conquer strategies, such as shortcut models, log-linear attention, and recursive language models (and of course, classic algorithms like quicksort, segment trees, FFT, and so on). I hope to see more exciting progress in scalable off-policy RL in the near future!

Acknowledgments

I’d like to thank Kevin and Sergey for their helpful feedback on this post.



This post originally appeared on Seohong Park’s blog.

#ai

Presented by CelonisAI adoption is accelerating, but results often lag expectations. And enterprise leaders are under pressure to prove measurable ROI from the AI solutions — especially as the use of autonomous agents rises and global tariffs disrupt supply chains.The issue isn’t the AI itself, says Alex Rinke, co-founder and co-CEO of Celonis, a global leader in process intelligence. “To succeed, enterprise AI needs to understand the context of a business’s processes — and how to improve them,” he explains. Without this business context, AI risks becoming, as Rinke puts it, “just an internal social experiment.”Next week’s Celosphere 2025 will tackle the AI ROI challenge head-on. The three-day event brings together customer strategies, hands-on workshops, and live demonstrations, highlighting enhancements to the Celonis Process Intelligence (PI) Platform that help enterprises harness ‘enterprise AI,’ powered by PI, to continuously improve operations, creating measurable business value at scale.Focus on measurable ROIThe event’s focus on achieving AI ROI reflects three challenges facing technology and business leaders moving from pilot to production: obsolete systems, break-neck industry change, and agentic AI. According to Gartner, 64% of board members now view AI as a top-three priority — yet only 10% of organizations report meaningful financial returns.Celonis customers are bucking that trend. A Forrester Total Economic Impact study found organizations using its platform achieved 383% ROI over three years, with payback in just six months. One company improved sales order automation from 33% to 86%, saving $24.5 million. The study estimated $44.1 million in total benefits over three years, driven by faster automation, reduced inefficiencies, and higher process visibility. These numbers underscore a broader pattern — companies that modernize outdated systems and align AI with process optimization see faster payback and sustained gains.Real companies, real resultsCelosphere will spotlight how global enterprises are building “future-fit” operations. Mercedes-Benz Group AG and Vinmar Group will showcase AI-driven, composable solutions, powered by PI, and attendees will see demonstrations of PI enabling agents in live production environments.Among the notable success stories: AstraZeneca, the pharmaceutical company, reduced excess inventory while keeping critical medicines flowing by using Celonis as a foundation for its OpenAI partnership.The State of Oklahoma can answer procurement status questions at scale, unlocking over $10 million in value. Cosentino clears blocked sales orders up to 5x faster using an AI-powered credit management assistant. Raising the stakes for agentic AINumerous sessions will focus on orchestrating AI agents. The shift from AI-as-advisor to AI-as-actor, changes everything, says Rinke. “The agent needs to understand not just what to do, but how your specific business actually works,” he explains. “Process intelligence provides those rails." This leap from recommendation to autonomous action raises the stakes exponentially. When agents can independently trigger purchase orders, reroute shipments, or approve exceptions, bad context can mean catastrophically bad outcomes at scale.Celosphere attendees will get to see first-hand how companies are using the Celonis Orchestration Engine to coordinate AI agents alongside people and systems. Effective orchestration is a crucial protection against the chaos of agents working at cross-purposes, duplicating actions, or letting crucial steps fall through the cracks. Navigating tariffs and supply chain shocksGlobal trade volatility isn't just a headline — it's an operational nightmare reshaping how companies deploy AI, Rinke says. New tariffs trigger cascading effects across procurement, logistics, and compliance. Each policy shift can cascade across thousands of SKUs — forcing new supplier contracts, rerouted shipments, and rebalanced inventories. For AI systems trained on static conditions, that volatility is almost impossible to predict. Traditional AI systems struggle with such variability — but process intelligence gives organizations real-time visibility into how changes ripple through operations.Celosphere case studies will show how companies turn disruption into advantage. Smurfit Westrock uses PI to optimize inventory and reduce costs amid tariff uncertainty, while ASOS leverages PI to optimize its supply chain operations, enhancing efficiency, reducing costs, and continuing to deliver an outstanding customer experience.Platform over point solutionsRinke argues that Celonis’ edge lies in treating process intelligence not as an add-on, but as the foundation of the enterprise stack. Unlike bolt-on optimization tools, the Celonis platform creates a living digital twin of business operations — a continuously updated model enriched by context that lets AI operate effectively from analysis to execution.“What sets Celonis apart is visibility across systems and offline tasks, which is critical for true intelligent automation,” Rinke says. “The platform offers comprehensive capabilities spanning process analysis, design, and orchestration rather than a point solution.”“Free the Process” and the future of AICelonis continues to champion openness through its “Free the Process” movement, promoting fair competition and freeing enterprises from legacy lock-in. By giving organizations full access to their own process data, open APIs, and a growing partner network that includes The Hackett Group, ClearOps, and Lobster, Celonis is building the connective tissue for a new era of interoperable automation.For Rinke, this open foundation is what turns AI from a set of experiments into an enterprise engine. “Process intelligence creates a flywheel,” he says. “Better understanding leads to better optimization, which enables better AI — and that, in turn, drives even greater understanding. There is no AI without PI."Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

machine learning continues to evolve faster than most can keep up with.

#artificial intelligence #app

Are you feeling it? I hear it’s close: two years, five years—maybe next year! And I hear it’s going to change everything: it will cure disease, save the planet, and usher in an age of abundance. It will solve our biggest problems in ways we cannot yet imagine. It will redefine what it means to…

«123...165»
×