The coding framework uses modular concepts and simple synchronization rules to make software clearer, safer, and easier for LLMs to generate.
Presented by Salesforce Vibe coding — the fast-growing trend of using generative AI to spin up code from plain-language prompts — is quick, creative, and great for instant prototypes. But many argue that it's not cut out for building production-ready business apps with the security, governance, and trusted infrastructure that enterprises require. In other words, a few saved hours in development can mean a future full of security vulnerabilities, endless maintenance, and scalability headaches, says Mohith Shrivastava, principal developer advocate at Salesforce."For rapid experimentation, building minimum viable products, and tackling creative challenges, vibe coding is a game-changer," Shrivastava says. "However, that same speed and improvisational nature are exactly what makes its application in a professional, enterprise setting a topic of intense debate. And the skepticism from the developer community is 100% justified."Risks and rewards of vibe coding The excitement is all about speed: going from a rough idea to a working prototype in hours, not weeks, is a massive advantage. But as Shrivastava shared, developers have been vocal about the potential downsides."When you apply vibe coding indiscriminately to an entire application stack, you’re not just moving fast; you’re accumulating risk at an unprecedented rate," Shrivastava explains. "The cons are significant." That includes potential security nightmares, as AI models don't typically take into consideration the company's specific security policies. They can easily introduce vulnerabilities like hardcoded secrets or use insecure, hallucinated packages. Then there’s the issue of what Shrivastava calls "spaghetti code on steroids," or verbose code that lacks a coherent architectural pattern, creating a mountain of technical debt.Equally concerning is the illusion of progress: vibe coding may complete 80% of a feature in record time, but the remaining 20% — the edge cases, performance tuning, and compliance work — becomes exponentially harder.But does this mean vibe coding has no place in the enterprise?"The idea that you can just vibe your way to a complex, secure, and maintainable enterprise application is a dangerous fantasy," Shrivastava says. "But — the pros are undeniable if it's used correctly. The key is not to avoid vibe coding, but to apply it intelligently in your enterprise."Red and green zones: Enterprise-grade vibe codingYou can't, and you absolutely should not, vibe code your entire enterprise stack with just any generic tool, Shrivastava warns. But when paired with no-, low-, or pro-code tools that are built for the enterprise, many of the gaps can be addressed. An enterprise-grade vibe coding solution, for example, can automatically scan for security issues, flag performance bottlenecks, and provide a safety net. It’s also critical to understand which parts of an application suit this approach — and which demand a higher level of trust and control. Shrivastava divides the stack into red and green zones to illustrate.The green zone is the presentation layer, or the UI and UX. It’s ideal for vibe coding, where developers can move fast and iterate quickly without much risk. In contrast is the red zone, which covers the foundational pillars of an application, including business logic and data layers.Empowering developers in the green zoneDeveloper expertise remains the foundation for effective and safe vibe coding. But developers can be amplified by AI tools and emerging agents that are grounded in business context, connected to real applications, integrations, and data flows."A generic AI agent can't grasp your company's unique processes, but a context-aware tool can act as a powerful pair programmer, helping a developer draft complex logic or model data with greater speed and accuracy," Shrivastava says. "It’s about making the expert developer more efficient, not trying to do their job for them."Some areas will always be high risk for ungoverned AI — especially infrastructure and security. Letting a generic AI agent configure firewalls or Identity and Access Management [IAM] policies without oversight, Shrivastava warns, is a recipe for disaster. The solution isn’t to avoid the red zone entirely, but to approach it with the right tools — ones that embed governance, security, and context from the ground up."The winning strategy is clear: Vibe code the green zone for agility, approach the red zone by augmenting your developers with powerful, context-aware tools, and never, ever DIY your core infrastructure with AI," he says.Embracing enterprise vibe codingTo harness the power of enterprise vibe coding, Salesforce developed Agentforce Vibes. This new vibe coding offering for the enterprise includes Agentforce, an autonomous AI agent built to collaborate like a pair programmer on the Salesforce Platform. It’s designed precisely to provide developers with the right tools for the job, covering both the green and red zones. For the green zone, it offers the speed and agility to rapidly build UIs and prototypes. But its true power lies in how it augments developers in the red zone."Enterprise vibe coding like Agentforce lets organizations take AI-assisted development to the organizational level, accelerating coding, testing, and deployment, while ensuring consistency, security, and performance," says Dan Fernandez, VP of product, developer services at Salesforce. "It's not about throwing away governance for speed; it’s about integrating AI into every stage of the application lifecycle to work smarter."Because Agentforce Vibes’ tooling is deeply integrated with your business context on the platform, it can safely assist with business logic and data modeling. Most importantly, it operates on a trusted platform. Instead of a DIY approach — jury-rigging a generic AI agent to handle your networking — developers build on a foundation that has security and governance built in, so they can innovate safely, knowing the most critical layers of the stack are secure and compliant.Major enterprises are putting vibe coding to work Agentforce Vibes users are now tapping the tool to build around 20 to 25% of their new code base, according to Salesforce data, and users are accepting around 1.2 million lines of agentic code per month. That includes companies like Coinbase, CGI, Grupo Globo, and one of the top five banks in the U.S., which is using Agentforce Vibes capabilities to develop production-ready apps faster. Agentforce Vibes is part of a suite of tools in Agentforce 360 that span from no-code and low-code to pro-code development. These tools are together helping customers develop and deploy at speeds previously unheard of.With the low-code Agent Builder in Agentforce, the Secret Escapes team was able to build, test, and launch their agent to support customer service in just two weeks, compared to the six months it had previously taken the company to build and train a bot. With Agentforce, 1-800Accountant autonomously resolved 70% of customer chat engagements during tax week in 2025, without writing a line of code, using Salesforce’s low-code tools and AI assistance. Meanwhile, media company Grupo Globo deployed agents to identify subscribers at risk of lapsing, offer personalized upgrades, cross-sell, and convert non-subscribers. As a result, Agentforce boosted Globo’s retention rates by 22% in less than three months.Innovation meets discipline Enterprise tools show that disciplined engineering and creative experimentation can coexist — and that balance, Shrivastava says, is the key to lasting innovation."Vibe coding is not a fad, but it's also not a silver bullet that will replace disciplined software engineering," Shrivastava says. "The smart path forward is a hybrid approach where human software skills are augmented with agentic intelligence. This balanced approach is how you get the best of both worlds: radical innovation at the edge and unwavering stability at the core."Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
Presented by ArmAI is no longer confined to the cloud or data centers. Increasingly, it’s running directly where data is created — in devices, sensors, and networks at the edge. This shift toward on-device intelligence is being driven by latency, privacy, and cost concerns that companies are confronting as they continue their investments in AI. For leadership teams, the opportunity is clear, says Chris Bergey, SVP and GM, of Arm’s Client Business: Invest in AI-first platforms that complement cloud usage, deliver real-time responsiveness, and protect sensitive data. "With the explosion of connected devices and the rise of IoT, edge AI provides a significant opportunity for organizations to gain a competitive edge through faster, more efficient AI," Bergey explains. "Those who move first aren’t just improving efficiency, they’re redefining what customers expect. AI is becoming a differentiator in trust, responsiveness, and innovation. The sooner a business makes AI central to its workflows, the faster it compounds that advantage." Use cases: Deploying AI where data livesEnterprises are discovering that edge AI isn’t just a performance boost — it’s a new operational model. Processing locally means less dependency on the cloud and faster, safer decision-making in real time. For instance, a factory floor can analyze equipment data instantly to prevent downtime, while a hospital can run diagnostic models securely on-site. Retailers are deploying in-store analytics using vision systems while logistic companies are using on-device AI to optimize fleet operations. Instead of sending vast data volumes to the cloud, organizations can analyze and act on insights where they emerge. The result is a more responsive, privacy-preserving, and cost-effective AI architecture.The consumer expectation: Immediacy and trustWorking with Alibaba’s Taobao team, the largest Chinese ecommerce platform, Arm (Nasdaq:Arm) enabled on-device product recommendations that update instantly without depending on the cloud. This helped online shoppers find what they need faster while keeping browsing data private.Another example comes from consumer tech: Meta’s Ray-Ban smart glasses, which blend cloud and on-device AI. The glasses handle quick commands locally for faster responses, while heavier tasks like translation and visual recognition are processed in the cloud."Every major technology shift has created new ways to engage and monetize," Bergey says. "As AI capabilities and user expectations grow, more intelligence will need to move closer to the edge to deliver this kind of immediacy and trust that people now expect." This shift is also taking place with the tools people use every day. Assistants like Microsoft Copilot and Google Gemini are blending cloud and on-device intelligence to bring generative AI closer to the user, delivering faster, more secure, and more context-aware experiences. That same principle applies across industries: the more intelligence you move safely and efficiently to the edge, the more responsive, private, and valuable your operations become. Building smarter for scaleThe explosion of AI at the edge demands not only smarter chips but smarter infrastructure. By aligning compute power with workload demands, enterprises can reduce energy consumption while maintaining high performance. This balance of sustainability and scale is fast becoming a competitive differentiator."Compute needs, whether in the cloud or on-premises, will continue to rise sharply. The question becomes, how do you maximize value from that compute?" he said. "You can only do this by investing in compute platforms and software that scale with your AI ambitions. The real measure of progress is enterprise value creation, not raw efficiency metrics."The intelligent foundationThe rapid evolution of AI models, especially those powering edge inferencing, multimodal applications, and low-latency responses, demands not just smarter algorithms, but a foundation of highly performant, energy-efficient hardware. As workloads grow more diverse and distributed, legacy architectures designed for traditional workloads are no longer adequate. The role of CPUs is evolving, and they now sit at the center of increasingly heterogenous systems that deliver advanced on-device AI experiences. Thanks to their flexibility, efficiency, and mature software support, modern CPUs can run everything from classic machine learning to complex generative AI workloads. When paired with accelerators such as NPUs or GPUs, they intelligently coordinate compute across the system — ensuring the right workload runs on the right engine for maximum performance and efficiency. The CPU continues to be the foundation that enables scalable, efficient AI everywhere.Technologies like Arm’s Scalable Matrix Extension 2 (SME2) bring advanced matrix acceleration to Armv9 CPUs. Meanwhile, Arm KleidiAI, its intelligent software layer, is extensively integrated across leading frameworks to automatically boost performance for a wide range of AI workloads, from language models to speech recognition to computer vision, running on Arm-based edge devices — without needing developers to rewrite their code."These technologies ensure that AI frameworks can tap into the full performance of Arm-based systems without extra developer effort," he says. "It’s how we make AI both scalable and sustainable: by embedding intelligence into the foundation of modern compute, so innovation happens at the speed of software, not hardware cycles."That democratization of compute power is also what will facilitate the next wave of intelligent, real-time experiences across the enterprise, not just in flagship products, but across entire device portfolios. The evolution of edge AI As AI moves from isolated pilots to full-scale deployment, the enterprises that succeed will be those that connect intelligence across every layer of infrastructure. Agentic AI systems will depend on this seamless integration — enabling autonomous processes that can reason, coordinate, and deliver value instantly."The pattern is familiar as in every disruptive wave, incumbents that move slowly risk being overtaken by new entrants," he says. "The companies that thrive will be the ones that wake up every morning asking how to make their organization AI-first. As with the rise of the internet and cloud computing, those who lean in and truly become AI-enabled will shape the next decade."Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
By now, enterprises understand that retrieval augmented generation (RAG) allows applications and agents to find the best, most grounded information for queries. However, typical RAG setups could be an engineering challenge and also exhibit undesirable traits. To help solve this, Google released the File Search Tool on the Gemini API, a fully managed RAG system “that abstracts away the retrieval pipeline.” File Search removes much of the tool and application-gathering involved in setting up RAG pipelines, so engineers don’t need to stitch together things like storage solutions and embedding creators. This tool competes directly with enterprise RAG products from OpenAI, AWS and Microsoft, which also aim to simplify RAG architecture. Google, though, claims its offering requires less orchestration and is more standalone. “File Search provides a simple, integrated and scalable way to ground Gemini with your data, delivering responses that are more accurate, relevant and verifiable,” Google said in a blog post. Enterprises can access some features of File Search, such as storage and embedding generation, for free at query time. Users will begin paying for embeddings when these files are indexed at a fixed rate of $0.15 per 1 million tokens. Google’s Gemini Embedding model, which eventually became the top embedding model on the Massive Text Embedding Benchmark, powers File Search. File Search and integrated experiences Google said File Search works “by handling the complexities of RAG for you.” File Search manages file storage, chunking strategies and embeddings. Developers can invoke File Search within the existing generateContent API, which Google said makes the tool easier to adopt. File Search uses vector search to “understand the meaning and context of a user’s query.” Ideally, it will find the relevant information to answer a query from documents, even if the prompt contains inexact words. The feature has built-in citations that point to the specific parts of a document it used to generate answers, and also supports a variety of file formats. These include PDF, Docx, txt, JSON and “many common programming language file types," Google says. Continuous RAG experimentation Enterprises may have already begun building out a RAG pipeline as they lay the groundwork for their AI agents to actually tap the correct data and make informed decisions. Because RAG represents a key part of how enterprises maintain accuracy and tap into insights about their business, organizations must quickly have visibility into this pipeline. RAG can be an engineering pain because orchestrating multiple tools together can become complicated. Building “traditional” RAG pipelines means organizations must assemble and fine-tune a file ingestion and parsing program, including chunking, embedding generation and updates. They must then contract a vector database like Pinecone, determine its retrieval logic, and fit it all within a model’s context window. Additionally, they can, if desired, add source citations. File Search aims to streamline all of that, although competitor platforms offer similar features. OpenAI’s Assistants API allows developers to utilize a file search feature, guiding an agent to relevant documents for responses. AWS’s Bedrock unveiled a data automation managed service in December. While File Search stands similarly to these other platforms, Google’s offering abstracts all, rather than just some, elements of the RAG pipeline creation. Phaser Studio, the creator of AI-driven game generation platform Beam, said in Google’s blog that it used File Search to sift through its library of 3,000 files.“File Search allows us to instantly surface the right material, whether that’s a code snippet for bullet patterns, genre templates or architectural guidance from our Phaser ‘brain’ corpus,” said Phaser CTO Richard Davey. “The result is ideas that once took days to prototype now become playable in minutes.”Since the announcement, several users expressed interest in using the feature.
AI agents tested, update ChatGPT mid-stream, Tinder photo mining, AI security flaws, and more...
Google Cloud has introduced a big update in a bid to keep AI developers on its Vertex AI platform for concepting, designing, building, testing, deploying and modifying AI agents in enterprise use cases.The new features, announced today, include additional governance tools for enterprises and expanding the capabilities for creating agents with just a few lines of code, moving faster with state-of-the-art context management layers and one-click deployment, as well as managed services for scaling production and evaluation, and support for identifying agents.Agent Builder, released last year during its annual Cloud Next event, provides a no-code platform for enterprises to create agents and connect these to orchestration frameworks like LangChain.Google’s Agent Development Kit (ADK), which lets developers build agents “in under 100 lines of code,” can also be accessed through Agent Builder. “These new capabilities underscore our commitment to Agent Builder, and simplify the agent development process to meet developers where they are, no matter which tech stack they choose,” said Mike Clark, director of Product Management, Vertex AI Agent Builder. Build agents fasterPart of Google’s pitch for Agent Builder’s new features is that enterprises can bake in-orchestration even as they construct their agents. “Building an agent from a concept to a working product involves complex orchestration,” said Clark. The new capabilities, which are shipped with the ADK, include:SOTA context management layers including Static, Turn, User and Cache layers so enterprises have more control over the agents’ contextPrebuilt plugins with customizable logic. One of the new plugins allows agents to recognize failed tool calls and “self-heal” by retrying the task with a different approachAdditional language support in ADK, including Go, alongside Python and Java, that launched with ADKOne-click deployment through the ADK command line interface to move agents from a local environment to live testing with a single commandGovernance layerEnterprises require high accuracy; security; observability and auditability (what a program did and why); and steerability (control) in their production-grade AI agents.While Google had observability features in the local development environment at launch, developers can now access these tools through the Agent Engine managed runtime dashboard. The company said this brings cloud-based production monitoring to track token consumption, error rates and latency. Within this observability dashboard, enterprises can visualize the actions agents take and reproduce any issues. Agent Engine will also have a new Evaluation Layer to help “simulate agent performance across a vast array of user interactions and situations.”This governance layer will also include:Agent Identities that Google said give “agents their own unique, native identities within Google Cloud Model Armor, which would block prompt injections, screen tool calls and agent responsesSecurity Command Center, so admins can build an inventory of their agents to detect threats like unauthorized access“These native identities provide a deep, built-in layer of control and a clear audit trail for all agent actions. These certificate-backed identities further strengthen your security as they cannot be impersonated and are tied directly to the agent's lifecycle, eliminating the risk of dormant accounts,” Clark said. The battle of agent builders It’s no surprise that model providers create platforms to build agents and bring them to production. The competition lies in how fast new tools and features are added.Google’s Agent Builder competes with OpenAI’s open-source Agent Development Kit, which enables developers to create AI agents using non-OpenAI models. Additionally, there is the recently announced AgentKit, which features an Agent Builder that enables companies to integrate agents into their applications easily. Microsoft has its Azure AI Foundry, launched last year around this time for AI agent creation, and AWS also offers agent builders on its Bedrock platform, but Google is hoping is suite of new features will help give it a competitive edge. However, it isn’t just companies with their own models that court developers to build their AI agents within their platforms. Any enterprise service provider with an agent library also wants clients to make agents on their systems. Capturing developer interest and keeping them within the ecosystem is the big battle between tech companies now, with features to make building and governing agents easier.
Today, we're announcing enhancements to Structured Outputs in the Gemini API.
AI models can help map species, protect forests and listen to birds around the world
Language models , as incredibly useful as they are, are not perfect, and they may fail or exhibit undesired performance due to a variety of factors, such as data quality, tokenization constraints, or difficulties in correctly interpreting user prompts.
The following article originally appeared on Gradient Flow and is being reposted here with the author’s permission. We’re living through a peculiar moment in AI development. On one hand, the demos are spectacular: agents that reason and plan with apparent ease, models that compose original songs from a text prompt, and research tools that produce […]
USC researchers built artificial neurons that replicate real brain processes using ion-based diffusive memristors. These devices emulate how neurons use chemicals to transmit and process signals, offering massive energy and size advantages. The technology may enable brain-like, hardware-based learning systems. It could transform AI into something closer to natural intelligence.
A new approach developed at MIT could help a search-and-rescue robot navigate an unpredictable environment by rapidly generating an accurate map of its surroundings.
The latest big headline in AI isn’t model size or multimodality — it’s the capacity crunch. At VentureBeat’s latest AI Impact stop in NYC, Val Bercovici, chief AI officer at WEKA, joined Matt Marshall, VentureBeat CEO, to discuss what it really takes to scale AI amid rising latency, cloud lock-in, and runaway costs.Those forces, Bercovici argued, are pushing AI toward its own version of surge pricing. Uber famously introduced surge pricing, bringing real-time market rates to ridesharing for the first time. Now, Bercovici argued, AI is headed toward the same economic reckoning — especially for inference — when the focus turns to profitability."We don't have real market rates today. We have subsidized rates. That’s been necessary to enable a lot of the innovation that’s been happening, but sooner or later — considering the trillions of dollars of capex we’re talking about right now, and the finite energy opex — real market rates are going to appear; perhaps next year, certainly by 2027," he said. "When they do, it will fundamentally change this industry and drive an even deeper, keener focus on efficiency."The economics of the token explosion"The first rule is that this is an industry where more is more. More tokens equal exponentially more business value," Bercovici said. But so far, no one's figured out how to make that sustainable. The classic business triad — cost, quality, and speed — translates in AI to latency, cost, and accuracy (especially in output tokens). And accuracy is non-negotiable. That holds not only for consumer interactions with agents like ChatGPT, but for high-stakes use cases such as drug discovery and business workflows in heavily regulated industries like financial services and healthcare."That’s non-negotiable," Bercovici said. "You have to have a high amount of tokens for high inference accuracy, especially when you add security into the mix, guardrail models, and quality models. Then you’re trading off latency and cost. That’s where you have some flexibility. If you can tolerate high latency, and sometimes you can for consumer use cases, then you can have lower cost, with free tiers and low cost-plus tiers." However, latency is a critical bottleneck for AI agents. “These agents now don't operate in any singular sense. You either have an agent swarm or no agentic activity at all,” Bercovici noted.In a swarm, groups of agents work in parallel to complete a larger objective. An orchestrator agent — the smartest model — sits at the center, determining subtasks and key requirements: architecture choices, cloud vs. on-prem execution, performance constraints, and security considerations. The swarm then executes all subtasks, effectively spinning up numerous concurrent inference users in parallel sessions. Finally, evaluator models judge whether the overall task was successfully completed.“These swarms go through what's called multiple turns, hundreds if not thousands of prompts and responses until the swarm convenes on an answer,” Bercovici said. “And if you have a compound delay in those thousand turns, it becomes untenable. So latency is really, really important. And that means typically having to pay a high price today that's subsidized, and that's what's going to have to come down over time.”Reinforcement learning as the new paradigmUntil around May of this year, agents weren't that performant, Bercovici explained. And then context windows became large enough, and GPUs available enough, to support agents that could complete advanced tasks, like writing reliable software. It's now estimated that in some cases, 90% of software is generated by coding agents. Now that agents have essentially come of age, Bercovici noted, reinforcement learning is the new conversation among data scientists at some of the leading labs, like OpenAI, Anthropic, and Gemini, who view it as a critical path forward in AI innovation.."The current AI season is reinforcement learning. It blends many of the elements of training and inference into one unified workflow,” Bercovici said. “It’s the latest and greatest scaling law to this mythical milestone we’re all trying to reach called AGI — artificial general intelligence,” he added. "What’s fascinating to me is that you have to apply all the best practices of how you train models, plus all the best practices of how you infer models, to be able to iterate these thousands of reinforcement learning loops and advance the whole field."The path to AI profitability There’s no one answer when it comes to building an infrastructure foundation to make AI profitable, Bercovici said, since it's still an emerging field. There’s no cookie-cutter approach. Going all on-prem may be the right choice for some — especially frontier model builders — while being cloud-native or running in a hybrid environment may be a better path for organizations looking to innovate agilely and responsively. Regardless of which path they choose initially, organizations will need to adapt their AI infrastructure strategy as their business needs evolve."Unit economics are what fundamentally matter here," said Bercovici. "We are definitely in a boom, or even in a bubble, you could say, in some cases, since the underlying AI economics are being subsidized. But that doesn’t mean that if tokens get more expensive, you’ll stop using them. You’ll just get very fine-grained in terms of how you use them." Leaders should focus less on individual token pricing and more on transaction-level economics, where efficiency and impact become visible, Bercovici concludes. The pivotal question enterprises and AI companies should be asking, Bercovici said, is “What is the real cost for my unit economics?”Viewed through that lens, the path forward isn’t about doing less with AI — it’s about doing it smarter and more efficiently at scale.
Presented by Elastic Logs set to become the primary tool for finding the “why” in diagnosing network incidents Modern IT environments have a data problem: there’s too much of it. Organizations that need to manage a company’s environment are increasingly challenged to detect and diagnose issues in real-time, optimize performance, improve reliability, and ensure security and compliance — all within constrained budgets. The modern observability landscape has many tools that offer a solution. Most revolve around DevOps teams or Site Reliability Engineers (SREs) analyzing logs, metrics, and traces to uncover patterns and figure out what’s happening across the network, and diagnose why an issue or incident occurred. The problem is that the process creates information overload: A Kubernetes cluster alone can emit 30 to 50 gigabytes of logs a day, and suspicious behavior patterns can sneak past human eyes. "It’s so anachronistic now, in the world of AI, to think about humans alone observing infrastructure," says Ken Exner, chief product officer at Elastic. "I hate to break it to you, but machines are better than human beings at pattern matching.“An industry-wide focus on visualizing symptoms forces engineers to manually hunt for answers. The crucial "why" is buried in logs, but because they contain massive volumes of unstructured data, the industry tends to use them as a tool of last resort. This has forced teams into costly tradeoffs: either spend countless hours building complex data pipelines, drop valuable log data and risk critical visibility gaps, or log and forget.Elastic, the Search AI Company, recently released a new feature for observability called Streams, which aims to become the primary signal for investigations by taking noisy logs and turning them into patterns, context and meaning. Streams uses AI to automatically partition and parse raw logs to extract relevant fields, and greatly reduce the effort required of SREs to make logs usable. Streams also automatically surfaces significant events such as critical errors and anomalies from context-rich logs, giving SREs early warnings and a clear understanding of their workloads, enabling them to investigate and resolve issues faster. The ultimate goal is to show remediation steps."From raw, voluminous, messy data, Streams automatically creates structure, putting it into a form that is usable, automatically alerts you to issues and helps you remediate them," Exner says. "That is the magic of Streams."A broken workflowStreams upends an observability process that some say is broken. Typically, SREs set up metrics, logs and traces. Then they set up alerts, and service level objectives (SLOs) — often hard-coded rules to show where a service or process has gone beyond a threshold, or a specific pattern has been detected. When an alert is triggered, it points to the metric that's showing an anomaly. From there, SREs look at a metrics dashboard, where they can visualize the issue and compare the alert to other metrics, or CPU to memory to I/O, and start looking for patterns. They may then need to look at a trace, and examine upstream and downstream dependencies across the application to dig into the root cause of the issue. Once they figure out what's causing the trouble, they jump into the logs for that database or service to try and debug the issue. Some companies simply seek to add more tools when current ones prove ineffective. That means SREs are hopping from tool to tool to keep on top of monitoring and troubleshooting across their infrastructure and applications."You’re hopping across different tools. You’re relying on a human to interpret these things, visually look at the relationship between systems in a service map, visually look at graphs on a metrics dashboard, to figure out what and where the issue is, " Exner says. "But AI automates that workflow away." With AI-powered Streams, logs are not just used reactively to resolve issues, but also to proactively process potential issues and create information-rich alerts that help teams jump straight to problem-solving, offering a solution for remediation or even fixing the issue entirely, before automatically notifying the team that it's been taken care of."I believe that logs, the richest set of information, the original signal type, will start driving a lot of the automation that a service reliability engineer typically does today, and does very manually," he adds. "A human should not be in that process, where they are doing this by digging into themselves, trying to figure out what is going on, where and what the issue is, and then once they find the root cause, they’re trying to figure out how to debug it."Observability’s future Large language models (LLMs) could be a key player in the future of observability. LLMs excel at recognizing patterns in vast quantities of repetitive data, which closely resembles log and telemetry data in complex, dynamic systems. And today’s LLMs can be trained for specific IT processes. With automation tooling, the LLM has the information and tools it needs to resolve database errors or Java heap issues, and more. Incorporating those into platforms that bring context and relevance will be essential. Automated remediation will still take some time, Exner says, but automated runbooks and playbooks generated by LLMs will become standard practice within the next couple of years. In other words, remediation steps will be driven by LLMs. The LLM will offer up fixes, and the human will verify and implement them, rather than calling in an expert.Addressing skill shortagesGoing all in on AI for observability would help address a major shortage in the talent needed to manage IT infrastructure. Hiring is slow because organizations need teams with a great deal of experience and understanding of potential issues, and how to resolve them fast. That experience can come from an LLM that is contextually grounded, Exner says."We can help deal with the skill shortage by augmenting people with LLMs that make them all instantly experts," he explains. "I think this is going to make it much easier for us to take novice practitioners and make them expert practitioners in both security and observability, and it’s going to make it possible for a more novice practitioner to act like an expert.” Streams in Elastic Observability is available now. Get started by reading more on the Streams. Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
AI in orbit, Microsoft replaces OpenAI, Sora Android, bubble explained, and more...
The intelligence of AI models isn't what's blocking enterprise deployments. It's the inability to define and measure quality in the first place.That's where AI judges are now playing an increasingly important role. In AI evaluation, a "judge" is an AI system that scores outputs from another AI system. Judge Builder is Databricks' framework for creating judges and was first deployed as part of the company's Agent Bricks technology earlier this year. The framework has evolved significantly since its initial launch in response to direct user feedback and deployments.Early versions focused on technical implementation but customer feedback revealed the real bottleneck was organizational alignment. Databricks now offers a structured workshop process that guides teams through three core challenges: getting stakeholders to agree on quality criteria, capturing domain expertise from limited subject matter experts and deploying evaluation systems at scale."The intelligence of the model is typically not the bottleneck, the models are really smart," Jonathan Frankle, Databricks' chief AI scientist, told VentureBeat in an exclusive briefing. "Instead, it's really about asking, how do we get the models to do what we want, and how do we know if they did what we wanted?"The 'Ouroboros problem' of AI evaluationJudge Builder addresses what Pallavi Koppol, a Databricks research scientist who led the development, calls the "Ouroboros problem." An Ouroboros is an ancient symbol that depicts a snake eating its own tail. Using AI systems to evaluate AI systems creates a circular validation challenge."You want a judge to see if your system is good, if your AI system is good, but then your judge is also an AI system," Koppol explained. "And now you're saying like, well, how do I know this judge is good?"The solution is measuring "distance to human expert ground truth" as the primary scoring function. By minimizing the gap between how an AI judge scores outputs versus how domain experts would score them, organizations can trust these judges as scalable proxies for human evaluation.This approach differs fundamentally from traditional guardrail systems or single-metric evaluations. Rather than asking whether an AI output passed or failed on a generic quality check, Judge Builder creates highly specific evaluation criteria tailored to each organization's domain expertise and business requirements.The technical implementation also sets it apart. Judge Builder integrates with Databricks' MLflow and prompt optimization tools and can work with any underlying model. Teams can version control their judges, track performance over time and deploy multiple judges simultaneously across different quality dimensions.Lessons learned: Building judges that actually workDatabricks' work with enterprise customers revealed three critical lessons that apply to anyone building AI judges.Lesson one: Your experts don't agree as much as you think. When quality is subjective, organizations discover that even their own subject matter experts disagree on what constitutes acceptable output. A customer service response might be factually correct but use an inappropriate tone. A financial summary might be comprehensive but too technical for the intended audience."One of the biggest lessons of this whole process is that all problems become people problems," Frankle said. "The hardest part is getting an idea out of a person's brain and into something explicit. And the harder part is that companies are not one brain, but many brains."The fix is batched annotation with inter-rater reliability checks. Teams annotate examples in small groups, then measure agreement scores before proceeding. This catches misalignment early. In one case, three experts gave ratings of 1, 5 and neutral for the same output before discussion revealed they were interpreting the evaluation criteria differently.Companies using this approach achieve inter-rater reliability scores as high as 0.6 compared to typical scores of 0.3 from external annotation services. Higher agreement translates directly to better judge performance because the training data contains less noise.Lesson two: Break down vague criteria into specific judges. Instead of one judge evaluating whether a response is "relevant, factual and concise," create three separate judges. Each targets a specific quality aspect. This granularity matters because a failing "overall quality" score reveals something is wrong but not what to fix.The best results come from combining top-down requirements such as regulatory constraints, stakeholder priorities, with bottom-up discovery of observed failure patterns. One customer built a top-down judge for correctness but discovered through data analysis that correct responses almost always cited the top two retrieval results. This insight became a new production-friendly judge that could proxy for correctness without requiring ground-truth labels.Lesson three: You need fewer examples than you think. Teams can create robust judges from just 20-30 well-chosen examples. The key is selecting edge cases that expose disagreement rather than obvious examples where everyone agrees."We're able to run this process with some teams in as little as three hours, so it doesn't really take that long to start getting a good judge," Koppol said.Production results: From pilots to seven-figure deploymentsFrankle shared three metrics Databricks uses to measure Judge Builder's success: whether customers want to use it again, whether they increase AI spending and whether they progress further in their AI journey.On the first metric, one customer created more than a dozen judges after their initial workshop. "This customer made more than a dozen judges after we walked them through doing this in a rigorous way for the first time with this framework," Frankle said. "They really went to town on judges and are now measuring everything."For the second metric, the business impact is clear. "There are multiple customers who have gone through this workshop and have become seven-figure spenders on GenAI at Databricks in a way that they weren't before," Frankle said.The third metric reveals Judge Builder's strategic value. Customers who previously hesitated to use advanced techniques like reinforcement learning now feel confident deploying them because they can measure whether improvements actually occurred."There are customers who have gone and done very advanced things after having had these judges where they were reluctant to do so before," Frankle said. "They've moved from doing a little bit of prompt engineering to doing reinforcement learning with us. Why spend the money on reinforcement learning, and why spend the energy on reinforcement learning if you don't know whether it actually made a difference?"What enterprises should do nowThe teams successfully moving AI from pilot to production treat judges not as one-time artifacts but as evolving assets that grow with their systems.Databricks recommends three practical steps. First, focus on high-impact judges by identifying one critical regulatory requirement plus one observed failure mode. These become your initial judge portfolio.Second, create lightweight workflows with subject matter experts. A few hours reviewing 20-30 edge cases provides sufficient calibration for most judges. Use batched annotation and inter-rater reliability checks to denoise your data.Third, schedule regular judge reviews using production data. New failure modes will emerge as your system evolves. Your judge portfolio should evolve with them."A judge is a way to evaluate a model, it's also a way to create guardrails, it's also a way to have a metric against which you can do prompt optimization and it's also a way to have a metric against which you can do reinforcement learning," Frankle said. "Once you have a judge that you know represents your human taste in an empirical form that you can query as much as you want, you can use it in 10,000 different ways to measure or improve your agents."
When the transformer architecture was introduced in 2017 in the now seminal Google paper "Attention Is All You Need," it became an instant cornerstone of modern artificial intelligence. Every major large language model (LLM) — from OpenAI's GPT series to Anthropic's Claude, Google's Gemini, and Meta's Llama — has been built on some variation of its central mechanism: attention, the mathematical operation that allows a model to look back across its entire input and decide what information matters most.Eight years later, the same mechanism that defined AI’s golden age is now showing its limits. Attention is powerful, but it is also expensive — its computational and memory costs scale quadratically with context length, creating an increasingly unsustainable bottleneck for both research and industry. As models aim to reason across documents, codebases, or video streams lasting hours or days, attention becomes the architecture’s Achilles’ heel.On October 28, 2025, the little-known AI startup Manifest AI introduced a radical alternative. Their new model, Brumby-14B-Base, is a retrained variant of Qwen3-14B-Base, one of the leading open-source transformer models.But while many variants of Qwen have been trained already, Brumby-14B-Base is novel in that it abandons attention altogether. Instead, Brumby replaces those layers with a novel mechanism called Power Retention—a recurrent, hardware-efficient architecture that stores and updates information over arbitrarily long contexts without the exponential memory growth of attention.Trained at a stated cost of just $4,000, the 14-billion-parameter Brumby model performs on par with established transformer models like Qwen3-14B and GLM-4.5-Air, achieving near-state-of-the-art accuracy on a range of reasoning and comprehension benchmarks.From Attention to Retention: The Architectural ShiftThe core of Manifest AI’s innovation lies in what they call the Power Retention layer. In a traditional transformer, every token computes a set of queries (Q), keys (K), and values (V), then performs a matrix operation that measures the similarity between every token and every other token—essentially a full pairwise comparison across the sequence. This is what gives attention its flexibility, but also what makes it so costly: processing a sequence twice as long takes roughly four times the compute and memory.Power Retention keeps the same inputs (Q, K, V), but replaces the global similarity operation with a recurrent state update. Each layer maintains a memory matrix S, which is updated at each time step according to the incoming key, value, and a learned gating signal. The process looks more like an RNN (Recurrent Neural Network) than a transformer: instead of recomputing attention over the entire context, the model continuously compresses past information into a fixed-size latent state.This means the computational cost of Power Retention does not grow with context length. Whether the model is processing 1,000 or 1,000,000 tokens, the per-token cost remains constant. That property alone—constant-time per-token computation—marks a profound departure from transformer behavior.At the same time, Power Retention preserves the expressive power that made attention successful. Because the recurrence involves tensor powers of the input (hence the name “power retention”), it can represent higher-order dependencies between past and present tokens. The result is an architecture that can theoretically retain long-term dependencies indefinitely, while remaining as efficient as an RNN and as expressive as a transformer.Retraining, Not RebuildingPerhaps the most striking aspect of Brumby-14B’s training process is its efficiency. Manifest AI trained the model for only 60 hours on 32 Nvidia H100 GPUs, at a cost of roughly $4,000 — less than 2% of what a conventional model of this scale would cost to train from scratch.However, since it relied on a transformer-based model, it's safe to say that this advance alone will not end the transformer AI-era.As Jacob Buckman, founder of Manifest AI, clarified in an email to VentureBeat: “The ability to train for $4,000 is indeed only possible when leveraging an existing transformer model,” he said. “Brumby could not be trained from scratch for that price.”Still, Buckman emphasized the significance of that result: “The reason this is important is that the ability to build on the weights of the previous generation of model architectures is a critical accelerant for the adoption of a new modeling paradigm.” He argues this demonstrates how attention-free systems can catch up to transformer performance “for orders-of-magnitude less” investment.In the loss curves released by Manifest AI, Brumby’s training loss quickly converges to that of the Qwen3 baseline within 3,000 training steps, even as the architecture diverges significantly from its transformer origins. Although Brumby-14B-Base began life as Qwen3-14B-Base, it did not remain identical for long. Manifest AI fundamentally altered Qwen3’s architecture by removing its attention layers—the mathematical engine that defines how a transformer model processes information—and replacing them with their new “power retention” mechanism. This change restructured the model’s internal wiring, effectively giving it a new brain while preserving much of its prior knowledge.Because of that architectural swap, the existing Qwen3 weights no longer fit perfectly. They were trained to operate within a transformer’s attention dynamics, not the new retention-based system. As a result, the Brumby model initially “forgot” how to apply some of its learned knowledge effectively. The retraining process—about 3,000 steps of additional learning—served to recalibrate those weights, aligning them with the power retention framework without having to start from zero.A helpful way to think about this is to imagine taking a world-class pianist and handing them a guitar. They already understand rhythm, harmony, and melody, but their hands must learn entirely new patterns to produce the same music. Similarly, Brumby had to relearn how to use its existing knowledge through a new computational instrument. Those 3,000 training steps were, in effect, its crash course in guitar lessons.By the end of this short retraining phase, Brumby had regained its full performance, reaching the same accuracy as the original Qwen3 model. That quick recovery is what makes the result so significant: it shows that an attention-free system can inherit and adapt the capabilities of a transformer model with only a fraction of the training time and cost.The benchmark progression plots show a similar trend: the model rapidly approaches its target accuracy on core evaluations like GSM8K, HellaSwag, and MMLU after only a few thousand steps, matching or even slightly surpassing Qwen3 on several tasks.Benchmarking the BrumbyAcross standard evaluation tasks, Brumby-14B-Base consistently performs at or near parity with transformer baselines of comparable scale.TaskBrumby-14BQwen3-14BGLM-4.5-AirNemotron Nano (12B)ARC0.890.940.920.93GSM8K0.880.840.830.84GSM8K (Platinum)0.870.880.850.87HellaSwag0.770.810.850.82MATH0.620.540.470.26MBPP0.570.750.730.71MMLU0.710.780.770.78MMLU (Pro)0.360.550.510.53While it lags slightly behind transformers on knowledge-heavy evaluations like MMLU-Pro, it matches or outperforms them on mathematical reasoning and long-context reasoning tasks—precisely where attention architectures tend to falter. This pattern reinforces the idea that recurrent or retention-based systems may hold a structural advantage for reasoning over extended temporal or logical dependencies.Hardware Efficiency and Inference PerformanceBrumby’s power retention design offers another major advantage: hardware efficiency.Because the state update involves only local matrix operations, inference can be implemented with linear complexity in sequence length. Manifest AI reports that their fastest kernels, developed through their in-house CUDA framework Vidrial, can deliver hundreds-fold speedups over attention on very long contexts.Buckman said the alpha-stage Power Retention kernels “achieve typical hardware utilization of 80–85%, which is higher than FlashAttention2’s 70–75% or Mamba’s 50–60%.” (Mamba is another emerging “post-transformer” architecture developed by Carnegie Mellon scientists back in 2023 that, like Power Retention, seeks to eliminate the computational bottleneck of attention. It replaces attention with a state-space mechanism that processes sequences linearly — updating an internal state over time rather than comparing every token to every other one. This makes it far more efficient for long inputs, though it typically achieves lower hardware utilization than Power Retention in early tests.)Both Power Retention and Mamba, he added, “expend meaningfully fewer total FLOPs than FlashAttention2 on long contexts, as well as far less memory.” According to Buckman, the reported 100× speedup comes from this combined improvement in utilization and computational efficiency, though he noted that “we have not yet stress-tested it on production-scale workloads.”Training and Scaling EconomicsPerhaps no statistic in the Brumby release generated more attention than the training cost.A 14-billion-parameter model, trained for $4,000, represents a two-order-of-magnitude reduction in the cost of foundation model development.Buckman confirmed that the low cost reflects a broader scaling pattern. “Far from diminishing returns, we have found that ease of retraining improves with scale,” he said. “The number of steps required to successfully retrain a model decreases with its parameter count.” Manifest has not yet validated the cost of retraining models at 700B parameters, but Buckman projected a range of $10,000–$20,000 for models of that magnitude—still far below transformer training budgets.He also reiterated that this approach could democratize large-scale experimentation by allowing smaller research groups or companies to retrain or repurpose existing transformer checkpoints without prohibitive compute costs.Integration and DeploymentAccording to Buckman, converting an existing transformer into a Power Retention model is designed to be simple. “It is straightforward for any company that is already retraining, post-training, or fine-tuning open-source models,” he said. “Simply pip install retention, change one line of your architecture code, and resume training where you left off.”He added that after only a small number of GPU-hours, the model typically recovers its original performance—at which point it gains the efficiency benefits of the attention-free design. “The resulting architecture will permit far faster long-context training and inference than previously,” Buckman noted.On infrastructure, Buckman said the main Brumby kernels are written in Triton, compatible with both NVIDIA and AMD accelerators. Specialized CUDA kernels are also available through the team’s in-house Vidrial framework. Integration with vLLM and other inference engines remains a work in progress: “We have not yet integrated Power Retention into inference engines, but doing so is a major ongoing initiative at Manifest.”As for distributed inference, Buckman dismissed concerns about instability: “We have not found this difficulty to be exacerbated in any way by our recurrent-state architecture. In fact, context-parallel training and GPU partitioning for multi-user inference both become significantly cleaner technically when using our approach.”Mission and Long-Term VisionBeyond the engineering details, Buckman also described Manifest’s broader mission. “Our mission is to train a neural network to model all human output,” he said. The team’s goal, he explained, is to move beyond modeling “artifacts of intelligence” toward modeling “the intelligent processes that generated them.” This shift, he argued, requires “fundamentally rethinking” how models are designed and trained—work that Power Retention represents only the beginning of.The Brumby-14B release, he said, is “one step forward in a long march” toward architectures that can model thought processes continuously and efficiently.Public Debate and Industry ReceptionThe launch of Brumby-14B sparked immediate discussion on X (formerly Twitter), where researchers debated the framing of Manifest AI’s announcement. Some, including Meta researcher Ariel (@redtachyon), argued that the “$4,000 foundation model” tagline was misleading, since the training involved reusing pretrained transformer weights rather than training from scratch.“They shuffled around the weights of Qwen, fine-tuned it a bit, and called it ‘training a foundation model for $4k,’” Ariel wrote.Buckman responded publicly, clarifying that the initial tweet had been part of a longer thread explaining the retraining approach. “It’s not like I was being deceptive about it,” he wrote. “I broke it up into separate tweets, and now everyone is mad about the first one.”In a follow-up email, Buckman took a measured view of the controversy. “The end of the transformer era is not yet here,” he reiterated, “but the march has begun.” He also acknowledged that the $4,000 claim, though technically accurate in context, had drawn attention precisely because it challenged expectations about what it costs to experiment at frontier scale.Conclusion: A Crack in the Transformer’s Wall?The release of Brumby-14B-Base is more than an engineering milestone; it is a proof of concept that the transformer’s dominance may finally face credible competition. By replacing attention with power retention, Manifest AI has demonstrated that performance parity with state-of-the-art transformers is possible at a fraction of the computational cost—and that the long-context bottleneck can be broken without exotic hardware.The broader implications are twofold. First, the economics of training and serving large models could shift dramatically, lowering the barrier to entry for open research and smaller organizations. Second, the architectural diversity of AI models may expand again, reigniting theoretical and empirical exploration after half a decade of transformer monoculture.As Buckman put it: “The end of the transformer era is not yet here. Our release is just one step forward in a long march toward the future.”
The next time you use a tool like ChatGPT or Perplexity, stop and count the total words being generated to fulfill your request. Each word results from a process called inference—the revenue-generation mechanism of AI systems where each word generated can be analyzed using basic financial and economic business principles. The goal of performing this […]
Here are Google’s latest AI updates from October 2025
Understanding machine learning models is a vital aspect of building trustworthy AI systems.
Market researchers have embraced artificial intelligence at a staggering pace, with 98% of professionals now incorporating AI tools into their work and 72% using them daily or more frequently, according to a new industry survey that reveals both the technology's transformative promise and its persistent reliability problems.The findings, based on responses from 219 U.S. market research and insights professionals surveyed in August 2025 by QuestDIY, a research platform owned by The Harris Poll, paint a picture of an industry caught between competing pressures: the demand to deliver faster business insights and the burden of validating everything AI produces to ensure accuracy.While more than half of researchers — 56% — report saving at least five hours per week using AI tools, nearly four in ten say they've experienced "increased reliance on technology that sometimes produces errors." An additional 37% report that AI has "introduced new risks around data quality or accuracy," and 31% say the technology has "led to more work re-checking or validating AI outputs."The disconnect between productivity gains and trustworthiness has created what amounts to a grand bargain in the research industry: professionals accept time savings and enhanced capabilities in exchange for constant vigilance over AI's mistakes, a dynamic that may fundamentally reshape how insights work gets done.How market researchers went from AI skeptics to daily users in less than a yearThe numbers suggest AI has moved from experiment to infrastructure in record time. Among those using AI daily, 39% deploy it once per day, while 33% use it "several times per day or more," according to the survey conducted between August 15-19, 2025. Adoption is accelerating: 80% of researchers say they're using AI more than they were six months ago, and 71% expect to increase usage over the next six months. Only 8% anticipate their usage will decline.“While AI provides excellent assistance and opportunities, human judgment will remain vital,” Erica Parker, Managing Director Research Products at The Harris Poll, told VentureBeat. “The future is a teamwork dynamic where AI will accelerate tasks and quickly unearth findings, while researchers will ensure quality and provide high level consultative insights.”The top use cases reflect AI's strength in handling data at scale: 58% of researchers use it for analyzing multiple data sources, 54% for analyzing structured data, 50% for automating insight reports, 49% for analyzing open-ended survey responses, and 48% for summarizing findings. These tasks—traditionally labor-intensive and time-consuming — now happen in minutes rather than hours.Beyond time savings, researchers report tangible quality improvements. Some 44% say AI improves accuracy, 43% report it helps surface insights they might otherwise have missed, 43% cite increased speed of insights delivery, and 39% say it sparks creativity. The overwhelming majority — 89% — say AI has made their work lives better, with 25% describing the improvement as "significant."The productivity paradox: saving time while creating new validation workYet the same survey reveals deep unease about the technology's reliability. The list of concerns is extensive: 39% of researchers report increased reliance on error-prone technology, 37% cite new risks around data quality or accuracy, 31% describe additional validation work, 29% report uncertainty about job security, and 28% say AI has raised concerns about data privacy and ethics.The report notes that "accuracy is the biggest frustration with AI experienced by researchers when asked on an open-ended basis." One researcher captured the tension succinctly: "The faster we move with AI, the more we need to check if we're moving in the right direction."This paradox — saving time while simultaneously creating new work — reflects a fundamental characteristic of current AI systems, which can produce outputs that appear authoritative but contain what researchers call "hallucinations," or fabricated information presented as fact. The challenge is particularly acute in a profession where credibility depends on methodological rigor and where incorrect data can lead clients to make costly business decisions."Researchers view AI as a junior analyst, capable of speed and breadth, but needing oversight and judgment," said Gary Topiol, Managing Director at QuestDIY, in the report.That metaphor — AI as junior analyst — captures the industry's current operating model. Researchers treat AI outputs as drafts requiring senior review rather than finished products, a workflow that provides guardrails but also underscores the technology's limitations.Why data privacy fears are the biggest obstacle to AI adoption in researchWhen asked what would limit AI use at work, researchers identified data privacy and security concerns as the greatest barrier, cited by 33% of respondents. This concern isn't abstract: researchers handle sensitive customer data, proprietary business information, and personally identifiable information subject to regulations like GDPR and CCPA. Sharing that data with AI systems — particularly cloud-based large language models — raises legitimate questions about who controls the information and whether it might be used to train models accessible to competitors.Other significant barriers include time to experiment and learn new tools (32%), training (32%), integration challenges (28%), internal policy restrictions (25%), and cost (24%). An additional 31% cited lack of transparency in AI use as a concern, which could complicate explaining results to clients and stakeholders.The transparency issue is particularly thorny. When an AI system produces an analysis or insight, researchers often cannot trace how the system arrived at its conclusion — a problem that conflicts with the scientific method's emphasis on replicability and clear methodology. Some clients have responded by including no-AI clauses in their contracts, forcing researchers to either avoid the technology entirely or use it in ways that don't technically violate contractual terms but may blur ethical lines."Onboarding beats feature bloat," Parker said in the report. "The biggest brakes are time to learn and train. Packaged workflows, templates, and guided setup all unlock usage faster than piling on capabilities."Inside the new workflow: treating AI like a junior analyst who needs constant supervisionDespite these challenges, researchers aren't abandoning AI — they're developing frameworks to use it responsibly. The consensus model, according to the survey, is "human-led research supported by AI," where AI handles repetitive tasks like coding, data cleaning, and report generation while humans focus on interpretation, strategy, and business impact.About one-third of researchers (29%) describe their current workflow as "human-led with significant AI support," while 31% characterize it as "mostly human with some AI help." Looking ahead to 2030, 61% envision AI as a "decision-support partner" with expanded capabilities including generative features for drafting surveys and reports (56%), AI-driven synthetic data generation (53%), automation of core processes like project setup and coding (48%), predictive analytics (44%), and deeper cognitive insights (43%).The report describes an emerging division of labor where researchers become "Insight Advocates" — professionals who validate AI outputs, connect findings to stakeholder challenges, and translate machine-generated analysis into strategic narratives that drive business decisions. In this model, technical execution becomes less central to the researcher's value proposition than judgment, context, and storytelling."AI can surface missed insights — but it still needs a human to judge what really matters," Topiol said in the report.What other knowledge workers can learn from the research industry's AI experimentThe market research industry's AI adoption may presage similar patterns in other knowledge work professions where the technology promises to accelerate analysis and synthesis. The experience of researchers — early AI adopters who have integrated the technology into daily workflows — offers lessons about both opportunities and pitfalls.First, speed genuinely matters. One boutique agency research lead quoted in the report described watching survey results accumulate in real-time after fielding: "After submitting it for fielding, I literally watched the survey count climb and finish the same afternoon. It was a remarkable turnaround." That velocity enables researchers to respond to business questions within hours rather than weeks, making insights actionable while decisions are still being made rather than after the fact.Second, the productivity gains are real but uneven. Saving five hours per week represents meaningful efficiency for individual contributors, but those savings can disappear if spent validating AI outputs or correcting errors. The net benefit depends on the specific task, the quality of the AI tool, and the user's skill in prompting and reviewing the technology's work.Third, the skills required for research are changing. The report identifies future competencies including cultural fluency, strategic storytelling, ethical stewardship, and what it calls "inquisitive insight advocacy" — the ability to ask the right questions, validate AI outputs, and frame insights for maximum business impact. Technical execution, while still important, becomes less differentiating as AI handles more of the mechanical work.The strange phenomenon of using technology intensively while questioning its reliabilityThe survey's most striking finding may be the persistence of trust issues despite widespread adoption. In most technology adoption curves, trust builds as users gain experience and tools mature. But with AI, researchers appear to be using tools intensively while simultaneously questioning their reliability — a dynamic driven by the technology's pattern of performing well most of the time but failing unpredictably.This creates a verification burden that has no obvious endpoint. Unlike traditional software bugs that can be identified and fixed, AI systems' probabilistic nature means they may produce different outputs for the same inputs, making it difficult to develop reliable quality assurance processes.The data privacy concerns — cited by 33% as the biggest barrier to adoption — reflect a different dimension of trust. Researchers worry not just about whether AI produces accurate outputs but also about what happens to the sensitive data they feed into these systems. QuestDIY's approach, according to the report, is to build AI directly into a research platform with ISO/IEC 27001 certification rather than requiring researchers to use general-purpose tools like ChatGPT that may store and learn from user inputs."The center of gravity is analysis at scale — fusing multiple sources, handling both structured and unstructured data, and automating reporting," Topiol said in the report, describing where AI delivers the most value.The future of research work: elevation or endless verification?The report positions 2026 as an inflection point when AI moves from being a tool researchers use to something more like a team member — what the authors call a "co-analyst" that participates in the research process rather than merely accelerating specific tasks.This vision assumes continued improvement in AI capabilities, particularly in areas where researchers currently see the technology as underdeveloped. While 41% currently use AI for survey design, 37% for programming, and 30% for proposal creation, most researchers consider these appropriate use cases, suggesting significant room for growth once the tools become more reliable or the workflows more structured.The human-led model appears likely to persist. "The future is human-led, with AI as a trusted co-analyst," Parker said in the report. But what "human-led" means in practice may shift. If AI handles most analytical tasks and researchers focus on validation and strategic interpretation, the profession may come to resemble editorial work more than scientific analysis — curating and contextualizing machine-generated insights rather than producing them from scratch."AI gives researchers the space to move up the value chain – from data gatherers to Insight Advocates, focused on maximising business impact," Topiol said in the report.Whether this transformation marks an elevation of the profession or a deskilling depends partly on how the technology evolves. If AI systems become more transparent and reliable, the verification burden may decrease and researchers can focus on higher-order thinking. If they remain opaque and error-prone, researchers may find themselves trapped in an endless cycle of checking work produced by tools they cannot fully trust or explain.The survey data suggests researchers are navigating this uncertainty by developing a form of professional muscle memory — learning which tasks AI handles well, where it tends to fail, and how much oversight each type of output requires. This tacit knowledge, accumulated through daily use and occasional failures, may become as important to the profession as statistical literacy or survey design principles.Yet the fundamental tension remains unresolved. Researchers are moving faster than ever, delivering insights in hours instead of weeks, and handling analytical tasks that would have been impossible without AI. But they're doing so while shouldering a new responsibility that previous generations never faced: serving as the quality control layer between powerful but unpredictable machines and business leaders making million-dollar decisions.The industry has made its bet. Now comes the harder part: proving that human judgment can keep pace with machine speed — and that the insights produced by this uneasy partnership are worth the trust clients place in them.
SAP aims to displace more general large language models with the release of its own foundational “tabular” model, which the company claims will reduce training requirements for enterprises. The model, called SAP RPT-1, is a pre-trained model with business and enterprise knowledge out of the box. SAP calls it a Relational Foundation Model, meaning it can do predictions based on relational databases even without fine-tuning or additional training.Walter Sun, SAP's global head of AI, told VentureBeat in an interview that the value of the new model lies in its ability to perform various enterprise tasks, such as predictive analytics, out of the box. “Everyone knows about language models, and there’s a bunch of good ones that already exist,” Sun said. “But we trained the model on data on business transactions, basically Excel spreadsheets, and so we have a model that can do predictive analytics where the value is that it’s out of the box, meaning you don’t need to have specifics of a company to do tasks analogous to a language model.” Sun said that right out of the gate, RPT-1 can essentially build out a business model for enterprises based on its knowledge gained from data from SAP’s decades of information. Organizations can plug the model directly into applications, even without additional fine-tuning.RPT-1, SAP’s first large family of AI models, will be generally available in “Q4 of 2025” and be deployed via SAP’s AI Foundation. While RPT-1 is currently available, the company stated that additional models will be made available soon, including an open-source, state-of-the-art model. SAP will also release a no-code playground environment to experiment with the model.
Tabular models vs LLMs
Tabular or relational AI models learned from spreadsheets, unlike LLMs, which learned from text and code. RPT-1 not only understands numbers and the relationships between different cells, but it’s also able to provide more structured and precise answers. When enterprises decide to use RPT-1, they can add more direction to the model through a bit of context engineering, since the model is semantically aware and learns based on how it is being used. SAP researchers first proposed the idea that tabular models can both exhibit semantic awareness and learn from content through a paper published in June. It proposed ConTextTab introduced context-aware pretraining. It utilizes semantic signals, such as table headers or column types, to guide model training, enabling the model to build a relational structure with the data. It’s this architecture that makes the model work best for tasks with precise answers, such as for financial or enterprise use cases.The RPT models build on the ConTextTab work that lets it learn structured business data, say from SAP’s knowledge graph, and then be able to add more context through usage. SAP researchers did test ConTextTab against benchmarks, saying it “is competitive” against similar models like TabPFN and TabIFL. Industry-specific models continue to grow
Many enterprises prefer to fine-tune general LLMs like GPT-5 or Claude, to basically retrain the model to answer only questions relevant to their business. However, a shift towards industry-specific models has begun to take root. Sun said that his experience at a previous company, building a very narrow, highly customized AI model for sentiment analysis, influenced a lot of what makes RPT-1 different. “It was a very customized model, a narrow model that takes specific feedback for specific products but it wasn’t scalable,” Sun said. “When LLMs came about, that one model measures sentiment. But there are use cases that we can do that LLMs cannot do.”He said these use cases include predictions, such as determining when a shopper will return to a grocery store, which may involve numerical analysis along with an understanding of the shopper’s buying habits. However, some LLMs have begun integrating into spreadsheets, and AI model providers encourage users to upload similar data to teach them context. Microsoft added new capabilities to Copilot, including the ability to work in Excel. Anthropic integrated its Claude model with Excel, complementing its Claude for Finance service. Chinese startup Manus also offers a data visualization tool that understands spreadsheets, and ChatGPT can create charts from uploaded spreadsheets and other data sources. However, SAP noted that it is more than just reading a spreadsheet; RPT-1 should stand out amongst its competitors because it requires fewer additional pieces of information about a business to provide its responses.
Presented by ZendeskAgentic AI is currently transforming three key areas of work — creative, coding, and support — says Shashi Upadhyay, president of engineering, AI, and product at Zendesk. But he notes that support presents a distinct challenge. "Support is special because you’re putting an autonomous AI agent right in front of your customer," Upadhyay says. "You have to be confident that it’s going to do the right thing for the customer and by the customer. Every step forward in AI should make service more dependable for both customers and human agents." Zendesk, recently named a Leader in the 2025 Gartner Magic Quadrant for the CRM Customer Engagement Center, started implementing AI agents about a year and a half ago. Since then, they've seen that AI agents can solve almost 80% of all incoming customer requests on their own. For the remaining 20%, the AI agent can hand it over to a human to help solve the more complex problems. "Autonomous AI agents work 24/7, with no wait or queue time. You have a problem; they provide an answer right away. All of that adds up," he says. "Not only do you get higher resolutions, higher automation, but you can also improve the CSAT at the same time. Because 80% is such a promising number, and the results are so solid, we believe it’s only a matter of time before everyone adopts this technology. We already see that across the board."The company's efforts to advance its standard of usability, depth of insight, and time to value for organizations of all sizes require continuous testing, integration of advanced models like ChatGPT-5, and a major upgrade of its analytics capabilities and real-time, gen AI–powered insights with the acquisition of HyperArc, an AI-native analytics platform.Designing, testing, and deploying a better agent"In a support context especially, it’s important AI agents behave consistently with the brand of the company, policies, and regulatory requirements you may have," Upadhyay says. "We test every agent, every model continuously across all our customers. We do it before we release it and we do it after we release it, across five categories." Those categories — automation rate, execution, precision, latency, and safety — form the foundation of Zendesk’s ongoing benchmarking program. Each model is scored on how accurately it resolves issues, how well it follows instructions, how fast it responds, and whether it stays within clearly defined guardrails. The goal isn’t just to make AI faster — it’s to make it dependable, accountable, and aligned with the standards that define great customer service.That testing is reinforced by Zendesk’s QA agent — an automated monitor that keeps a constant eye on every conversation. If an exchange starts to drift off course, whether in tone or accuracy, the system immediately flags it and alerts a human agent to step in. It’s an added layer of assurance that keeps the customer experience on track, even when AI is running the first line of support.GPT-5 for next-level agentsIn the world of support and service, the move from simple chatbots that answer basic queries or solve uncomplicated problems, to agents that actually take action, is groundbreaking. An agent that can understand that a customer wants to return an item, confirm whether it's eligible for a return, process the return, and issue a refund, is a powerful upgrade. With the introduction of ChatGPT-5, Zendesk recognized an opportunity to integrate that ability into its Resolution Platform."We worked very closely with OpenAI because GPT-5 was a pretty big improvement in model capabilities, going from being able to answer questions, to being able to reason and take action," Upadhyay says. "First, it does a much better job at solving problems autonomously. Secondly, it's much better at understanding your intent, which improves the customer experience because you feel understood. Last but not least, it has 95%-plus reliability on executing correctly."Those gains ripple across Zendesk’s AI agents, Copilot, and App Builder. GPT-5 cuts workflow failures by 30%, thanks to its ability to adapt to unexpected complexity without losing context, and reduces fallback escalations by more than 20%, with more complete and accurate responses. The result: faster resolutions, fewer hand-offs, and AI that behaves more like a seasoned support professional than a scripted assistant.Plus, GPT-5 is better at handling ambiguity, and able to clarify vague customer input, which improves routing and increases automated workflows in over 65% of conversations. It has greater accuracy across five languages, and makes agents more productive with more concise, contextually relevant answers that align with tone guidelines.And in App Builder, GPT-5 delivered 25% to 30% faster overall performance, with more prompt iterations per minute, speeding app builder development workflows.Filling in the analytics gapTraditionally, support analytics has focused on structured data — the kind that fits neatly into a table: when a ticket was opened, who handled it, how long it took to resolve, and when it was closed. But the most valuable insights often live in unstructured data — the conversations themselves, spread across email, chat, voice, and messaging apps like WhatsApp."Customers often don’t realize how much intelligence sits in their support interactions," Upadhyay says. "What we’re pushing for with analytics is ways in which we can improve the entire company with the insights that are sitting in support data."To surface those deeper insights, Zendesk turned to HyperArc, an AI-native analytics company known for its proprietary HyperGraph engine and generative-AI-powered insights. The acquisition gave new life to Explore, Zendesk’s analytics platform, transforming it into a modern solution capable of merging structured and unstructured data, supporting conversational interfaces, and drawing on persistent memory to use past interactions as context for new queries."Your support interactions are telling you everything that’s not working in your business today, all that information is sitting in these millions of tickets that you’ve collected over time," Upadhyay says. "We wanted to make that completely visible. Now we have this genius AI agent that can analyze it all and come back with explicit recommendations. That doesn’t just improve support. It improves the entire company."That visibility now translates into actionable intelligence. The system can pinpoint where issues are most persistent, identify the patterns behind them, and suggest ways to resolve them. It can even anticipate problems before they happen. During high-pressure events like Black Friday, for example, it can analyze historical data to flag recurring issues, predict where new bottlenecks might appear, and recommend preventive measures — turning reactive support into proactive strategy."That’s where HyperArc shines," Upadhyay says. It doesn’t just help you understand the past — it helps you plan better for the future."By integrating HyperArc’s AI-native intelligence, Zendesk is moving customer service toward continuous learning — where every interaction builds trust and sharpens performance, setting the stage for AI that can see what’s coming next.Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
AGI prediction, AI holiday ad backlash, OpenAI + AWS, ChatGPT course, and more...
MIT’s Teaching Systems Lab, led by Associate Professor Justin Reich, is working to help educators by listening to and sharing their stories.
MIT PhD student and CSAIL researcher Justin Kay describes his work combining AI and computer vision systems to monitor the ecosystems that support our planet.
I’m thrilled to announce a fantastic new addition to our leadership team: Karyne Levy is joining VentureBeat as our new Managing Editor. Today is her first day.Many of you may know Karyne from her most recent role as Deputy Managing Editor at TechCrunch, but her career is a highlight reel of veteran tech journalism. Her resume includes pivotal roles at Protocol, NerdWallet, Business Insider, and CNET, giving her a deep understanding of this industry from every angle.Hiring Karyne is a significant step forward for VentureBeat. As we’ve sharpened our focus on serving you – the enterprise technical decision-maker navigating the complexities of AI and data – I’ve been looking for a very specific kind of leader.The "Organizer's Dopamine Hit"In the past, a managing editor was often the final backstop for copy. Today, at a modern, data-focused media company like ours, the role is infinitely more dynamic. It’s the central hub of the entire content operation.During my search, I found myself talking a lot about the two types of "dopamine hits" in our business. There’s the writer’s hit – seeing your name on a great story. And then there’s the organizer’s hit – the satisfaction that comes from building, tuning, and running the complex machine that allows a dozen different parts of the company to move in a single, powerful direction.We were looking for the organizer.When I spoke with Karyne, I explained this vision: a leader who thrives on creating workflows, who loves being the liaison between editorial, our data and survey team, our events, and our marketing operations.Her response confirmed she was the one: "Everything you said is exactly my dopamine hit."Karyne’s passion is making the entire operation hum. She has a proven track record of managing people, running newsrooms, and interfacing with all parts of a business to ensure everyone is aligned. That operational rigor is precisely what we need for our next chapter.Why This Matters for Our Strategy (and for You)As I’ve written about before, VentureBeat is on a mission to evolve. In an age where experts and companies can publish directly, it’s not enough to be a secondary source. Our goal is to become a primary source for you.How? By leveraging our relationship with our community of millions of technical leaders. We are increasingly surveying you directly to generate proprietary insights you can’t get anywhere else. We want to be the first to tell you which vector stores your peers are actually implementing, what governance challenges are most pressing for data scientists, or how your counterparts are budgeting for generative AI.This is an ambitious strategy. It requires a tight-knit team where our editorial content, our research surveys and reports, our newsletters, and our VB Transform events are all working from the same playbook.Karyne is the leader who will help us execute that vision. Her experience at Protocol, which was also dedicated to serving technical and business decision-makers, means she fundamentally understands our audience. She is ideally suited to manage our newsroom and ensure that every piece of content we produce helps you do your job better. She’ll be working alongside Carl Franzen, our executive editor, who continues to drive news decision-making.This is a fantastic hire for VentureBeat. It’s another sign of our commitment to building the most focused, expert team in enterprise AI and data.Please join me in welcoming Karyne to the team.
The buzzed-about but still stealthy New York City startup Augmented Intelligence Inc (AUI), which seeks to go beyond the popular "transformer" architecture used by most of today's LLMs such as ChatGPT and Gemini, has raised $20 million in a bridge SAFE round at a $750 million valuation cap, bringing its total funding to nearly $60 million, VentureBeat can exclusively reveal.The round, completed in under a week, comes amid heightened interest in deterministic conversational AI and precedes a larger raise now in advanced stages.AUI relies on a fusion of the transformer tech and a newer technology called "neuro-symbolic AI," described in greater detail below. "We realize that you can combine the brilliance of LLMs in linguistic capabilities with the guarantees of symbolic AI," said Ohad Elhelo, AUI co-founder and CEO in a recent interview with VentureBeat. Elhelo launched the company in 2017 alongside co-founder and Chief Product Officer Ori Cohen.The new financing includes participation from eGateway Ventures, New Era Capital Partners, existing shareholders, and other strategic investors. It follows a $10 million raise in September 2024 at a $350 million valuation cap, coinciding with the company’s announced go-to-market partnership with Google in October 2024. Early investors include Vertex Pharmaceuticals founder Joshua Boger, UKG Chairman Aron Ain, and former IBM President Jim Whitehurst.According to the company, the bridge round is a precursor to a significantly larger raise already in advanced stages.AUI is the company behind Apollo-1, a new foundation model built for task-oriented dialog, which it describes as the "economic half" of conversational AI — distinct from the open-ended dialog handled by LLMs like ChatGPT and Gemini. The firm argues that existing LLMs lack the determinism, policy enforcement, and operational certainty required by enterprises, especially in regulated sectors.Chris Varelas, co-founder of Redwood Capital and an advisor to AUI, said in a press release provided to VentureBeat: “I’ve seen some of today’s top AI leaders walk away with their heads spinning after interacting with Apollo-1.”A Distinctive Neuro-Symbolic ArchitectureApollo-1’s core innovation is its neuro-symbolic architecture, which separates linguistic fluency from task reasoning. Instead of using the most common technology underpinning most LLMs and conversational AI systems today — the vaunted transformer architecture described in the seminal 2017 Google paper "Attention Is All You Need" — AUI's system integrates two layers:Neural modules, powered by LLMs, handle perception: encoding user inputs and generating natural language responses.A symbolic reasoning engine, developed over several years, interprets structured task elements such as intents, entities, and parameters. This symbolic state engine determines the appropriate next actions using deterministic logic.This hybrid architecture allows Apollo-1 to maintain state continuity, enforce organizational policies, and reliably trigger tool or API calls — capabilities that transformer-only agents lack.Elhelo said this design emerged from a multi-year data collection effort: “We built a consumer service and recorded millions of human-agent interactions across 60,000 live agents. From that, we abstracted a symbolic language that defines the structure of task-based dialogs, separate from their domain-specific content.”However, enterprises that have already built systems built around transformer LLMs needn't worry. AUI wants to make adopting its new technology just as easy. "Apollo-1 deploys like any modern foundation model," Elhelo told VentureBeat in a text last night. "It doesn’t require dedicated or proprietary clusters to run. It operates across standard cloud and hybrid environments, leveraging both GPUs and CPUs, and is significantly more cost-efficient to deploy than frontier reasoning models. Apollo-1 can also be deployed across all major clouds in a separated environment for increased security."Generalization and Domain FlexibilityApollo-1 is described as a foundation model for task-oriented dialog, meaning it is domain-agnostic and generalizable across verticals like healthcare, travel, insurance, and retail.Unlike consulting-heavy AI platforms that require building bespoke logic per client, Apollo-1 allows enterprises to define behaviors and tools within a shared symbolic language. This approach supports faster onboarding and reduces long-term maintenance. According to the team, an enterprise can launch a working agent in under a day.Crucially, procedural rules are encoded at the symbolic layer — not learned from examples. This enables deterministic execution for sensitive or regulated tasks. For instance, a system can block cancellation of a Basic Economy flight not by guessing intent but by applying hard-coded logic to a symbolic representation of the booking class.As Elhelo explained to VentureBeat, LLMs are "not a good mechanism when you’re looking for certainty. It’s better if you know what you’re going to send [to an AI model] and always send it, and you know, always, what’s going to come back [to the user] and how to handle that.”Availability and Developer AccessApollo-1 is already in active use within Fortune 500 enterprises in a closed beta, and a broader general availability release is expected before the end of 2025, according to a previous report by The Information, which broke the initial news on the startup.Enterprises can integrate with Apollo-1 either via:A developer playground, where business users and technical teams jointly configure policies, rules, and behaviors; orA standard API, using OpenAI-compatible formats.The model supports policy enforcement, rule-based customization, and steering via guardrails. Symbolic rules allow businesses to dictate fixed behaviors, while LLM modules handle open-text interpretation and user interaction.Enterprise Fit: When Reliability Beats FluencyWhile LLMs have advanced general-purpose dialog and creativity, they remain probabilistic — a barrier to enterprise deployment in finance, healthcare, and customer service. Apollo-1 targets this gap by offering a system where policy adherence and deterministic task completion are first-class design goals.Elhelo puts it plainly: “If your use case is task-oriented dialog, you have to use us, even if you are ChatGPT.”
Large language models (LLMs) exhibit outstanding abilities to reason over, summarize, and creatively generate text.
An international team of researchers has released an artificial intelligence system capable of autonomously conducting scientific research across multiple disciplines — generating papers from initial concept to publication-ready manuscript in approximately 30 minutes for about $4 each.The system, called Denario, can formulate research ideas, review existing literature, develop methodologies, write and execute code, create visualizations, and draft complete academic papers. In a demonstration of its versatility, the team used Denario to generate papers spanning astrophysics, biology, chemistry, medicine, neuroscience, and other fields, with one AI-generated paper already accepted for publication at an academic conference."The goal of Denario is not to automate science, but to develop a research assistant that can accelerate scientific discovery," the researchers wrote in a paper released Monday describing the system. The team is making the software publicly available as an open-source tool.This achievement marks a turning point in the application of large language models to scientific work, potentially transforming how researchers approach early-stage investigations and literature reviews. However, the research also highlights substantial limitations and raises pressing questions about validation, authorship, and the changing nature of scientific labor.From data to draft: how AI agents collaborate to conduct researchAt its core, Denario operates not as a single AI brain but as a digital research department where specialized AI agents collaborate to push a project from conception to completion. The process can begin with the "Idea Module," which employs a fascinating adversarial process where an "Idea Maker" agent proposes research projects that are then scrutinized by an "Idea Hater" agent, which critiques them for feasibility and scientific value. This iterative loop refines raw concepts into robust research directions.Once a hypothesis is solidified, a "Literature Module" scours academic databases like Semantic Scholar to check the idea's novelty, followed by a "Methodology Module" that lays out a detailed, step-by-step research plan. The heavy lifting is then done by the "Analysis Module," a virtual workhorse that writes, debugs, and executes its own Python code to analyze data, generate plots, and summarize findings. Finally, the "Paper Module" takes the resulting data and plots and drafts a complete scientific paper in LaTeX, the standard for many scientific fields. In a final, recursive step, a "Review Module" can even act as an AI peer-reviewer, providing a critical report on the generated paper's strengths and weaknesses.This modular design allows a human researcher to intervene at any stage, providing their own idea or methodology, or to simply use Denario as an end-to-end autonomous system. "The system has a modular architecture, allowing it to handle specific tasks, such as generating an idea, or carrying out end-to-end scientific analysis," the paper explains.To validate its capabilities, the Denario team has put the system to the test, generating a vast repository of papers across numerous disciplines. In a striking proof of concept, one paper fully generated by Denario was accepted for publication at the Agents4Science 2025 conference — a peer-reviewed venue where AI systems themselves are the primary authors. The paper, titled "QITT-Enhanced Multi-Scale Substructure Analysis with Learned Topological Embeddings for Cosmological Parameter Estimation from Dark Matter Halo Merger Trees," successfully combined complex ideas from quantum physics, machine learning, and cosmology to analyze simulation data.The ghost in the machine: AI’s ‘vacuous’ results and ethical alarmsWhile the successes are notable, the research paper is refreshingly candid about Denario's significant limitations and failure modes. The authors stress that the system currently "behaves more like a good undergraduate or early graduate student rather than a full professor in terms of big picture, connecting results...etc." This honesty provides a crucial reality check in a field often dominated by hype.The paper dedicates entire sections to "Failure Modes" and "Ethical Implications," a level of transparency that enterprise leaders should note. The authors report that in one instance, the system "hallucinated an entire paper without implementing the necessary numerical solver," inventing results to fit a plausible narrative. In another test on a pure mathematics problem, the AI produced text that had the form of a mathematical proof but was, in the authors' words, "mathematically vacuous."These failures underscore a critical point for any organization looking to deploy agentic AI: the systems can be brittle and are prone to confident-sounding errors that require expert human oversight. The Denario paper serves as a vital case study in the importance of keeping a human in the loop for validation and critical assessment.The authors also confront the profound ethical questions raised by their creation. They warn that "AI agents could be used to quickly flood the scientific literature with claims driven by a particular political agenda or specific commercial or economic interests." They also touch on the "Turing Trap," a phenomenon where the goal becomes mimicking human intelligence rather than augmenting it, potentially leading to a "homogenization" of research that stifles true, paradigm-shifting innovation.An open-source co-pilot for the world's labsDenario is not just a theoretical exercise locked away in an academic lab. The entire system is open-source under a GPL-3.0 license and is accessible to the broader community. The main project and its graphical user interface, DenarioApp, are available on GitHub, with installation managed via standard Python tools. For enterprise environments focused on reproducibility and scalability, the project also provides official Docker images. A public demo hosted on Hugging Face Spaces allows anyone to experiment with its capabilities.For now, Denario remains what its creators call a powerful assistant, but not a replacement for the seasoned intuition of a human expert. This framing is deliberate. The Denario project is less about creating an automated scientist and more about building the ultimate co-pilot, one designed to handle the tedious and time-consuming aspects of modern research.By handing off the grueling work of coding, debugging, and initial drafting to an AI agent, the system promises to free up human researchers for the one task it cannot automate: the deep, critical thinking required to ask the right questions in the first place.