GPT-5.1, Russian robot face-plants, AI celeb voices, AI fighter jet flies, and more...
Another day in late 2025, another impressive result from a Chinese company in open source artificial intelligence.Chinese social networking company Weibo's AI division recently released its open source VibeThinker-1.5B—a 1.5 billion parameter large language model (LLM) that is a fine-tuned variant of rival Chinese tech firm Alibaba's Qwen2.5-Math-1.5B. It's available now for free download and usage by researchers and enterprise developers—even for commercial purposes—under a permissive MIT License on Hugging Face, GitHub and ModelScope, with a technical report on open access science publishing site arxiv.org.And yet, despite its compact size, VibeThinker-1.5B achieves benchmark-topping reasoning performance on math and code tasks, rivaling or surpassing models hundreds of times its size, even outperforming Chinese rival DeepSeek's famed R1 that went viral at the start of this year—a 671-billion parameter model—on formal reasoning benchmark.It further eclipses Mistral AI's Magistral Medium and holds its own against Anthropic's Claude Opus 4 and OpenAI's gpt-oss-20B Medium, all while requiring a fraction of the infrastructure and investment.It also does so having been post-trained on a budget of merely $7800 USD for compute resources (3900 GPU hours on Nvidia H800s) — far less than the tens, or even hundreds, of thousands of dollars typically required to fine-tune models of similar or larger scale.Recall this is not the total cost of the model's development, however: LLMs are trained in stages. First comes pre-training, when the model learns basic language structure and general knowledge by predicting the next word across enormous amounts of text from the internet, books, and articles. This gives it fluency but not much sense of how to follow instructions or hold a conversationPost-training comes next, using much smaller, higher-quality datasets—typically collections of example questions, prompts, and expert-written answers—to teach the model how to respond helpfully, reason through problems, and align with human expectations. Still, Weibo's post-training cost effectiveness on VibeThinker-1.5B is noteworthy and should be commended.The open-source release upends assumptions about parameter scale, compute intensity, and the minimum viable size for high-performance LLMs.A Different Training Approach: Spectrum-to-SignalVibeThinker-1.5B owes its performance not to scale, but to the training framework behind it: the Spectrum-to-Signal Principle (SSP).Instead of optimizing a model purely for single-answer correctness (Pass@1), the SSP framework decouples supervised fine-tuning (SFT) and reinforcement learning (RL) into two distinct phases with different goals:SFT (“Spectrum Phase”): The model is trained to maximize diversity across potential correct answers, improving its Pass@K score. This builds a wide range of plausible solution paths.RL (“Signal Phase”): A second-stage reinforcement learning system (called MaxEnt-Guided Policy Optimization, or MGPO) is used to identify and amplify the most correct paths from this diverse solution pool. MGPO prioritizes problems where the model is most uncertain, using entropy-based weighting to focus learning.The authors argue this separation allows small models to explore reasoning space more effectively—achieving signal amplification without relying on massive parameter counts.VibeThinker-1.5B makes a compelling case that the industry’s reliance on parameter scaling as the only route to better reasoning performance may be outdated. By adopting a diversity-first training pipeline, WeiboAI has shown that smaller, more accessible models can match and even outperform billion-dollar systems in logic-heavy tasks.The low resource footprint is among the most significant aspects of VibeThinker-1.5B. At under $8,000, the post-training cost is 30–60x lower than models like DeepSeek R1 and MiniMax-M1, which cost between $294K and $535K to train.Performance Across DomainsDespite its small size, VibeThinker-1.5B delivers cross-domain reasoning that outpaces many larger open-source and commercial models:ModelAIME25LiveCodeBench v6GPQA-DiamondVibeThinker-1.5B74.451.146.7GPT-OSS-20B-Medium72.154.966.0Claude Opus 469.256.679.6MiniMax M1 (456B)74.662.369.2DeepSeek R1 (671B)70.065.971.5Kimi K2 (1.09T)49.553.775.1VibeThinker was benchmarked against both reasoning-centric models (Magistral, Claude, OpenAI o3-mini) and non-reasoning LLMs (GPT-4.1, Kimi K2, DeepSeek V3). Across structured reasoning benchmarks, the model consistently outperformed non-reasoning models, regardless of size:On AIME24 (math), it beat Kimi K2 (1.09T) by over 10 points (80.3 vs. 69.6).On LiveCodeBench v6, it surpassed Claude Opus 4 (51.1 vs. 47.4).On GPQA, it scored below GPT-4.1 and Claude, but still doubled its base model (from 16.4 to 46.7).This supports the authors’ claim that size is not the only path to reasoning capability—with proper training design, smaller models can reach or even exceed the performance of far larger systems in targeted tasks.Notably, it achieves parity with models hundreds of times larger on math and code, though it lags behind in general knowledge reasoning (GPQA), where larger models maintain an edge.This suggests a potential specialization trade-off: while VibeThinker excels at structured logical tasks, it has less capacity for wide-ranging encyclopedic recall, a known limitation of smaller architectures.Guidance for Enterprise AdoptionThe release includes recommended inference settings (temperature = 0.6, top_p = 0.95, max tokens = 40960).The model is small enough to be deployed on edge devices, including mobile phones and vehicle-embedded systems, while inference costs are estimated to be 20–70x cheaper than with large models.This positions VibeThinker-1.5B not just as a research achievement, but as a potential foundation for cost-efficient, locally deployable reasoning systems.Weibo’s Strategy and Market PositionWeibo, launched by Sina Corporation in 2009, remains a cornerstone of China’s social media ecosystem. Often described as China’s version of X (formerly Twitter), the platform blends microblogging, multimedia content, and trending-topic features with a regulatory environment shaped by tight government oversight. Despite counting 600 million monthly active users (more than twice that of X), investors are not optimistic about its advertising revenue growth potential in the near term, and Weibo is navigating intensifying competition from video-first platforms like Douyin, which are drawing younger users and increasing time-spent elsewhere. In response, Weibo has leaned into creator-economy monetization, live-streaming, and vertical video—adding tools for influencer engagement, e-commerce integration, and richer analytics for brands.The platform’s role as a digital public square also makes it a focus of regulatory scrutiny. Chinese authorities continue to apply pressure on issues ranging from content governance to data security. In September 2025, Weibo was among the platforms cited in official warnings, highlighting its ongoing exposure to policy risks.Weibo’s push into AI R&D—exemplified by the release of VibeThinker-1.5B—signals a shift in ambition. Beyond being a media platform, Weibo is positioning itself as a player in the next phase of Chinese AI development, using its capital reserves, user behavior data, and in-house research capacity to pursue adjacent technical domains.What It Means for Enterprise Technical Decision MakersFor engineering leaders and enterprise AI teams, VibeThinker’s release has practical implications for everything from orchestration pipelines to cost modeling. A 1.5B-parameter model that outperforms 100x larger models on math and programming tasks doesn’t just save compute—it shifts the architectural balance. It enables LLM inference on constrained infrastructure, reduces latency at the edge, and lowers the barrier to entry for applications that otherwise would have required API access to closed, frontier-scale models.That matters for enterprise ML leads trying to deploy reasoning-capable agents within existing systems, or for platform owners tasked with integrating LLMs into automated workflows. It also speaks to those running reinforcement learning from human feedback (RLHF) pipelines or managing inference optimization across hybrid cloud environments. The model’s post-training methodology—particularly its entropy-targeted reinforcement learning approach—offers a roadmap for teams looking to refine smaller checkpoints instead of relying on large-scale pretraining.VibeThinker’s benchmark transparency and data decontamination steps also address another emerging priority in enterprise AI: auditability. While its performance on general-knowledge tests still trails large frontier models, its task-specific reliability makes it an attractive candidate for controlled environments where correctness matters more than coverage.In short, VibeThinker-1.5B isn’t just a research milestone—it’s a strong candidate for practical enterprise use, deployment and learnings. It suggests that a new class of compact, reasoning-optimized models is viable for enterprise use cases that were previously the domain of far larger systems. For organizations trying to balance cost, latency, interpretability, and control, it’s a good new option to the long, growing list of Chinese open source offerings.
A good language model should learn correct language usage, free of biases and errors.
As software systems grow more complex and AI tools generate code faster than ever, a fundamental problem is getting worse: Engineers are drowning in debugging work, spending up to half their time hunting down the causes of software failures instead of building new products. The challenge has become so acute that it's creating a new category of tooling — AI agents that can diagnose production failures in minutes instead of hours.Deductive AI, a startup emerging from stealth mode Wednesday, believes it has found a solution by applying reinforcement learning — the same technology that powers game-playing AI systems — to the messy, high-stakes world of production software incidents. The company announced it has raised $7.5 million in seed funding led by CRV, with participation from Databricks Ventures, Thomvest Ventures, and PrimeSet, to commercialize what it calls "AI SRE agents" that can diagnose and help fix software failures at machine speed.The pitch resonates with a growing frustration inside engineering organizations: Modern observability tools can show that something broke, but they rarely explain why. When a production system fails at 3 a.m., engineers still face hours of manual detective work, cross-referencing logs, metrics, deployment histories, and code changes across dozens of interconnected services to identify the root cause."The complexities and inter-dependencies of modern infrastructure means that investigating the root cause of an outage or incident can feel like searching for a needle in a haystack, except the haystack is the size of a football field, it's made of a million other needles, it's constantly reshuffling itself, and is on fire — and every second you don't find it equals lost revenue," said Sameer Agarwal, Deductive's co-founder and chief technology officer, in an exclusive interview with VentureBeat.Deductive's system builds what the company calls a "knowledge graph" that maps relationships across codebases, telemetry data, engineering discussions, and internal documentation. When an incident occurs, multiple AI agents work together to form hypotheses, test them against live system evidence, and converge on a root cause — mimicking the investigative workflow of experienced site reliability engineers, but completing the process in minutes rather than hours.The technology has already shown measurable impact at some of the world's most demanding production environments. DoorDash's advertising platform, which runs real-time auctions that must complete in under 100 milliseconds, has integrated Deductive into its incident response workflow. The company has set an ambitious 2026 goal of resolving production incidents within 10 minutes."Our Ads Platform operates at a pace where manual, slow-moving investigations are no longer viable. Every minute of downtime directly affects company revenue," said Shahrooz Ansari, Senior Director of Engineering at DoorDash, in an interview with VentureBeat. "Deductive has become a critical extension of our team, rapidly synthesizing signals across dozens of services and surfacing the insights that matter—within minutes."DoorDash estimates that Deductive has root-caused approximately 100 production incidents over the past few months, translating to more than 1,000 hours of annual engineering productivity and a revenue impact "in millions of dollars," according to Ansari. At location intelligence company Foursquare, Deductive reduced the time to diagnose Apache Spark job failures by 90% —t urning a process that previously took hours or days into one that completes in under 10 minutes — while generating over $275,000 in annual savings.Why AI-generated code is creating a debugging crisisThe timing of Deductive's launch reflects a brewing tension in software development: AI coding assistants are enabling engineers to generate code faster than ever, but the resulting software is often harder to understand and maintain."Vibe coding," a term popularized by AI researcher Andrej Karpathy, refers to using natural-language prompts to generate code through AI assistants. While these tools accelerate development, they can introduce what Agarwal describes as "redundancies, breaks in architectural boundaries, assumptions, or ignored design patterns" that accumulate over time."Most AI-generated code still introduces redundancies, breaks architectural boundaries, makes assumptions, or ignores established design patterns," Agarwal told Venturebeat. "In many ways, we now need AI to help clean up the mess that AI itself is creating."The claim that engineers spend roughly half their time on debugging isn't hyperbole. The Association for Computing Machinery reports that developers spend 35% to 50% of their time validating and debugging software. More recently, Harness's State of Software Delivery 2025 report found that 67% of developers are spending more time debugging AI-generated code."We've seen world-class engineers spending half of their time debugging instead of building," said Rakesh Kothari, Deductive's co-founder and CEO. "And as vibe coding generates new code at a rate we've never seen, this problem is only going to get worse."How Deductive's AI agents actually investigate production failuresDeductive's technical approach differs substantially from the AI features being added to existing observability platforms like Datadog or New Relic. Most of those systems use large language models to summarize data or identify correlations, but they lack what Agarwal calls "code-aware reasoning"—the ability to understand not just that something broke, but why the code behaves the way it does."Most enterprises use multiple observability tools across different teams and services, so no vendor has a single holistic view of how their systems behave, fail, and recover—nor are they able to pair that with an understanding of the code that defines system behavior," Agarwal explained. "These are key ingredients to resolving software incidents and it is exactly the gap Deductive fills."The system connects to existing infrastructure using read-only API access to observability platforms, code repositories, incident management tools, and chat systems. It then continuously builds and updates its knowledge graph, mapping dependencies between services and tracking deployment histories.When an alert fires, Deductive launches what the company describes as a multi-agent investigation. Different agents specialize in different aspects of the problem: one might analyze recent code changes, another examines trace data, while a third correlates the timing of the incident with recent deployments. The agents share findings and iteratively refine their hypotheses.The critical difference from rule-based automation is Deductive's use of reinforcement learning. The system learns from every incident which investigative steps led to correct diagnoses and which were dead ends. When engineers provide feedback, the system incorporates that signal into its learning model."Each time it observes an investigation, it learns which steps, data sources, and decisions led to the right outcome," Agarwal said. "It learns how to think through problems, not just point them out."At DoorDash, a recent latency spike in an API initially appeared to be an isolated service issue. Deductive's investigation revealed that the root cause was actually timeout errors from a downstream machine learning platform undergoing a deployment. The system connected these dots by analyzing log volumes, traces, and deployment metadata across multiple services."Without Deductive, our team would have had to manually correlate the latency spike across all logs, traces, and deployment histories," Ansari said. "Deductive was able to explain not just what changed, but how and why it impacted production behavior."The company keeps humans in the loop—for nowWhile Deductive's technology could theoretically push fixes directly to production systems, the company has deliberately chosen to keep humans in the loop—at least for now."While our system is capable of deeper automation and could push fixes to production, currently, we recommend precise fixes and mitigations that engineers can review, validate, and apply," Agarwal said. "We believe maintaining a human in the loop is essential for trust, transparency and operational safety."However, he acknowledged that "over time, we do think that deeper automation will come and how humans operate in the loop will evolve."Databricks and ThoughtSpot veterans bet on reasoning over observabilityThe founding team brings deep expertise from building some of Silicon Valley's most successful data infrastructure platforms. Agarwal earned his Ph.D. at UC Berkeley, where he created BlinkDB, an influential system for approximate query processing. He was among the first engineers at Databricks, where he helped build Apache Spark. Kothari was an early engineer at ThoughtSpot, where he led teams focused on distributed query processing and large-scale system optimization.The investor syndicate reflects both the technical credibility and market opportunity. Beyond CRV's Max Gazor, the round included participation from Ion Stoica, founder of Databricks and Anyscale; Ajeet Singh, founder of Nutanix and ThoughtSpot; and Ben Sigelman, founder of Lightstep.Rather than competing with platforms like Datadog or PagerDuty, Deductive positions itself as a complementary layer that sits on top of existing tools. The pricing model reflects this: Instead of charging based on data volume, Deductive charges based on the number of incidents investigated, plus a base platform fee.The company offers both cloud-hosted and self-hosted deployment options and emphasizes that it doesn't store customer data on its servers or use it to train models for other customers — a critical assurance given the proprietary nature of both code and production system behavior.With fresh capital and early customer traction at companies like DoorDash, Foursquare, and Kumo AI, Deductive plans to expand its team and deepen the system's reasoning capabilities from reactive incident analysis to proactive prevention. The near-term vision: helping teams predict problems before they occur.DoorDash's Ansari offers a pragmatic endorsement of where the technology stands today: "Investigations that were previously manual and time-consuming are now automated, allowing engineers to shift their energy toward prevention, business impact, and innovation."In an industry where every second of downtime translates to lost revenue, that shift from firefighting to building increasingly looks less like a luxury and more like table stakes.
The following article originally appeared on Medium and is being republished here with the author’s permission. Early on, I caught myself saying “you” to my AI tools—“Can you add retries?” “Great idea!”—like I was talking to a junior dev. And then I’d get mad when it didn’t “understand” me. That’s on me. These models aren’t […]
ChatGPT is about to become faster and more conversational as OpenAI upgrades its flagship model GPT-5 to GPT-5.1.OpenAI announced two updates to the GPT-5 series: GPT-5.1 Instant and GPT-5.1 Thinking. Both models are now accessible on ChatGPT. GPT-5.1 Instant, essentially the default and most-used model, is now “warmer, more intelligent, and better at following your instructions,” according to OpenAI. Meanwhile, GPT-5.1 Thinking is an advanced reasoning model that responds faster for simple tasks and more persistently on complex ones.“We heard clearly from users that great AI should not only be smart, but also enjoyable to talk to,” OpenAI said in a blog post. “GPT-5.1 improves meaningfully on both intelligence and communication style.” The company added that both models offer a way for users to “shape ChatGPT’s tone,” allowing people to control how the chat platform responds depending on the conversation they are having. Both models were rolled out to ChatGPT Pro, Plus, Go and Business users, as well as the free tier. Those on the Enterprise and Edu plans will get a seven-day early-access toggle for the models before GPT-5.1 becomes the default model. OpenAI said the models can also be accessible through the API, both with adapted reasoning. OpenAI has noted that it will soon update GPT-5 Pro to version 5.1. Instant and Thinking models The 5.1 tag reflects improvements to the base model, and OpenAI considers these part of the GPT-5 family, trained on the same stack and data as its reasoning models. The biggest difference between 5.1 and 5 is its more natural and conversational tone, OpenAI CEO for Applications Fidji Simo said in a Substack post. “Based on early testing, it often surprises people with its playfulness while remaining clear and useful,” OpenAI said in its post. Instant can use adaptive reasoning to help it decide when it needs to think about its answers, especially when it comes to more complicated questions. OpenAI noted that it has improved the model's instruction following, so that while it continues to respond quickly, it also directly addresses the user’s query. Recent model releases, such as Baidu’s ERNIE-4.5-VL-28B-A3B-Thinking, have been outperforming GPT-5 in benchmarks like instruction-following. GPT-5.1 Thinking can figure out on its own how much reasoning power it should devote to a prompt. It adapts to the type and complexity of a query, so it will take longer to answer a fuller, complex question than a simple summary request. OpenAI said evaluations showed that GPT-5.1 Thinking spends less time and therefore uses fewer tokens on simple tasks compared to GPT-5, outperforming the base model in speed of response. One thing enterprises should note is that GPT-5.1 Thinking answers “with less jargon and fewer undefined terms.” OpenAI said removing jargony responses makes Thinking more approachable when it comes to explaining technical concepts.More personalizationAnother big update to ChatGPT is increased personalization. This allows users to toggle between a friendly and authoritative chat platform experience in their conversations. ChatGPT already allows users to choose preset options for model tone, but the new update expands these options “to better reflect the most common ways people use ChatGPT.”Options include "default," "friendly" (formerly "listener"), "efficient" (previously "robot"), "professional," "candid" and "quirky." Two other personalities, "cynical" and "nerdy," remain unchanged. “We think many people will find that GPT-5.1 does a better job of bringing IQ and EQ together, but one default clearly can’t meet everyone’s needs," Simo said. "That’s why we’re also making it easier to customize ChatGPT with a range of presets to choose from. The model has the same capabilities whether you select default or one of these options, but the style of its responses will differ — more formal or familiar, more playful or direct, more or less jargon or slang. Of course, eight personalities still don't cover the full range of human diversity, but we know from our research that many people prefer simple, guided control over too many settings or open-ended options."People can also adjust how much ChatGPT uses emojis. OpenAI offers granular controls for responses and is experimenting with the ability to make the models more concise, warm or scannable.Saving a rolloutOpenAI’s GPT-5 rollout was…less than perfect. While company executives, including CEO Sam Altman, touted the new model’s capabilities, a decision to initially sunset older and beloved models on ChatGPT was met with dissatisfaction. Worse yet, many early adopters found that GPT-5 didn’t perform better than older options in domains such as math, science and writing. This led Altman to walk back some of his statements around model removal, blaming performance issues on GPT-5’s router. The router, which automatically directs queries to the most suited models, is not going away, as GPT-5.1 Auto will route prompts to the model type that can best answer queries. OpenAI is careful to note that GPT-5 models Instant, Thinking and Pro are still available in ChatGPT’s model dropdown, although paid subscribers only have three months to compare these older versions with the 5.1 update. The sunset period for GPT-5, however, will not impact models like GPT-4o.“Going forward, when we introduce new ChatGPT models, our approach is to give people ample space to evaluate what’s changed and share feedback, allowing us to continue innovating our frontier models while transitioning smoothly,” the company said. “Sunset periods will be communicated clearly and with plenty of advance notice.”
OpenAI for health, AI memory vs reasoning, AI binoculars, build any app, and more...
Baidu Inc., China's largest search engine company, released a new artificial intelligence model on Monday that its developers claim outperforms competitors from Google and OpenAI on several vision-related benchmarks despite using a fraction of the computing resources typically required for such systems.The model, dubbed ERNIE-4.5-VL-28B-A3B-Thinking, is the latest salvo in an escalating competition among technology companies to build AI systems that can understand and reason about images, videos, and documents alongside traditional text — capabilities increasingly critical for enterprise applications ranging from automated document processing to industrial quality control.What sets Baidu's release apart is its efficiency: the model activates just 3 billion parameters during operation while maintaining 28 billion total parameters through a sophisticated routing architecture. According to documentation released with the model, this design allows it to match or exceed the performance of much larger competing systems on tasks involving document understanding, chart analysis, and visual reasoning while consuming significantly less computational power and memory."Built upon the powerful ERNIE-4.5-VL-28B-A3B architecture, the newly upgraded ERNIE-4.5-VL-28B-A3B-Thinking achieves a remarkable leap forward in multimodal reasoning capabilities," Baidu wrote in the model's technical documentation on Hugging Face, the AI model repository where the system was released.The company said the model underwent "an extensive mid-training phase" that incorporated "a vast and highly diverse corpus of premium visual-language reasoning data," dramatically boosting its ability to align visual and textual information semantically.How the model mimics human visual problem-solving through dynamic image analysisPerhaps the model's most distinctive feature is what Baidu calls "Thinking with Images" — a capability that allows the AI to dynamically zoom in and out of images to examine fine-grained details, mimicking how humans approach visual problem-solving tasks."The model thinks like a human, capable of freely zooming in and out of images to grasp every detail and uncover all information," according to the model card. When paired with tools like image search, Baidu claims this feature "dramatically elevates the model's ability to process fine-grained details and handle long-tail visual knowledge."This approach marks a departure from traditional vision-language models, which typically process images at a fixed resolution. By allowing dynamic image examination, the system can theoretically handle scenarios requiring both broad context and granular detail—such as analyzing complex technical diagrams or detecting subtle defects in manufacturing quality control.The model also supports what Baidu describes as enhanced "visual grounding" capabilities with "more precise grounding and flexible instruction execution, easily triggering grounding functions in complex industrial scenarios," suggesting potential applications in robotics, warehouse automation, and other settings where AI systems must identify and locate specific objects in visual scenes.Baidu's performance claims draw scrutiny as independent testing remains pendingBaidu's assertion that the model outperforms Google's Gemini 2.5 Pro and OpenAI's GPT-5-High on various document and chart understanding benchmarks has drawn attention across social media, though independent verification of these claims remains pending.The company released the model under the permissive Apache 2.0 license, allowing unrestricted commercial use—a strategic decision that contrasts with the more restrictive licensing approaches of some competitors and could accelerate enterprise adoption."Apache 2.0 is smart," wrote one X user responding to Baidu's announcement, highlighting the competitive advantage of open licensing in the enterprise market.According to Baidu's documentation, the model demonstrates six core capabilities beyond traditional text processing. In visual reasoning, the system can perform what Baidu describes as "multi-step reasoning, chart analysis, and causal reasoning capabilities in complex visual tasks," aided by what the company characterizes as "large-scale reinforcement learning." For STEM problem solving, Baidu claims that "leveraging its powerful visual abilities, the model achieves a leap in performance on STEM tasks like solving problems from photos." The visual grounding capability allows the model to identify and locate objects within images with what Baidu characterizes as industrial-grade precision. Through tool integration, the system can invoke external functions including image search capabilities to access information beyond its training data.For video understanding, Baidu claims the model possesses "outstanding temporal awareness and event localization abilities, accurately identifying content changes across different time segments in a video." Finally, the thinking with images feature enables the dynamic zoom functionality that distinguishes this model from competitors.Inside the mixture-of-experts architecture that powers efficient multimodal processingUnder the hood, ERNIE-4.5-VL-28B-A3B-Thinking employs a Mixture-of-Experts (MoE) architecture — a design pattern that has become increasingly popular for building efficient large-scale AI systems. Rather than activating all 28 billion parameters for every task, the model uses a routing mechanism to selectively activate only the 3 billion parameters most relevant to each specific input.This approach offers substantial practical advantages for enterprise deployments. According to Baidu's documentation, the model can run on a single 80GB GPU — hardware readily available in many corporate data centers — making it significantly more accessible than competing systems that may require multiple high-end accelerators.The technical documentation reveals that Baidu employed several advanced training techniques to achieve the model's capabilities. The company used "cutting-edge multimodal reinforcement learning techniques on verifiable tasks, integrating GSPO and IcePop strategies to stabilize MoE training combined with dynamic difficulty sampling for exceptional learning efficiency."Baidu also notes that in response to "strong community demand," the company "significantly strengthened the model's grounding performance with improved instruction-following capabilities."The new model fits into Baidu's ambitious multimodal AI ecosystemThe new release is one component of Baidu's broader ERNIE 4.5 model family, which the company unveiled in June 2025. That family comprises 10 distinct variants, including Mixture-of-Experts models ranging from the flagship ERNIE-4.5-VL-424B-A47B with 424 billion total parameters down to a compact 0.3 billion parameter dense model.According to Baidu's technical report on the ERNIE 4.5 family, the models incorporate "a novel heterogeneous modality structure, which supports parameter sharing across modalities while also allowing dedicated parameters for each individual modality."This architectural choice addresses a longstanding challenge in multimodal AI development: training systems on both visual and textual data without one modality degrading the performance of the other. Baidu claims this design "has the advantage to enhance multimodal understanding without compromising, and even improving, performance on text-related tasks."The company reported achieving 47% Model FLOPs Utilization (MFU) — a measure of training efficiency — during pre-training of its largest ERNIE 4.5 language model, using the PaddlePaddle deep learning framework developed in-house.Comprehensive developer tools aim to simplify enterprise deployment and integrationFor organizations looking to deploy the model, Baidu has released a comprehensive suite of development tools through ERNIEKit, what the company describes as an "industrial-grade training and compression development toolkit."The model offers full compatibility with popular open-source frameworks including Hugging Face Transformers, vLLM (a high-performance inference engine), and Baidu's own FastDeploy toolkit. This multi-platform support could prove critical for enterprise adoption, allowing organizations to integrate the model into existing AI infrastructure without wholesale platform changes.Sample code released by Baidu shows a relatively straightforward implementation path. Using the Transformers library, developers can load and run the model with approximately 30 lines of Python code, according to the documentation on Hugging Face.For production deployments requiring higher throughput, Baidu provides vLLM integration with specialized support for the model's "reasoning-parser" and "tool-call-parser" capabilities — features that enable the dynamic image examination and external tool integration that distinguish this model from earlier systems.The company also offers FastDeploy, a proprietary inference toolkit that Baidu claims delivers "production-ready, easy-to-use multi-hardware deployment solutions" with support for various quantization schemes that can reduce memory requirements and increase inference speed.Why this release matters for the enterprise AI market at a critical inflection pointThe release comes at a pivotal moment in the enterprise AI market. As organizations move beyond experimental chatbot deployments toward production systems that process documents, analyze visual data, and automate complex workflows, demand for capable and cost-effective vision-language models has intensified.Several enterprise use cases appear particularly well-suited to the model's capabilities. Document processing — extracting information from invoices, contracts, and forms — represents a massive market where accurate chart and table understanding directly translates to cost savings through automation. Manufacturing quality control, where AI systems must detect visual defects, could benefit from the model's grounding capabilities. Customer service applications that handle images from users could leverage the multi-step visual reasoning.The model's efficiency profile may prove especially attractive to mid-market organizations and startups that lack the computing budgets of large technology companies. By fitting on a single 80GB GPU — hardware costing roughly $10,000 to $30,000 depending on the specific model — the system becomes economically viable for a much broader range of organizations than models requiring multi-GPU setups costing hundreds of thousands of dollars."With all these new models, where's the best place to actually build and scale? Access to compute is everything," wrote one X user in response to Baidu's announcement, highlighting the persistent infrastructure challenges facing organizations attempting to deploy advanced AI systems.The Apache 2.0 licensing further lowers barriers to adoption. Unlike models released under more restrictive licenses that may limit commercial use or require revenue sharing, organizations can deploy ERNIE-4.5-VL-28B-A3B-Thinking in production applications without ongoing licensing fees or usage restrictions.Competition intensifies as Chinese tech giant takes aim at Google and OpenAIBaidu's release intensifies competition in the vision-language model space, where Google, OpenAI, Anthropic, and Chinese companies including Alibaba and ByteDance have all released capable systems in recent months.The company's performance claims — if validated by independent testing — would represent a significant achievement. Google's Gemini 2.5 Pro and OpenAI's GPT-5-High are substantially larger models backed by the deep resources of two of the world's most valuable technology companies. That a more compact, openly available model could match or exceed their performance on specific tasks would suggest the field is advancing more rapidly than some analysts anticipated."Impressive that ERNIE is outperforming Gemini 2.5 Pro," wrote one social media commenter, expressing surprise at the claimed results.However, some observers counseled caution about benchmark comparisons. "It's fascinating to see how multimodal models are evolving, especially with features like 'Thinking with Images,'" wrote one X user. "That said, I'm curious if ERNIE-4.5's edge over competitors like Gemini-2.5-Pro and GPT-5-High primarily lies in specific use cases like document and chart" understanding rather than general-purpose vision tasks.Industry analysts note that benchmark performance often fails to capture real-world behavior across the diverse scenarios enterprises encounter. A model that excels at document understanding may struggle with creative visual tasks or real-time video analysis. Organizations evaluating these systems typically conduct extensive internal testing on representative workloads before committing to production deployments.Technical limitations and infrastructure requirements that enterprises must considerDespite its capabilities, the model faces several technical challenges common to large vision-language systems. The minimum requirement of 80GB of GPU memory, while more accessible than some competitors, still represents a significant infrastructure investment. Organizations without existing GPU infrastructure would need to procure specialized hardware or rely on cloud computing services, introducing ongoing operational costs.The model's context window — the amount of text and visual information it can process simultaneously — is listed as 128K tokens in Baidu's documentation. While substantial, this may prove limiting for some document processing scenarios involving very long technical manuals or extensive video content.Questions also remain about the model's behavior on adversarial inputs, out-of-distribution data, and edge cases. Baidu's documentation does not provide detailed information about safety testing, bias mitigation, or failure modes — considerations increasingly important for enterprise deployments where errors could have financial or safety implications.What technical decision-makers need to evaluate beyond the benchmark numbersFor technical decision-makers evaluating the model, several implementation factors warrant consideration beyond raw performance metrics.The model's MoE architecture, while efficient during inference, adds complexity to deployment and optimization. Organizations must ensure their infrastructure can properly route inputs to the appropriate expert subnetworks — a capability not universally supported across all deployment platforms.The "Thinking with Images" feature, while innovative, requires integration with image manipulation tools to achieve its full potential. Baidu's documentation suggests this capability works best "when paired with tools like image zooming and image search," implying that organizations may need to build additional infrastructure to fully leverage this functionality.The model's video understanding capabilities, while highlighted in marketing materials, come with practical constraints. Processing video requires substantially more computational resources than static images, and the documentation does not specify maximum video length or optimal frame rates.Organizations considering deployment should also evaluate Baidu's ongoing commitment to the model. Open-source AI models require continuing maintenance, security updates, and potential retraining as data distributions shift over time. While the Apache 2.0 license ensures the model remains available, future improvements and support depend on Baidu's strategic priorities.Developer community responds with enthusiasm tempered by practical requestsEarly response from the AI research and development community has been cautiously optimistic. Developers have requested versions of the model in additional formats including GGUF (a quantization format popular for local deployment) and MNN (a mobile neural network framework), suggesting interest in running the system on resource-constrained devices."Release MNN and GGUF so I can run it on my phone," wrote one developer, highlighting demand for mobile deployment options.Other developers praised Baidu's technical choices while requesting additional resources. "Fantastic model! Did you use discoveries from PaddleOCR?" asked one user, referencing Baidu's open-source optical character recognition toolkit.The model's lengthy name—ERNIE-4.5-VL-28B-A3B-Thinking—drew lighthearted commentary. "ERNIE-4.5-VL-28B-A3B-Thinking might be the longest model name in history," joked one observer. "But hey, if you're outperforming Gemini-2.5-Pro with only 3B active params, you've earned the right to a dramatic name!"Baidu plans to showcase the ERNIE lineup during its Baidu World 2025 conference on November 13, where the company is expected to provide additional details about the model's development, performance validation, and future roadmap.The release marks a strategic move by Baidu to establish itself as a major player in the global AI infrastructure market. While Chinese AI companies have historically focused primarily on domestic markets, the open-source release under a permissive license signals ambitions to compete internationally with Western AI giants.For enterprises, the release adds another capable option to a rapidly expanding menu of AI models. Organizations no longer face a binary choice between building proprietary systems or licensing closed-source models from a handful of vendors. The proliferation of capable open-source alternatives like ERNIE-4.5-VL-28B-A3B-Thinking is reshaping the economics of AI deployment and accelerating adoption across industries.Whether the model delivers on its performance promises in real-world deployments remains to be seen. But for organizations seeking powerful, cost-effective tools for visual understanding and reasoning, one thing is certain. As one developer succinctly summarized: "Open source plus commercial use equals chef's kiss. Baidu not playing around."
Researchers at Meta FAIR and the National University of Singapore have developed a new reinforcement learning framework for self-improving AI systems. Called Self-Play In Corpus Environments (SPICE), the framework pits two AI agents against each other, creating its own challenges and gradually improving without human supervision.While currently a proof-of-concept, this self-play mechanism could provide a basis for future AI systems that can dynamically adapt to their environments, making them more robust against the unpredictability of real-world applications.The challenge of self-improving AIThe goal of self-improving AI is to create systems that can enhance their capabilities by interacting with their environment. A common approach is reinforcement learning with verifiable rewards (RLVR), where models are rewarded for providing the correct answers to problems. This is often limited by its reliance on human-curated problem sets and domain-specific reward engineering, which makes it difficult to scale.Self-play, where a model improves by competing against itself, is another promising paradigm. But existing self-play methods for language models are often limited by two critical factors. Factual errors in generated questions and answers compound, leading to a feedback loop of hallucinations. When the problem generator and solver have information symmetry (i.e., share the same knowledge base) they fail to generate genuinely new challenges and fall into repetitive patterns. As the researchers note in their paper, “These systematic empirical failures indicate that self-improvement requires interaction with an external source providing diverse, verifiable feedback, rather than closed-loop pure introspection.”How SPICE worksSPICE is a self-play framework where a single model acts in two distinct roles. A "Challenger" constructs a curriculum of challenging problems from a large corpus of documents. A "Reasoner" then attempts to solve these problems without access to the source documents. This setup breaks the information symmetry that limits other self-play methods, as the Reasoner does not have access to the documents and knowledge that the Challenger uses to generate the problems.Grounding the tasks in a vast and diverse corpus of documents prevents hallucination by anchoring questions and answers in real-world content. This is important because for AI systems to reliably self-improve, they need external grounding sources. Therefore, LLM agents should learn from interactions with humans and the real world, not just their own outputs, to avoid compounding errors.The adversarial dynamic between the two roles creates an automatic curriculum. The Challenger is rewarded for generating problems that are both diverse and at the frontier of the Reasoner's capability (not too easy and also not impossible). The Reasoner is rewarded for answering correctly. This symbiotic interaction pushes both agents to continuously discover and overcome new challenges. Because the system uses raw documents instead of pre-defined question-answer pairs, it can generate diverse task formats, such as multiple-choice and free-form questions. This flexibility allows SPICE to be applied to any domain, breaking the bottleneck that has confined previous methods to narrow fields like math and code. It also reduces dependence on expensive human-curated datasets for specialized domains like legal or medical analysis.SPICE in actionThe researchers evaluated SPICE on several base models, including Qwen3-4B-Base and OctoThinker-3B-Hybrid-Base. They compared its performance against baselines such as the base model with no training, a Reasoner model trained with a fixed "Strong Challenger" (Qwen3-32B-Instruct), and pure self-play methods like R-Zero and Absolute Zero. The evaluation covered a wide range of mathematical and general reasoning benchmarks.Across all models, SPICE consistently outperformed the baselines, delivering significant improvements in both mathematical and general reasoning tasks. The results show that the reasoning capabilities developed through corpus-grounded self-play transfer broadly across different models, thanks to the diverse external knowledge corpus they used.A key finding is that the adversarial dynamic creates an effective automatic curriculum. As training progresses, the Challenger learns to generate increasingly difficult problems. In one experiment, the Reasoner's pass rate on a fixed set of problems increased from 55% to 85% over time, showing its improved capabilities. Meanwhile, later versions of the Challenger were able to generate questions that dropped the pass rate of an early-stage Reasoner from 55% to 35%, confirming that both roles co-evolve successfully.The researchers conclude that this approach presents a paradigm shift in self-improving reasoning methods from “closed-loop self-play that often stagnates due to hallucination drift, to open-ended improvement through interaction with the vast, verifiable knowledge embedded in web document corpora.”Currently, the corpus used for SPICE represents human experience captured in text. The ultimate goal is for self-improving systems to generate questions based on interactions with reality, including the physical world, the internet, and human interactions across multiple modalities like video, audio, and sensor data.
In this post, we demonstrate how you can use the A2A protocol for AI agents built with different frameworks to collaborate seamlessly. You'll learn how to deploy A2A servers on AgentCore Runtime, configure agent discovery and authentication, and build a real-world multi-agent system for incident response. We'll cover the complete A2A request lifecycle, from agent card discovery to task delegation, showing how standardized protocols eliminate the complexity of multi-agent coordination.
The Cohere Embed 4 multimodal embeddings model is now available as a fully managed, serverless option in Amazon Bedrock. In this post, we dive into the benefits and unique capabilities of Embed 4 for enterprise search use cases. We’ll show you how to quickly get started using Embed 4 on Amazon Bedrock, taking advantage of integrations with Strands Agents, S3 Vectors, and Amazon Bedrock AgentCore to build powerful agentic retrieval-augmented generation (RAG) workflows.
The regulatory landscape for GxP compliance is evolving to address the unique characteristics of AI. Traditional Computer System Validation (CSV) approaches, often with uniform validation strategies, are being supplemented by Computer Software Assurance (CSA) frameworks that emphasize flexible risk-based validation methods tailored to each system's actual impact and complexity (FDA latest guidance). In this post, we cover a risk-based implementation, practical implementation considerations across different risk levels, the AWS shared responsibility model for compliance, and concrete examples of risk mitigation strategies.
In this post, we explore four key collaboration patterns for multi-agent, multimodal AI systems – Agents as Tools, Swarms Agents, Agent Graphs, and Agent Workflows – and discuss when and how to apply each using the open-source AWS Strands Agents SDK with Amazon Nova models.
Senior software developers are preparing for a major shift in how they work as artificial intelligence becomes central to their workflows, according to BairesDev’s latest Dev Barometer report published today. VentureBeat was given an exclusive early look and the findings below come directly from that report. The quarterly global survey, which polled 501 developers and 19 project managers across 92 software initiatives, finds that nearly two-thirds (65%) of senior developers expect their roles to be redefined by AI in 2026. The data highlights a transformation underway in software development: fewer routine coding tasks, more emphasis on design and strategy, and a rising need for AI fluency.From Coders to StrategistsAmong those anticipating change, 74% say they expect to shift from hands-on coding to designing solutions. Another 61% plan to integrate AI-generated code into their workflows, and half foresee spending more time on system strategy and architecture.“It’s not about lines of code anymore,” said Justice Erolin, Chief Technology Officer at BairesDev, in a recent interview with VentureBeat conducted over video call. “It’s about the quality and type of code, and the kind of work developers are doing.”Erolin said the company is watching developers evolve from individual contributors into system thinkers.“AI is great at code scaffolding and generating unit tests, saving developers around eight hours a week,” he explained. “That time can now be used for solution architecture and strategy work—areas where AI still falls short.”The survey’s data reflects this shift. Developers are moving toward higher-value tasks while automation takes over much of the repetitive coding that once occupied junior engineers.Erolin noted that BairesDev’s internal data mirrors these findings. “We’re seeing a shift where senior engineers with AI tools are outperforming, and even replacing, the traditional senior-plus-junior team setup,” he said.Realism About AI’s LimitsDespite widespread enthusiasm, developers remain cautious about AI’s reliability.Over half (56%) describe AI-generated code as “somewhat reliable,” saying it still requires validation for accuracy and security. Only 9% trust it enough to use without human oversight.Erolin agreed with that sentiment. “AI doesn’t replace human oversight,” he said. “Even as tools improve, developers still need to understand how individual components fit into the bigger system.” He added that the biggest constraint in large language models today is “their context window”—the limited ability to retain and reason across entire systems. “Engineers need to think holistically about architecture, not just individual lines of code,” he said.The CTO described 2025 as a turning point for how engineers use AI tools like GitHub Copilot, Cursor, Claude, and OpenAI’s models. “We’re tracking what tools and models our engineers use,” he said. “But the bigger story is how those tools impact learning, productivity, and oversight.”That tempered optimism aligns with BairesDev’s previous Dev Barometer findings, which reported that 92% of developers were already using AI-assisted coding by Q3 2025, saving an average of 7.3 hours per week.A Year of UpskillingIn 2025, AI integration already brought tangible professional benefits. 74% of developers said the technology strengthened their technical skills, 50% reported better work-life balance, and 37% said AI tools expanded their career opportunities.Erolin said the company is seeing AI emerge as “a top use case for upskilling.” Developers use it to “learn new technologies faster and fill knowledge gaps,” he noted. “When developers understand how AI works and its limitations, they can use it to enhance—not replace—their critical thinking. They prompt better and learn more efficiently.”Still, he warned of a potential long-term risk in the industry’s current trajectory. “If junior engineers are being replaced or not hired, we’ll face a shortage of qualified senior engineers in ten years as current ones retire,” Erolin said.The Dev Barometer findings echo that concern. Developers expect leaner teams, but many also worry that fewer entry-level opportunities could lead to long-term talent pipeline issues.Leaner Teams, New PrioritiesDevelopers expect 2026 to bring smaller, more specialized teams. 58% say automation will reduce entry-level tasks, while 63% expect new career paths to emerge as AI redefines team structures. 59% anticipate that AI will create entirely new specialized roles.According to BairesDev’s data, developers currently divide their time between writing code (48%), debugging (42%), and documentation (35%). Only 19% report focusing primarily on creative problem-solving and innovation—a share that’s expected to grow as AI removes lower-level coding tasks.The report also highlights where developers see the fastest-growing areas for 2026: AI/ML (67%), data analytics (46%), and cybersecurity (45%). In parallel, 63% of project managers said developers will need more training in AI, cloud, and security.Erolin described the next generation of developers as “T-shaped engineers”—people with broad system knowledge and deep expertise in one or more areas. “The most important developer moving forward will be the T-shaped engineer,” he said. “Broad in understanding, deep in skill.”AI as an Industry StandardThe Q4 Dev Barometer frames AI not as an experiment but as a foundation for how teams will operate in 2026. Developers are moving beyond using AI as a coding shortcut and instead incorporating it into architecture, validation, and design decisions.Erolin emphasized that BairesDev is already adapting its internal teams to this new reality. “Our engineers are full-time with us, and we staff them out where they’re needed,” he said. “Some clients need help for six months to a year; others outsource their entire dev team to us.”He said BairesDev provides “about 5,000 software engineers from Latin America, offering clients timezone-aligned, culturally aligned, and highly fluent English-speaking talent.”As developers integrate AI deeper into their daily work, Erolin believes the competitive advantage will belong to those who understand both the technology’s capabilities and its constraints. “When developers learn to collaborate with AI instead of compete against it, that’s when the real productivity and creativity gains happen,” he said.Background: Who BairesDev IsFounded in Buenos Aires in 2009 by Nacho De Marco and Paul Azorin, BairesDev began with a mission to connect what it describes as the “top 1%” of Latin American developers with global companies seeking high-quality software solutions. The company grew from those early roots into a major nearshore software development and staffing provider, offering everything from individual developer placements to full end-to-end project outsourcing.Today, BairesDev claims to have delivered more than 1,200 projects across 130+ industries, serving hundreds of clients ranging from startups to Fortune 500 firms such as Google, Adobe, and Rolls-Royce. It operates with a remote-first model and a workforce of over 4,000 professionals across more than 40 countries, aligning its teams to North American time zones.The company emphasizes three core advantages: access to elite technical talent across 100+ technologies, rapid scalability for project needs, and nearshore proximity for real-time collaboration. It reports client relationships averaging over three years and a satisfaction rate around 91%.BairesDev’s unique position—bridging Latin American talent with global enterprise clients—gives it an unusually data-rich perspective on how AI is transforming software development at scale.The TakeawayThe Dev Barometer’s Q4 2025 results suggest 2026 will mark a turning point for software engineering. Developers are becoming system architects rather than pure coders, AI literacy is becoming a baseline requirement, and traditional entry-level roles may give way to new, specialized positions.As AI becomes embedded in every stage of development—from design to testing—developers who can combine technical fluency with strategic thinking are set to lead the next era of software creation.
Introducing Private AI Compute, our new way to bring you helpful AI with the power of the cloud, while keeping your data private to you.
Learn more about new AI tools in Google Photos, including Nano Banana image-generation and more.
We’ve been bombarded with claims about how much generative AI improves software developer productivity: It turns regular programmers into 10x programmers, and 10x programmers into 100x. And even more recently, we’ve been (somewhat less, but still) bombarded with the other side of the story: METR reports that, despite software developers’ belief that their productivity has […]
Our new paper analyzes the important ways AI systems organize the visual world differently from humans.
Building machine learning models in high-stakes contexts like finance, healthcare, and critical infrastructure often demands robustness, explainability, and other domain-specific constraints.
We’re bringing together experts, students, educators and more at our Google AI for Learning Forum.
3 AI business models, OpenAI's targets, AI legends speak, and more...
Associate Professor Phillip Isola studies the ways in which intelligent machines “think,” in an effort to safely integrate AI into human society.
Meta has just released a new multilingual automatic speech recognition (ASR) system supporting 1,600+ languages — dwarfing OpenAI’s open source Whisper model, which supports just 99. Is architecture also allows developers to extend that support to thousands more. Through a feature called zero-shot in-context learning, users can provide a few paired examples of audio and text in a new language at inference time, enabling the model to transcribe additional utterances in that language without any retraining.In practice, this expands potential coverage to more than 5,400 languages — roughly every spoken language with a known script.It’s a shift from static model capabilities to a flexible framework that communities can adapt themselves. So while the 1,600 languages reflect official training coverage, the broader figure represents Omnilingual ASR’s capacity to generalize on demand, making it the most extensible speech recognition system released to date.Best of all: it's been open sourced under a plain Apache 2.0 license — not a restrictive, quasi open-source Llama license like the company's prior releases, which limited use by larger enterprises unless they paid licensing fees — meaning researchers and developers are free to take and implement it right away, for free, without restrictions, even in commercial and enterprise-grade projects!Released on November 10 on Meta's website, Github, along with a demo space on Hugging Face and technical paper, Meta’s Omnilingual ASR suite includes a family of speech recognition models, a 7-billion parameter multilingual audio representation model, and a massive speech corpus spanning over 350 previously underserved languages. All resources are freely available under open licenses, and the models support speech-to-text transcription out of the box.“By open sourcing these models and dataset, we aim to break down language barriers, expand digital access, and empower communities worldwide,” Meta posted on its @AIatMeta account on XDesigned for Speech-to-Text TranscriptionAt its core, Omnilingual ASR is a speech-to-text system. The models are trained to convert spoken language into written text, supporting applications like voice assistants, transcription tools, subtitles, oral archive digitization, and accessibility features for low-resource languages.Unlike earlier ASR models that required extensive labeled training data, Omnilingual ASR includes a zero-shot variant. This version can transcribe languages it has never seen before—using just a few paired examples of audio and corresponding text. This lowers the barrier for adding new or endangered languages dramatically, removing the need for large corpora or retraining.Model Family and Technical DesignThe Omnilingual ASR suite includes multiple model families trained on more than 4.3 million hours of audio from 1,600+ languages:wav2vec 2.0 models for self-supervised speech representation learning (300M–7B parameters)CTC-based ASR models for efficient supervised transcriptionLLM-ASR models combining a speech encoder with a Transformer-based text decoder for state-of-the-art transcriptionLLM-ZeroShot ASR model, enabling inference-time adaptation to unseen languagesAll models follow an encoder–decoder design: raw audio is converted into a language-agnostic representation, then decoded into written text.Why the Scale MattersWhile Whisper and similar models have advanced ASR capabilities for global languages, they fall short on the long tail of human linguistic diversity. Whisper supports 99 languages. Meta’s system:Directly supports 1,600+ languagesCan generalize to 5,400+ languages using in-context learningAchieves character error rates (CER) under 10% in 78% of supported languagesAmong those supported are more than 500 languages never previously covered by any ASR model, according to Meta’s research paper.This expansion opens new possibilities for communities whose languages are often excluded from digital toolsHere’s the revised and expanded background section, integrating the broader context of Meta’s 2025 AI strategy, leadership changes, and Llama 4’s reception, complete with in-text citations and links:Background: Meta’s AI Overhaul and a Rebound from Llama 4The release of Omnilingual ASR arrives at a pivotal moment in Meta’s AI strategy, following a year marked by organizational turbulence, leadership changes, and uneven product execution. Omnilingual ASR is the first major open-source model release since the rollout of Llama 4, Meta’s latest large language model, which debuted in April 2025 to mixed and ultimately poor reviews, with scant enterprise adoption compared to Chinese open source model competitors.The failure led Meta founder and CEO Mark Zuckerberg to appoint Alexandr Wang, co-founder and prior CEO of AI data supplier Scale AI, as Chief AI Officer, and embark on an extensive and costly hiring spree that shocked the AI and business communities with eye-watering pay packages for top AI researchers.In contrast, Omnilingual ASR represents a strategic and reputational reset. It returns Meta to a domain where the company has historically led — multilingual AI — and offers a truly extensible, community-oriented stack with minimal barriers to entry. The system’s support for 1,600+ languages and its extensibility to over 5,000 more via zero-shot in-context learning reassert Meta’s engineering credibility in language technology. Importantly, it does so through a free and permissively licensed release, under Apache 2.0, with transparent dataset sourcing and reproducible training protocols.This shift aligns with broader themes in Meta’s 2025 strategy. The company has refocused its narrative around a “personal superintelligence” vision, investing heavily in infrastructure (including a September release of custom AI accelerators and Arm-based inference stacks) source while downplaying the metaverse in favor of foundational AI capabilities. The return to public training data in Europe after a regulatory pause also underscores its intention to compete globally, despite privacy scrutiny source.Omnilingual ASR, then, is more than a model release — it’s a calculated move to reassert control of the narrative: from the fragmented rollout of Llama 4 to a high-utility, research-grounded contribution that aligns with Meta’s long-term AI platform strategy.Community-Centered Dataset CollectionTo achieve this scale, Meta partnered with researchers and community organizations in Africa, Asia, and elsewhere to create the Omnilingual ASR Corpus, a 3,350-hour dataset across 348 low-resource languages. Contributors were compensated local speakers, and recordings were gathered in collaboration with groups like:African Next Voices: A Gates Foundation–supported consortium including Maseno University (Kenya), University of Pretoria, and Data Science NigeriaMozilla Foundation’s Common Voice, supported through the Open Multilingual Speech FundLanfrica / NaijaVoices, which created data for 11 African languages including Igala, Serer, and UrhoboThe data collection focused on natural, unscripted speech. Prompts were designed to be culturally relevant and open-ended, such as “Is it better to have a few close friends or many casual acquaintances? Why?” Transcriptions used established writing systems, with quality assurance built into every step.Performance and Hardware ConsiderationsThe largest model in the suite, the omniASR_LLM_7B, requires ~17GB of GPU memory for inference, making it suitable for deployment on high-end hardware. Smaller models (300M–1B) can run on lower-power devices and deliver real-time transcription speeds.Performance benchmarks show strong results even in low-resource scenarios:CER
In this post, we demonstrate that fine-tuning VLMs provides a powerful and flexible approach to automate and significantly enhance document understanding capabilities. We also demonstrate that using focused fine-tuning allows smaller, multi-modal models to compete effectively with much larger counterparts (98% accuracy with Qwen2.5 VL 3B).
Chronosphere, a New York-based observability startup valued at $1.6 billion, announced Monday it will launch AI-Guided Troubleshooting capabilities designed to help engineers diagnose and fix production software failures — a problem that has intensified as artificial intelligence tools accelerate code creation while making systems harder to debug.The new features combine AI-driven analysis with what Chronosphere calls a Temporal Knowledge Graph, a continuously updated map of an organization's services, infrastructure dependencies, and system changes over time. The technology aims to address a mounting challenge in enterprise software: developers are writing code faster than ever with AI assistance, but troubleshooting remains largely manual, creating bottlenecks when applications fail."For AI to be effective in observability, it needs more than pattern recognition and summarization," said Martin Mao, Chronosphere's CEO and co-founder, in an exclusive interview with VentureBeat. "Chronosphere has spent years building the data foundation and analytical depth needed for AI to actually help engineers. With our Temporal Knowledge Graph and advanced analytics capabilities, we're giving AI the understanding it needs to make observability truly intelligent — and giving engineers the confidence to trust its guidance."The announcement comes as the observability market — software that monitors complex cloud applications— faces mounting pressure to justify escalating costs. Enterprise log data volumes have grown 250% year-over-year, according to Chronosphere's own research, while a study from MIT and the University of Pennsylvania found that generative AI has spurred a 13.5% increase in weekly code commits, signifying faster development velocity but also greater system complexity.AI writes code 13% faster, but debugging stays stubbornly manualDespite advances in automated code generation, debugging production failures remains stubbornly manual. When a major e-commerce site slows during checkout or a banking app fails to process transactions, engineers must sift through millions of data points — server logs, application traces, infrastructure metrics, recent code deployments — to identify root causes.Chronosphere's answer is what it calls AI-Guided Troubleshooting, built on four core capabilities: automated "Suggestions" that propose investigation paths backed by data; the Temporal Knowledge Graph that maps system relationships and changes; Investigation Notebooks that document each troubleshooting step for future reference; and natural language query building.Mao explained the Temporal Knowledge Graph in practical terms: "It's a living, time-aware model of your system. It stitches together telemetry—metrics, traces, logs—infrastructure context, change events like deploys and feature flags, and even human input like notes and runbooks into a single, queryable map that updates as your system evolves."This differs fundamentally from the service dependency maps offered by competitors like Datadog, Dynatrace, and Splunk, Mao argued. "It adds time, not just topology," he said. "It tracks how services and dependencies change over time and connects those changes to incidents—what changed and why. Many tools rely on standardized integrations; our graph goes a step further to normalize custom, non-standard telemetry so application-specific signals aren't a blind spot."Why Chronosphere shows its work instead of making automatic decisionsUnlike purely automated systems, Chronosphere designed its AI features to keep engineers in the driver's seat—a deliberate choice meant to address what Mao calls the "confident-but-wrong guidance" problem plaguing early AI observability tools."'Keeping engineers in control' means the AI shows its work, proposes next steps, and lets engineers verify or override — never auto-deciding behind the scenes," Mao explained. "Every Suggestion includes the evidence—timing, dependencies, error patterns — and a 'Why was this suggested?' view, so they can inspect what was checked and ruled out before acting."He walked through a concrete example: "An SLO [service level objective] alert fires on Checkout. Chronosphere immediately surfaces a ranked Suggestion: errors appear to have started in the dependent Payment service. An engineer can click Investigate to see the charts and reasoning and, if it holds up, choose to dig deeper. As they steer into Payment, the system adapts with new Suggestions scoped to that service—all from one view, no tab-hopping."In this scenario, the engineer asks "what changed?" and the system pulls in change events. "Our Notebook capability makes the causal chain plain: a feature-flag update preceded pod memory exhaustion in Payment; Checkout's spike is a downstream symptom," Mao said. "They can decide to roll back the flag. That whole path — suggestions followed, evidence viewed, conclusions—is captured automatically in an Investigation Notebook, and the outcome feeds the Temporal Knowledge Graph so similar future incidents are faster to resolve."How a $1.6 billion startup takes on Datadog, Dynatrace, and SplunkChronosphere enters an increasingly crowded field. Datadog, the publicly traded observability leader valued at over $40 billion, has introduced its own AI-powered troubleshooting features. So have Dynatrace and Splunk. All three offer comprehensive "all-in-one" platforms that promise single-pane-of-glass visibility.Mao distinguished Chronosphere's approach on technical grounds. "Early 'AI for observability' leaned heavily on pattern-spotting and summarization, which tends to break down during real incidents," he said. "These approaches often stop at correlating anomalies or producing fluent explanations without the deeper analysis and causal reasoning observability leaders need. They can feel impressive in demos but disappoint in production—they summarize signals rather than explain cause and effect."A specific technical gap, he argued, involves custom application telemetry. "Most platforms reason over standardized integrations—Kubernetes, common cloud services, popular databases—ignoring the most telling clues that live in custom app telemetry," Mao said. "With an incomplete picture, large language models will 'fill in the gaps,' producing confident-but-wrong guidance that sends teams down dead ends."Chronosphere's competitive positioning received validation in July when Gartner named it a Leader in the 2025 Magic Quadrant for Observability Platforms for the second consecutive year. The firm was recognized based on both "Completeness of Vision" and "Ability to Execute." In December 2024, Chronosphere also tied for the highest overall rating among recognized vendors in Gartner Peer Insights' "Voice of the Customer" report, scoring 4.7 out of 5 based on 70 reviews.Yet the company faces intensifying competition for high-profile customers. UBS analysts noted in July that OpenAI now runs both Datadog and Chronosphere side-by-side to monitor GPU workloads, suggesting the AI leader is evaluating alternatives. While UBS maintained its buy rating on Datadog, the analysts warned that growing Chronosphere usage could pressure Datadog's pricing power.Inside the 84% cost reduction claims—and what CIOs should actually measureBeyond technical capabilities, Chronosphere has built its market position on cost control — a critical factor as observability spending spirals. The company claims its platform reduces data volumes and associated costs by 84% on average while cutting critical incidents by up to 75%.When pressed for specific customer examples with real numbers, Mao pointed to several case studies. "Robinhood has seen a 5x improvement in reliability and a 4x improvement in Mean Time to Detection," he said. "DoorDash used Chronosphere to improve governance and standardize monitoring practices. Astronomer achieved over 85% cost reduction by shaping data on ingest, and Affirm scaled their load 10x during a Black Friday event with no issues, highlighting the platform's reliability under extreme conditions."The cost argument matters because, as Paul Nashawaty, principal analyst at CUBE Research, noted when Chronosphere launched its Logs 2.0 product in June: "Organizations are drowning in telemetry data, with over 70% of observability spend going toward storing logs that are never queried."For CIOs fatigued by "AI-powered" announcements, Mao acknowledged skepticism is warranted. "The way to cut through it is to test whether the AI shortens incidents, reduces toil, and builds reusable knowledge in your own environment, not in a demo," he advised. He recommended CIOs evaluate three factors: transparency and control (does the system show its reasoning?), coverage of custom telemetry (can it handle non-standardized data?), and manual toil avoided (how many ad-hoc queries and tool-switches are eliminated?).Why Chronosphere partners with five vendors instead of building everything itselfAlongside the AI troubleshooting announcement, Chronosphere revealed a new Partner Program integrating five specialized vendors to fill gaps in its platform: Arize for large language model monitoring, Embrace for real user monitoring, Polar Signals for continuous profiling, Checkly for synthetic monitoring, and Rootly for incident management.The strategy represents a deliberate bet against the all-in-one platforms dominating the market. "While an all-in-one platform may be sufficient for smaller organizations, global enterprises demand best-in-class depth across each domain," Mao said. "This is what drove us to build our Partner Program and invest in seamless integrations with leading providers—so our customers can operate with confidence and clarity at every layer of observability."Noah Smolen, head of partnerships at Arize, said the collaboration addresses a specific enterprise need. "With a wide array of Fortune 500 customers, we understand the high bar needed to ensure AI agent systems are ready to deploy and stay incident-free, especially given the pace of AI adoption in the enterprise," Smolen said. "Our partnership with Chronosphere comes at a time when an integrated purpose-built cloud-native and AI-observability suite solves a huge pain point for forward-thinking C-suite leaders who demand the very best across their entire observability stack."Similarly, JJ Tang, CEO and founder of Rootly, emphasized the incident resolution benefits. "Incidents hinder innovation and revenue, and the challenge lies in sifting through vast amounts of observability data, mobilizing teams, and resolving issues quickly," Tang said. "Integrating Chronosphere with Rootly allows engineers to collaborate with context and resolve issues faster within their existing communication channels, drastically reducing time to resolution and ultimately improving reliability—78% plus decreases in repeat Sev0 and Sev1 incidents."When asked how total costs compare when customers use multiple partner contracts versus a single platform, Mao acknowledged the current complexity. "At present, mutual customers typically maintain separate contracts unless they engage through a services partner or system integrator," he said. However, he argued the economics still favor the composable approach: "Our combined technologies deliver exceptional value—in most circumstances at just a fraction of the price of a single-platform solution. Beyond the savings, customers gain a richer, more unified observability experience that unlocks deeper insights and greater efficiency, especially for large-scale environments."The company plans to streamline this over time. "As the ISV program matures, we're focused on delivering a more streamlined experience by transitioning to a single, unified contract that simplifies procurement and accelerates time to value," Mao said.How two Uber engineers turned Halloween outages into a billion-dollar startupChronosphere's origins trace to 2019, when Mao and co-founder Rob Skillington left Uber after building the ride-hailing giant's internal observability platform. At Uber, Mao's team had faced a crisis: the company's in-house tools would fail on its two busiest nights — Halloween and New Year's Eve — cutting off visibility into whether customers could request rides or drivers could locate passengers.The solution they built at Uber used open-source software and ultimately allowed the company to operate without outages, even during high-volume events. But the broader market insight came at an industry conference in December 2018, when major cloud providers threw their weight behind Kubernetes, Google's container orchestration technology."This meant that most technology architectures were eventually going to look like Uber's," Mao recalled in an August 2024 profile by Greylock Partners, Chronosphere's lead investor. "And that meant every company, not just a few big tech companies and the Walmarts of the world, would have the exact same problem we had solved at Uber."Chronosphere has since raised more than $343 million in funding across multiple rounds led by Greylock, Lux Capital, General Atlantic, Addition, and Founders Fund. The company operates as a remote-first organization with offices in New York, Austin, Boston, San Francisco, and Seattle, employing approximately 299 people according to LinkedIn data.The company's customer base includes DoorDash, Zillow, Snap, Robinhood, and Affirm — predominantly high-growth technology companies operating cloud-native, Kubernetes-based infrastructures at massive scale.What's available now—and what enterprises can expect in 2026Chronosphere's AI-Guided Troubleshooting capabilities, including Suggestions and Investigation Notebooks, entered limited availability Monday with select customers. The company plans full general availability in 2026. The Model Context Protocol (MCP) Server, which enables engineers to integrate Chronosphere directly into internal AI workflows and query observability data through AI-enabled development environments, is available immediately for all Chronosphere customers.The phased rollout reflects the company's cautious approach to deploying AI in production environments where mistakes carry real costs. By gathering feedback from early adopters before broad release, Chronosphere aims to refine its guidance algorithms and validate that its suggestions genuinely accelerate troubleshooting rather than simply generating impressive demonstrations.The longer game, however, extends beyond individual product features. Chronosphere's dual bet — on transparent AI that shows its reasoning and on a partner ecosystem rather than all-in-one integration — amounts to a fundamental thesis about how enterprise observability will evolve as systems grow more complex.If that thesis proves correct, the company that solves observability for the AI age won't be the one with the most automated black box. It will be the one that earns engineers' trust by explaining what it knows, admitting what it doesn't, and letting humans make the final call. In an industry drowning in data and promised silver bullets, Chronosphere is wagering that showing your work still matters — even when AI is doing the math.
In this post, we demonstrate how Clario has used Amazon Bedrock and other AWS services to build an AI-powered solution that automates and improves the analysis of COA interviews.
A six-month long pilot program with the Northern Ireland Education Authority’s C2k initiative found that integrating Gemini and other generative AI tools saved participating teachers an average of 10 hours per week.
As cloud project tracking software monday.com’s engineering organization scaled past 500 developers, the team began to feel the strain of its own success. Product lines were multiplying, microservices proliferating, and code was flowing faster than human reviewers could keep up. The company needed a way to review thousands of pull requests each month without drowning developers in tedium — or letting quality slip.That’s when Guy Regev, VP of R&D and head of the Growth and monday Dev teams, started experimenting with a new AI tool from Qodo, an Israeli startup focused on developer agents. What began as a lightweight test soon became a critical part of monday.com’s software delivery infrastructure, as a new case study released by both Qodo and monday.com today reveals. “Qodo doesn’t feel like just another tool—it’s like adding a new developer to the team who actually learns how we work," Regev told VentureBeat in a recent video call interview, adding that it has "prevented over 800 issues per month from reaching production—some of them could have caused serious security vulnerabilities."Unlike code generation tools like GitHub Copilot or Cursor, Qodo isn’t trying to write new code. Instead, it specializes in reviewing it — using what it calls context engineering to understand not just what changed in a pull request, but why, how it aligns with business logic, and whether it follows internal best practices. "You can call Claude Code or Cursor and in five minutes get 1,000 lines of code," said Itamar Friedman, co-founder and CEO of Qodo, in the same video call interview as with Regev. "You have 40 minutes, and you can't review that. So you need Qodo to actually review it.”For monday.com, this capability wasn’t just helpful — it was transformative.Code Review, at ScaleAt any given time, monday.com’s developers are shipping updates across hundreds of repositories and services. The engineering org works in tightly coordinated teams, each aligned with specific parts of the product: marketing, CRM, dev tools, internal platforms, and more.That’s where Qodo came in. The company’s platform uses AI not just to check for obvious bugs or style violations, but to evaluate whether a pull request follows team-specific conventions, architectural guidelines, and historical patterns. It does this by learning from your own codebase — training on previous PRs, comments, merges, and even Slack threads to understand how your team works."The comments Qodo gives aren’t generic—they reflect our values, our libraries, even our standards for things like feature flags and privacy," Regev said. "It’s context-aware in a way traditional tools aren’t."What “Context Engineering” Actually MeansQodo calls its secret sauce context engineering — a system-level approach to managing everything the model sees when making a decision. This includes the PR code diff, of course, but also prior discussions, documentation, relevant files from the repo, even test results and configuration data.The idea is that language models don’t really “think” — they predict the next token based on the inputs they’re given. So the quality of their output depends almost entirely on the quality and structure of their inputs.As Dana Fine, Qodo’s community manager, put it in a blog post: “You’re not just writing prompts; you’re designing structured input under a fixed token limit. Every token is a design decision.”This isn’t just theory. In monday.com’s case, it meant Qodo could catch not only the obvious bugs, but the subtle ones that typically slip past human reviewers — hardcoded variables, missing fallbacks, or violations of cross-team architecture conventions.One example stood out. In a recent PR, Qodo flagged a line that inadvertently exposed a staging environment variable — something no human reviewer caught. Had it been merged, it might have caused problems in production. "The hours we would spend on fixing this security leak and the legal issue that it would bring would be much more than the hours that we reduce from a pull-request," said Regev.Integration into the PipelineToday, Qodo is deeply integrated into monday.com’s development workflow, analyzing pull requests and surfacing context-aware recommendations based on prior team code reviews. “It doesn’t feel like just another tool... It feels like another teammate that joined the system — one who learns how we work," Regev noted. Developers receive suggestions during the review process and remain in control of final decisions — a human-in-the-loop model that was critical for adoption.Because Qodo integrated directly into GitHub via pull request actions and comments, Monday.com’s infrastructure team didn’t face a steep learning curve.“It’s just a GitHub action,” said Regev. “It creates a PR with the tests. It’s not like a separate tool we had to learn.”“The purpose is to actually help the developer learn the code, take ownership, give feedback to each other, and learn from that and establish the standards," added Friedman.The Results: Time Saved, Bugs PreventedSince rolling out Qodo more broadly, monday.com has seen measurable improvements across multiple teams.Internal analysis shows that developers save roughly an hour per pull request on average. Multiply that across thousands of PRs per month, and the savings quickly reach thousands of developer hours annually.These aren’t just cosmetic issues — many relate to business logic, security, or runtime stability. And because Qodo’s suggestions reflect monday.com’s actual conventions, developers are more likely to act on them.The system’s accuracy is rooted in its data-first design. Qodo trains on each company’s private codebase and historical data, adapting to different team styles and practices. It doesn’t rely on one-size-fits-all rules or external datasets. Everything is tailored.From Internal Tool to Product VisionRegev’s team was so impressed with Qodo’s impact that they’ve started planning deeper integrations between Qodo and Monday Dev, the developer-focused product line monday.com is building.The vision is to create a workflow where business context — tasks, tickets, customer feedback — flows directly into the code review layer. That way, reviewers can assess not just whether the code “works,” but whether it solves the right problem.“Before, we had linters, danger rules, static analysis... rule-based... you need to configure all the rules," Regev said. "But it doesn’t know what you don’t know... Qodo... feels like it’s learning from our engineers.”This aligns closely with Qodo’s own roadmap. The company doesn’t just review code. It’s building a full platform of developer agents — including Qodo Gen for context-aware code generation, Qodo Merge for automated PR analysis, and Qodo Cover, a regression-testing agent that uses runtime validation to ensure test coverage.All of this is powered by Qodo’s own infrastructure, including its new open-source embedding model, Qodo-Embed-1-1.5B, which outperformed offerings from OpenAI and Salesforce on code retrieval benchmarks.What’s Next?Qodo is now offering its platform under a freemium model — free for individuals, discounted for startups through Google Cloud’s Perks program, and enterprise-grade for companies that need SSO, air-gapped deployment, or advanced controls.The company is already working with teams at NVIDIA, Intuit, and other Fortune 500 companies. And thanks to a recent partnership with Google Cloud, Qodo’s models are available directly inside Vertex AI’s Model Garden, making it easier to integrate into enterprise pipelines."Context engines will be the big story of 2026," Friedman said. "Every enterprise will need to build their own second brain if they want AI that actually understands and helps them."As AI systems become more embedded in software development, tools like Qodo are showing how the right context — delivered at the right moment — can transform how teams build, ship, and scale code across the enterprise.
Baseten, the AI infrastructure company recently valued at $2.15 billion, is making its most significant product pivot yet: a full-scale push into model training that could reshape how enterprises wean themselves off dependence on OpenAI and other closed-source AI providers.The San Francisco-based company announced Thursday the general availability of Baseten Training, an infrastructure platform designed to help companies fine-tune open-source AI models without the operational headaches of managing GPU clusters, multi-node orchestration, or cloud capacity planning. The move is a calculated expansion beyond Baseten's core inference business, driven by what CTO Amir Haghighat describes as relentless customer demand and a strategic imperative to capture the full lifecycle of AI deployment."We had a captive audience of customers who kept coming to us saying, 'Hey, I hate this problem,'" Haghighat said in an interview. "One of them told me, 'Look, I bought a bunch of H100s from a cloud provider. I have to SSH in on Friday, run my fine-tuning job, then check on Monday to see if it worked. Sometimes I realize it just hasn't been working all along.'"The launch comes at a critical inflection point in enterprise AI adoption. As open-source models from Meta, Alibaba, and others increasingly rival proprietary systems in performance, companies face mounting pressure to reduce their reliance on expensive API calls to services like OpenAI's GPT-5 or Anthropic's Claude. But the path from off-the-shelf open-source model to production-ready custom AI remains treacherous, requiring specialized expertise in machine learning operations, infrastructure management, and performance optimization.Baseten's answer: provide the infrastructure rails while letting companies retain full control over their training code, data, and model weights. It's a deliberately low-level approach born from hard-won lessons.How a failed product taught Baseten what AI training infrastructure really needsThis isn't Baseten's first foray into training. The company's previous attempt, a product called Blueprints launched roughly two and a half years ago, failed spectacularly — a failure Haghighat now embraces as instructive."We had created the abstraction layer a little too high," he explained. "We were trying to create a magical experience, where as a user, you come in and programmatically choose a base model, choose your data and some hyperparameters, and magically out comes a model."The problem? Users didn't have the intuition to make the right choices about base models, data quality, or hyperparameters. When their models underperformed, they blamed the product. Baseten found itself in the consulting business rather than the infrastructure business, helping customers debug everything from dataset deduplication to model selection."We became consultants," Haghighat said. "And that's not what we had set out to do."Baseten killed Blueprints and refocused entirely on inference, vowing to "earn the right" to expand again. That moment arrived earlier this year, driven by two market realities: the vast majority of Baseten's inference revenue comes from custom models that customers train elsewhere, and competing training platforms were using restrictive terms of service to lock customers into their inference products."Multiple companies who were building fine-tuning products had in their terms of service that you as a customer cannot take the weights of the fine-tuned model with you somewhere else," Haghighat said. "I understand why from their perspective — I still don't think there is a big company to be made purely on just training or fine-tuning. The sticky part is in inference, the valuable part where value is unlocked is in inference, and ultimately the revenue is in inference."Baseten took the opposite approach: customers own their weights and can download them at will. The bet is that superior inference performance will keep them on the platform anyway.Multi-cloud GPU orchestration and sub-minute scheduling set Baseten apart from hyperscalersThe new Baseten Training product operates at what Haghighat calls "the infrastructure layer" — lower-level than the failed Blueprints experiment, but with opinionated tooling around reliability, observability, and integration with Baseten's inference stack.Key technical capabilities include multi-node training support across clusters of NVIDIA H100 or B200 GPUs, automated checkpointing to protect against node failures, sub-minute job scheduling, and integration with Baseten's proprietary Multi-Cloud Management (MCM) system. That last piece is critical: MCM allows Baseten to dynamically provision GPU capacity across multiple cloud providers and regions, passing cost savings to customers while avoiding the capacity constraints and multi-year contracts typical of hyperscaler deals."With hyperscalers, you don't get to say, 'Hey, give me three or four B200 nodes while my job is running, and then take it back from me and don't charge me for it,'" Haghighat said. "They say, 'No, you need to sign a three-year contract.' We don't do that."Baseten's approach mirrors broader trends in cloud infrastructure, where abstraction layers increasingly allow workloads to move fluidly across providers. When AWS experienced a major outage several weeks ago, Baseten's inference services remained operational by automatically routing traffic to other cloud providers — a capability now extended to training workloads.The technical differentiation extends to Baseten's observability tooling, which provides per-GPU metrics for multi-node jobs, granular checkpoint tracking, and a refreshed UI that surfaces infrastructure-level events. The company also introduced an "ML Cookbook" of open-source training recipes for popular models like Gemma, GPT OSS, and Qwen, designed to help users reach "training success" faster.Early adopters report 84% cost savings and 50% latency improvements with custom modelsTwo early customers illustrate the market Baseten is targeting: AI-native companies building specialized vertical solutions that require custom models.Oxen AI, a platform focused on dataset management and model fine-tuning, exemplifies the partnership model Baseten envisions. CEO Greg Schoeninger articulated a common strategic calculus, telling VentureBeat: "Whenever I've seen a platform try to do both hardware and software, they usually fail at one of them. That's why partnering with Baseten to handle infrastructure was the obvious choice."Oxen built its customer experience entirely on top of Baseten's infrastructure, using the Baseten CLI to programmatically orchestrate training jobs. The system automatically provisions and deprovisions GPUs, fully concealing Baseten's interface behind Oxen's own. For one Oxen customer, AlliumAI — a startup bringing structure to messy retail data — the integration delivered 84% cost savings compared to previous approaches, reducing total inference costs from $46,800 to $7,530."Training custom LoRAs has always been one of the most effective ways to leverage open-source models, but it often came with infrastructure headaches," said Daniel Demillard, CEO of AlliumAI. "With Oxen and Baseten, that complexity disappears. We can train and deploy models at massive scale without ever worrying about CUDA, which GPU to choose, or shutting down servers after training."Parsed, another early customer, tackles a different pain point: helping enterprises reduce dependence on OpenAI by creating specialized models that outperform generalist LLMs on domain-specific tasks. The company works in mission-critical sectors like healthcare, finance, and legal services, where model performance and reliability aren't negotiable."Prior to switching to Baseten, we were seeing repetitive and degraded performance on our fine-tuned models due to bugs with our previous training provider," said Charles O'Neill, Parsed's co-founder and chief science officer. "On top of that, we were struggling to easily download and checkpoint weights after training runs."With Baseten, Parsed achieved 50% lower end-to-end latency for transcription use cases, spun up HIPAA-compliant EU deployments for testing within 48 hours, and kicked off more than 500 training jobs. The company also leveraged Baseten's modified vLLM inference framework and speculative decoding — a technique that generates draft tokens to accelerate language model output — to cut latency in half for custom models."Fast models matter," O'Neill said. "But fast models that get better over time matter more. A model that's 2x faster but static loses to one that's slightly slower but improving 10% monthly. Baseten gives us both — the performance edge today and the infrastructure for continuous improvement."Why training and inference are more interconnected than the industry realizesThe Parsed example illuminates a deeper strategic rationale for Baseten's training expansion: the boundary between training and inference is blurrier than conventional wisdom suggests.Baseten's model performance team uses the training platform extensively to create "draft models" for speculative decoding, a cutting-edge technique that can dramatically accelerate inference. The company recently announced it achieved 650+ tokens per second on OpenAI's GPT OSS 120B model — a 60% improvement over its launch performance — using EAGLE-3 speculative decoding, which requires training specialized small models to work alongside larger target models."Ultimately, inference and training plug in more ways than one might think," Haghighat said. "When you do speculative decoding in inference, you need to train the draft model. Our model performance team is a big customer of the training product to train these EAGLE heads on a continuous basis."This technical interdependence reinforces Baseten's thesis that owning both training and inference creates defensible value. The company can optimize the entire lifecycle: a model trained on Baseten can be deployed with a single click to inference endpoints pre-optimized for that architecture, with deployment-from-checkpoint support for chat completion and audio transcription workloads.The approach contrasts sharply with vertically integrated competitors like Replicate or Modal, which also offer training and inference but with different architectural tradeoffs. Baseten's bet is on lower-level infrastructure flexibility and performance optimization, particularly for companies running custom models at scale.As open-source AI models improve, enterprises see fine-tuning as the path away from OpenAI dependencyUnderpinning Baseten's entire strategy is a conviction about the trajectory of open-source AI models — namely, that they're getting good enough, fast enough, to unlock massive enterprise adoption through fine-tuning."Both closed and open-source models are getting better and better in terms of quality," Haghighat said. "We don't even need open source to surpass closed models, because as both of them are getting better, they unlock all these invisible lines of usefulness for different use cases."He pointed to the proliferation of reinforcement learning and supervised fine-tuning techniques that allow companies to take an open-source model and make it "as good as the closed model, not at everything, but at this narrow band of capability that they want."That trend is already visible in Baseten's Model APIs business, launched alongside Training earlier this year to provide production-grade access to open-source models. The company was the first provider to offer access to DeepSeek V3 and R1, and has since added models like Llama 4 and Qwen 3, optimized for performance and reliability. Model APIs serves as a top-of-funnel product: companies start with off-the-shelf open-source models, realize they need customization, move to Training for fine-tuning, and ultimately deploy on Baseten's Dedicated Deployments infrastructure.Yet Haghighat acknowledged the market remains "fuzzy" around which training techniques will dominate. Baseten is hedging by staying close to the bleeding edge through its Forward Deployed Engineering team, which works hands-on with select customers on reinforcement learning, supervised fine-tuning, and other advanced techniques."As we do that, we will see patterns emerge about what a productized training product can look like that really addresses the user's needs without them having to learn too much about how RL works," he said. "Are we there as an industry? I would say not quite. I see some attempts at that, but they all seem like almost falling to the same trap that Blueprints fell into—a bit of a walled garden that ties the hands of AI folks behind their back."The roadmap ahead includes potential abstractions for common training patterns, expansion into image, audio, and video fine-tuning, and deeper integration of advanced techniques like prefill-decode disaggregation, which separates the initial processing of prompts from token generation to improve efficiency.Baseten faces crowded field but bets developer experience and performance will win enterprise customersBaseten enters an increasingly crowded market for AI infrastructure. Hyperscalers like AWS, Google Cloud, and Microsoft Azure offer GPU compute for training, while specialized providers like Lambda Labs, CoreWeave, and Together AI compete on price, performance, or ease of use. Then there are vertically integrated platforms like Hugging Face, Replicate, and Modal that bundle training, inference, and model hosting.Baseten's differentiation rests on three pillars: its MCM system for multi-cloud capacity management, deep performance optimization expertise built from its inference business, and a developer experience tailored for production deployments rather than experimentation.The company's recent $150 million Series D and $2.15 billion valuation provide runway to invest in both products simultaneously. Major customers include Descript, which uses Baseten for transcription workloads; Decagon, which runs customer service AI; and Sourcegraph, which powers coding assistants. All three operate in domains where model customization and performance are competitive advantages.Timing may be Baseten's biggest asset. The confluence of improving open-source models, enterprise discomfort with dependence on proprietary AI providers, and growing sophistication around fine-tuning techniques creates what Haghighat sees as a sustainable market shift."There is a lot of use cases for which closed models have gotten there and open ones have not," he said. "Where I'm seeing in the market is people using different training techniques — more recently, a lot of reinforcement learning and SFT — to be able to get this open model to be as good as the closed model, not at everything, but at this narrow band of capability that they want. That's very palpable in the market."For enterprises navigating the complex transition from closed to open AI models, Baseten's positioning offers a clear value proposition: infrastructure that handles the messy middle of fine-tuning while optimizing for the ultimate goal of performant, reliable, cost-effective inference at scale. The company's insistence that customers own their model weights — a stark contrast to competitors using training as a lock-in mechanism — reflects confidence that technical excellence, not contractual restrictions, will drive retention.Whether Baseten can execute on this vision depends on navigating tensions inherent in its strategy: staying at the infrastructure layer without becoming consultants, providing power and flexibility without overwhelming users with complexity, and building abstractions at exactly the right level as the market matures. The company's willingness to kill Blueprints when it failed suggests a pragmatism that could prove decisive in a market where many infrastructure providers over-promise and under-deliver."Through and through, we're an inference company," Haghighat emphasized. "The reason that we did training is at the service of inference."That clarity of purpose — treating training as a means to an end rather than an end in itself—may be Baseten's most important strategic asset. As AI deployment matures from experimentation to production, the companies that solve the full stack stand to capture outsized value. But only if they avoid the trap of technology in search of a problem.At least Baseten's customers no longer have to SSH into boxes on Friday and pray their training jobs complete by Monday. In the infrastructure business, sometimes the best innovation is simply making the painful parts disappear.