Researchers at Meta, the University of Chicago, and UC Berkeley have developed a new framework that addresses the high costs, infrastructure complexity, and unreliable feedback associated with using reinforcement learning (RL) to train large language model (LLM) agents. The framework, DreamGym, simulates an RL environment to train agents for complex applications. As it progresses through the training process, the framework dynamically adjusts task difficulty, ensuring the agent gradually learns to solve more challenging problems as it improves.Experiments by the research team show that DreamGym substantially improves RL training in both fully synthetic settings and scenarios where the model must apply its simulated learning to the real world. In settings where RL is possible but expensive, it matches the performance of popular algorithms using only synthetic interactions, significantly cutting the costs of data gathering and environment interaction. This approach could be vital for enterprises, allowing them to train agents for bespoke applications while avoiding the complexities of setting up and running live RL environments.The challenge of training LLM agentsReinforcement learning is a key technique for training LLMs to handle complex tasks in agentic environments, such as web navigation, tool use, and robotics. It allows models to learn from direct interaction and experience, moving beyond the static datasets used in pre-training.However, RL for agent training remains difficult. Real-world applications often involve long action sequences with sparse signals, meaning the agent only receives a positive signal after a long and correct sequence of actions. Gathering enough diverse and validated data is also expensive, frequently requiring human experts to verify tasks and annotate outcomes. And the infrastructure required to create the live environments for large-scale RL training can be prohibitively complex and costly. Not to mention that interacting with live systems carries risks, as wrong actions (like deleting a file) can cause irreparable damage. “These limitations make building general-purpose and scalable systems for training agents with RL an open and pressing challenge,” the researchers write.DreamGym directly challenges that model by delivering comparable performance entirely in simulation, removing the infrastructure burden that has kept most enterprises from adopting RL — and giving teams a practical path to train agents without touching costly or risky live environments.How DreamGym worksThe researchers describe DreamGym as a “unified and scalable RL framework that synthesizes diverse experience data in an online manner to enable efficient and effective training of LLM agents.” It is built around three core components that work together to create a controlled and effective training loop.The first component is a “reasoning-based experience model” that translates the dynamics of a target environment into a textual space. This model acts as the simulator of the application environment. Instead of interacting with a costly real environment, the agent interacts with this model, which generates consistent state transitions and feedback based on the agent’s actions. The researchers argue that agent training doesn't need perfectly realistic environments, but rather data that is "sufficiently diverse, informative, and causally grounded." For example, in a web shopping task, the model synthesizes clean listings of on-page elements rather than processing raw HTML code. This abstract approach makes training the experience model highly efficient, requiring only a small amount of public data.The second component is an “experience replay buffer,” which acts as a dynamic memory. At the beginning of the training process, the buffer is seeded with offline data to provide essential context and is continuously updated with new synthetic trajectories generated during training. This buffer helps guide the experience model's predictions, ensuring the synthetic experiences remain diverse and factually grounded. The third component, a “curriculum task generator,” works in tandem with the experience model to adaptively create new tasks that are progressively more challenging. It identifies tasks where the agent's performance is mixed (signaling they are difficult but solvable) and generates variations to push the agent's capabilities.Together, these components create a closed-loop system for scalable agent training. “By unifying interaction, memory, and adaptive online task generation, DreamGym addresses the persistent challenges that have limited RL for LLM agents training: prohibitive cost, scarcity of diverse tasks, unstable reward signals, and heavy infrastructure demands,” according to the researchers.DreamGym in actionThe researchers evaluated DreamGym across several agent benchmarks, including WebShop (e-commerce), ALFWorld (embodied control), and WebArena (realistic web interaction). They used Llama 3 and Qwen 2.5 models as agent backbones and compared DreamGym against several traditional training strategies. These included offline methods like supervised fine-tuning (SFT) and direct preference optimization (DPO), as well as online RL algorithms like Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), which improve agents through live environment interaction.DreamGym showed its most significant advantage in environments like WebArena, where setting up a large-scale RL infrastructure is difficult. Agents trained entirely inside DreamGym achieved success rates over 30% higher than baseline methods, which struggled with the sparse rewards and limited exploration in the real environment. The researchers said this shows DreamGym is a mechanism that makes RL training “feasible in domains that were previously intractable due to inherent task and engineering constraints.”In environments where RL is supported but costly, agents trained with DreamGym performed on par with those trained using GRPO and PPO, but without any costly interactions with the external environment. The team also introduced a sim-to-real approach, DreamGym-S2R, where an agent is first trained in the synthetic environment and then fine-tuned on a small amount of real-world data. This strategy yielded over a 40% performance improvement compared to training from scratch in the real environment while using less than 10% of the external data. This provides a scalable "warm-start" for training general-purpose agents.Finally, the framework demonstrated strong generalization. An agent trained on tasks in one domain, such as WebShop, could successfully transfer its learned skills to another, like WebArena. The researchers suggest this is because DreamGym agents learn in an "abstract meta-representation space, enabling the agent to learn domain-agnostic behavioral priors rather than memorizing task-specific patterns."While still in its early stages, DreamGym shows that simulated environments can provide great gains in training agents. In practice, an enterprise could gather a small amount of trajectories and descriptions for the tasks it wants to automate. It can then use this small seed to bootstrap the DreamGym frameworks for the scalable and sample-efficient training of agents.
RoboTic-Tac-Toe is an interactive game where two physical robots move around a tic-tac-toe board, with both the gameplay and robots’ movements orchestrated by LLMs. Players can control the robots using natural language commands, directing them to place their markers on the game board. In this post, we explore the architecture and prompt engineering techniques used to reason about a tic-tac-toe game and decide the next best game strategy and movement plan for the current player.
This article is divided into two parts; they are: • Picking a Dataset • Training a Tokenizer To keep things simple, we'll use English text only.
Update: A day after this article was published, xAI unveiled Grok 4.1 access through its API for $0.20 per 1 input million tokens (or $0.05 for cached input) and output tokens at $0.50 per million, making it among the cheaper available frontier AI model options. Read more here.In what appeared to be a bid to soak up some of Google's limelight prior to the launch of its new Gemini 3 flagship AI model — now recorded as the most powerful LLM in the world by multiple independent evaluators — Elon Musk's rival AI startup xAI last night unveiled its newest large language model, Grok 4.1.The model is now live for consumer use on Grok.com, social network X (formerly Twitter), and the company’s iOS and Android mobile apps, and it arrives with major architectural and usability enhancements, among them: faster reasoning, improved emotional intelligence, and significantly reduced hallucination rates. xAI also published a white paper on its evaluations and including a small bit on training process here. Across public benchmarks, Grok 4.1 has vaulted to the top of the leaderboard, outperforming rival models from Anthropic, OpenAI, and Google — at least, Google's pre-Gemini 3 model (Gemini 2.5 Pro). It builds upon the success of xAI's Grok-4 Fast, which VentureBeat covered favorably shortly following its release back in September 2025.However, enterprise developers looking to integrate the new and improved model Grok 4.1 into production environments will find one major constraint: It's not yet available through xAI’s public API. Despite its high benchmarks, Grok 4.1 remains confined to xAI’s consumer-facing interfaces, with no announced timeline for API exposure. At present, only older models — including Grok 4 Fast (reasoning and non-reasoning variants), Grok 4 0709, and legacy models such as Grok 3, Grok 3 Mini, and Grok 2 Vision — are available for programmatic use via the xAI developer API. These support up to 2 million tokens of context, with token pricing ranging from $0.20 to $3.00 per million depending on the configuration.For now, this limits Grok 4.1’s utility in enterprise workflows that rely on backend integration, fine-tuned agentic pipelines, or scalable internal tooling. While the consumer rollout positions Grok 4.1 as the most capable LLM in xAI’s portfolio, production deployments in enterprise environments remain on hold.Model design and deployment strategyGrok 4.1 arrives in two configurations: a fast-response, low-latency mode for immediate replies, and a “thinking” mode that engages in multi-step reasoning before producing output. Both versions are live for end users and are selectable via the model picker in xAI’s apps.The two configurations differ not just in latency but also in how deeply the model processes prompts. Grok 4.1 Thinking leverages internal planning and deliberation mechanisms, while the standard version prioritizes speed. Despite the difference in architecture, both scored higher than any competing models in blind preference and benchmark testing.Leading the field in human and expert evaluationOn the LMArena Text Arena leaderboard, Grok 4.1 Thinking briefly held the top position with a normalized Elo score of 1483 — then was dethroned a few hours later with Google's release of Gemini 3 and its incredible 1501 Elo score. The non-thinking version of Grok 4.1 also fares well on the index, however, at 1465. These scores place Grok 4.1 above Google’s Gemini 2.5 Pro, Anthropic’s Claude 4.5 series, and OpenAI’s GPT-4.5 preview.In creative writing, Grok 4.1 ranks second only to Polaris Alpha (an early GPT-5.1 variant), with the “thinking” model earning a score of 1721.9 on the Creative Writing v3 benchmark. This marks a roughly 600-point improvement over previous Grok iterations. Similarly, in the Arena Expert leaderboard, which aggregates feedback from professional reviewers, Grok 4.1 Thinking again leads the field with a score of 1510.The gains are especially notable given that Grok 4.1 was released only two months after Grok 4 Fast, highlighting the accelerated development pace at xAI.Core improvements over previous generationsTechnically, Grok 4.1 represents a significant leap in real-world usability. Visual capabilities — previously limited in Grok 4 — have been upgraded to enable robust image and video understanding, including chart analysis and OCR-level text extraction. Multimodal reliability was a pain point in prior versions and has now been addressed.Token-level latency has been reduced by approximately 28% while preserving reasoning depth. In long-context tasks, Grok 4.1 maintains coherent output up to 1 million tokens, improving on Grok 4’s tendency to degrade past the 300,000 token mark.xAI has also improved the model's tool orchestration capabilities. Grok 4.1 can now plan and execute multiple external tools in parallel, reducing the number of interaction cycles required to complete multi-step queries. According to internal test logs, some research tasks that previously required four steps can now be completed in one or two.Other alignment improvements include better truth calibration — reducing the tendency to hedge or soften politically sensitive outputs — and more natural, human-like prosody in voice mode, with support for different speaking styles and accents.Safety and adversarial robustnessAs part of its risk management framework, xAI evaluated Grok 4.1 for refusal behavior, hallucination resistance, sycophancy, and dual-use safety.The hallucination rate in non-reasoning mode has dropped from 12.09% in Grok 4 Fast to just 4.22% — a roughly 65% improvement.The model also scored 2.97% on FActScore, a factual QA benchmark, down from 9.89% in earlier versions.In the domain of adversarial robustness, Grok 4.1 has been tested with prompt injection attacks, jailbreak prompts, and sensitive chemistry and biology queries. Safety filters showed low false negative rates, especially for restricted chemical knowledge (0.00%) and restricted biological queries (0.03%). The model’s ability to resist manipulation in persuasion benchmarks, such as MakeMeSay, also appears strong—it registered a 0% success rate as an attacker.Limited enterprise access via APIDespite these gains, Grok 4.1 initially was not available to enterprise users through xAI’s API. According to the company’s public documentation, the latest available models for developers are Grok 4 Fast (both reasoning and non-reasoning variants), each supporting up to 2 million tokens of context at pricing tiers ranging from $0.20 to $0.50 per million tokens. These are backed by a 4M tokens-per-minute throughput limit and 480 requests per minute (RPM) rate cap.By contrast, Grok 4.1 is accessible only through xAI’s consumer-facing properties—X, Grok.com, and the mobile apps. This means organizations cannot yet deploy Grok 4.1 via fine-tuned internal workflows, multi-agent chains, or real-time product integrations.As of November 19, 2025, xAI made the models available through its API as grok-4-1-fast-reasoning and grok-4-1-fast-non-reasoning, both optimized for real-world tool use, including web search, code execution, and document retrieval. xAI also introduced the Agent Tools API, a framework that allows autonomous agents to operate over real-time X data, external toolchains, and remote functions, with integrated orchestration handled entirely on xAI’s infrastructure.This update positions Grok 4.1 Fast as xAI’s flagship enterprise model, outperforming competitors like Claude Sonnet 4.5, GPT-5, and Gemini 3 Pro on agentic benchmarks such as τ²-bench and Berkeley Function Calling v4. Pricing is competitive, with input tokens billed at $0.20 per million (or $0.05 for cached input) and output tokens at $0.50 per million—matching Grok 4 Fast pricing tiers. Tool usage is metered separately at $5 per 1,000 successful invocations, though all tool access is temporarily free through December 3, 2025, in partnership with OpenRouter. In long-context and multi-turn performance, Grok 4.1 Fast shows measurable improvements over both Grok 4 and Grok 4 Fast, suggesting significant reinforcement learning optimization for agentic and retrieval-augmented workflows.With this release, Grok 4.1 transitions from a consumer-facing product to a production-grade platform for enterprise and developer integration. It also resolves a key limitation in the original rollout by making its most performant variant accessible to backend applications, research pipelines, and autonomous agents through the API. See a price comparison chart below:ModelInput (/1M tokens)Output (/1M tokens)Total CostSourceERNIE 4.5 Turbo$0.11$0.45$0.56QianfanGrok 4.1 Fast (cached)$0.05$0.50$0.55xAI APIGrok 4.1 Fast (uncached)$0.20$0.50$0.70xAI APIERNIE 5.0$0.85$3.40$4.25QianfanQwen3 (Coder ex.)$0.85$3.40$4.25QianfanGPT-5.1$1.25$10.00$11.25OpenAIGemini 2.5 Pro (≤200K)$1.25$10.00$11.25GoogleGemini 3 Pro (≤200K)$2.00$12.00$14.00GoogleGemini 2.5 Pro (>200K)$2.50$15.00$17.50GoogleGemini 3 Pro (>200K)$4.00$18.00$22.00GoogleGrok 4 (0709)$3.00$15.00$18.00xAI APIClaude Opus 4.1$15.00$75.00$90.00AnthropicIndustry reception and next stepsThe release has been met with strong public and industry feedback. Elon Musk, founder of xAI, posted a brief endorsement, calling it “a great model” and congratulating the team. AI benchmark platforms have praised the leap in usability and linguistic nuance.For enterprise customers, however, the picture is more mixed. Grok 4.1’s performance represents a breakthrough for general-purpose and creative tasks.As competitive models from OpenAI, Google, and Anthropic continue to evolve, xAI has fielded a competitive and compelling option for developers and enterprise use cases.
This blog post introduces two major enhancements to Amazon SageMaker HyperPod that strengthen security and storage capabilities for large-scale machine learning infrastructure. The new features include customer managed key (CMK) support for encrypting EBS volumes with organization-controlled encryption keys, and Amazon EBS CSI driver integration that enables dynamic storage management for Kubernetes volumes in AI workloads.
This infographic illustrates what sets these agents apart, how they operate, and why they represent a foundational leap for AI.
In this post, I will illustrate how applying platform engineering principles to generative AI unlocks faster time-to-value, cost control, and scalable innovation.
Google DeepMind opens a new Singapore research lab, accelerating AI progress in the Asia-Pacific region.
Which is actually how some people do it
The post How to Build an Over-Engineered Retrieval System appeared first on Towards Data Science.
Want to level up your data skills? Check out these 5 free books that explain data science clearly and practically.
Google today unveiled Gemini 3, a major upgrade to its flagship multimodal model. The firm says the new model is better at reasoning, has more fluid multimodal capabilities (the ability to work across voice, text or images), and will work like an agent. The previous model, Gemini 2.5, supports multimodal input. Users can feed it…
After more than a month of rumors and feverish speculation — including Polymarket wagering on the release date — Google today unveiled Gemini 3, its newest proprietary frontier model family and the company’s most comprehensive AI release since the Gemini line debuted in 2023. The models are proprietary (closed-source), available exclusively through Google products, developer platforms, and paid APIs, including Google AI Studio, Vertex AI, the Gemini command line interface (CLI) for developers, and third-party integrations across the broader integrated developer environment (IDE) ecosystem.Gemini 3 arrives as a full portfolio:Gemini 3 Pro: the flagship frontier modelGemini 3 Deep Think: an enhanced reasoning modeGenerative interface models powering Visual Layout and Dynamic ViewGemini Agent for multi-step task executionGemini 3 engine embedded in Google Antigravity, the company’s new agent-first development environment."This is the best model in the world, by a crazy wide margin!" wrote Google DeepMind Research Scientist Yi Tay on X. Indeed, already, independent AI benchmarking and analysis organization Artificial Analysis has crowned Gemini 3 Pro the "new leader in AI" globally, achieving the top score of 73 on the organization's index, leaping Google from its former placement of 9th overall with the preceding Gemini 2.5 Pro model, which scored 60 behind OpenAI, Moonshot AI, xAI, Anthropic and MiniMax models. As Artificial Analysis wrote on X: "For the first time, Google has the most intelligent model."Another independent leaderboard site, LMArena reported that Gemini 3 Pro ranked first in the world across all of its major evaluation tracks, including text reasoning, vision, coding, and web development. In a public post, the @arena account on X said the model surpassed even the newly released (hours old) Grok-4.1, as well as Claude 4.5, and GPT-5-class systems in categories such as math, long-form queries, creative writing, and several occupational benchmarks. The post also highlighted the scale of gains over Gemini 2.5 Pro, including a 50-point jump in text Elo, a 70-point increase in vision, and a 280-point rise in web-development tasks. While these results reflect live community voting and remain preliminary, they signal unusually broad performance improvements across domains where previous Gemini models trailed competitors.What It Means For Google In the Hotly Competitive AI RaceThe launch represents one of Google’s largest, most tightly coordinated model releases. Gemini 3 is shipping simultaneously across Google Search, the Gemini app, Google AI Studio, Vertex AI, and a range of developer tools. Executives emphasized that this integration reflects Google’s control of tensor processing unit (TPU — its homegrown Nvidia GPU rival chips) hardware, data center infrastructure, and consumer products. According to the company, the Gemini app now has more than 650 million monthly active users, more than 13 million developers build with Google’s AI tools, and more than 2 billion monthly users engage with Gemini-powered AI Overviews in Search.At the center of the release is a shift toward agentic AI — systems that plan, act, navigate interfaces, and coordinate tools, rather than just generating text. Gemini 3 is designed to translate high-level instructions into multi-step workflows across devices and applications, with the ability to generate functional interfaces, run tools, and manage complex tasks.Major Performance Gains Over Gemini 2.5 ProGemini 3 Pro introduces large gains over Gemini 2.5 Pro across reasoning, mathematics, multimodality, tool use, coding, and long-horizon planning. Google’s benchmark disclosures show substantial improvements in many categories.Gemini 3 Pro debuted at the top of the LMArena text-reasoning leaderboard, posting a preliminary Elo score of 1501 based on pre-release community voting — the first LLM to ever cross the 1500 threshold.That places it above xAI’s newly announced Grok-4.1-thinking model (1484) and Grok-4.1 (1465), both of which were unveiled just hours earlier, as well as above Gemini 2.5 Pro (1451) and recent Claude Sonnet and Opus releases. While LMArena covers only text-reasoning performance and the results are labeled preliminary, this ranking positions Gemini 3 Pro as the strongest publicly evaluated model on that benchmark as of its launch day — though not necessarily the top performer in the world across all modalities, tasks, or evaluation suites.In mathematical and scientific reasoning, Gemini 3 Pro scored 95% on AIME 2025 without tools and 100% with code execution, compared to 88% for its predecessor. On GPQA Diamond, it reached 91.9%, up from 86.4%. The model also recorded a major jump on MathArena Apex, reaching 23.4% versus 0.5% for Gemini 2.5 Pro, and delivered 31.1% on ARC-AGI-2 compared to 4.9% previously.ARC-AGI-2 is the second-generation version of the Abstraction and Reasoning Corpus (ARC), a benchmark introduced by AI researcher François Chollet to measure generalization, not memorization. Unlike typical multiple-choice or dataset-based evaluations, ARC-AGI-2 presents models with tiny grid-based puzzles that require discovering and applying abstract rules. Each task provides a few input–output examples, and the model must infer the underlying transformation and apply it to a new test case. The problems span visual pattern recognition, symbolic manipulation, object transformations, spatial reasoning, and rule induction — all designed to test reasoning capabilities that do not depend on training-set familiarity.The new ARC-AGI-2 variant is deliberately constructed to be out-of-distribution and resistant to memorization, making it one of the most difficult benchmarks for large language models. Its tasks are engineered to stress-test whether a model can infer a previously unseen rule purely from examples, a proxy for early forms of generalized problem-solving. Astonishingly, the "Deep Think" version of Gemini 3, designed to take longer to solve problems and use more reasoning, scored 45.1%, representing a substantial jump over prior frontier models, which typically score in the mid-teens to low-twenties. It also far exceeds Gemini 3 Pro’s 31.1% and is an order-of-magnitude improvement over older Gemini releases. These results suggest that Deep Think’s architecture is particularly effective at multi-step hypothesis generation, checking, and revision — the specific capabilities ARC-AGI-2 is designed to measure.Multimodal performance increased across the board. Gemini 3 Pro scored 81% on MMMU-Pro, up from 68%, and 87.6% on Video-MMMU, compared to 83.6%. Its result on ScreenSpot-Pro, a key benchmark for agentic computer use, rose from 11.4% to 72.7%. Document understanding and chart reasoning also improved.Coding and tool-use performance showed equally significant gains. The model’s LiveCodeBench Pro score reached 2,439, up from 1,775. On Terminal-Bench 2.0 it achieved 54.2% versus 32.6% previously. SWE-Bench Verified, which measures agentic coding through structured fixes, increased from 59.6% to 76.2%. The model also posted 85.4% on t2-bench, up from 54.9%.Long-context and planning benchmarks indicate more stable multi-step behavior. Gemini 3 achieved 77% on MRCR v2 at 128k context (versus 58%) and 26.3% at 1 million tokens (versus 16.4%). Its Vending-Bench 2 score reached $5,478.16, compared to $573.64 for Gemini 2.5 Pro, reflecting stronger consistency during long-running decision processes.Language understanding scores improved on SimpleQA Verified (72.1% versus 54.5%), MMLU (91.8% versus 89.5%), and the FACTS Benchmark Suite (70.5% versus 63.4%), supporting more reliable fact-based work in regulated sectors.Generative interfaces move Gemini beyond textGemini 3 introduces a new class of generative interface capabilities in the consumer-facing Google Search AI Mode and for developers through Google AI Studio.Visual Layout produces structured, magazine-style pages with images, diagrams, and modules tailored to the query. Dynamic View generates functional interface components such as calculators, simulations, galleries, and interactive graphs. These experiences will be available starting today globally in Google Search’s AI Mode, enabling models to surface information in visual, interactive formats beyond static text.Developers can reproduce similar UI elements through Google AI Studio and the Gemini API, but the full consumer-facing interface types are not available as direct API outputs; instead, developers receive the underlying code or schema to render these components themselves. The branded Visual Layout and Dynamic View formats are therefore specific to Search and not exposed as standalone API features.Google says the model analyzes user intent to construct the layout best suited to a task. In practice, this includes everything from automatically building diagrams for scientific concepts to generating custom UI components that respond to user input.Google held a press call the day before the Gemini 3 announcement to brief reporters on the model family, its intended use cases, and how it differed from earlier Gemini releases. The call was led by multiple Google and DeepMind executives who walked through the model’s capabilities and framed Gemini 3 as a step toward more reliable, multi-step agentic systems that can operate across Google’s ecosystem.During the briefing, speakers emphasized that Gemini 3 was engineered to support more consistent long-horizon reasoning, better tool use, and smoother planning loops than Gemini 2.5 Pro. One presenter said the model benefits from an architecture that allows it to generate and evaluate multiple hypotheses in parallel, improving reliability on mathematically hard questions and complex procedural tasks. Another speaker explained that Gemini 3’s improved spatial reasoning enables more robust interaction with interface elements, which supports agentic workflows across screens and applications.Presenters highlighted growing enterprise adoption, noting strong demand for multimodal analysis, structured document reasoning, and agentic coding tools. They said Gemini 3’s performance on multimodal and scientific benchmarks reflected Google’s focus on grounded, verifiable reasoning. And they discussed Gemini 3's safety processes and improvements, including reduced sycophancy, stronger prompt-injection resistance, and a more structured evaluation pipeline guided by Google’s Frontier Safety Framework introduced back in 2024.A portion of the call was dedicated to developer experience. Google described updates to its AI Studio and API that allow developers to control thinking depth, adjust model “resolution,” and combine new grounding tools with URL context and Search. Demoes showed Gemini 3 generating application interfaces, managing tool sequences, and debugging code in Antigravity, illustrating the model’s shift toward agentic operation rather than single-step generation.The call positioned Gemini 3 as an upgrade across reasoning, planning, multimodal understanding, and developer workflows, with Google framing these advances as the foundation for its next generation of agent-driven products and enterprise services.Gemini Agent introduces multistep workflow automationGemini Agent marks Google’s effort to move beyond conversational assistance toward operational AI. The system coordinates multi-step tasks across tools like Gmail, Calendar, Canvas, and live browsing. It reviews inboxes, drafts replies, prepares plans, triages information, and reasons through complex workflows, while requiring user approval before performing sensitive actions.On a press call with journalists ahead of the release yesterday, Google said the agent is designed to handle multi-turn planning and tool-use sequences with consistency that was not feasible in earlier generations. It is rolling out first to Google AI Ultra subscribers in the Gemini app.Google Antigravity and developer toolchain integrationAntigravity is Google’s new agent-first development environment designed around Gemini 3. Developers collaborate with agents across an editor, terminal, and browser. The system orchestrates full-stack tasks, including code generation, UI prototyping, debugging, live execution, and report generation.Across the broader developer ecosystem, Google AI Studio now includes a Build mode that automatically wires the right models and APIs to speed up AI-native app creation. Annotations support allows developers to attach prompts to UI elements for faster iteration. Spatial reasoning improvements enable agents to interpret mouse movements, screen annotations, and multi-window layouts to operate computer interfaces more effectively.Developers also gain new reasoning controls through “thinking level” and “model resolution” parameters in the Gemini API, along with stricter validation of thought signatures for multi-turn consistency. A hosted server-side bash tool supports secure, multi-language code generation and prototyping. Grounding with Google Search and URL context can now be combined to extract structured information for downstream tasks.Enterprise impact and adoptionEnterprise teams gain multimodal understanding, agentic coding, and long-horizon planning needed for production use cases. The new model unifies analysis of documents, audio, video, workflows, and logs. Improvements in spatial and visual reasoning support robotics, autonomous systems, and scenarios requiring navigation of screens and applications. High-frame-rate video understanding helps developers detect events in fast-moving environments.Gemini 3’s structured document understanding capabilities support legal review, complex form processing, and regulated workflows. Its ability to generate functional interfaces and prototypes with minimal prompting reduces engineering cycles. In addition, the gains in system reliability, tool-calling stability, and context retention make multi-step planning viable for operations like financial forecasting, customer support automation, supply chain modeling, and predictive maintenance.Developer and API pricingGoogle has disclosed initial API pricing for Gemini 3 Pro. In preview, the model is priced at $2 per million input tokens and $12 per million output tokens for prompts up to 200,000 tokens in Google AI Studio and Vertex AI. For prompts that require more than 200,000 tokens, the input pricing doubles to $2 per 1M tok, while the output rises to $18 per 1M tok.When compared to the API pricing for other frontier AI models from rival labs, Gemini 3 is priced in the mid-high range, which may impact adoption as cheaper and open-source (permissively licensed) Chinese models have increasingly come to be adopted by U.S. startups. Here's how it stacks up:ModelInput (/1M tokens)Output (/1M tokens)Total CostSourceERNIE 4.5 Turbo$0.11$0.45$0.56QianfanERNIE 5.0$0.85$3.40$4.25QianfanQwen3 (Coder ex.)$0.85$3.40$4.25QianfanGPT-5.1$1.25$10.00$11.25OpenAIGemini 2.5 Pro (≤200K)$1.25$10.00$11.25GoogleGemini 3 Pro (≤200K)$2.00$12.00$14.00GoogleGemini 2.5 Pro (>200K)$2.50$15.00$17.50GoogleGemini 3 Pro (>200K)$4.00$18.00$22.00GoogleGrok 4 (0709)$3.00$15.00$18.00xAI APIClaude Opus 4.1$15.00$75.00$90.00AnthropicGemini 3 Pro is also available at no charge with rate limits in Google AI Studio for experimentation.The company has not yet announced pricing for Gemini 3 Deep Think, extended context windows, generative interfaces, or tool invocation. Enterprises planning deployment at scale will require these details to estimate operational costs.Multimodal, visual, and spatial reasoning enhancementsGemini 3’s improvements in embodied and spatial reasoning support pointing and trajectory prediction, task progression, and complex screen parsing. These capabilities extend to desktop and mobile environments, enabling agents to interpret screen elements, respond to on-screen context, and unlock new forms of computer-use automation.The model also delivers improved video reasoning with high-frame-rate understanding for analyzing fast-moving scenes, along with long-context video recall for synthesizing narratives across hours of footage. Google’s examples show the model generating full interactive demo apps directly from prompts, illustrating the depth of multimodal and agentic integration.Vibe coding and agentic code generationGemini 3 advances Google’s concept of “vibe coding,” where natural language acts as the primary syntax. The model can translate high-level ideas into full applications with a single prompt, handling multi-step planning, code generation, and visual design. Enterprise partners like Figma, JetBrains, Cursor, Replit, and Cline report stronger instruction following, more stable agentic operation, and better long-context code manipulation compared to prior models.Rumors and rumblingsIn the weeks leading up to the announcement, X became a hub of speculation about Gemini 3. Well-known accounts such as @slow_developer suggested internal builds were significantly ahead of Gemini 2.5 Pro and likely exceeded competitor performance in reasoning and tool use. Others, including @synthwavedd and @VraserX, noted mixed behavior in early checkpoints but acknowledged Google’s advantage in TPU hardware and training data. Viral clips from users like @lepadphone and @StijnSmits showed the model generating websites, animations, and UI layouts from single prompts, adding to the momentum.Prediction markets on Polymarket amplified the speculation. Whale accounts drove the odds of a mid-November release sharply upward, prompting widespread debate about insider activity. A temporary dip during a global Cloudflare outage became a moment of humor and conspiracy before odds surged again.The key moment came when users including @cheatyyyy shared what appeared to be an internal model-card benchmark table for Gemini 3 Pro. The image circulated rapidly, with commentary from figures like @deedydas and @kimmonismus arguing the numbers suggested a significant lead. When Google published the official benchmarks, they matched the leaked table exactly, confirming the document’s authenticity.By launch day, enthusiasm reached a peak. A brief “Geminiii” post from Sundar Pichai triggered widespread attention, and early testers quickly shared real examples of Gemini 3 generating interfaces, full apps, and complex visual designs. While some concerns about pricing and efficiency appeared, the dominant sentiment framed the launch as a turning point for Google and a display of its full-stack AI capabilities.Safety and evaluationGoogle says Gemini 3 is its most secure model yet, with reduced sycophancy, stronger prompt-injection resistance, and better protection against misuse. The company partnered with external groups, including Apollo and Vaultis, and conducted evaluations using its Frontier Safety Framework.Deployment across Google productsGemini 3 is available across Google Search AI Mode, the Gemini app, Google AI Studio, Vertex AI, the Gemini CLI, and Google’s new agentic development platform, Antigravity. Google says additional Gemini 3 variants will arrive later.Gemini 3 represents Google’s largest step forward in reasoning, multimodality, enterprise reliability, and agentic capabilities. The model’s performance gains over Gemini 2.5 Pro are substantial across mathematical reasoning, vision, coding, and planning. Generative interfaces, Gemini Agent, and Antigravity demonstrate a shift toward systems that not only respond to prompts but plan tasks, construct interfaces, and coordinate tools. Combined with an unusually intense hype and leak cycle, the launch marks a significant moment in the AI landscape as Google moves aggressively to expand its presence across both consumer-facing and enterprise-facing AI workflows.
Microsoft still sees AI agents as the future of work, and the enterprise software giant wants companies to be able to manage those agents just like human employees.
Learn more about Gemini 3 and how it’s upgrading Google Search and AI Mode.
Gemini 3 is skilled at reasoning, generating video, and writing code. Amid talk of an AI bubble, Google notes the new model could help increase search revenue too.
Today we’re releasing Gemini 3 – our most intelligent model that helps you bring any idea to life.
Microsoft is fundamentally restructuring its Windows operating system to become what executives call the first "agentic OS," embedding the infrastructure needed for autonomous AI agents to operate securely at enterprise scale — a watershed moment in the evolution of personal computing that positions the 40-year-old platform as the foundation for a new era of human-machine collaboration.The company announced Tuesday at its Ignite conference that it is introducing native agent infrastructure directly into Windows 11, allowing AI agents — autonomous software programs that can perform complex, multi-step tasks on behalf of users — to discover tools, execute workflows, and interact with applications through standardized protocols while operating in secure, policy-controlled environments separate from user sessions.The shift is Microsoft's most significant architectural evolution of Windows since the introduction of the modern security model, transforming the operating system from a platform where users manually orchestrate applications into one where they can "simply express your desired outcome, and agents handle the complexity," according to Pavan Davuluri, President of Windows & Devices at Microsoft."Windows 11 starts with this notion of secure by design, secure by default," Davuluri said in an exclusive interview with VentureBeat. "And a lot of the work that we're doing today, when we think about the engagement we have with our customers, the expectations they have with us is making sure we are building upon the fact that Windows is the most secure platform for them and is the most resilient platform as well."The announcements arrive as enterprises are experimenting with AI agents but struggling with fragmented tooling, security concerns, and lack of centralized management — challenges that Microsoft believes only operating system-level integration can solve. The stakes are enormous: with Windows running on an estimated 1.4 billion devices globally, Microsoft's architectural choices will likely shape how organizations deploy autonomous AI systems for years to come.New platform primitives create foundation for agent computingAt the core of Microsoft's vision are three new platform capabilities entering preview that fundamentally change how agents operate on Windows. Agent Connectors provide native support for the Model Context Protocol (MCP), an open standard introduced by Anthropic that allows AI agents to connect with external tools and data sources. Microsoft has built what it calls an "on-device registry" — a secure, manageable repository where developers can register their applications' capabilities as agent connectors, making them discoverable to any compatible agent on the system."These are platform capabilities that then become available to all of our customers," Davuluri explained, describing how the Windows file system, for example, becomes an agent connector that any MCP-compatible agent can access with user consent. "We're able to do this in a fashion that can scale for one but it also allows others to participate in the Windows registry for MCP."The architecture introduces an MCP proxy layer that handles authentication, authorization, and auditing for all communication between agents and connectors. Microsoft is launching with two built-in agent connectors for File Explorer and System Settings, allowing agents to manage files or adjust system configurations like switching between light and dark mode — all with explicit user permission.Agent Workspace, entering private preview, represents perhaps the most significant security innovation. It creates what Microsoft describes as "a contained, policy-controlled, and auditable environment where agents can interact with software" — essentially a parallel desktop session where agents operate with their own distinct identity, completely separate from the user's primary session."We want to be able to have clarity in the identity of the agent that is operating in the local operating system," Davuluri said, addressing security concerns about agents accessing sensitive data. "We want that session to be a session that is secure, that is policy control, that is manageable, that has transparency and auditability."Each agent workspace runs with minimal privileges by default, accessing only explicitly granted resources. The system maintains detailed audit logs distinguishing agent actions from user actions — critical for enterprises that need to prove compliance and track all changes to systems and data.Windows 365 for Agents extends this infrastructure to the cloud, turning Microsoft's Cloud PC offering into execution environments for agents. Instead of running on local devices, agents can operate in secure, policy-controlled virtual machines in Azure, enabling what Microsoft calls "computer-using agents" to interact with legacy applications and perform automation tasks at scale without consuming local compute resources.Taskbar becomes command center for monitoring AI agents at workThe infrastructure enables significant user interface changes designed to make agents as commonplace as applications. Microsoft is introducing "Ask Copilot on the taskbar," a unified entry point in preview that combines Microsoft 365 Copilot, agent invocation, and traditional search in a single interface.Users will be able to invoke agents using "@" mentions directly from the taskbar, then monitor their progress through familiar UI patterns like hover cards, progress badges, and notifications — all while continuing other work. When an agent completes a task or needs input, it surfaces updates through the taskbar without disrupting the user's primary workflow."We've evolved and created new UX in the taskbar to reflect the unique needs of agents performing background tasks on your behalf," said Navjot Virk, Corporate Vice President of Windows Experiences, describing features like progress bars and status badges that indicate when agents are working, need approval, or have completed tasks.The design philosophy, Virk emphasized, centers on user control. "These experiences are designed to be opt in. We want to give customers full control over when and how they engage with copilots and agents."For commercial Microsoft 365 Copilot users, the integration goes deeper. Microsoft is embedding Copilot directly into File Explorer, allowing users to ask questions, generate summaries, or draft emails based on document contents without leaving the file management interface. On Copilot+ PCs — devices with neural processing units capable of 40 trillion operations per second — new capabilities include converting any on-screen table into an Excel spreadsheet through the Click to Do feature.Microsoft bets on open standards against Apple and Google's proprietary approachesMicrosoft's embrace of the open Model Context Protocol, created by Anthropic, marks a strategic bet on openness as enterprises evaluate competing AI platforms from Apple and Google that use proprietary frameworks."Windows is an open platform, and by virtue [of being] an open platform, we certainly have the ability to take existing technologies, evolve, harden, adapt those, but we also allow customers to bring their own capabilities to the platform as well," Davuluri said when asked about competing with Apple Intelligence and Google's Android AI for Enterprise.The company demonstrated this openness with Claude, Anthropic's AI assistant, accessing the Windows file system through agent connectors with user consent — one of numerous partnerships Microsoft has secured. Dynamics 365 is using the File Explorer connector to streamline expense reporting, reducing what was previously a 30-minute, dozen-step process to "one sentence with high accuracy," according to Microsoft's blog post. Other early partners include Manus AI, Dropbox Dash, Roboflow, and Infosys."Windows is the platform in which they build upon," Davuluri said of enterprise customers. "And so our ability to take those existing bodies of work they have, and extend them is the, I think, the least friction way for them to go, learn, adopt, experiment and find ways to [scale]."Security model enforces strict containment and mandatory user consentMicrosoft's security model for agents adheres to what it calls "secure by default" policies aligned with the company's broader Secure Future Initiative. All agent connectors registered in the on-device registry must meet strict requirements around packaging and identity, with applications properly packaged and signed by trusted sources. Developers must explicitly declare the minimum capabilities their agent connectors require, and agents and connectors run in isolated environments with dedicated agent user accounts, separate from human user accounts. Windows requires explicit user approval when agents first access sensitive resources like files or system settings."We give Windows the ability to go deliver on the security expectations, and then it is auditable at the end of the day," Davuluri said. "You still want an auditability log that looks similar to perhaps what you use in the cloud. And so all three pieces are built into the design and architecture of Agent Workspace."For IT administrators, Microsoft is introducing management policies through Intune and Group Policy that allow organizations to enable or disable agent features at device and account levels, set minimum security policy levels, and access event logs enumerating all agent connector invocations and errors. The company emphasized that agents operate with restricted privileges, with minimal permissions by default and access granted only to explicitly approved resources that users can revoke at any time. Post-quantum cryptography and recovery tools address emerging and persistent threatsBeyond agent infrastructure, Microsoft announced significant security and resilience updates addressing both emerging and persistent enterprise challenges. Post-Quantum Cryptography APIs are now generally available in Windows, allowing organizations to begin migrating to encryption algorithms designed to withstand future quantum computing attacks that could break today's cryptographic standards. Microsoft worked closely with the National Institute of Standards and Technology to implement these algorithms."We are introducing post quantum cryptography APIs in Windows," Davuluri said. "For customers who want to be able to do cryptographic encryption in their workloads, they can start taking advantage of these APIs in Windows for the first time. That is a huge step forward for us when we think about the future of windows."Hardware-accelerated BitLocker will arrive on new devices starting spring 2026, offloading disk encryption to dedicated silicon for faster performance while providing hardware-level key protection. Sysmon functionality is becoming generally available as part of Windows in early 2026, bringing advanced forensics and threat detection capabilities previously available only as a separate download directly into the operating system's event logging system.The company also detailed progress on its Windows Resiliency Initiative, launched a year ago following the CrowdStrike incident that disrupted 8.5 million Windows devices globally. New recovery capabilities include Quick Machine Recovery with expanded networking support and Autopatch management, allowing IT to remotely fix devices stuck in Windows Recovery Environment. Point-in-time restore entering preview rolls back devices to earlier states to resolve update conflicts or configuration errors, while Cloud rebuild in preview allows IT to remotely rebuild malfunctioning devices by downloading fresh installation media and using Autopilot for zero-touch provisioning.Microsoft is also raising security requirements for third-party drivers across the Windows ecosystem. Following updated requirements for antivirus drivers effective April 1, 2025, the company is expanding this approach to other driver classes including networking, cameras, USB, printers, and storage — requiring higher certification standards, adding compiler safeguards, and providing more Windows in-box drivers to reduce reliance on third-party kernel-mode code.Measured rollout reflects enterprise caution around autonomous softwareMicrosoft is positioning these updates as essential infrastructure for what it calls "Frontier Firms" — organizations that "blend human ingenuity with intelligent systems to deliver real outcomes." However, the company emphasized a cautious, opt-in approach that reflects enterprise concerns about autonomous software agents."The principles we're using in designing these new platform capabilities accounts for the reality that we have a very, very broad user base," Davuluri said. "A lot of the features and capabilities we're building are opt in capabilities. And so it is our goal to be able to have users find value in the workflow and meet them."Virk emphasized the measured approach: "This is more about meeting customers where they are and then taking them on this journey when they are ready. So there's the optionality, but also having support for it. And really important thing is that they should feel comfortable. They should feel secure."Microsoft's bet is that only operating system-level integration can provide the security, governance, and user experience required for mainstream AI agent adoption. Whether that vision materializes will depend on developer adoption, enterprise comfort with autonomous software, and Microsoft's ability to balance innovation with the stability that 40 years of Windows customers expect. After four decades of putting users in control of their computers, Windows is now asking them to share that control with machines.
Writer, a San Francisco-based artificial intelligence startup, is launching a unified AI agent platform designed to let any employee automate complex business workflows without writing code — a capability the company says distinguishes it from consumer-oriented tools like Microsoft Copilot and ChatGPT.The platform, called Writer Agent, combines chat-based assistance with autonomous task execution in a single interface. Starting Tuesday, enterprise customers can use natural language to instruct the AI to create presentations, analyze financial data, generate marketing campaigns, or coordinate across multiple business systems like Salesforce, Slack, and Google Workspace—then save those workflows as reusable "Playbooks" that run automatically on schedules.The announcement comes as enterprises struggle to move AI initiatives beyond pilot programs into production at scale. Writer CEO May Habib has been outspoken about this challenge, recently revealing that 42% of Fortune 500 executives surveyed by her company said AI is "tearing their company apart" due to coordination failures between departments."We're delivering an agent interface that is both incredibly powerful and radically simple to transform individual productivity into organizational impact," Habib said in a statement. "Writer Agent is the difference between a single sales rep asking a chatbot to write an outreach email and an enterprise ensuring that 1,000 reps are all sending on-brand, compliant, and contextually-aware messages to target accounts."How Writer is putting workflow automation in the hands of non-technical workersThe platform's core innovation centers on making workflow automation accessible to non-technical employees—what Writer executives call "democratizing who gets to be a builder."In an exclusive interview with VentureBeat, Doris Jwo, Writer's director of product management, demonstrated how the system works: A user types a request in plain English — for example, "Create a two-page partnership proposal between [Company A] and [Company B], make it a branded deck, include impact metrics and partnership tiers."The AI agent then breaks down that request into discrete steps, conducts web research, generates graphics and charts on the fly, creates individual slides with sourced information, and assembles a complete presentation. The entire process, which might take an employee hours or days, can be completed in 10-12 minutes."The agent basically looks at the request, breaks it down, does research, understands what pieces it needs, creates a detailed plan at a step-by-step level," Jwo explained during a product demonstration. "It might say, 'I need to do web research,' or 'This user needs information from Gong or Slack,' and it reaches out to those connectors, grabs the data, and executes the plan."Crucially, users can save these multi-step processes as Playbooks—reusable templates that colleagues can deploy with a single click. Routines allow those Playbooks to run automatically at scheduled intervals, essentially putting knowledge work "on autopilot."Security and compliance controls: Writer's answer to enterprise IT concernsWriter positions these enterprise-focused controls as a key differentiator from competitors. While Microsoft, OpenAI, and Anthropic offer powerful AI capabilities, Writer's executives argue those tools weren't designed from the ground up for the security, compliance, and governance requirements of large regulated organizations."All of the products you mentioned are great products, but even Copilot is very much focused on personal productivity—summarizing email, for example, which is important, but that's not the component we're focusing on," said Matan-Paul Shetrit, Writer's director of product management, in an exclusive interview with VentureBeat.Shetrit emphasized Writer's "trust, security, and interoperability" approach. IT administrators can granularly control what the AI can access — for instance, preventing market research agents from mentioning competitors, or restricting which employees can use web search capabilities. All activity is logged with detailed audit trails showing exactly what data the agent touched and what actions it took."These fine-grained controls are what make products enterprise-ready," Shetrit said. "We can deploy to tens of thousands or hundreds of thousands of employees while maintaining the security and guardrails you need for that scale."This architecture reflects Writer's origin story. Unlike OpenAI or Anthropic, which started as research labs and later added enterprise offerings, Writer has targeted Fortune 500 companies since its 2020 founding. "We're not a research lab that went to consumer and is dabbling in enterprise," Shetrit said. "We are first and foremost targeting the Global 2000 and Fortune 500, and our research is in service of these customers' needs."Inside Writer's strategy to connect AI agents across enterprise software systemsA critical technical component is Writer's approach to system integrations. The platform includes pre-built connectors to more than a dozen enterprise applications—Google Workspace, Microsoft 365, Snowflake, Asana, Slack, Gong, HubSpot, Atlassian, Databricks, PitchBook, and FactSet—allowing the AI to retrieve information and take actions across those systems.Writer built these connectors using the Model Context Protocol (MCP), an emerging standard for AI system integrations, but added what Shetrit described as an "enterprise-ready" layer on top."We took a first-principle approach of: You have this MCP connector infrastructure—how do you build it in a way that's enterprise-ready?" Shetrit explained. "What we have today in the industry is definitely not it."The system can write and execute code on the fly to handle unexpected scenarios. If a user uploads an unfamiliar file format, for instance, the agent will generate code to extract and process the text without requiring a human to intervene.Jwo demonstrated this capability with a daily workflow she runs: Every morning at 10 a.m., a Routine automatically summarizes her Google Calendar meetings, identifies external participants, finds their LinkedIn profiles, and sends the summary to her via Slack — all without her involvement."This was pretty simple, but you can imagine for a salesperson it might say, 'At the end of the day, wrap up a summary of all the calls I had, send me action items, post it to the account-specific Slack channel, and tag these folks so they can accomplish those workflows,'" Jwo said. "That can run continuously each day, each week, or on demand."From mortgage lenders to CPG brands: Real-world AI agent use cases across industriesThe platform is attracting customers across multiple industries. New American Funding, a mortgage lender, uses Writer Agent to automate marketing workflows. Senior Content Marketing Manager Karen Rodriguez uploads Asana project tickets with creative briefs, and the AI executes tasks like updating email campaigns or transforming articles into social media carousels, video scripts, and captions.Other use cases span financial services teams creating investment dashboards with PitchBook and FactSet data, consumer packaged goods companies brainstorming new product lines based on social media trends, and marketing teams generating partnership presentations with branded assets.Writer has added customers including TikTok, Comcast, Keurig Dr Pepper, CAA, and Aptitude Health, joining an existing base that includes Accenture, Qualcomm, Uber, Vanguard, and Marriott. The company now serves more than 300 enterprises and has secured over $50 million in signed contracts, with projections to double that to $100 million this year.The startup's net retention rate — a measure of how much existing customers expand their usage — stands at 160%, meaning customers on average increase their spending by 60% after initial contracts. Twenty customers who started with $200,000-$300,000 contracts now spend about $1 million annually, according to company data.'Vibe working': Writer's vision for AI-powered productivity beyond codingWriter executives frame the platform as enabling what they call "vibe working" — a playful reference to the popular term "vibe coding," which describes AI tools like Cursor that dramatically accelerate software development."We used to call it transformation when we took 12 steps and made them nine. That's optimizing the world as it is," Habib said at Writer's AI Leaders Forum earlier this month, according to Forbes. "We can now create a new world. That is the greenfield mindset."Shetrit echoed this framing: "Vibe coding is the theme of 2025. Our view is that ‘vibe working’ is the theme of 2026. How do you bring the same productivity gains you've seen with coding agents into the workspace in a way that non-technical users can maximize them?"The platform is powered by Palmyra X5, Writer's proprietary large language model featuring a one-million-token context window — among the largest commercially available. Writer trained the model for approximately $700,000, a fraction of the estimated $100 million OpenAI spent on GPT-4, by using synthetic data and techniques that halt training when returns diminish.The model can process one million tokens in about 22 seconds and costs 60 cents per million input tokens and $6 per million output tokens — significantly cheaper than comparable offerings, according to company specifications.Making AI Decisions Visible: Writer's Approach to Trust and TransparencyA distinctive aspect of Writer's approach is transparency into the AI's decision-making process. The interface displays the agent's step-by-step reasoning, showing which data sources it accessed, what code it generated, and how it arrived at outputs."There's a very clear exhibition of how the agent is thinking, what it's doing, what it's touching," Shetrit said. "This is important for the end user to trust it, but also important for the IT person or security professional to see what's going on."This "supervision" model goes beyond simple observability of API calls to encompass what Shetrit described as "a superset of observability" — giving organizations the ability to not just monitor but control AI behavior through policies and permissions.Session logs capture all agent activity when enabled by administrators, and users can submit feedback on every output to help improve system performance. The platform also emphasizes providing sources and citations for generated content, allowing users to verify information."With any sort of chat assistant, agentic or not, trust but verify is really important," Jwo said. "That's part of the pillars of us building this and making it enterprise-grade."What Writer Agent Costs—and Why It's Included in the Base PlatformWriter is including all the new capabilities—Playbooks, Routines, Connectors, and Personality customization—as part of its core platform without additional charges, according to Jwo."This is fully included as part of the Writer platform," she said. "We're not charging additional for using Writer Agent."The "Personality" feature allows individual users, teams, or entire organizations to customize the AI's communication style, ensuring generated content matches brand voice and tone guidelines. This works alongside company-level controls that enforce terminology and style requirements.For highly structured, repetitive tasks, Writer also offers a library of more than 100 pre-built agents and an AI Studio for building custom multi-agent systems aligned with specific business use cases.The Race to Define Enterprise AI: Can Purpose-Built Platforms Beat Tech Giants?The launch crystallizes a fundamental tension in how enterprises will adopt AI at scale. While consumer-facing AI tools emphasize individual productivity gains, companies need systems that work reliably across thousands of employees, integrate with existing software infrastructure, maintain regulatory compliance, and deliver measurable business impact.Writer's wager is that these requirements demand purpose-built enterprise platforms rather than consumer tools adapted for business use. The company's $1.9 billion valuation — achieved in a November 2024 funding round that raised $200 million — suggests investors see merit in this thesis. Backers include Premji Invest, Radical Ventures, ICONIQ Growth, Salesforce Ventures, and Adobe Ventures.Yet the competitive landscape remains formidable. Microsoft and Google command enormous distribution advantages through their existing enterprise software relationships. OpenAI and Anthropic possess research capabilities that have produced breakthrough models. Whether Writer can maintain its differentiation as these giants expand their enterprise offerings will test the startup's core premise: that serving Fortune 500 companies from day one creates advantages that research labs turned enterprise vendors cannot easily replicate."We're entering an era where if you can describe a better way to work, you can build it," Jwo said. "The new Writer Agent democratizes who gets to be a builder, empowering the operational experts and creative problem-solvers in every department to become the architects of their own transformation. That's how you unlock innovation that competitors can't replicate."The promise is alluring — AI capabilities powerful enough to transform how work gets done, accessible enough for any employee to use, and controlled enough for enterprises to deploy safely at scale. Whether Writer can deliver on that promise at the speed and scale required will determine if its vision of "vibe working" becomes the 2026 theme Shetrit predicts, or just another ambitious attempt to solve enterprise AI's execution problem.But one thing is certain: In a market where 85% of AI initiatives fail to escape pilot purgatory, Writer is betting that the winners won't be the companies with the most powerful models—they'll be the ones that make those models actually work inside the enterprise.
A structured approach can make any data science project easier to handle. This guide breaks it down into five practical steps to take you from problem definition to results.
LLMs are a seamless way to find value in your unstructured data, but the truth is, there is so much more value hidden within your structured data. This post explores what LLMs are (and aren’t) optimized for and how the industry is approaching AI over structured business datasets – including one approach developed by my team and me.
The post Why LLMs Aren’t a One-Size-Fits-All Solution for Enterprises appeared first on Towards Data Science.
Introduction Automatic plant leaf detection is a remarkable innovation in computer vision and machine learning, enabling the identification of plant species by examining a photograph of the leaves. Deep learning is applied to extract meaningful features from an image of leaves and convert them into small, numerical representations known as embeddings. These embeddings capture the […]
The post How Deep Feature Embeddings and Euclidean Similarity Power Automatic Plant Leaf Recognition appeared first on Towards Data Science.
In the winter of 2022, as the tech world was becoming mesmerized by the sudden, explosive arrival of OpenAI’s ChatGPT, Benjamin Alarie faced a pivotal choice. His legal tech startup, Blue J, had a respectable business built on the AI of a bygone era, serving hundreds of accounting firms with predictive models. But it had hit a ceiling.Alarie, a tenured tax law professor at the University of Toronto, saw the nascent, error-prone, yet powerful capabilities of large language models not as a curiosity, but as the future. He made a high-stakes decision: to pivot his entire company, which had been painstakingly built over nearly a decade, and rebuild it from the ground up on this unproven technology.That bet has paid off handsomely. Blue J has since quietly secured a $122 million Series D funding round co-led by Oak HC/FT and Sapphire Ventures, placing the company's valuation at over $300 million. The move transformed Blue J from a niche player into one of Canada's fastest-growing legal tech firms, multiplying its revenue roughly twelve-fold and attracting 10 to 15 new customers every day.The company now serves more than 3,500 organizations, including global accounting giant KPMG and several Fortune 500 companies. It is tackling a critical bottleneck in the professional services industry: a severe and worsening talent shortage. The U.S. has 340,000 fewer accountants than it did five years ago, and with 75% of current CPAs expected to retire in the next decade, firms are desperate for tools that can amplify the productivity of their remaining experts.“What once took tax professionals 15 hours of manual research to do can now be completed in about 15 seconds with Blue J,” Alarie, the company's CEO, said in an exclusive interview with VentureBeat. "That value proposition—we can take hours of work and turn it into seconds of work—that is driving a lot of this."When the dean's biography was wrong: the moment that changed everythingAlarie vividly remembers January 2023, when the dean of the law school stopped by his office for New Year's greetings. He asked her about ChatGPT and prompted the AI to describe her. ChatGPT confidently generated a biography. Some details were accurate. Others were completely fabricated."She was like, 'Okay, this is really kind of scary. This is wrong, and this has implications,'" Alarie said. Yet that moment of obvious failure didn't deter him. Instead, it crystallized his conviction.The company's first iteration, launched in 2015, used supervised machine learning to build predictive models that could forecast judicial outcomes on specific tax issues. While technically sophisticated, it had a fundamental flaw: it couldn't answer every tax research question."The challenge was it couldn't answer every tax research question, which was really the holy grail," Alarie said. Customers loved the tool when it applied to their problem, but would quickly abandon it when it didn't. Revenue plateaued around $2 million annually.Despite ChatGPT's notorious hallucinations, Alarie convinced his board to make the pivot. "I had this conviction that if we continued down that path, we weren't going to be able to address our number one limitation," he said. "Large language models seemed like a very promising direction."He gave his team six months to deliver a working product.From 90-second responses to 3 million queries: How Blue J tamed AI hallucinationsBy August 2023, Blue J was ready to launch. What they released was, in Alarie's candid assessment, "super janky." The system took 90 seconds to respond. About half the answers had issues. The Net Promoter Score registered at just 20.What transformed that flawed product into today's platform — with response times measured in seconds, a dissatisfaction rate of just one in 700 queries, and an NPS score in the mid-80s — was relentless focus on three strategic pillars.First is proprietary content at massive scale. Blue J secured exclusive licensing with Tax Analysts (Tax Notes) and IBFD, the Amsterdam-based global tax authority covering 220+ jurisdictions. "We are the only platform on earth that takes in the best U.S. tax information from Tax Notes and the best global tax information from IBFD," Alarie said.Second is deep human expertise. Blue J employs tax experts led by Susan Massey, who spent 13 years at the IRS Office of Chief Counsel as Branch Chief for Corporate Tax. Her team constantly tests the AI and refines its performance.Third is an unprecedented feedback flywheel. With over 3 million tax research queries processed in 2025, Blue J is amassing unparalleled data. Each query generates feedback that flows back into the system.Weekly active user rates hover between 75% and 85%, compared to 15% to 25% for traditional platforms. "A charitable ratio is like we're five times more intensively used," Alarie noted.Inside Blue J's early access partnership with OpenAIBlue J maintains an unusually close relationship with OpenAI that has proven crucial to its success. "We have a very good relationship with OpenAI, and we get early access to their models,"Alarie said. "It's quite collaborative. We give them a lot of really high quality feedback about how well different versions of forthcoming models are performing."This feedback proves valuable because Blue J has developed what Alarie calls "ecologically valid" test questions — drawn from actual tax professional queries, with correct answers determined by Blue J's expert team. This helps OpenAI improve performance on complex reasoning tasks.The company tests models from all major providers — OpenAI, Anthropic, Google's Gemini, and open-source alternatives — continuously evaluating which performs best. "We're not necessarily 100% committed to any particular provider," he explained. "We're testing all the time."This approach helps Blue J navigate a challenging business model: charging approximately $1,500 per seat annually for unlimited queries while absorbing variable compute costs. "We've pre-committed to delivering them a really good user experience, unlimited tax research answers at a fixed price," Alarie said. "We're absorbing a lot of that risk."Competition among foundation model providers creates downward pressure on API pricing, while Blue J's conservative usage modeling has proven accurate. Gross revenue retention exceeds 99%, while net revenue retention reaches 130% — considered best-in-class for SaaS businesses.Taking on Thomson Reuters and LexisNexis with 75% weekly engagementBlue J faces competition from established publishers like Thomson Reuters, LexisNexis, and Bloomberg, all of which announced AI capabilities throughout 2023 and 2024. Yet Blue J's engagement metrics suggest it has captured significant momentum, growing from just 200 customers in 2021 to over 3,500 organizations today.The daily updates prove crucial. While the tax code itself changes only when Congress acts, the ecosystem evolves constantly through IRS regulations, new rulings, and court cases. All 50 states modify their tax codes regularly."Things are changing literally every day," Alarie said. "Every day we're updating the materials, and that's just the U.S. We cover Canada, we cover the UK. The aspirations are truly global for this thing."Alarie's ambitions extend beyond building a successful startup. As author of the award-winning book "The Legal Singularity" and faculty affiliate at the Vector Institute for Artificial Intelligence, he has spent years contemplating AI's long-term impact on law.In academic papers published in Tax Notes throughout 2023 and 2024, he chronicled generative AI's rise, predicting that "clients will become substantially more sophisticated" and that AI would push human experts toward higher-value strategic roles rather than routine research.Blue J's $122 million plan: From tax research to 'global tax cognition'The Series D funding, which brought total capital raised to over $133 million, will fuel aggressive geographic and product expansion. Blue J already operates in the U.S., Canada, and the U.K., with plans to eventually cover 220+ jurisdictions through its IBFD partnership.Future capabilities could include automated memo generation, tax form completion, document drafting, and conversational history maintaining context across sessions—transforming Blue J from a research tool into what Alarie describes as "the operating layer for global tax cognition."For all its success, Blue J operates in a domain where errors carry serious consequences. The hallucination problem hasn't been eliminated — it's been minimized through careful engineering, content curation, and human oversight. Blue J has trained its models to acknowledge when they cannot answer a question rather than fabricate information.The business also faces economic risks if compute costs spiral or usage patterns exceed projections. And subtler questions loom about professional judgment: as AI systems become more capable, will users defer to outputs without sufficient critical evaluation?From 15 hours to 15 seconds: What Blue J's AI pivot teaches every industryBlue J's transformation offers lessons beyond tax software. The company's willingness to abandon eight years of proprietary technology and rebuild on an initially unreliable foundation required both courage and calculated risk-taking.The decision paid off not because generative AI was inherently superior to supervised machine learning in all dimensions, but because it addressed the right problem: comprehensiveness rather than precision in narrow domains. Tax professionals didn't need 95% accuracy on 5% of questions. They needed good-enough accuracy on 100% of questions.The improvement from an NPS of 20 to 84 in just over two years reflects relentless iteration informed by massive data collection. The content partnerships created differentiation that pure technology couldn't replicate. The team of tax experts provided domain knowledge necessary to ensure reliability.Most fundamentally, Blue J recognized that the real competition wasn't other AI startups or even established publishers. It was the old way of doing things — the 15 hours of manual research, the institutional knowledge locked in retiring professionals' heads."People are like, 'What does Blue J do? They provide better tax answers. Okay, I think we need that,'" Alarie reflected.As AI transforms profession after profession, that clarity of purpose may matter more than technological sophistication. The future belongs not to those who build the most advanced AI, but to those who most effectively harness it to solve problems humans actually have.For a tax law professor who started with frustration about inefficient research methods, building a $300 million company marks an audacious endpoint. For the thousands of professionals now answering complex questions in 15 seconds instead of 15 hours, it represents the future of their profession, arriving faster than most expected.The bet on ChatGPT when it was still hallucinating biographies has become a validation that sometimes the riskiest move is not to move at all.
The search giant fires its latest salvo against traditional RAG processing.
The post Introducing Google’s File Search Tool appeared first on Towards Data Science.
Picture this: You’re a data analyst on day one at a midsize SaaS company. You’ve got the beginnings of a data warehouse—some structured, usable data and plenty of raw data you’re not quite sure what to do with yet. But that’s not the real problem. The real problem is that different teams are doing their […]
Industry leaders agree collaboration is key to advancing critical technologies.
Industry leaders agree collaboration is key to advancing critical technologies.
Much of the US economy rests on AI’s future. On this episode of The Big Interview podcast, Odd Lots cohost Joe Weisenthal breaks down why AI’s impact on finance goes beyond billion-dollar investments.