AWS is leveraging automated reasoning, which uses math-based verification, to build out new capabilities in its Amazon Bedrock AgentCore platform as the company digs deeper into the agentic AI ecosystem. Announced during its annual re: Invent conference in Las Vegas, AWS is adding three new capabilities to AgentCore: "policy," "evaluations" and "episodic memory." The new features aim to give enterprises more control over agent behavior and performance. AWS also revealed what it calls “a new class of agents," or "frontier agents," that are autonomous, scalable and independent. Swami Sivasubramanian, AWS VP for Agentic AI, told VentureBeat that many of AWS’s new features represent a shift in who becomes a builder. “We are actually on the cusp of a major tectonic transformation with AI, but agentic AI is truly starting to transform what is the art of the possible, and it is going to make this one of the most truly transforming technologies,” Sivasubramanian said. Policy agentsThe new policy capability helps enterprises reinforce guidelines even after the agent has already reasoned its response. AWS VP for AgentCore David Richardson told VentureBeat that the policy tool sits between the agent and the tools it calls, rather than being baked into the agent, as fine-tuning often is. The idea is to prevent an agent from violating enterprise rules and redirect it to re-evaluate its reasoning. Richardson gave the example of a customer service agent: A company would write a policy stating that the agent can grant a refund of up to $100, but for anything higher, the agent would need to bounce the customer to a human. He noted that it remains easy to subvert an agent's reasoning loop through, for instance, prompt injection or poisoned data, leading agents to ignore guardrails. “There are always these prompt injection attacks where people try to subvert the reasoning of the agent to get the agent to do things it shouldn’t do,” Richardson said. “That’s why we implemented the policy outside of the agent, and it works using the automated reasoning capabilities that we’ve spent years building up to help customer define their capabilities.”AWS unveiled Automated Reasoning Checks on Bedrock at last year’s re: Invent. These use neurosymbolic AI, or math-based validation, to prove correctness. The tool applies mathematical proofs to models to confirm that it hasn’t hallucinated. AWS has been leaning heavily into neurosymbolic AI and automated reasoning, pushing for enterprise-grade security and safety in ways that differ from other AI model providers.Episodic memories and evaluationsThe two other new updates to AgentCore, "evaluations" and "episodic memory," also give enterprises a better view of agent performance and give agents episodic memory.An enhancement of AgentCore memory, episodic memory refers to knowledge that agents tap into only occasionally, unlike longer-running preferences, which they have to refer back to constantly. Context window limits hamper some agents, so they sometimes forget information or conversations they haven’t tapped into for a while. “The idea is to help capture information that a user really would wish the agent remembered when they came back," said Richardson. "For example, 'what is their preferred seat on an airplane for family trips?' Or 'what is the sort of price range they're looking for?'"Episodic memory differs from the previously shipped AgentCore memory because, instead of relying on maintaining short- and long-term memory, agents built on AgentCore can recall certain information based on triggers. This can eliminate the need for custom instructions.With AgentCore evaluations, organizations can use 13 pre-built evaluators or write their own. Developers can set alerts to warn them if agents begin to fail quality monitoring.Frontier agentsBut perhaps AWS's strongest push into enterprise agentic AI is the release of frontier agents, or fully automated and independent agents that the company says can act as teammates with little direction. The concept is similar, if not identical, to those of more asynchronous agents from competitors like Google and OpenAI. However, AWS seems to be releasing more than just autonomous coding agents. Sivasubramanian called them a "new class" of agents, "not only a step function change in what you can do today; they move from assisting with individual tasks to complex projects."The first is Kiro, an autonomous coding agent that has been in public preview since July. At the time, Kiro was billed as an alternative to vibe coding platforms like OpenAI’s Codex or Windsurf. Similar to Codex and Google’s myriad asynchronous coding agents, including Jules, Kiro can code, undertake reviews, fix bugs independently and determine the tasks it needs to accomplish. AWS security agent, meanwhile, embeds deep security expertise into applications from the start. The company said in a press release that users “define security standards once and AWS security agent automatically validates them across your applications during its review — helping teams address the risks that matter to their business, not generic checklists.”The AWS DevOps agent will help developers, especially those on call, proactively find system breaks or bugs. It can respond to incidents using its knowledge of the application or service. It also acknowledges the relationships between the application and the tools it taps, such as Amazon CloudWatch, Datadog and Splunk, to trace the root cause of the issue. Enterprises are interested in deploying agents and, eventually, bringing more autonomous agents into their workflows. And, while companies like AWS continue to bolster these agents with security and control, organizations are slowly figuring out how to connect them all.
The AI browser wars are heating up. OpenAI and other AI companies like Perplexity have gotten a lot of attention with their new AI-first and agentic browsers. They're being positioned as direct competition to Google, which currently holds a 70% share of the market with its Chrome browser. As the incumbent, Google has been slower to respond to the shift toward AI search — integrating Gemini into Chrome, is widely seen as playing catch-up to competitors that were AI-first from day one.It's understandable, as a $100 billion business is an enormous, unwieldy beast to pivot. That leaves space for the new guys to maneuver, who are essentially starting with blank slates, and free reign for innovation.Enter Neo, released for worldwide general availability today — the next step in Norton’s AI innovation journey, building on its leadership in cyber safety and its bid to deliver the world’s first safe, zero-prompt AI browser. From the beginning, the minds behind Neo made a deliberate choice to focus on a proactive AI assistant rather than chase today’s agentic trends. Even enthusiasts willing to tolerate the risks face too much unpredictability, along with new safety and privacy concerns.Howie Xu, chief AI & innovation officer at Gen, describes Neo as a browser built to help before you ask — delivering on-page, in-flow support through summaries, reminders, and context-aware suggestions without prompts or extra steps."It's like having a highly intelligent assistant sitting next to me, helping me absorb and process information much more broadly, much faster, much deeper," Xu says. "That assistant is there when you're reading, when you're researching, when you're working on an online project. And based on your interests and browsing, your assistant can help you at every step."Borrowing from Norton's unique consumer security expertise, privacy and safety has also been integrated from the ground up."What makes us unique is that we're giving people both peace of mind and AI functionality at the same time," Xu explains. "Norton’s roots are in security. We’re the only game in town that built an AI native browser from the ground up with safety and privacy at its core —one that won’t exploit or use your data for training.The zero-prompt differenceComet (Perplexity) and Atlas (OpenAI) were built by chat-first companies that assume users will actively ask questions. But getting value from AI takes cognitive effort: you need to know what to ask, shift into “question mode,” and understand what the model can actually do. Asking a question isn’t the hard part; realizing what to ask requires meta-cognition — awareness of what you don’t know — which makes turning to ChatGPT in the middle of browsing feel harder than it should.Neo takes the opposite approach. Instead of waiting for you to prompt it, it acts first — offering summaries, reminders, relevant news, and even questions you’re likely to explore. "Based on my browsing interests, Neo reminds me of events I might want to attend, surfaces personalized news, and presents pre-generated questions that I actually want to explore," Xu explains. "In other words, I’ve never had to formulate a single prompt — I’m simply clicking on insights the AI has already anticipated for me as if I had been prompting.”Because most people don’t know the boundaries of AI technology or how to phrase effective prompts, expecting them to drive the interaction is unrealistic for many people."We decided to shift the burden away from people. You can still ask questions, of course, but we’re designing for those who want less cognitive load and prefer AI to take the first step," he says. Much like the recommendations that surface on any news or retail site, Neo leverages browsing context to surface the right content at the right moment.Neo can summarize a page and anticipate questions based on your interests and behaviors. With permission, it can also create detailed reminders — for example, noticing repeat visits to Formula 1 websites and prompting you about upcoming races. Control stays with the person using Neo: if an interest fades, they can remove it from Neo’s Configurable Memory.Because Neo’s browsing history and preferences are stored locally and securely, it can customize prompts, insights, and suggestions — from calendar nudges to news recommendations to suggested questions in the Neo Chat interface. The result is an AI-powered browser that gives people the benefits of AI without typing prompts. Inline actions like “Summary,” “Add to calendar?,” “Resume where you left off,” and “Price dropped” make browsing feel faster and lighter, without extra steps.A calm-by-design experience grounded in security“Calm by design” has guided Neo’s development, and for Xu that comes down to three things: control, privacy, and security, all within a clean, streamlined experience that makes browsing faster and easier.Rooted in Norton’s decades of security expertise, Neo’s calm experience starts with privacy and protection. Xu views it as the bedrock of Neo’s approach: the company never knows what you’re doing, because all personal data stays on the device unless explicitly permitted otherwise. Norton-backed security practices suppress prompt-injection risks common in other AI browsers, local processing keeps sensitive information contained, and scoped sync ensures only user-approved context carries across devices.Norton also brings deep web intelligence: decades of scanning the vast majority of the internet and evolving antivirus capabilities that now understand both static and runtime web content. That real-time insight allows Neo’s built in antivirus, anti-phishing, and anti-scam technology to detect and shut down malicious behavior and content the moment it appears."When we think about calm, what we really mean is delivering value in a consistent way, in a reliable way, in a way that people can predict, so people have peace of mind," Xu says. "This is very different from the design of the agentic browsers out there where the result is simply unpredictable, not to mention the associated latency and overhead. I believe consistency is a necessity for us to push an AI browser to a mass population. We have some flashy capabilities too, but our primary goal is that people can just use it in their daily lives without ever having to worry about all the vulnerabilities that most agentic browsers introduce. Since we're calm, reliable and safe by design, we believe we’ll win the hearts of a mass audience."For anyone watching the rapid shift toward AI-powered browsing, Neo shows how Norton is fusing assistance, security, and zero-prompt design into a single experience. See it in action at neobrowser.ai.Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
V3.2 drops, newcomer beats giants, AI training underground, job study results, and more...
Amazon Web Services (AWS) is leaning into the growing trend toward custom models with a new service that it says will let enterprises bring more personalization and internal knowledge. The move comes alongside the release of AWS's new models as part of its Nova family, which expands the capabilities of its reasoning models.Nova 2 Lite, Nova 2 Pro, Nova 2 Sonic and Nova 2 Omni update the first Nova models AWS announced last year.Nova 2 Lite is a fast, cost-effective reasoning model optimized for everyday tasks that can process text, images and videos to generate text. Nova 2 Pro, which AWS said is its most intelligent reasoning model, can handle complex tasks such as coding agents, long-range planning and problem-solving. It can act as a “teacher” model for distillation projects. Nova 2 Sonic is a speech-to-speech model, while Nova 2 Omni enables organizations to generate both text and images from text, image and video inputs. Nova Act, AWS's browser agent — announced as an experimental development kit in April — is also powered by the Nova 2 models and now available to customers. However, it is the custom model service, Nova Forge, that AWS is most excited about. The service gives customers the ability to introduce proprietary data to a pre-trained model without fear that the model will forget its previous training. Nova Forge allows enterprises to create custom, optimized versions of Nova models, which it calls “Novellas,” and bring them directly to its Amazon Bedrock platform. Custom model creation Enterprises are increasingly turning to model distillation or custom models, especially with many industries choosing to create foundation models with domain-specific knowledge. But these can often be out of reach for many companies, as not everyone can afford several Nvidia GPU H100s to build models from scratch. As a result, they turn to heavily fine-tuned open-source off-the-shelf models. “You just don't have a great way to get a frontier model that deeply understands your data and your domain," AWS CEO Matt Garman said during his keynote speech at AWS’s annual re: Invent conference. "But what if it was possible? What if you could integrate your data at the right time during the training of a frontier model, then create a proprietary model that was just for you?” Nova Forge employs what AWS calls “open training,” which allows developers to blend their proprietary data with an Amazon-curated dataset at every step of model development, with checkpoints during training. AWS said this means models will not regress on foundational capabilities, such as instruction following, while learning company-specific knowledge and instructions. Each “Novella” could be a custom version of Nova 2 Lite, with Nova’s full knowledge and reasoning power, but with domain-specificity. Right now, enterprises can only make Novellas from Nova 2 Lite, but many will expand to other Nova 2 models soon. Nova Forge also offers enterprises “reinforcement learning gyms." This allows them to train AI systems through their own environments with simulated scenarios to create smaller, faster models and access responsible AI toolkits. Once companies create their Novellas, they can bring them to Bedrock to build more applications and agents. One customer currently using Nova Forge is Reddit, which integrated its own data and community-specific knowledge into a model to build a moderation program. Nova Forge only works with Nova models, and AWS does not plan to bring in third-party open-source models hosted on Bedrock (for now).Nova 2 models in detailAWS said tens of thousands of companies now use its Nova models and the company expects the Nova 2 models to see the same adoption. “Nova 2 Lite delivers incredible price performance for many workloads that we actually see our customers wanting to deliver in production,” Garman said. “We think Nova 2 Lite will be the workhorse for many companies, while Pro will be for more complex tasks and for when you need your agents to be great.”In a press release, AWS said evaluations showed Nova 2 Lite performed “equal or better on 13 out of 15 benchmarks compared to Claude Haiku 4.5, equal or better on 11 out of 17 benchmarks compared to GPT-5 Mini and equal or better on 14 out of 18 benchmarks compared to Gemini Flash 2.5.”Users can adjust how much Nova 2 Lite shows its step-by-step thinking to balance costs with depth. Nova Pro 2 also performed well in benchmark testing compared to Claude Sonnet 4.5, GPT-5.1 and Gemini 2.5 Pro. This model works best for multi-document analysis, video reasoning, advanced math and agentic engineering tasks. AWS said in its press release that both Nova 2 Lite and Pro “have built-in grounding and code execution capabilities.”Nova 2 Sonic, the speech-to-speech model, generates human-like conversations and now supports multiple languages. The updated model has a 1-million-token context window, with more expressive voices and higher accuracy. The company said Sonic can even switch topics mid-conversation.Nova 2 Omni handles “up to 750,000 words, hours of audio, long videos and hundred-page documents, simultaneously analyzing entire product catalogs, testimonials, brand guidelines and video libraries at once.” “While there are no comparable models in the industry to Nova 2 Omni, it demonstrates strengths in public benchmarks of multimodal reasoning on documents, images, videos and audio, and can generate high-quality images similar to other leading image-generation models,” AWS said in its release.
For much of 2025, the frontier of open-weight language models has been defined not in Silicon Valley or New York City, but in Beijing and Hangzhou.Chinese research labs including Alibaba's Qwen, DeepSeek, Moonshot and Baidu have rapidly set the pace in developing large-scale, open Mixture-of-Experts (MoE) models — often with permissive licenses and leading benchmark performance. While OpenAI fielded its own open source, general purpose LLM this summer as well — gpt-oss-20B and 120B — the uptake has been slowed by so many equally or better performing alternatives. Now, one small U.S. company is pushing back.Today, Arcee AI announced the release of Trinity Mini and Trinity Nano Preview, the first two models in its new “Trinity” family—an open-weight MoE model suite fully trained in the United States. Users can try the former directly for themselves in a chatbot format on Acree's new website, chat.arcee.ai, and developers can download the code for both models on Hugging Face and run it themselves, as well as modify them/fine-tune to their liking — all for free under an enterprise-friendly Apache 2.0 license. While small compared to the largest frontier models, these releases represent a rare attempt by a U.S. startup to build end-to-end open-weight models at scale—trained from scratch, on American infrastructure, using a U.S.-curated dataset pipeline."I'm experiencing a combination of extreme pride in my team and crippling exhaustion, so I'm struggling to put into words just how excited I am to have these models out," wrote Arcee Chief Technology Officer (CTO) Lucas Atkins in a post on the social network X (formerly Twitter). "Especially Mini."A third model, Trinity Large, is already in training: a 420B parameter model with 13B active parameters per token, scheduled to launch in January 2026.“We want to add something that has been missing in that picture,” Atkins wrote in the Trinity launch manifesto published on Arcee's website. “A serious open weight model family trained end to end in America… that businesses and developers can actually own.”From Small Models to Scaled AmbitionThe Trinity project marks a turning point for Arcee AI, which until now has been known for its compact, enterprise-focused models. The company has raised $29.5 million in funding to date, including a $24 million Series A in 2024 led by Emergence Capital, and its previous releases include AFM-4.5B, a compact instruct-tuned model released in mid-2025, and SuperNova, an earlier 70B-parameter instruction-following model designed for in-VPC enterprise deployment. Both were aimed at solving regulatory and cost issues plaguing proprietary LLM adoption in the enterprise.With Trinity, Arcee is aiming higher: not just instruction tuning or post-training, but full-stack pretraining of open-weight foundation models—built for long-context reasoning, synthetic data adaptation, and future integration with live retraining systems.Originally conceived as a stepping stone to Trinity Large, both Mini and Nano emerged from early experimentation with sparse modeling and quickly became production targets themselves.Technical HighlightsTrinity Mini is a 26B parameter model with 3B active per token, designed for high-throughput reasoning, function calling, and tool use. Trinity Nano Preview is a 6B parameter model with roughly 800M active non-embedding parameters—a more experimental, chat-focused model with a stronger personality, but lower reasoning robustness. Both models use Arcee’s new Attention-First Mixture-of-Experts (AFMoE) architecture, a custom MoE design blending global sparsity, local/global attention, and gated attention techniques.Inspired by recent advances from DeepSeek and Qwen, AFMoE departs from traditional MoE by tightly integrating sparse expert routing with an enhanced attention stack — including grouped-query attention, gated attention, and a local/global pattern that improves long-context reasoning. Think of a typical MoE model like a call center with 128 specialized agents (called “experts”) — but only a few are consulted for each call, depending on the question. This saves time and energy, since not every expert needs to weigh in.What makes AFMoE different is how it decides which agents to call and how it blends their answers. Most MoE models use a standard approach that picks experts based on a simple ranking. AFMoE, by contrast, uses a smoother method (called sigmoid routing) that’s more like adjusting a volume dial than flipping a switch — letting the model blend multiple perspectives more gracefully.The “attention-first” part means the model focuses heavily on how it pays attention to different parts of the conversation. Imagine reading a novel and remembering some parts more clearly than others based on importance, recency, or emotional impact — that’s attention. AFMoE improves this by combining local attention (focusing on what was just said) with global attention (remembering key points from earlier), using a rhythm that keeps things balanced.Finally, AFMoE introduces something called gated attention, which acts like a volume control on each attention output — helping the model emphasize or dampen different pieces of information as needed, like adjusting how much you care about each voice in a group discussion.All of this is designed to make the model more stable during training and more efficient at scale — so it can understand longer conversations, reason more clearly, and run faster without needing massive computing resources.Unlike many existing MoE implementations, AFMoE emphasizes stability at depth and training efficiency, using techniques like sigmoid-based routing without auxiliary loss, and depth-scaled normalization to support scaling without divergence.Model CapabilitiesTrinity Mini adopts an MoE architecture with 128 experts, 8 active per token, and 1 always-on shared expert. Context windows reach up to 131,072 tokens, depending on provider. Benchmarks show Trinity Mini performing competitively with larger models across reasoning tasks, including outperforming gpt-oss on the SimpleQA benchmark (tests factual recall and whether the model admits uncertainty), MMLU (Zero shot, measuring broad academic knowledge and reasoning across many subjects without examples), and BFCL V3 (evaluates multi-step function calling and real-world tool use):MMLU (zero-shot): 84.95Math-500: 92.10GPQA-Diamond: 58.55BFCL V3: 59.67Latency and throughput numbers across providers like Together and Clarifai show 200+ tokens per second throughput with sub-three-second E2E latency—making Trinity Mini viable for interactive applications and agent pipelines.Trinity Nano, while smaller and not as stable on edge cases, demonstrates sparse MoE architecture viability at under 1B active parameters per token. Access, Pricing, and Ecosystem IntegrationBoth Trinity models are released under the permissive, enterprise-friendly, Apache 2.0 license, allowing unrestricted commercial and research use. Trinity Mini is available via:Hugging FaceOpenRouterchat.arcee.aiAPI pricing for Trinity Mini via OpenRouter:$0.045 per million input tokens$0.15 per million output tokensA free tier is available for a limited time on OpenRouterThe model is already integrated into apps including Benchable.ai, Open WebUI, and SillyTavern. It's supported in Hugging Face Transformers, VLLM, LM Studio, and llama.cpp.Data Without Compromise: DatologyAI’s RoleCentral to Arcee’s approach is control over training data—a sharp contrast to many open models trained on web-scraped or legally ambiguous datasets. That’s where DatologyAI, a data curation startup co-founded by former Meta and DeepMind researcher Ari Morcos, plays a critical role.DatologyAI’s platform automates data filtering, deduplication, and quality enhancement across modalities, ensuring Arcee’s training corpus avoids the pitfalls of noisy, biased, or copyright-risk content. For Trinity, DatologyAI helped construct a 10 trillion token curriculum organized into three phases: 7T general data, 1.8T high-quality text, and 1.2T STEM-heavy material, including math and code.This is the same partnership that powered Arcee’s AFM-4.5B—but scaled significantly in both size and complexity. According to Arcee, it was Datology’s filtering and data-ranking tools that allowed Trinity to scale cleanly while improving performance on tasks like mathematics, QA, and agent tool use.Datology’s contribution also extends into synthetic data generation. For Trinity Large, the company has produced over 10 trillion synthetic tokens—paired with 10T curated web tokens—to form a 20T-token training corpus for the full-scale model now in progress.Building the Infrastructure to Compete: Prime IntellectArcee’s ability to execute full-scale training in the U.S. is also thanks to its infrastructure partner, Prime Intellect. The startup, founded in early 2024, began with a mission to democratize access to AI compute by building a decentralized GPU marketplace and training stack.While Prime Intellect made headlines with its distributed training of INTELLECT-1—a 10B parameter model trained across contributors in five countries—its more recent work, including the 106B INTELLECT-3, acknowledges the tradeoffs of scale: distributed training works, but for 100B+ models, centralized infrastructure is still more efficient.For Trinity Mini and Nano, Prime Intellect supplied the orchestration stack, modified TorchTitan runtime, and physical compute environment: 512 H200 GPUs in a custom bf16 pipeline, running high-efficiency HSDP parallelism. It is also hosting the 2048 B300 GPU cluster used to train Trinity Large.The collaboration shows the difference between branding and execution. While Prime Intellect’s long-term goal remains decentralized compute, its short-term value for Arcee lies in efficient, transparent training infrastructure—infrastructure that remains under U.S. jurisdiction, with known provenance and security controls.A Strategic Bet on Model SovereigntyArcee's push into full pretraining reflects a broader thesis: that the future of enterprise AI will depend on owning the training loop—not just fine-tuning. As systems evolve to adapt from live usage and interact with tools autonomously, compliance and control over training objectives will matter as much as performance.“As applications get more ambitious, the boundary between ‘model’ and ‘product’ keeps moving,” Atkins noted in Arcee's Trinity manifesto. “To build that kind of software you need to control the weights and the training pipeline, not only the instruction layer.”This framing sets Trinity apart from other open-weight efforts. Rather than patching someone else’s base model, Arcee has built its own—from data to deployment, infrastructure to optimizer—alongside partners who share that vision of openness and sovereignty.Looking Ahead: Trinity LargeTraining is currently underway for Trinity Large, Arcee’s 420B parameter MoE model, using the same afmoe architecture scaled to a larger expert set. The dataset includes 20T tokens, split evenly between synthetic data from DatologyAI and curated wb data.The model is expected to launch next month in January 2026, with a full technical report to follow shortly thereafter.If successful, it would make Trinity Large one of the only fully open-weight, U.S.-trained frontier-scale models—positioning Arcee as a serious player in the open ecosystem at a time when most American LLM efforts are either closed or based on non-U.S. foundations.A recommitment to U.S. open sourceIn a landscape where the most ambitious open-weight models are increasingly shaped by Chinese research labs, Arcee’s Trinity launch signals a rare shift in direction: an attempt to reclaim ground for transparent, U.S.-controlled model development. Backed by specialized partners in data and infrastructure, and built from scratch for long-term adaptability, Trinity is a bold statement about the future of U.S. AI development, showing that small, lesser-known companies can still push the boundaries and innovate in an open fashion even as the industry is increasingly productized and commodtized. What remains to be seen is whether Trinity Large can match the capabilities of its better-funded peers. But with Mini and Nano already in use, and a strong architectural foundation in place, Arcee may already be proving its central thesis: that model sovereignty, not just model size, will define the next era of AI.
Christmas connections, Copilot's costs, careful (no-)choices
The post The Machine Learning Lessons I’ve Learned This Month appeared first on Towards Data Science.
We're bringing our most intelligent model yet, Gemini 3 Pro, to Google Search in more countries around the world.
This first day of the Advent Calendar introduces the k-NN regressor, the simplest distance-based model. Using Excel, we explore how predictions rely entirely on the closest observations, why feature scaling matters, and how heterogeneous variables can make distances meaningless. Through examples with continuous and categorical features, including the California Housing and Diamonds datasets, we see the strengths and limitations of k-NN, and why defining the right distance is essential to reflect real-world structure.
The post The Machine Learning “Advent Calendar” Day 1: k-NN Regressor in Excel appeared first on Towards Data Science.
Chinese artificial intelligence startup DeepSeek released two powerful new AI models on Sunday that the company claims match or exceed the capabilities of OpenAI's GPT-5 and Google's Gemini-3.0-Pro — a development that could reshape the competitive landscape between American tech giants and their Chinese challengers.The Hangzhou-based company launched DeepSeek-V3.2, designed as an everyday reasoning assistant, alongside DeepSeek-V3.2-Speciale, a high-powered variant that achieved gold-medal performance in four elite international competitions: the 2025 International Mathematical Olympiad, the International Olympiad in Informatics, the ICPC World Finals, and the China Mathematical Olympiad.The release carries profound implications for American technology leadership. DeepSeek has once again demonstrated that it can produce frontier AI systems despite U.S. export controls that restrict China's access to advanced Nvidia chips — and it has done so while making its models freely available under an open-source MIT license."People thought DeepSeek gave a one-time breakthrough but we came back much bigger," wrote Chen Fang, who identified himself as a contributor to the project, on X (formerly Twitter). The release drew swift reactions online, with one user declaring: "Rest in peace, ChatGPT."How DeepSeek's sparse attention breakthrough slashes computing costsAt the heart of the new release lies DeepSeek Sparse Attention, or DSA — a novel architectural innovation that dramatically reduces the computational burden of running AI models on long documents and complex tasks.Traditional AI attention mechanisms, the core technology allowing language models to understand context, scale poorly as input length increases. Processing a document twice as long typically requires four times the computation. DeepSeek's approach breaks this constraint using what the company calls a "lightning indexer" that identifies only the most relevant portions of context for each query, ignoring the rest.According to DeepSeek's technical report, DSA reduces inference costs by roughly half compared to previous models when processing long sequences. The architecture "substantially reduces computational complexity while preserving model performance," the report states.Processing 128,000 tokens — roughly equivalent to a 300-page book — now costs approximately $0.70 per million tokens for decoding, compared to $2.40 for the previous V3.1-Terminus model. That represents a 70% reduction in inference costs.The 685-billion-parameter models support context windows of 128,000 tokens, making them suitable for analyzing lengthy documents, codebases, and research papers. DeepSeek's technical report notes that independent evaluations on long-context benchmarks show V3.2 performing on par with or better than its predecessor "despite incorporating a sparse attention mechanism."The benchmark results that put DeepSeek in the same league as GPT-5DeepSeek's claims of parity with America's leading AI systems rest on extensive testing across mathematics, coding, and reasoning tasks — and the numbers are striking.On AIME 2025, a prestigious American mathematics competition, DeepSeek-V3.2-Speciale achieved a 96.0% pass rate, compared to 94.6% for GPT-5-High and 95.0% for Gemini-3.0-Pro. On the Harvard-MIT Mathematics Tournament, the Speciale variant scored 99.2%, surpassing Gemini's 97.5%.The standard V3.2 model, optimized for everyday use, scored 93.1% on AIME and 92.5% on HMMT — marginally below frontier models but achieved with substantially fewer computational resources.Most striking are the competition results. DeepSeek-V3.2-Speciale scored 35 out of 42 points on the 2025 International Mathematical Olympiad, earning gold-medal status. At the International Olympiad in Informatics, it scored 492 out of 600 points — also gold, ranking 10th overall. The model solved 10 of 12 problems at the ICPC World Finals, placing second.These results came without internet access or tools during testing. DeepSeek's report states that "testing strictly adheres to the contest's time and attempt limits."On coding benchmarks, DeepSeek-V3.2 resolved 73.1% of real-world software bugs on SWE-Verified, competitive with GPT-5-High at 74.9%. On Terminal Bench 2.0, measuring complex coding workflows, DeepSeek scored 46.4%—well above GPT-5-High's 35.2%.The company acknowledges limitations. "Token efficiency remains a challenge," the technical report states, noting that DeepSeek "typically requires longer generation trajectories" to match Gemini-3.0-Pro's output quality.Why teaching AI to think while using tools changes everythingBeyond raw reasoning, DeepSeek-V3.2 introduces "thinking in tool-use" — the ability to reason through problems while simultaneously executing code, searching the web, and manipulating files.Previous AI models faced a frustrating limitation: each time they called an external tool, they lost their train of thought and had to restart reasoning from scratch. DeepSeek's architecture preserves the reasoning trace across multiple tool calls, enabling fluid multi-step problem solving.To train this capability, the company built a massive synthetic data pipeline generating over 1,800 distinct task environments and 85,000 complex instructions. These included challenges like multi-day trip planning with budget constraints, software bug fixes across eight programming languages, and web-based research requiring dozens of searches.The technical report describes one example: planning a three-day trip from Hangzhou with constraints on hotel prices, restaurant ratings, and attraction costs that vary based on accommodation choices. Such tasks are "hard to solve but easy to verify," making them ideal for training AI agents.DeepSeek employed real-world tools during training — actual web search APIs, coding environments, and Jupyter notebooks — while generating synthetic prompts to ensure diversity. The result is a model that generalizes to unseen tools and environments, a critical capability for real-world deployment.DeepSeek's open-source gambit could upend the AI industry's business modelUnlike OpenAI and Anthropic, which guard their most powerful models as proprietary assets, DeepSeek has released both V3.2 and V3.2-Speciale under the MIT license — one of the most permissive open-source frameworks available.Any developer, researcher, or company can download, modify, and deploy the 685-billion-parameter models without restriction. Full model weights, training code, and documentation are available on Hugging Face, the leading platform for AI model sharing.The strategic implications are significant. By making frontier-capable models freely available, DeepSeek undermines competitors charging premium API prices. The Hugging Face model card notes that DeepSeek has provided Python scripts and test cases "demonstrating how to encode messages in OpenAI-compatible format" — making migration from competing services straightforward.For enterprise customers, the value proposition is compelling: frontier performance at dramatically lower cost, with deployment flexibility. But data residency concerns and regulatory uncertainty may limit adoption in sensitive applications — particularly given DeepSeek's Chinese origins.Regulatory walls are rising against DeepSeek in Europe and AmericaDeepSeek's global expansion faces mounting resistance. In June, Berlin's data protection commissioner Meike Kamp declared that DeepSeek's transfer of German user data to China is "unlawful" under EU rules, asking Apple and Google to consider blocking the app.The German authority expressed concern that "Chinese authorities have extensive access rights to personal data within the sphere of influence of Chinese companies." Italy ordered DeepSeek to block its app in February. U.S. lawmakers have moved to ban the service from government devices, citing national security concerns.Questions also persist about U.S. export controls designed to limit China's AI capabilities. In August, DeepSeek hinted that China would soon have "next generation" domestically built chips to support its models. The company indicated its systems work with Chinese-made chips from Huawei and Cambricon without additional setup.DeepSeek's original V3 model was reportedly trained on roughly 2,000 older Nvidia H800 chips — hardware since restricted for China export. The company has not disclosed what powered V3.2 training, but its continued advancement suggests export controls alone cannot halt Chinese AI progress.What DeepSeek's release means for the future of AI competitionThe release arrives at a pivotal moment. After years of massive investment, some analysts question whether an AI bubble is forming. DeepSeek's ability to match American frontier models at a fraction of the cost challenges assumptions that AI leadership requires enormous capital expenditure.The company's technical report reveals that post-training investment now exceeds 10% of pre-training costs — a substantial allocation credited for reasoning improvements. But DeepSeek acknowledges gaps: "The breadth of world knowledge in DeepSeek-V3.2 still lags behind leading proprietary models," the report states. The company plans to address this by scaling pre-training compute.DeepSeek-V3.2-Speciale remains available through a temporary API until December 15, when its capabilities will merge into the standard release. The Speciale variant is designed exclusively for deep reasoning and does not support tool calling — a limitation the standard model addresses.For now, the AI race between the United States and China has entered a new phase. DeepSeek's release demonstrates that open-source models can achieve frontier performance, that efficiency innovations can slash costs dramatically, and that the most powerful AI systems may soon be freely available to anyone with an internet connection.As one commenter on X observed: "Deepseek just casually breaking those historic benchmarks set by Gemini is bonkers."The question is no longer whether Chinese AI can compete with Silicon Valley. It's whether American companies can maintain their lead when their Chinese rival gives comparable technology away for free.
How Atlas and most current AI-powered browsers fail on three aspects: privacy, security, and censorship
The post The Problem with AI Browsers: Security Flaws and the End of Privacy appeared first on Towards Data Science.
When Liquid AI, a startup founded by MIT computer scientists back in 2023, introduced its Liquid Foundation Models series 2 (LFM2) in July 2025, the pitch was straightforward: deliver the fastest on-device foundation models on the market using the new "liquid" architecture, with training and inference efficiency that made small models a serious alternative to cloud-only large language models (LLMs) such as OpenAI's GPT series and Google's Gemini. The initial release shipped dense checkpoints at 350M, 700M, and 1.2B parameters, a hybrid architecture heavily weighted toward gated short convolutions, and benchmark numbers that placed LFM2 ahead of similarly sized competitors like Qwen3, Llama 3.2, and Gemma 3 on both quality and CPU throughput. The message to enterprises was clear: real-time, privacy-preserving AI on phones, laptops, and vehicles no longer required sacrificing capability for latency.In the months since that launch, Liquid has expanded LFM2 into a broader product line — adding task-and-domain-specialized variants, a small video ingestion and analysis model, and an edge-focused deployment stack called LEAP — and positioned the models as the control layer for on-device and on-prem agentic systems. Now, with the publication of the detailed, 51-page LFM2 technical report on arXiv, the company is going a step further: making public the architecture search process, training data mixture, distillation objective, curriculum strategy, and post-training pipeline behind those models. And unlike earlier open models, LFM2 is built around a repeatable recipe: a hardware-in-the-loop search process, a training curriculum that compensates for smaller parameter budgets, and a post-training pipeline tuned for instruction following and tool use. Rather than just offering weights and an API, Liquid is effectively publishing a detailed blueprint that other organizations can use as a reference for training their own small, efficient models from scratch, tuned to their own hardware and deployment constraints.A model family designed around real constraints, not GPU labsThe technical report begins with a premise enterprises are intimately familiar with: real AI systems hit limits long before benchmarks do. Latency budgets, peak memory ceilings, and thermal throttling define what can actually run in production—especially on laptops, tablets, commodity servers, and mobile devices.To address this, Liquid AI performed architecture search directly on target hardware, including Snapdragon mobile SoCs and Ryzen laptop CPUs. The result is a consistent outcome across sizes: a minimal hybrid architecture dominated by gated short convolution blocks and a small number of grouped-query attention (GQA) layers. This design was repeatedly selected over more exotic linear-attention and SSM hybrids because it delivered a better quality-latency-memory Pareto profile under real device conditions.This matters for enterprise teams in three ways:Predictability. The architecture is simple, parameter-efficient, and stable across model sizes from 350M to 2.6B.Operational portability. Dense and MoE variants share the same structural backbone, simplifying deployment across mixed hardware fleets.On-device feasibility. Prefill and decode throughput on CPUs surpass comparable open models by roughly 2× in many cases, reducing the need to offload routine tasks to cloud inference endpoints.Instead of optimizing for academic novelty, the report reads as a systematic attempt to design models enterprises can actually ship.This is notable and more practical for enterprises in a field where many open models quietly assume access to multi-H100 clusters during inference.A training pipeline tuned for enterprise-relevant behaviorLFM2 adopts a training approach that compensates for the smaller scale of its models with structure rather than brute force. Key elements include:10–12T token pre-training and an additional 32K-context mid-training phase, which extends the model’s useful context window without exploding compute costs.A decoupled Top-K knowledge distillation objective that sidesteps the instability of standard KL distillation when teachers provide only partial logits.A three-stage post-training sequence—SFT, length-normalized preference alignment, and model merging—designed to produce more reliable instruction following and tool-use behavior.For enterprise AI developers, the significance is that LFM2 models behave less like “tiny LLMs” and more like practical agents able to follow structured formats, adhere to JSON schemas, and manage multi-turn chat flows. Many open models at similar sizes fail not due to lack of reasoning ability, but due to brittle adherence to instruction templates. The LFM2 post-training recipe directly targets these rough edges.In other words: Liquid AI optimized small models for operational reliability, not just scoreboards.Multimodality designed for device constraints, not lab demosThe LFM2-VL and LFM2-Audio variants reflect another shift: multimodality built around token efficiency.Rather than embedding a massive vision transformer directly into an LLM, LFM2-VL attaches a SigLIP2 encoder through a connector that aggressively reduces visual token count via PixelUnshuffle. High-resolution inputs automatically trigger dynamic tiling, keeping token budgets controllable even on mobile hardware. LFM2-Audio uses a bifurcated audio path—one for embeddings, one for generation—supporting real-time transcription or speech-to-speech on modest CPUs.For enterprise platform architects, this design points toward a practical future where:document understanding happens directly on endpoints such as field devices;audio transcription and speech agents run locally for privacy compliance;multimodal agents operate within fixed latency envelopes without streaming data off-device.The through-line is the same: multimodal capability without requiring a GPU farm.Retrieval models built for agent systems, not legacy searchLFM2-ColBERT extends late-interaction retrieval into a footprint small enough for enterprise deployments that need multilingual RAG without the overhead of specialized vector DB accelerators.This is particularly meaningful as organizations begin to orchestrate fleets of agents. Fast local retrieval—running on the same hardware as the reasoning model—reduces latency and provides a governance win: documents never leave the device boundary.Taken together, the VL, Audio, and ColBERT variants show LFM2 as a modular system, not a single model drop.The emerging blueprint for hybrid enterprise AI architecturesAcross all variants, the LFM2 report implicitly sketches what tomorrow’s enterprise AI stack will look like: hybrid local-cloud orchestration, where small, fast models operating on devices handle time-critical perception, formatting, tool invocation, and judgment tasks, while larger models in the cloud offer heavyweight reasoning when needed.Several trends converge here:Cost control. Running routine inference locally avoids unpredictable cloud billing.Latency determinism. TTFT and decode stability matter in agent workflows; on-device eliminates network jitter.Governance and compliance. Local execution simplifies PII handling, data residency, and auditability.Resilience. Agentic systems degrade gracefully if the cloud path becomes unavailable.Enterprises adopting these architectures will likely treat small on-device models as the “control plane” of agentic workflows, with large cloud models serving as on-demand accelerators.LFM2 is one of the clearest open-source foundations for that control layer to date.The strategic takeaway: on-device AI is now a design choice, not a compromiseFor years, organizations building AI features have accepted that “real AI” requires cloud inference. LFM2 challenges that assumption. The models perform competitively across reasoning, instruction following, multilingual tasks, and RAG—while simultaneously achieving substantial latency gains over other open small-model families.For CIOs and CTOs finalizing 2026 roadmaps, the implication is direct: small, open, on-device models are now strong enough to carry meaningful slices of production workloads.LFM2 will not replace frontier cloud models for frontier-scale reasoning. But it offers something enterprises arguably need more: a reproducible, open, and operationally feasible foundation for agentic systems that must run anywhere, from phones to industrial endpoints to air-gapped secure facilities.In the broadening landscape of enterprise AI, LFM2 is less a research milestone and more a sign of architectural convergence. The future is not cloud or edge—it’s both, operating in concert. And releases like LFM2 provide the building blocks for organizations prepared to build that hybrid future intentionally rather than accidentally.
Build a lightweight Python DSL to define and check data quality rules in a clear, expressive way. Turn complex validation logic into simple, reusable configurations that anyone on your data team can understand.
Welcome back to The State of AI, a new collaboration between the Financial Times and MIT Technology Review. Every Monday for the next two weeks, writers from both publications will debate one aspect of the generative AI revolution reshaping global power. You can read the rest of the series here. This week, Richard Waters, FT…
AquaCulture Shock program, in collaboration with MIT-Scandinavia MISTI, offers international internships for AI and autonomy in aquaculture
At MITEI’s Fall Colloquium, General Motors’ battery development expert emphasized how affordability, accessibility, and commercialization can position the US as a leader in battery tech.
Vyacheslav Efimov on AI hackathons, data science roadmaps, and how AI meaningfully changed day-to-day ML Engineer work
The post Learning, Hacking, and Shipping ML appeared first on Towards Data Science.
It’s not about clever wording anymore. It’s about designing environments where AI can think with depth, consistency, and purpose.
Large language models (LLMs) are mainly trained to generate text responses to user queries or prompts, with complex reasoning under the hood that not only involves language generation by predicting each next token in the output sequence, but also entails a deep understanding of the linguistic patterns surrounding the user input text.
An accidental leak revealed that Flock, which has cameras in thousands of US communities, is using workers in the Philippines to review and classify footage.
A stealth artificial intelligence startup founded by an MIT researcher emerged this morning with an ambitious claim: its new AI model can control computers better than systems built by OpenAI and Anthropic — at a fraction of the cost.OpenAGI, led by chief executive Zengyi Qin, released Lux, a foundation model designed to operate computers autonomously by interpreting screenshots and executing actions across desktop applications. The San Francisco-based company says Lux achieves an 83.6 percent success rate on Online-Mind2Web, a benchmark that has become the industry's most rigorous test for evaluating AI agents that control computers.That score is a significant leap over the leading models from well-funded competitors. OpenAI's Operator, released in January, scores 61.3 percent on the same benchmark. Anthropic's Claude Computer Use achieves 56.3 percent."Traditional LLM training feeds a large amount of text corpus into the model. The model learns to produce text," Qin said in an exclusive interview with VentureBeat. "By contrast, our model learns to produce actions. The model is trained with a large amount of computer screenshots and action sequences, allowing it to produce actions to control the computer."The announcement arrives at a pivotal moment for the AI industry. Technology giants and startups alike have poured billions of dollars into developing autonomous agents capable of navigating software, booking travel, filling out forms, and executing complex workflows. OpenAI, Anthropic, Google, and Microsoft have all released or announced agent products in the past year, betting that computer-controlling AI will become as transformative as chatbots.Yet independent research has cast doubt on whether current agents are as capable as their creators suggest.Why university researchers built a tougher benchmark to test AI agents—and what they discoveredThe Online-Mind2Web benchmark, developed by researchers at Ohio State University and the University of California, Berkeley, was designed specifically to expose the gap between marketing claims and actual performance.Published in April and accepted to the Conference on Language Modeling 2025, the benchmark comprises 300 diverse tasks across 136 real websites — everything from booking flights to navigating complex e-commerce checkouts. Unlike earlier benchmarks that cached parts of websites, Online-Mind2Web tests agents in live online environments where pages change dynamically and unexpected obstacles appear.The results, according to the researchers, painted "a very different picture of the competency of current agents, suggesting over-optimism in previously reported results."When the Ohio State team tested five leading web agents with careful human evaluation, they found that many recent systems — despite heavy investment and marketing fanfare — did not outperform SeeAct, a relatively simple agent released in January 2024. Even OpenAI's Operator, the best performer among commercial offerings in their study, achieved only 61 percent success."It seemed that highly capable and practical agents were maybe indeed just months away," the researchers wrote in a blog post accompanying their paper. "However, we are also well aware that there are still many fundamental gaps in research to fully autonomous agents, and current agents are probably not as competent as the reported benchmark numbers may depict."The benchmark has gained traction as an industry standard, with a public leaderboard hosted on Hugging Face tracking submissions from research groups and companies.How OpenAGI trained its AI to take actions instead of just generating textOpenAGI's claimed performance advantage stems from what the company calls "Agentic Active Pre-training," a training methodology that differs fundamentally from how most large language models learn.Conventional language models train on vast text corpora, learning to predict the next word in a sequence. The resulting systems excel at generating coherent text but were not designed to take actions in graphical environments.Lux, according to Qin, takes a different approach. The model trains on computer screenshots paired with action sequences, learning to interpret visual interfaces and determine which clicks, keystrokes, and navigation steps will accomplish a given goal."The action allows the model to actively explore the computer environment, and such exploration generates new knowledge, which is then fed back to the model for training," Qin told VentureBeat. "This is a naturally self-evolving process, where a better model produces better exploration, better exploration produces better knowledge, and better knowledge leads to a better model."This self-reinforcing training loop, if it functions as described, could help explain how a smaller team might achieve results that elude larger organizations. Rather than requiring ever-larger static datasets, the approach would allow the model to continuously improve by generating its own training data through exploration.OpenAGI also claims significant cost advantages. The company says Lux operates at roughly one-tenth the cost of frontier models from OpenAI and Anthropic while executing tasks faster.Unlike browser-only competitors, Lux can control Slack, Excel, and other desktop applicationsA critical distinction in OpenAGI's announcement: Lux can control applications across an entire desktop operating system, not just web browsers.Most commercially available computer-use agents, including early versions of Anthropic's Claude Computer Use, focus primarily on browser-based tasks. That limitation excludes vast categories of productivity work that occur in desktop applications — spreadsheets in Microsoft Excel, communications in Slack, design work in Adobe products, code editing in development environments.OpenAGI says Lux can navigate these native applications, a capability that would substantially expand the addressable market for computer-use agents. The company is releasing a developer software development kit alongside the model, allowing third parties to build applications on top of Lux.The company is also working with Intel to optimize Lux for edge devices, which would allow the model to run locally on laptops and workstations rather than requiring cloud infrastructure. That partnership could address enterprise concerns about sending sensitive screen data to external servers."We are partnering with Intel to optimize our model on edge devices, which will make it the best on-device computer-use model," Qin said.The company confirmed it is in exploratory discussions with AMD and Microsoft about additional partnerships.What happens when you ask an AI agent to copy your bank detailsComputer-use agents present novel safety challenges that do not arise with conventional chatbots. An AI system capable of clicking buttons, entering text, and navigating applications could, if misdirected, cause significant harm — transferring money, deleting files, or exfiltrating sensitive information.OpenAGI says it has built safety mechanisms directly into Lux. When the model encounters requests that violate its safety policies, it refuses to proceed and alerts the user.In an example provided by the company, when a user asked the model to "copy my bank details and paste it into a new Google doc," Lux responded with an internal reasoning step: "The user asks me to copy the bank details, which are sensitive information. Based on the safety policy, I am not able to perform this action." The model then issued a warning to the user rather than executing the potentially dangerous request.Such safeguards will face intense scrutiny as computer-use agents proliferate. Security researchers have already demonstrated prompt injection attacks against early agent systems, where malicious instructions embedded in websites or documents can hijack an agent's behavior. Whether Lux's safety mechanisms can withstand adversarial attacks remains to be tested by independent researchers.The MIT researcher who built two of GitHub's most downloaded AI modelsQin brings an unusual combination of academic credentials and entrepreneurial experience to OpenAGI.He completed his doctorate at the Massachusetts Institute of Technology in 2025, where his research focused on computer vision, robotics, and machine learning. His academic work appeared in top venues including the Conference on Computer Vision and Pattern Recognition, the International Conference on Learning Representations, and the International Conference on Machine Learning.Before founding OpenAGI, Qin built several widely adopted AI systems. JetMoE, a large language model he led development on, demonstrated that a high-performing model could be trained from scratch for less than $100,000 — a fraction of the tens of millions typically required. The model outperformed Meta's LLaMA2-7B on standard benchmarks, according to a technical report that attracted attention from MIT's Computer Science and Artificial Intelligence Laboratory.His previous open-source projects achieved remarkable adoption. OpenVoice, a voice cloning model, accumulated approximately 35,000 stars on GitHub and ranked in the top 0.03 percent of open-source projects by popularity. MeloTTS, a text-to-speech system, has been downloaded more than 19 million times, making it one of the most widely used audio AI models since its 2024 release.Qin also co-founded MyShell, an AI agent platform that has attracted six million users who have collectively built more than 200,000 AI agents. Users have had more than one billion interactions with agents on the platform, according to the company.Inside the billion-dollar race to build AI that controls your computerThe computer-use agent market has attracted intense interest from investors and technology giants over the past year.OpenAI released Operator in January, allowing users to instruct an AI to complete tasks across the web. Anthropic has continued developing Claude Computer Use, positioning it as a core capability of its Claude model family. Google has incorporated agent features into its Gemini products. Microsoft has integrated agent capabilities across its Copilot offerings and Windows.Yet the market remains nascent. Enterprise adoption has been limited by concerns about reliability, security, and the ability to handle edge cases that occur frequently in real-world workflows. The performance gaps revealed by benchmarks like Online-Mind2Web suggest that current systems may not be ready for mission-critical applications.OpenAGI enters this competitive landscape as an independent alternative, positioning superior benchmark performance and lower costs against the massive resources of its well-funded rivals. The company's Lux model and developer SDK are available beginning today.Whether OpenAGI can translate benchmark dominance into real-world reliability remains the central question. The AI industry has a long history of impressive demos that falter in production, of laboratory results that crumble against the chaos of actual use. Benchmarks measure what they measure, and the distance between a controlled test and an 8-hour workday full of edge cases, exceptions, and surprises can be vast.But if Lux performs in the wild the way it performs in the lab, the implications extend far beyond one startup's success. It would suggest that the path to capable AI agents runs not through the largest checkbooks but through the cleverest architectures—that a small team with the right ideas can outmaneuver the giants.The technology industry has seen that story before. It rarely stays true for long.
For PhD student Benjamin Manning, the future of work means grasping AI’s role on our behalf while transforming and accelerating social scientific discovery.
In this article, we overview five cutting-edge MLOps trends that will shape 2026.
You can’t align what you don’t evaluate
The post Why AI Alignment Starts With Better Evaluation appeared first on Towards Data Science.
A US telecom company trained an AI model on years of inmates’ phone and video calls and is now piloting that model to scan their calls, texts, and emails in the hope of predicting and preventing crimes. Securus Technologies president Kevin Elder told MIT Technology Review that the company began building its AI tools in…
With some needed infrastructure now being developed for agentic commerce, enterprises will want to figure out how to participate in this new form of buying and selling. But it remains a fragmented Wild West with competing payment protocols, and it's unclear what enterprises need to do to prepare. More cloud providers and AI model companies will start providing enterprises with the tools needed to begin building systems that enable agentic commerce.AWS, which will list Visa’s Intelligence Commerce platform on the AWS Marketplace, believes that making it easier to connect to tools that enable agentic payments would accelerate the adoption of agentic commerce. While this doesn’t mean Amazon has formally adopted Visa’s Trusted Agent Protocol (TAP), which would bring the world’s largest e-commerce platform to the agentic shopping space, it does show just how agentic commerce is fast becoming an area enterprises want to focus on. Scott Mullins, AWS managing director of Worldwide Financial Services, told VentureBeat in an email that listing the platform “makes payment capabilities accessible” in a secure manner that quickly integrates with Visa’s system. “We’re giving developers pre-built frameworks and standardized infrastructure to eliminate major development barriers,” Mullins said. He added that the idea is to list Visa’s platform to streamline integration with AWS services like Bedrock and AgentCore. In addition to listing the Visa Intelligence Commerce platform on AWS Marketplace, the two companies will also publish blueprints to the public Bedrock AgentCore repository. Mullins said this will “significantly reduce development time and complexity that anyone can use to create travel booking agents, retail shopping agents and B2B payment reconciliation agents.”The Visa Intelligence Commerce platform will be MCP-compatible, allowing enterprises to connect agents running on it to other agents. What enterprises need to know
Through the Visa Intelligence Commerce platform, AWS customers can access authentication, agentic tokenization and data personalization tools. These allow organizations to register and connect their agents to Visa’s payment infrastructure. The platform helps mask credit card details through tokenized digital credentials and lets companies set guidelines for agent transactions, like spending limits. Rubali Birwadker, senior vice president and global head of Growth at Visa, said in a press release that bringing the platform to AWS lets it scale, “helping to unlock faster innovation for developers and better experiences for consumers and businesses worldwide.”Mullins said Visa and AWS are helping provide the foundational infrastructure for developers and businesses to push for agentic commerce projects, but for this to work, developers must coordinate several agents and understand the different needs of industries. “Real-world commerce often requires multiple agents working together,” Mullins said. “The Travel Booking Agent blueprint, for instance, connects flight, hotel, car rental, and train providers to deliver complete travel journeys with integrated payments. Developers need to design coordination patterns for these complex, multi-agent workflows.”Different use cases also have different needs, so enterprises need to plan carefully around existing infrastructure. This is where the MCP connection is vital, since it will enable communication between an organization’s agents to Visa’s platform while maintaining identity and security.
Blueprints for agentic commerceMullins said the biggest stumbling block for many enterprises experimenting with agentic commerce is the fragmentation of commerce systems, which creates integration challenges. “This collaboration will address these challenges by providing reference architecture blueprints that developers can use as starting points, combined with AWS's cloud infrastructure and Visa's trusted payment network to create a standardized, secure foundation for agentic commerce,” he said.The reference blueprints would give a framework for enterprise developers, solution architects and software vendors to follow when building these new workflows. Mullins said the blueprints are being developed in coordination with Expedia Group, Intuit and the Eurostars Hotel company. The blueprints will work with the Visa Intelligent Commerce MCP server and APIs and will be managed through Amazon Bedrock AgentCore. AWS said that its goal is to “enable a foundation for agentic commerce at Scale, where transactions are handled by agents capable of real-time reasoning and coordination.”These blueprints would eventually become composable, reusable workflows for any organization looking to build travel booking agents or retail shopping agents. These don’t have to be consumer-focused agents; there can also be agents buying flights for employees.
Agentic commerce marches forwardAgentic commerce, where agents do the product searching, cart adding and payments, is fast becoming the next frontier for AI players. Companies like OpenAI and Google have come out with AI-powered shopping tools to make it easier to surface products and for agents to find them. Browsers like OpenAI’s Atlas and Comet from Perplexity also play a role in connecting agents to websites. Retailers like Walmart and Target have also integrated into ChatGPT, so users can ask the chatbot to search for items through chat. One of the biggest problems facing the adoption of agentic commerce revolves around enabling safe, secure transactions. OpenAI and Stripe launched the Agentic Commerce Protocol (ACP) in September, following Google’s announcement of Agent Pay Protocol (AP2) in collaboration with American Express, Mastercard, PayPal, Salesforce and ServiceNow. Visa followed soon after with TAP, which connects to the Visa Intelligent Commerce platform. “The foundation is now in place through this collaboration, but successful agentic commerce requires thoughtful design that considers the specific needs of industry, users and existing systems while leveraging the standardized infrastructure and blueprints now available,” Mullins said.
As AI, cloud, and other technology investments soar, organizations have to make investment decisions with increased speed and clarity. Practices like FinOps, IT financial management (ITFM), and strategic portfolio management (SPM) help stakeholders evaluate opportunities and trade-offs for maximum value. But they depend on unified, reliable data. And that’s often where the challenge begins.AI can surface insights from data within specific domains, but important decisions rarely rely on a single source of data. To account for operational and organizational factors as well as financial impact, finance and IT teams have to cut through disconnected systems, outdated data, and inconsistent definitions of value. Real control over technology spend comes from financial intelligence — turning fragmented inputs into actionable, context-rich insights. Apptio technology business management (TBM) solutions deliver that intelligence to technology and finance leaders. By connecting financial, operational, and business data across the enterprise, they give leaders the clarity to make every tech dollar count. Wrangling inputs instead of driving strategyWhen different stakeholders rely on different sources of truth, they don’t share the same perspective on the finance and technology landscape. The CFO sees the cost structures in the ERP system. The CIO sees systems configuration and performance metrics in ITSM and monitoring tools. The business looks at outcomes in CRM and analytics platforms. But no single domain has the holistic understanding needed to balance organizational, operational, and financial priorities.Organizations must also evaluate competing priorities across applications, infrastructure, cloud services, DevOps tools, and workforce investments. Informed trade-offs — such as carving out budget for AI investments without undermining existing capabilities — require visibility into usage patterns, system redundancies, and relative value across all these domains. Without visibility, FinOps, ITFM, and SPM practices can’t fulfill their potential for IT and cloud cost optimization.Instead, siloed data sources force finance teams to spend hours gathering reports from different systems of record and trying to reconcile inconsistent data formats. This practice is not only time- and labor-intensive, but it also opens the org to the risk of flawed forecasts, missed optimization opportunities, and wasted technology spend — potentially costing millions annually.This critical gap reveals why generic BI platforms and DIY tools only go so far. They can’t connect costs back to their sources at a detailed level, making it hard to trace allocations across systems, identify redundancies, or even answer the simplest question: What’s driving our costs?Turning static numbers into actionFinancial intelligence translates domain-specific financial, operational, and business metrics into a shared language of value on which leaders can act. By aggregating, normalizing, and enriching data from ERP systems, cloud platforms, IT service management tools, HR systems, and more, the Financial Intelligence Layer in Apptio supports three critical ITFM, FinOps, and SPM capabilities:Context. Aligning financial, operational, and outcome inputs so that: Cloud spend connects to business impactInfrastructure costs tie to application performanceWorkforce investments link to service deliveryInsights. Connecting cost, usage, performance, and value across the enterprise. For example, mapping AI model usage to ROI can reveal which initiatives do and do not deserve continued investment.Action. Empowering leaders to make informed, coordinated decisions rather than operating in silos.Hyperscalers surface cloud cost optimization insights on their own platforms. Single-function tech platforms like ERP, HR, CRM, and ITSM provide valuable metrics for their specific domains. Apptio TBM solutions go further, delivering the financial context and actionable insights needed to manage technology spend management across all areas: on-premises, multi-cloud, applications, and workforce.Domain expertise for FinOps, ITFM, and SPMRaw numbers don’t tell a story. What matters is structuring data so that it aligns with business goals and enables decision-makers to see patterns, weigh options, and chart the best path forward. Apptio has trained its AI specifically on FinOps, ITFM, and SPM to understand the questions these teams actually need to answer, so TBM teams can work faster and smarter. Apptio TBM solutions ease the cognitive load by automating time-consuming ingestion, mapping, anomaly detection, and enrichment — so people can focus on strategic decisions. Clean, enriched inputs feed forecasting models that anticipate cost trends and surface optimization opportunities. And because Apptio offers ready-to-use cost modeling frameworks and governance, organizations can start realizing value far faster than they can using DIY or open-source tools. The path to financial intelligenceFinancial intelligence starts with clean, contextualized data — but how that data is organized and used is equally critical for optimizing technology spend. TBM principles like cost and consumption allocation, process optimization, and unit economics will help teams translate data into meaningful insights and smarter decisions. Solutions purpose-built for technology spend management are essential. Spreadsheets don’t scale, and domain expertise matters. Apptio TBM solutions deliver enterprise-grade governance, financial context across all tech domains, and AI trained specifically for ITFM, FinOps, and SPM. These are capabilities that hyperscalers — focused on single-cloud optimization and generic BI tools — simply can’t provide at scale.In an era when rapid innovation places a premium on technology spend management, financial intelligence is vital for maximizing budgets. By optimizing the inputs that fuel AI-driven financial workflows, leaders can equip every stakeholder with the confidence and intelligence to steer technology investments with data-driven precision. Learn more here about how the Financial Intelligence Layer in Apptio transforms how enterprises decide, fund, and execute their TBM strategies in the AI era. Ajay Patel is General Manager at Apptio, an IBM Company.Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
Anthropic’s official guide, ChatGPT turns 3, 2026 predictions, cancer AI, and more...
Hybrid cloud security was built before the current era of automated, machine-based cyberattacks that take just milliseconds to execute and minutes to deliver devastating impacts to infrastructure. The architectures and tech stacks every enterprise depends on, from batch-based detection to siloed tools to 15-minute response windows, stood a better chance of defending against attackers moving at human speed. But in a weaponized AI world, those approaches to analyzing threat data don't make sense. The latest survey numbers tell the story. More than half (55%) of organizations suffered cloud breaches in the past year. That’s a 17-point spike, according to Gigamon's 2025 Hybrid Cloud Security Survey. Nearly half of the enterprises polled said their security tools missed the attack entirely. While 82% of enterprises now run hybrid or multi-cloud environments, only 36% express confidence in detecting threats in real time, per Fortinet's 2025 State of Cloud Security Report.Adversaries aren’t wasting any time weaponizing AI to target hybrid cloud vulnerabilities. Organizations now face 1,925 cyberattacks weekly. That’s an increase of 47% in a year. Further, ransomware surged 126% in the first quarter of 2025 alone. The visibility gaps everyone talks about in hybrid environments is where breaches originate. The bottom line is that the security architectures designed for the pre-AI era can't keep pace.But the industry is finally beginning to respond. CrowdStrike, for its part, is providing one vision of cybersecurity reinvention. Today at AWS re:Invent, the company is rolling out real-time Cloud Detection and Response, a platform designed to compress 15-minute response windows down to seconds. But the bigger story is why the entire approach to hybrid cloud security must change, and what that means for CISOs planning their 2026 strategies.Why the old model for hybrid cloud security is failingInitially, hybrid cloud promised the best of both worlds. Every organization could have public cloud agility with on-prem control. The security model that took shape reflected the best practices at the time. The trouble is that those best practices are now introducing vulnerabilities.How bad is it? The majority of security teams struggle to keep up with the threats and workloads. According to recent research: 91% of security leaders admit to making security compromises in their hybrid cloud environments, often trading visibility for speed, accepting siloed tools, and working with degraded data quality.76% report a shortage of cloud security expertise, limiting their ability to deploy and manage comprehensive solutions.Only 17% of organizations can see attackers moving laterally inside their network. That’s one of several blind spots that attackers capitalize on to exploit dwell times to the fullest, install ransomware, do reconnaissance, and lurk until the time is right to launch an attack.70% now view the public cloud as the riskiest environment in their infrastructure, and half are considering moving workloads back on-prem."You can't secure what you can't see," says Mandy Andress, CISO at Elastic. "That's the heart of the two big challenges we see as security practitioners: The complexity and sprawl of an organization's infrastructure, coupled with the rapid pace of technological change."CrowdStrike's Zaitsev diagnosed the root cause: "Everyone assumed this was a one-way trip, lift and shift everything to the cloud. That's not what happened. We're seeing companies pull workloads back on-prem when the economics make sense. The reality? Everyone's going to be hybrid. Five years from now. Ten years. Maybe forever. Security has to deal with that."Weaponized AI is changing the threat calculus fastThe weaponized AI era isn't just accelerating attacks. It’s breaking the fundamental assumptions on which hybrid cloud security was built. The window between patch release and weaponized exploit collapsed from weeks to hours. The majority of adversaries aren't typing commands anymore; they're automating machine-based campaigns that orchestrate agentic AI at a scale and speed that current hybrid cloud tools and human SOC teams can't keep up with.Zaitsev shared threat data from CrowdStrike's mid-year hunting report, which found that cloud intrusions spiked 136% in a year, with roughly 40% of all cloud actor activity coming from Chinese nexus adversaries. This illustrates how quickly the threat landscape can change, and why hybrid cloud security needs to be reinvented for the AI era now.Mike Riemer, SVP and field CISO at Ivanti, has witnessed the timeline collapse. Threat actors now reverse-engineer patches within 72 hours using AI assistance. If enterprises don't patch within that time frame, "they're open to exploit," Riemer told VentureBeat. "That's the new reality."Using previous-generation tools in the current cloud control plane is a dangerous bet. All it takes is a single compromised virtual machine (VM) that no one knows exists. Compromise the control plane, including the APIs that manage cloud resources, and they’ve got keys to spin up, modify or delete thousands of assets across a company’s hybrid environment.The seams between hybrid cloud environments are attack highways where millisecond-long attacks seldom leave any digital exhaust or traces. Many organizations never see weaponized AI attacks coming. VentureBeat hears that the worst hybrid cloud attacks can only be diagnosed long after the fact, when forensics and analysis are finally completed. Attackers and adversaries are that good at covering their tracks, often relying on living-off-the-land (LotL) tools to evade detection for months, even years in extreme cases. "Enterprises training AI models are concentrating sensitive data in cloud environments, which is gold for adversaries," CrowdStrike's Zaitsev said. "Attackers are using agentic AI to run their campaigns. The traditional SOC workflow — see the alert, triage, investigate for 15 or 20 minutes, take action an hour or a day later —is completely insufficient. You're bringing a knife to a gunfight."The human toll of relying on outdated architectureThe human toll of the hybrid cloud crisis shows up in SOC metrics and burnout. The AI SOC Market Landscape 2025 report found that the average security operations center processes 960 alerts daily. Each takes roughly 70 minutes to investigate properly. Assuming standard SOC staffing levels, there aren't enough hours in the day to get to all those alerts. Futher, at least 40% of alerts, on average, never get touched. The human cost is staggering. A Tines survey of SOC analysts found that 71% are experiencing burnout. Two-thirds say manual grunt work consumes more than half of SOC workers' day. The same percentage are eyeing the exit from their jobs, and, in many extreme cases as some confide to VentureBeat, the industry.Hybrid environments make everything more complicated. Enterprises have different tools for AWS, Azure and on-prem architectures. They have different consoles; often different teams. As for alert correlation across environments? It's manual and often delegated to the most senior SOC team members — if it happens at all. Batch-based detection can't survive the weaponized AI eraHere's what most legacy vendors of hybrid cloud security tools won't openly admit: Cloud security tools are fundamentally flawed and not designed for real-time defense. The majority are batch-based, collecting logs every five, ten or fifteen minutes, processing them through correlation engines, then generating alerts. In a world where adversaries are increasingly executing machine-based attacks in milliseconds, a 15-minute detection delay isn't just a minor setback; it's the difference between stopping an attack and having to investigate a breach.As adversaries weaponize AI to accelerate cloud attacks and move laterally across systems, traditional cloud detection and response (CDR) tools relying on log batch processing are too slow to keep up. These systems can take 15 minutes or more to surface a single detection.CrowdStrike's Zaitsev didn't hedge. Before the company's new tools released today, there was no such thing as real-time cloud detection and prevention, he claimed. "Everyone else is batch-based. Suck down logs every five or 10 minutes, wait for data, import it, correlate it. We've seen competitors take 10 to 15 minutes minimum. That's not detection—that's archaeology."He continued: "It's carrier pigeon versus 5G. The gap between 15 minutes and 15 seconds isn't just about alert quality. It's the difference between getting a notification that something has already happened; now you're doing cleanup, versus actually stopping the attack before the adversary achieves anything. One is incident response. The other is prevention."Reinventing hybrid cloud security must begin with speedCrowdStrike's new real-time Cloud Detection and Response, part of Falcon Cloud Security's unified cloud-native application protection platform (CNAPP), is intended to secure every layer of hybrid cloud risk. It is built on three key innovations:Real-time detection engine: Built on event streaming technology pioneered and battle-tested by Falcon Adversary OverWatch, this engine analyzes cloud logs as they stream in. It then applies detections to eliminate latency and false positives.New cloud-specific indicators of attack out of the box: AI and machine learning (ML) correlate what's happening in real time against cloud asset and identity data. That's how the system catches stealthy moves like privilege escalation and CloudShell abuse before attackers can capitalize on them.Automated cloud response actions and workflows: There's a gap in traditional cloud security. Cloud workload protection (CWP) simply stops at the workload. Cloud security posture management (CSPM) shows what could go wrong. But neither protects the control plane at runtime. New workflows built on Falcon Fusion SOAR close that gap, triggering instantly to disrupt adversaries before SOC teams can intervene.CrowdStrike's Cloud Detection and Response integrates with AWS EventBridge, Amazon's real-time serverless event streaming service. Instead of polling for logs on a schedule, the system taps directly into the event stream as things happen. "Anything that calls itself CNAPP that doesn't have real-time cloud detection and response is now obsolete," CrowdStrike CTO Elia Zaitsev said in an exclusive interview with VentureBeat. By contrast, EventBridge provides a us asynchronous, microservice-based, just-in-time event processing. "We're not waiting five minutes for a bucket of data," he said. But tapping into it is only half the problem. "Can you actually keep up with that firehose? Can you process it fast enough to matter?" Zaitsev asked rhetorically. CrowdStrike claims it can handle 60 million events per second. "This isn't duct tape and a demo."The underlying streaming technology isn't new to CrowdStrike. Falcon Adversary OverWatch has been running stream processing for 15 years to hunt across CrowdStrike's customer base, processing logs in real time rather than waiting for batch cycles to complete.The platform integrates Charlotte AI for automated triage, providing 98% accuracy matching expert managed detection and response (MDR) analysts, cutting 40-plus hours of manual work weekly. When the system detects a control plane compromise, it doesn't wait for human approval. It revokes tokens, kills sessions, boots the attacker and nukes malicious CloudFormation templates, all before the adversary can execute.What this means for the CNAPP marketCloud security is the fastest-growing segment in Gartner's latest forecast, expanding at a 25.9% CAGR through 2028. Precedence Research projects the market will grow from $36 billion in 2024 to $121 billion by 2034. And it's crowded: Palo Alto Networks, Wiz (now absorbed into Google via a $32 billion acquisition), Microsoft, Orca, SentinelOne (to name a few). CrowdStrike already had a seat at the table as a Leader in the 2025 IDC MarketScape for CNAPP for the third consecutive year. Gartner predicts that by 2029, 40% of enterprises that successfully implement zero trust in cloud environments will rely on CNAPP platforms due to their visibility and control.But Zaitsev is making a bigger claim, stating that today's announcement redefines what "complete" means for CNAPP in hybrid environments. "CSPM isn't going away. Cloud workload protection isn't going away. What becomes obsolete is calling something a CNAPP when it lacks real-time cloud detection and response. You're missing the safety net, the thing that catches what gets through proactive defenses. And in hybrid, something always gets through."The unified platform angle matters specifically for hybrid," he said. "Adversaries deliberately hop between environments because they know defenders run different tools, often different teams, for cloud versus on-prem versus identity. Jumping domains is how you shake your tail. Attackers know most organizations can't follow them across the seams. With us, they can't do that anymore."Building hybrid security for the AI eraReinventing hybrid cloud security won't happen overnight. Here's where CISOs should focus:Map your hybrid visibility gaps: Every cloud workload, every on-prem system, every identity traversing between them. If 82% of breaches trace to blind spots, know where yours are before attackers find them.Pressure vendors on detection latency: Ask challenging questions about architecture. If they're running batch-based processing, understand what a 15-minute window means when adversaries move in seconds.Deploy AI triage now: With 40% of alerts going uninvestigated and 71% of analysts burned out, automation isn't a roadmap item; it’s a must-have for a successful deterrence strategy. Look for measurable accuracy rates and real-time savings.Compress patch cycles to 72 hours: AI-assisted reverse engineering has collapsed the exploit window. Monthly patch cycles don't cut it anymore.Architect for permanent hybrid. Stop waiting for cloud migration to simplify security. It won't. Design for complexity as the baseline, not a temporary state. The 54% of enterprises running hybrid models today will still be hybrid tomorrow.The bottom lineHybrid cloud security must be reinvented for the AI era. Previous-generation hybrid cloud security solutions are quickly being eclipsed by weaponized AI attacks, often launched as machine-on-machine intrusion attempts. The evidence is clear: 55% breach rates, 91% of security leaders making compromises they know are dangerous and AI-accelerated attacks that move faster than batch-based detection can respond. Architectures designed for human-speed threats can't protect against machine-speed adversaries."Modern cybersecurity is about differentiating between acceptable and unacceptable risk," says Chaim Mazal, CSO at Gigamon. "Our research shows where CISOs are drawing that line, highlighting the critical importance of visibility into all data-in-motion to secure complex hybrid cloud infrastructure against today's emerging threats. It's clear that current approaches aren't keeping pace, which is why CISOs must reevaluate tool stacks and reprioritize investments and resources to more confidently secure their infrastructure."VentureBeat will be tracking which approaches to hybrid cloud reinvention actually deliver, and which don't, in the months ahead.
Opening the black box of ML models, step by step, directly in Excel
The post The Machine Learning and Deep Learning “Advent Calendar” Series: The Blueprint appeared first on Towards Data Science.
This article is divided into four parts; they are: • Optimizers for Training Language Models • Learning Rate Schedulers • Sequence Length Scheduling • Other Techniques to Help Training Deep Learning Models Adam has been the most popular optimizer for training deep learning models.