Large language models (LLMs) are mainly trained to generate text responses to user queries or prompts, with complex reasoning under the hood that not only involves language generation by predicting each next token in the output sequence, but also entails a deep understanding of the linguistic patterns surrounding the user input text.
An accidental leak revealed that Flock, which has cameras in thousands of US communities, is using workers in the Philippines to review and classify footage.
A stealth artificial intelligence startup founded by an MIT researcher emerged this morning with an ambitious claim: its new AI model can control computers better than systems built by OpenAI and Anthropic — at a fraction of the cost.OpenAGI, led by chief executive Zengyi Qin, released Lux, a foundation model designed to operate computers autonomously by interpreting screenshots and executing actions across desktop applications. The San Francisco-based company says Lux achieves an 83.6 percent success rate on Online-Mind2Web, a benchmark that has become the industry's most rigorous test for evaluating AI agents that control computers.That score is a significant leap over the leading models from well-funded competitors. OpenAI's Operator, released in January, scores 61.3 percent on the same benchmark. Anthropic's Claude Computer Use achieves 56.3 percent."Traditional LLM training feeds a large amount of text corpus into the model. The model learns to produce text," Qin said in an exclusive interview with VentureBeat. "By contrast, our model learns to produce actions. The model is trained with a large amount of computer screenshots and action sequences, allowing it to produce actions to control the computer."The announcement arrives at a pivotal moment for the AI industry. Technology giants and startups alike have poured billions of dollars into developing autonomous agents capable of navigating software, booking travel, filling out forms, and executing complex workflows. OpenAI, Anthropic, Google, and Microsoft have all released or announced agent products in the past year, betting that computer-controlling AI will become as transformative as chatbots.Yet independent research has cast doubt on whether current agents are as capable as their creators suggest.Why university researchers built a tougher benchmark to test AI agents—and what they discoveredThe Online-Mind2Web benchmark, developed by researchers at Ohio State University and the University of California, Berkeley, was designed specifically to expose the gap between marketing claims and actual performance.Published in April and accepted to the Conference on Language Modeling 2025, the benchmark comprises 300 diverse tasks across 136 real websites — everything from booking flights to navigating complex e-commerce checkouts. Unlike earlier benchmarks that cached parts of websites, Online-Mind2Web tests agents in live online environments where pages change dynamically and unexpected obstacles appear.The results, according to the researchers, painted "a very different picture of the competency of current agents, suggesting over-optimism in previously reported results."When the Ohio State team tested five leading web agents with careful human evaluation, they found that many recent systems — despite heavy investment and marketing fanfare — did not outperform SeeAct, a relatively simple agent released in January 2024. Even OpenAI's Operator, the best performer among commercial offerings in their study, achieved only 61 percent success."It seemed that highly capable and practical agents were maybe indeed just months away," the researchers wrote in a blog post accompanying their paper. "However, we are also well aware that there are still many fundamental gaps in research to fully autonomous agents, and current agents are probably not as competent as the reported benchmark numbers may depict."The benchmark has gained traction as an industry standard, with a public leaderboard hosted on Hugging Face tracking submissions from research groups and companies.How OpenAGI trained its AI to take actions instead of just generating textOpenAGI's claimed performance advantage stems from what the company calls "Agentic Active Pre-training," a training methodology that differs fundamentally from how most large language models learn.Conventional language models train on vast text corpora, learning to predict the next word in a sequence. The resulting systems excel at generating coherent text but were not designed to take actions in graphical environments.Lux, according to Qin, takes a different approach. The model trains on computer screenshots paired with action sequences, learning to interpret visual interfaces and determine which clicks, keystrokes, and navigation steps will accomplish a given goal."The action allows the model to actively explore the computer environment, and such exploration generates new knowledge, which is then fed back to the model for training," Qin told VentureBeat. "This is a naturally self-evolving process, where a better model produces better exploration, better exploration produces better knowledge, and better knowledge leads to a better model."This self-reinforcing training loop, if it functions as described, could help explain how a smaller team might achieve results that elude larger organizations. Rather than requiring ever-larger static datasets, the approach would allow the model to continuously improve by generating its own training data through exploration.OpenAGI also claims significant cost advantages. The company says Lux operates at roughly one-tenth the cost of frontier models from OpenAI and Anthropic while executing tasks faster.Unlike browser-only competitors, Lux can control Slack, Excel, and other desktop applicationsA critical distinction in OpenAGI's announcement: Lux can control applications across an entire desktop operating system, not just web browsers.Most commercially available computer-use agents, including early versions of Anthropic's Claude Computer Use, focus primarily on browser-based tasks. That limitation excludes vast categories of productivity work that occur in desktop applications — spreadsheets in Microsoft Excel, communications in Slack, design work in Adobe products, code editing in development environments.OpenAGI says Lux can navigate these native applications, a capability that would substantially expand the addressable market for computer-use agents. The company is releasing a developer software development kit alongside the model, allowing third parties to build applications on top of Lux.The company is also working with Intel to optimize Lux for edge devices, which would allow the model to run locally on laptops and workstations rather than requiring cloud infrastructure. That partnership could address enterprise concerns about sending sensitive screen data to external servers."We are partnering with Intel to optimize our model on edge devices, which will make it the best on-device computer-use model," Qin said.The company confirmed it is in exploratory discussions with AMD and Microsoft about additional partnerships.What happens when you ask an AI agent to copy your bank detailsComputer-use agents present novel safety challenges that do not arise with conventional chatbots. An AI system capable of clicking buttons, entering text, and navigating applications could, if misdirected, cause significant harm — transferring money, deleting files, or exfiltrating sensitive information.OpenAGI says it has built safety mechanisms directly into Lux. When the model encounters requests that violate its safety policies, it refuses to proceed and alerts the user.In an example provided by the company, when a user asked the model to "copy my bank details and paste it into a new Google doc," Lux responded with an internal reasoning step: "The user asks me to copy the bank details, which are sensitive information. Based on the safety policy, I am not able to perform this action." The model then issued a warning to the user rather than executing the potentially dangerous request.Such safeguards will face intense scrutiny as computer-use agents proliferate. Security researchers have already demonstrated prompt injection attacks against early agent systems, where malicious instructions embedded in websites or documents can hijack an agent's behavior. Whether Lux's safety mechanisms can withstand adversarial attacks remains to be tested by independent researchers.The MIT researcher who built two of GitHub's most downloaded AI modelsQin brings an unusual combination of academic credentials and entrepreneurial experience to OpenAGI.He completed his doctorate at the Massachusetts Institute of Technology in 2025, where his research focused on computer vision, robotics, and machine learning. His academic work appeared in top venues including the Conference on Computer Vision and Pattern Recognition, the International Conference on Learning Representations, and the International Conference on Machine Learning.Before founding OpenAGI, Qin built several widely adopted AI systems. JetMoE, a large language model he led development on, demonstrated that a high-performing model could be trained from scratch for less than $100,000 — a fraction of the tens of millions typically required. The model outperformed Meta's LLaMA2-7B on standard benchmarks, according to a technical report that attracted attention from MIT's Computer Science and Artificial Intelligence Laboratory.His previous open-source projects achieved remarkable adoption. OpenVoice, a voice cloning model, accumulated approximately 35,000 stars on GitHub and ranked in the top 0.03 percent of open-source projects by popularity. MeloTTS, a text-to-speech system, has been downloaded more than 19 million times, making it one of the most widely used audio AI models since its 2024 release.Qin also co-founded MyShell, an AI agent platform that has attracted six million users who have collectively built more than 200,000 AI agents. Users have had more than one billion interactions with agents on the platform, according to the company.Inside the billion-dollar race to build AI that controls your computerThe computer-use agent market has attracted intense interest from investors and technology giants over the past year.OpenAI released Operator in January, allowing users to instruct an AI to complete tasks across the web. Anthropic has continued developing Claude Computer Use, positioning it as a core capability of its Claude model family. Google has incorporated agent features into its Gemini products. Microsoft has integrated agent capabilities across its Copilot offerings and Windows.Yet the market remains nascent. Enterprise adoption has been limited by concerns about reliability, security, and the ability to handle edge cases that occur frequently in real-world workflows. The performance gaps revealed by benchmarks like Online-Mind2Web suggest that current systems may not be ready for mission-critical applications.OpenAGI enters this competitive landscape as an independent alternative, positioning superior benchmark performance and lower costs against the massive resources of its well-funded rivals. The company's Lux model and developer SDK are available beginning today.Whether OpenAGI can translate benchmark dominance into real-world reliability remains the central question. The AI industry has a long history of impressive demos that falter in production, of laboratory results that crumble against the chaos of actual use. Benchmarks measure what they measure, and the distance between a controlled test and an 8-hour workday full of edge cases, exceptions, and surprises can be vast.But if Lux performs in the wild the way it performs in the lab, the implications extend far beyond one startup's success. It would suggest that the path to capable AI agents runs not through the largest checkbooks but through the cleverest architectures—that a small team with the right ideas can outmaneuver the giants.The technology industry has seen that story before. It rarely stays true for long.
For PhD student Benjamin Manning, the future of work means grasping AI’s role on our behalf while transforming and accelerating social scientific discovery.
In this article, we overview five cutting-edge MLOps trends that will shape 2026.
You can’t align what you don’t evaluate
The post Why AI Alignment Starts With Better Evaluation appeared first on Towards Data Science.
A US telecom company trained an AI model on years of inmates’ phone and video calls and is now piloting that model to scan their calls, texts, and emails in the hope of predicting and preventing crimes. Securus Technologies president Kevin Elder told MIT Technology Review that the company began building its AI tools in…
With some needed infrastructure now being developed for agentic commerce, enterprises will want to figure out how to participate in this new form of buying and selling. But it remains a fragmented Wild West with competing payment protocols, and it's unclear what enterprises need to do to prepare. More cloud providers and AI model companies will start providing enterprises with the tools needed to begin building systems that enable agentic commerce.AWS, which will list Visa’s Intelligence Commerce platform on the AWS Marketplace, believes that making it easier to connect to tools that enable agentic payments would accelerate the adoption of agentic commerce. While this doesn’t mean Amazon has formally adopted Visa’s Trusted Agent Protocol (TAP), which would bring the world’s largest e-commerce platform to the agentic shopping space, it does show just how agentic commerce is fast becoming an area enterprises want to focus on. Scott Mullins, AWS managing director of Worldwide Financial Services, told VentureBeat in an email that listing the platform “makes payment capabilities accessible” in a secure manner that quickly integrates with Visa’s system. “We’re giving developers pre-built frameworks and standardized infrastructure to eliminate major development barriers,” Mullins said. He added that the idea is to list Visa’s platform to streamline integration with AWS services like Bedrock and AgentCore. In addition to listing the Visa Intelligence Commerce platform on AWS Marketplace, the two companies will also publish blueprints to the public Bedrock AgentCore repository. Mullins said this will “significantly reduce development time and complexity that anyone can use to create travel booking agents, retail shopping agents and B2B payment reconciliation agents.”The Visa Intelligence Commerce platform will be MCP-compatible, allowing enterprises to connect agents running on it to other agents. What enterprises need to know
Through the Visa Intelligence Commerce platform, AWS customers can access authentication, agentic tokenization and data personalization tools. These allow organizations to register and connect their agents to Visa’s payment infrastructure. The platform helps mask credit card details through tokenized digital credentials and lets companies set guidelines for agent transactions, like spending limits. Rubali Birwadker, senior vice president and global head of Growth at Visa, said in a press release that bringing the platform to AWS lets it scale, “helping to unlock faster innovation for developers and better experiences for consumers and businesses worldwide.”Mullins said Visa and AWS are helping provide the foundational infrastructure for developers and businesses to push for agentic commerce projects, but for this to work, developers must coordinate several agents and understand the different needs of industries. “Real-world commerce often requires multiple agents working together,” Mullins said. “The Travel Booking Agent blueprint, for instance, connects flight, hotel, car rental, and train providers to deliver complete travel journeys with integrated payments. Developers need to design coordination patterns for these complex, multi-agent workflows.”Different use cases also have different needs, so enterprises need to plan carefully around existing infrastructure. This is where the MCP connection is vital, since it will enable communication between an organization’s agents to Visa’s platform while maintaining identity and security.
Blueprints for agentic commerceMullins said the biggest stumbling block for many enterprises experimenting with agentic commerce is the fragmentation of commerce systems, which creates integration challenges. “This collaboration will address these challenges by providing reference architecture blueprints that developers can use as starting points, combined with AWS's cloud infrastructure and Visa's trusted payment network to create a standardized, secure foundation for agentic commerce,” he said.The reference blueprints would give a framework for enterprise developers, solution architects and software vendors to follow when building these new workflows. Mullins said the blueprints are being developed in coordination with Expedia Group, Intuit and the Eurostars Hotel company. The blueprints will work with the Visa Intelligent Commerce MCP server and APIs and will be managed through Amazon Bedrock AgentCore. AWS said that its goal is to “enable a foundation for agentic commerce at Scale, where transactions are handled by agents capable of real-time reasoning and coordination.”These blueprints would eventually become composable, reusable workflows for any organization looking to build travel booking agents or retail shopping agents. These don’t have to be consumer-focused agents; there can also be agents buying flights for employees.
Agentic commerce marches forwardAgentic commerce, where agents do the product searching, cart adding and payments, is fast becoming the next frontier for AI players. Companies like OpenAI and Google have come out with AI-powered shopping tools to make it easier to surface products and for agents to find them. Browsers like OpenAI’s Atlas and Comet from Perplexity also play a role in connecting agents to websites. Retailers like Walmart and Target have also integrated into ChatGPT, so users can ask the chatbot to search for items through chat. One of the biggest problems facing the adoption of agentic commerce revolves around enabling safe, secure transactions. OpenAI and Stripe launched the Agentic Commerce Protocol (ACP) in September, following Google’s announcement of Agent Pay Protocol (AP2) in collaboration with American Express, Mastercard, PayPal, Salesforce and ServiceNow. Visa followed soon after with TAP, which connects to the Visa Intelligent Commerce platform. “The foundation is now in place through this collaboration, but successful agentic commerce requires thoughtful design that considers the specific needs of industry, users and existing systems while leveraging the standardized infrastructure and blueprints now available,” Mullins said.
As AI, cloud, and other technology investments soar, organizations have to make investment decisions with increased speed and clarity. Practices like FinOps, IT financial management (ITFM), and strategic portfolio management (SPM) help stakeholders evaluate opportunities and trade-offs for maximum value. But they depend on unified, reliable data. And that’s often where the challenge begins.AI can surface insights from data within specific domains, but important decisions rarely rely on a single source of data. To account for operational and organizational factors as well as financial impact, finance and IT teams have to cut through disconnected systems, outdated data, and inconsistent definitions of value. Real control over technology spend comes from financial intelligence — turning fragmented inputs into actionable, context-rich insights. Apptio technology business management (TBM) solutions deliver that intelligence to technology and finance leaders. By connecting financial, operational, and business data across the enterprise, they give leaders the clarity to make every tech dollar count. Wrangling inputs instead of driving strategyWhen different stakeholders rely on different sources of truth, they don’t share the same perspective on the finance and technology landscape. The CFO sees the cost structures in the ERP system. The CIO sees systems configuration and performance metrics in ITSM and monitoring tools. The business looks at outcomes in CRM and analytics platforms. But no single domain has the holistic understanding needed to balance organizational, operational, and financial priorities.Organizations must also evaluate competing priorities across applications, infrastructure, cloud services, DevOps tools, and workforce investments. Informed trade-offs — such as carving out budget for AI investments without undermining existing capabilities — require visibility into usage patterns, system redundancies, and relative value across all these domains. Without visibility, FinOps, ITFM, and SPM practices can’t fulfill their potential for IT and cloud cost optimization.Instead, siloed data sources force finance teams to spend hours gathering reports from different systems of record and trying to reconcile inconsistent data formats. This practice is not only time- and labor-intensive, but it also opens the org to the risk of flawed forecasts, missed optimization opportunities, and wasted technology spend — potentially costing millions annually.This critical gap reveals why generic BI platforms and DIY tools only go so far. They can’t connect costs back to their sources at a detailed level, making it hard to trace allocations across systems, identify redundancies, or even answer the simplest question: What’s driving our costs?Turning static numbers into actionFinancial intelligence translates domain-specific financial, operational, and business metrics into a shared language of value on which leaders can act. By aggregating, normalizing, and enriching data from ERP systems, cloud platforms, IT service management tools, HR systems, and more, the Financial Intelligence Layer in Apptio supports three critical ITFM, FinOps, and SPM capabilities:Context. Aligning financial, operational, and outcome inputs so that: Cloud spend connects to business impactInfrastructure costs tie to application performanceWorkforce investments link to service deliveryInsights. Connecting cost, usage, performance, and value across the enterprise. For example, mapping AI model usage to ROI can reveal which initiatives do and do not deserve continued investment.Action. Empowering leaders to make informed, coordinated decisions rather than operating in silos.Hyperscalers surface cloud cost optimization insights on their own platforms. Single-function tech platforms like ERP, HR, CRM, and ITSM provide valuable metrics for their specific domains. Apptio TBM solutions go further, delivering the financial context and actionable insights needed to manage technology spend management across all areas: on-premises, multi-cloud, applications, and workforce.Domain expertise for FinOps, ITFM, and SPMRaw numbers don’t tell a story. What matters is structuring data so that it aligns with business goals and enables decision-makers to see patterns, weigh options, and chart the best path forward. Apptio has trained its AI specifically on FinOps, ITFM, and SPM to understand the questions these teams actually need to answer, so TBM teams can work faster and smarter. Apptio TBM solutions ease the cognitive load by automating time-consuming ingestion, mapping, anomaly detection, and enrichment — so people can focus on strategic decisions. Clean, enriched inputs feed forecasting models that anticipate cost trends and surface optimization opportunities. And because Apptio offers ready-to-use cost modeling frameworks and governance, organizations can start realizing value far faster than they can using DIY or open-source tools. The path to financial intelligenceFinancial intelligence starts with clean, contextualized data — but how that data is organized and used is equally critical for optimizing technology spend. TBM principles like cost and consumption allocation, process optimization, and unit economics will help teams translate data into meaningful insights and smarter decisions. Solutions purpose-built for technology spend management are essential. Spreadsheets don’t scale, and domain expertise matters. Apptio TBM solutions deliver enterprise-grade governance, financial context across all tech domains, and AI trained specifically for ITFM, FinOps, and SPM. These are capabilities that hyperscalers — focused on single-cloud optimization and generic BI tools — simply can’t provide at scale.In an era when rapid innovation places a premium on technology spend management, financial intelligence is vital for maximizing budgets. By optimizing the inputs that fuel AI-driven financial workflows, leaders can equip every stakeholder with the confidence and intelligence to steer technology investments with data-driven precision. Learn more here about how the Financial Intelligence Layer in Apptio transforms how enterprises decide, fund, and execute their TBM strategies in the AI era. Ajay Patel is General Manager at Apptio, an IBM Company.Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
Anthropic’s official guide, ChatGPT turns 3, 2026 predictions, cancer AI, and more...
Hybrid cloud security was built before the current era of automated, machine-based cyberattacks that take just milliseconds to execute and minutes to deliver devastating impacts to infrastructure. The architectures and tech stacks every enterprise depends on, from batch-based detection to siloed tools to 15-minute response windows, stood a better chance of defending against attackers moving at human speed. But in a weaponized AI world, those approaches to analyzing threat data don't make sense. The latest survey numbers tell the story. More than half (55%) of organizations suffered cloud breaches in the past year. That’s a 17-point spike, according to Gigamon's 2025 Hybrid Cloud Security Survey. Nearly half of the enterprises polled said their security tools missed the attack entirely. While 82% of enterprises now run hybrid or multi-cloud environments, only 36% express confidence in detecting threats in real time, per Fortinet's 2025 State of Cloud Security Report.Adversaries aren’t wasting any time weaponizing AI to target hybrid cloud vulnerabilities. Organizations now face 1,925 cyberattacks weekly. That’s an increase of 47% in a year. Further, ransomware surged 126% in the first quarter of 2025 alone. The visibility gaps everyone talks about in hybrid environments is where breaches originate. The bottom line is that the security architectures designed for the pre-AI era can't keep pace.But the industry is finally beginning to respond. CrowdStrike, for its part, is providing one vision of cybersecurity reinvention. Today at AWS re:Invent, the company is rolling out real-time Cloud Detection and Response, a platform designed to compress 15-minute response windows down to seconds. But the bigger story is why the entire approach to hybrid cloud security must change, and what that means for CISOs planning their 2026 strategies.Why the old model for hybrid cloud security is failingInitially, hybrid cloud promised the best of both worlds. Every organization could have public cloud agility with on-prem control. The security model that took shape reflected the best practices at the time. The trouble is that those best practices are now introducing vulnerabilities.How bad is it? The majority of security teams struggle to keep up with the threats and workloads. According to recent research: 91% of security leaders admit to making security compromises in their hybrid cloud environments, often trading visibility for speed, accepting siloed tools, and working with degraded data quality.76% report a shortage of cloud security expertise, limiting their ability to deploy and manage comprehensive solutions.Only 17% of organizations can see attackers moving laterally inside their network. That’s one of several blind spots that attackers capitalize on to exploit dwell times to the fullest, install ransomware, do reconnaissance, and lurk until the time is right to launch an attack.70% now view the public cloud as the riskiest environment in their infrastructure, and half are considering moving workloads back on-prem."You can't secure what you can't see," says Mandy Andress, CISO at Elastic. "That's the heart of the two big challenges we see as security practitioners: The complexity and sprawl of an organization's infrastructure, coupled with the rapid pace of technological change."CrowdStrike's Zaitsev diagnosed the root cause: "Everyone assumed this was a one-way trip, lift and shift everything to the cloud. That's not what happened. We're seeing companies pull workloads back on-prem when the economics make sense. The reality? Everyone's going to be hybrid. Five years from now. Ten years. Maybe forever. Security has to deal with that."Weaponized AI is changing the threat calculus fastThe weaponized AI era isn't just accelerating attacks. It’s breaking the fundamental assumptions on which hybrid cloud security was built. The window between patch release and weaponized exploit collapsed from weeks to hours. The majority of adversaries aren't typing commands anymore; they're automating machine-based campaigns that orchestrate agentic AI at a scale and speed that current hybrid cloud tools and human SOC teams can't keep up with.Zaitsev shared threat data from CrowdStrike's mid-year hunting report, which found that cloud intrusions spiked 136% in a year, with roughly 40% of all cloud actor activity coming from Chinese nexus adversaries. This illustrates how quickly the threat landscape can change, and why hybrid cloud security needs to be reinvented for the AI era now.Mike Riemer, SVP and field CISO at Ivanti, has witnessed the timeline collapse. Threat actors now reverse-engineer patches within 72 hours using AI assistance. If enterprises don't patch within that time frame, "they're open to exploit," Riemer told VentureBeat. "That's the new reality."Using previous-generation tools in the current cloud control plane is a dangerous bet. All it takes is a single compromised virtual machine (VM) that no one knows exists. Compromise the control plane, including the APIs that manage cloud resources, and they’ve got keys to spin up, modify or delete thousands of assets across a company’s hybrid environment.The seams between hybrid cloud environments are attack highways where millisecond-long attacks seldom leave any digital exhaust or traces. Many organizations never see weaponized AI attacks coming. VentureBeat hears that the worst hybrid cloud attacks can only be diagnosed long after the fact, when forensics and analysis are finally completed. Attackers and adversaries are that good at covering their tracks, often relying on living-off-the-land (LotL) tools to evade detection for months, even years in extreme cases. "Enterprises training AI models are concentrating sensitive data in cloud environments, which is gold for adversaries," CrowdStrike's Zaitsev said. "Attackers are using agentic AI to run their campaigns. The traditional SOC workflow — see the alert, triage, investigate for 15 or 20 minutes, take action an hour or a day later —is completely insufficient. You're bringing a knife to a gunfight."The human toll of relying on outdated architectureThe human toll of the hybrid cloud crisis shows up in SOC metrics and burnout. The AI SOC Market Landscape 2025 report found that the average security operations center processes 960 alerts daily. Each takes roughly 70 minutes to investigate properly. Assuming standard SOC staffing levels, there aren't enough hours in the day to get to all those alerts. Futher, at least 40% of alerts, on average, never get touched. The human cost is staggering. A Tines survey of SOC analysts found that 71% are experiencing burnout. Two-thirds say manual grunt work consumes more than half of SOC workers' day. The same percentage are eyeing the exit from their jobs, and, in many extreme cases as some confide to VentureBeat, the industry.Hybrid environments make everything more complicated. Enterprises have different tools for AWS, Azure and on-prem architectures. They have different consoles; often different teams. As for alert correlation across environments? It's manual and often delegated to the most senior SOC team members — if it happens at all. Batch-based detection can't survive the weaponized AI eraHere's what most legacy vendors of hybrid cloud security tools won't openly admit: Cloud security tools are fundamentally flawed and not designed for real-time defense. The majority are batch-based, collecting logs every five, ten or fifteen minutes, processing them through correlation engines, then generating alerts. In a world where adversaries are increasingly executing machine-based attacks in milliseconds, a 15-minute detection delay isn't just a minor setback; it's the difference between stopping an attack and having to investigate a breach.As adversaries weaponize AI to accelerate cloud attacks and move laterally across systems, traditional cloud detection and response (CDR) tools relying on log batch processing are too slow to keep up. These systems can take 15 minutes or more to surface a single detection.CrowdStrike's Zaitsev didn't hedge. Before the company's new tools released today, there was no such thing as real-time cloud detection and prevention, he claimed. "Everyone else is batch-based. Suck down logs every five or 10 minutes, wait for data, import it, correlate it. We've seen competitors take 10 to 15 minutes minimum. That's not detection—that's archaeology."He continued: "It's carrier pigeon versus 5G. The gap between 15 minutes and 15 seconds isn't just about alert quality. It's the difference between getting a notification that something has already happened; now you're doing cleanup, versus actually stopping the attack before the adversary achieves anything. One is incident response. The other is prevention."Reinventing hybrid cloud security must begin with speedCrowdStrike's new real-time Cloud Detection and Response, part of Falcon Cloud Security's unified cloud-native application protection platform (CNAPP), is intended to secure every layer of hybrid cloud risk. It is built on three key innovations:Real-time detection engine: Built on event streaming technology pioneered and battle-tested by Falcon Adversary OverWatch, this engine analyzes cloud logs as they stream in. It then applies detections to eliminate latency and false positives.New cloud-specific indicators of attack out of the box: AI and machine learning (ML) correlate what's happening in real time against cloud asset and identity data. That's how the system catches stealthy moves like privilege escalation and CloudShell abuse before attackers can capitalize on them.Automated cloud response actions and workflows: There's a gap in traditional cloud security. Cloud workload protection (CWP) simply stops at the workload. Cloud security posture management (CSPM) shows what could go wrong. But neither protects the control plane at runtime. New workflows built on Falcon Fusion SOAR close that gap, triggering instantly to disrupt adversaries before SOC teams can intervene.CrowdStrike's Cloud Detection and Response integrates with AWS EventBridge, Amazon's real-time serverless event streaming service. Instead of polling for logs on a schedule, the system taps directly into the event stream as things happen. "Anything that calls itself CNAPP that doesn't have real-time cloud detection and response is now obsolete," CrowdStrike CTO Elia Zaitsev said in an exclusive interview with VentureBeat. By contrast, EventBridge provides a us asynchronous, microservice-based, just-in-time event processing. "We're not waiting five minutes for a bucket of data," he said. But tapping into it is only half the problem. "Can you actually keep up with that firehose? Can you process it fast enough to matter?" Zaitsev asked rhetorically. CrowdStrike claims it can handle 60 million events per second. "This isn't duct tape and a demo."The underlying streaming technology isn't new to CrowdStrike. Falcon Adversary OverWatch has been running stream processing for 15 years to hunt across CrowdStrike's customer base, processing logs in real time rather than waiting for batch cycles to complete.The platform integrates Charlotte AI for automated triage, providing 98% accuracy matching expert managed detection and response (MDR) analysts, cutting 40-plus hours of manual work weekly. When the system detects a control plane compromise, it doesn't wait for human approval. It revokes tokens, kills sessions, boots the attacker and nukes malicious CloudFormation templates, all before the adversary can execute.What this means for the CNAPP marketCloud security is the fastest-growing segment in Gartner's latest forecast, expanding at a 25.9% CAGR through 2028. Precedence Research projects the market will grow from $36 billion in 2024 to $121 billion by 2034. And it's crowded: Palo Alto Networks, Wiz (now absorbed into Google via a $32 billion acquisition), Microsoft, Orca, SentinelOne (to name a few). CrowdStrike already had a seat at the table as a Leader in the 2025 IDC MarketScape for CNAPP for the third consecutive year. Gartner predicts that by 2029, 40% of enterprises that successfully implement zero trust in cloud environments will rely on CNAPP platforms due to their visibility and control.But Zaitsev is making a bigger claim, stating that today's announcement redefines what "complete" means for CNAPP in hybrid environments. "CSPM isn't going away. Cloud workload protection isn't going away. What becomes obsolete is calling something a CNAPP when it lacks real-time cloud detection and response. You're missing the safety net, the thing that catches what gets through proactive defenses. And in hybrid, something always gets through."The unified platform angle matters specifically for hybrid," he said. "Adversaries deliberately hop between environments because they know defenders run different tools, often different teams, for cloud versus on-prem versus identity. Jumping domains is how you shake your tail. Attackers know most organizations can't follow them across the seams. With us, they can't do that anymore."Building hybrid security for the AI eraReinventing hybrid cloud security won't happen overnight. Here's where CISOs should focus:Map your hybrid visibility gaps: Every cloud workload, every on-prem system, every identity traversing between them. If 82% of breaches trace to blind spots, know where yours are before attackers find them.Pressure vendors on detection latency: Ask challenging questions about architecture. If they're running batch-based processing, understand what a 15-minute window means when adversaries move in seconds.Deploy AI triage now: With 40% of alerts going uninvestigated and 71% of analysts burned out, automation isn't a roadmap item; it’s a must-have for a successful deterrence strategy. Look for measurable accuracy rates and real-time savings.Compress patch cycles to 72 hours: AI-assisted reverse engineering has collapsed the exploit window. Monthly patch cycles don't cut it anymore.Architect for permanent hybrid. Stop waiting for cloud migration to simplify security. It won't. Design for complexity as the baseline, not a temporary state. The 54% of enterprises running hybrid models today will still be hybrid tomorrow.The bottom lineHybrid cloud security must be reinvented for the AI era. Previous-generation hybrid cloud security solutions are quickly being eclipsed by weaponized AI attacks, often launched as machine-on-machine intrusion attempts. The evidence is clear: 55% breach rates, 91% of security leaders making compromises they know are dangerous and AI-accelerated attacks that move faster than batch-based detection can respond. Architectures designed for human-speed threats can't protect against machine-speed adversaries."Modern cybersecurity is about differentiating between acceptable and unacceptable risk," says Chaim Mazal, CSO at Gigamon. "Our research shows where CISOs are drawing that line, highlighting the critical importance of visibility into all data-in-motion to secure complex hybrid cloud infrastructure against today's emerging threats. It's clear that current approaches aren't keeping pace, which is why CISOs must reevaluate tool stacks and reprioritize investments and resources to more confidently secure their infrastructure."VentureBeat will be tracking which approaches to hybrid cloud reinvention actually deliver, and which don't, in the months ahead.
Opening the black box of ML models, step by step, directly in Excel
The post The Machine Learning and Deep Learning “Advent Calendar” Series: The Blueprint appeared first on Towards Data Science.
This article is divided into four parts; they are: • Optimizers for Training Language Models • Learning Rate Schedulers • Sequence Length Scheduling • Other Techniques to Help Training Deep Learning Models Adam has been the most popular optimizer for training deep learning models.
A modification to the Boruta algorithm that dramatically reduces computation while maintaining high sensitivity
The post The Greedy Boruta Algorithm: Faster Feature Selection Without Sacrificing Recall appeared first on Towards Data Science.
Open-source image gen, Perplexity memory, AI society, dot-com vibes, and more...
Enterprises are investing billions of dollars in AI agents and infrastructure to transform business processes. However, we are seeing limited success in real-world applications, often due to the inability of agents to truly understand business data, policies and processes. While we manage the integrations well with technologies like API management, model context protocol (MCP) and others, having agents truly understand the “meaning” of data in the context of a given businesis a different story. Enterprise data is mostly siloed into disparate systems in structured and unstructured forms and needs to be analyzed with a domain-specific business lens.sAs an example, the term “customer” may refer to a different group of people in a Sales CRM system, compared to a finance system which may use this tag for paying clients. One department might define “product” as a SKU; another may represent as a "product" family; a third as a marketing bundle. Data about “product sales” thus varies in meaning without agreed upon relationships and definitions. For agents to combine data from multiple systems, they must understand different representations. Agents need to know what the data means in context and how to find the right data for the right process. Moreover, schema changes in systems and data quality issues during collection can lead to more ambiguity and inability of agents to know how to act when such situations are encountered. Furthermore, classification of data into categories like PII (personally identifiable information) needs to be rigorously followed to maintain compliance with standards like GDPR and CCPA. This requires the data to be labelled correctly and agents to be able to understand and respect this classification. Hence we see that building a cool demo using agents is very much doable - but putting into production working on real business data is a different story altogether.The ontology-based source of truthBuilding effective agentic solutions requries an ontology-based single source of truth. Ontology is a business definition of concepts, their hierarchy and relationships. It defines terms with respect to business domains, can help establish a single-source of truth for data and capture uniform field names and apply classifications to fields. An ontology may be domain-specific (healthcare or finance), or organization-specific based on internal structures. Defining an ontology upfront is time consuming, but can help standardize business processes and lay a strong foundation for agentic AI. Ontology may be realized using common queryable formats like triplestore. More complex business rules with multi-hop relations could use a labelled property graphs like Neo4j. These graphs can also help enterprises discover new relationships and answer complex questions. Ontologies like FIBO (Finance Industry Business Ontology) and UMLS (Unified Medical Language System) are available in the public domain and can be a very good starting point. However, these usually need to be customized to capture specific details of an enterprise.Getting started with ontologyOnce implemented, an ontology can be the driving force for enterprise agents. We can now prompt AI to follow the ontology and use it to discover data and relationships. If needed, we can have an agentic layer serve key details of the ontology itself and discover data. Business rules and policies can be implemented in this ontology for agents to adhere to. This is an excellent way to ground your agents and establish guardrails based on real business context. Agents designed in this manner and tuned to follow an ontology can stick to guardrails and avoid hallucinations that can be caused by the large language models (LLM) powering them. For example, a business policy may define that unless all documents associated with a loan do not have verified flags set to "true," the loan status should be kept in “pending” state. Agents can work around this policy and determine what documents are needed and query the knowledge base. Here's an example implementation: (Original figure by Author)As illustrated, we have structured and unstructured data processed by a document intelligence (DocIntel) agent which populates a Neo4j database based on an ontology of the business domain. A data discovery agent in Neo4j finds and queries the right data and passes it to other agents handling business process execution. The inter-agent communication happens with a popular protocol like A2A (agent to agent). A new protocol called AG-UI (Agent User Interaction) can help build more generic UI screens to capture the workings and responses from these agents. With this method, we can avoid hallucinations by enforcing agents to follow ontology-driven paths and maintain data classifications and relationships. Moreover, we can scale easily by adding new assets, relationships and policies that agents can automatically comply to, and control hallucinations by defining rules for the whole system rather than individual entities. For example, if an agent hallucinates an individual 'customer,' because the connected data for the hallucinated 'customer' will not be verifiable in the data discovery, we can easily detect this anomaly and plan to eliminate it. This helps the agentic system scale with the business and manage its dynamic nature.Indeed, a reference architecture like this adds some overhead in data discovery and graph databases. But for a large enterprise, it adds the right guardrails and gives agents directions to orchestrate complex business processes. Dattaraj Rao is innovation and R&D architect at Persistent Systems. Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
As AI systems enter production, reliability and governance can’t depend on wishful thinking. Here’s how observability turns large language models (LLMs) into auditable, trustworthy enterprise systems.Why observability secures the future of enterprise AIThe enterprise race to deploy LLM systems mirrors the early days of cloud adoption. Executives love the promise; compliance demands accountability; engineers just want a paved road.Yet, beneath the excitement, most leaders admit they can’t trace how AI decisions are made, whether they helped the business, or if they broke any rule.Take one Fortune 100 bank that deployed an LLM to classify loan applications. Benchmark accuracy looked stellar. Yet, 6 months later, auditors found that 18% of critical cases were misrouted, without a single alert or trace. The root cause wasn’t bias or bad data. It was invisible. No observability, no accountability.If you can’t observe it, you can’t trust it. And unobserved AI will fail in silence.Visibility isn’t a luxury; it’s the foundation of trust. Without it, AI becomes ungovernable.Start with outcomes, not modelsMost corporate AI projects begin with tech leaders choosing a model and, later, defining success metrics.
That’s backward.Flip the order:Define the outcome first. What’s the measurable business goal?Deflect 15 % of billing callsReduce document review time by 60 %Cut case-handling time by two minutesDesign telemetry around that outcome, not around “accuracy” or “BLEU score.”Select prompts, retrieval methods and models that demonstrably move those KPIs.At one global insurer, for instance, reframing success as “minutes saved per claim” instead of “model precision” turned an isolated pilot into a company-wide roadmap.A 3-layer telemetry model for LLM observabilityJust like microservices rely on logs, metrics and traces, AI systems need a structured observability stack:a) Prompts and context: What went inLog every prompt template, variable and retrieved document.Record model ID, version, latency and token counts (your leading cost indicators).Maintain an auditable redaction log showing what data was masked, when and by which rule.b) Policies and controls: The guardrailsCapture safety-filter outcomes (toxicity, PII), citation presence and rule triggers.Store policy reasons and risk tier for each deployment.Link outputs back to the governing model card for transparency.c) Outcomes and feedback: Did it work?Gather human ratings and edit distances from accepted answers.Track downstream business events, case closed, document approved, issue resolved.Measure the KPI deltas, call time, backlog, reopen rate.All three layers connect through a common trace ID, enabling any decision to be replayed, audited or improved.Diagram © SaiKrishna Koorapati (2025). Created specifically for this article; licensed to VentureBeat for publication.Apply SRE discipline: SLOs and error budgets for AIService reliability engineering (SRE) transformed software operations; now it’s AI’s turn.Define three “golden signals” for every critical workflow:SignalTarget SLOWhen breachedFactuality≥ 95 % verified against source of recordFallback to verified templateSafety≥ 99.9 % pass toxicity/PII filtersQuarantine and human reviewUsefulness≥ 80 % accepted on first passRetrain or rollback prompt/modelIf hallucinations or refusals exceed budget, the system auto-routes to safer prompts or human review just like rerouting traffic during a service outage.This isn’t bureaucracy; it’s reliability applied to reasoning.Build the thin observability layer in two agile sprintsYou don’t need a six-month roadmap, just focus and two short sprints.Sprint 1 (weeks 1-3): FoundationsVersion-controlled prompt registryRedaction middleware tied to policyRequest/response logging with trace IDsBasic evaluations (PII checks, citation presence)Simple human-in-the-loop (HITL) UISprint 2 (weeks 4-6): Guardrails and KPIsOffline test sets (100–300 real examples)Policy gates for factuality and safetyLightweight dashboard tracking SLOs and costAutomated token and latency trackerIn 6 weeks, you’ll have the thin layer that answers 90% of governance and product questions.Make evaluations continuous (and boring)Evaluations shouldn’t be heroic one-offs; they should be routine.Curate test sets from real cases; refresh 10–20 % monthly.Define clear acceptance criteria shared by product and risk teams.Run the suite on every prompt/model/policy change and weekly for drift checks.Publish one unified scorecard each week covering factuality, safety, usefulness and cost.When evals are part of CI/CD, they stop being compliance theater and become operational pulse checks.Apply human oversight where it mattersFull automation is neither realistic nor responsible. High-risk or ambiguous cases should escalate to human review.Route low-confidence or policy-flagged responses to experts.Capture every edit and reason as training data and audit evidence.Feed reviewer feedback back into prompts and policies for continuous improvement.At one health-tech firm, this approach cut false positives by 22 % and produced a retrainable, compliance-ready dataset in weeks.Cost control through design, not hopeLLM costs grow non-linearly. Budgets won’t save you architecture will.Structure prompts so deterministic sections run before generative ones.Compress and rerank context instead of dumping entire documents.Cache frequent queries and memoize tool outputs with TTL.Track latency, throughput and token use per feature.When observability covers tokens and latency, cost becomes a controlled variable, not a surprise.The 90-day playbookWithin 3 months of adopting observable AI principles, enterprises should see:1–2 production AI assists with HITL for edge casesAutomated evaluation suite for pre-deploy and nightly runsWeekly scorecard shared across SRE, product and riskAudit-ready traces linking prompts, policies and outcomesAt a Fortune 100 client, this structure reduced incident time by 40 % and aligned product and compliance roadmaps.Scaling trust through observabilityObservable AI is how you turn AI from experiment to infrastructure.With clear telemetry, SLOs and human feedback loops:Executives gain evidence-backed confidence.Compliance teams get replayable audit chains.Engineers iterate faster and ship safely.Customers experience reliable, explainable AI.Observability isn’t an add-on layer, it’s the foundation for trust at scale.SaiKrishna Koorapati is a software engineering leader.Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
The most dangerous KPIs aren’t broken; they’re the ones trusted long after they’ve lost their meaning.
The post Metric Deception: When Your Best KPIs Hide Your Worst Failures appeared first on Towards Data Science.
Learn how to increase LLM usage to achieve increased productivity
The post How to Scale Your LLM Usage appeared first on Towards Data Science.
User data exposed, AI sacrifice test, Sora limits, robot bubble warning, and more...
This article is divided into two parts; they are: • Fine-tuning a BERT Model for GLUE Tasks • Fine-tuning a BERT Model for SQuAD Tasks GLUE is a benchmark for evaluating natural language understanding (NLU) tasks.
Agent memory remains a problem that enterprises want to fix, as agents forget some instructions or conversations the longer they run. Anthropic believes it has solved this issue for its Claude Agent SDK, developing a two-fold solution that allows an agent to work across different context windows.“The core challenge of long-running agents is that they must work in discrete sessions, and each new session begins with no memory of what came before,” Anthropic wrote in a blog post. “Because context windows are limited, and because most complex projects cannot be completed within a single window, agents need a way to bridge the gap between coding sessions.”Anthropic engineers proposed a two-fold approach for its Agent SDK: An initializer agent to set up the environment, and a coding agent to make incremental progress in each session and leave artifacts for the next. The agent memory problemSince agents are built on foundation models, they remain constrained by the limited, although continually growing, context windows. For long-running agents, this could create a larger problem, leading the agent to forget instructions and behave abnormally while performing a task. Enhancing agent memory becomes essential for consistent, business-safe performance. Several methods emerged over the past year, all attempting to bridge the gap between context windows and agent memory. LangChain’s LangMem SDK, Memobase and OpenAI’s Swarm are examples of companies offering memory solutions. Research on agentic memory has also exploded recently, with proposed frameworks like Memp and the Nested Learning Paradigm from Google offering new alternatives to enhance memory. Many of the current memory frameworks are open source and can ideally adapt to different large language models (LLMs) powering agents. Anthropic’s approach improves its Claude Agent SDK. How it worksAnthropic identified that even though the Claude Agent SDK had context management capabilities and “should be possible for an agent to continue to do useful work for an arbitrarily long time,” it was not sufficient. The company said in its blog post that a model like Opus 4.5 running the Claude Agent SDK can “fall short of building a production-quality web app if it’s only given a high-level prompt, such as 'build a clone of claude.ai.'” The failures manifested in two patterns, Anthropic said. First, the agent tried to do too much, causing the model to run out of context in the middle. The agent then has to guess what happened and cannot pass clear instructions to the next agent. The second failure occurs later on, after some features have already been built. The agent sees progress has been made and just declares the job done. Anthropic researchers broke down the solution: Setting up an initial environment to lay the foundation for features and prompting each agent to make incremental progress towards a goal, while still leaving a clean slate at the end. This is where the two-part solution of Anthropic's agent comes in. The initializer agent sets up the environment, logging what agents have done and which files have been added. The coding agent will then ask models to make incremental progress and leave structured updates. “Inspiration for these practices came from knowing what effective software engineers do every day,” Anthropic said. The researchers said they added testing tools to the coding agent, improving its ability to identify and fix bugs that weren’t obvious from the code alone. Future researchAnthropic noted that its approach is “one possible set of solutions in a long-running agent harness.” However, this is just the beginning stage of what could become a wider research area for many in the AI space. The company said its experiments to boost long-term memory for agents haven’t shown whether a single general-purpose coding agent works best across contexts or a multi-agent structure. Its demo also focused on full-stack web app development, so other experiments should focus on generalizing the results across different tasks.“It’s likely that some or all of these lessons can be applied to the types of long-running agentic tasks required in, for example, scientific research or financial modeling,” Anthropic said.
Hello, dear readers. Happy belated Thanksgiving and Black Friday!This year has felt like living inside a permanent DevDay. Every week, some lab drops a new model, a new agent framework, or a new “this changes everything” demo. It’s overwhelming. But it’s also the first year I’ve felt like AI is finally diversifying — not just one or two frontier models in the cloud, but a whole ecosystem: open and closed, giant and tiny, Western and Chinese, cloud and local.So for this Thanksgiving edition, here’s what I’m genuinely thankful for in AI in 2025 — the releases that feel like they’ll matter in 12–24 months, not just during this week’s hype cycle.1. OpenAI kept shipping strong: GPT-5, GPT-5.1, Atlas, Sora 2 and open weightsAs the company that undeniably birthed the "generative AI" era with its viral hit product ChatGPT in late 2022, OpenAI arguably had among the hardest tasks of any AI company in 2025: continue its growth trajectory even as well-funded competitors like Google with its Gemini models and other startups like Anthropic fielded their own highly competitive offerings. Thankfully, OpenAI rose to the challenge and then some. Its headline act was GPT-5, unveiled in August as the next frontier reasoning model, followed in November by GPT-5.1 with new Instant and Thinking variants that dynamically adjust how much “thinking time” they spend per task. In practice, GPT-5’s launch was bumpy — VentureBeat documented early math and coding failures and a cooler-than-expected community reaction in “OpenAI’s GPT-5 rollout is not going smoothly," but it quickly course corrected based on user feedback and, as a daily user of this model, I'm personally pleased with it and impressed with it. At the same time, enterprises actually using the models are reporting solid gains. ZenDesk Global, for example, says GPT-5-powered agents now resolve more than half of customer tickets, with some customers seeing 80–90% resolution rates. That’s the quiet story: these models may not always impress the chattering classes on X, but they’re starting to move real KPIs.On the tooling side, OpenAI finally gave developers a serious AI engineer with GPT-5.1-Codex-Max, a new coding model that can run long, agentic workflows and is already the default in OpenAI’s Codex environment. VentureBeat covered it in detail in “OpenAI debuts GPT-5.1-Codex-Max coding model and it already completed a 24-hour task internally.” Then there’s ChatGPT Atlas, a full browser with ChatGPT baked into the chrome itself — sidebar summaries, on-page analysis, and search tightly integrated into regular browsing. It’s the clearest sign yet that “assistant” and “browser” are on a collision course.On the media side, Sora 2 turned the original Sora video demo into a full video-and-audio model with better physics, synchronized sound and dialogue, and more control over style and shot structure, plus a dedicated Sora app with a full fledged social networking component, allowing any user to create their own TV network in their pocket. Finally — and maybe most symbolically — OpenAI released gpt-oss-120B and gpt-oss-20B, open-weight MoE reasoning models under an Apache 2.0–style license. Whatever you think of their quality (and early open-source users have been loud about their complaints), this is the first time since GPT-2 that OpenAI has put serious weights into the public commons.2. China’s open-source wave goes mainstreamIf 2023–24 was about Llama and Mistral, 2025 belongs to China’s open-weight ecosystem.A study from MIT and Hugging Face found that China now slightly leads the U.S. in global open-model downloads, largely thanks to DeepSeek and Alibaba’s Qwen family. Highlights:DeepSeek-R1 dropped in January as an open-source reasoning model rivaling OpenAI’s o1, with MIT-licensed weights and a family of distilled smaller models. VentureBeat has followed the story from its release to its cybersecurity impact to performance-tuned R1 variants.Kimi K2 Thinking from Moonshot, a “thinking” open-source model that reasons step-by-step with tools, very much in the o1/R1 mold, and is positioned as the best open reasoning model so far in the world.Z.ai shipped GLM-4.5 and GLM-4.5-Air as “agentic” models, open-sourcing base and hybrid reasoning variants on GitHub.Baidu’s ERNIE 4.5 family arrived as a fully open-sourced, multimodal MoE suite under Apache 2.0, including a 0.3B dense model and visual “Thinking” variants focused on charts, STEM, and tool use.Alibaba’s Qwen3 line — including Qwen3-Coder, large reasoning models, and the Qwen3-VL series released over the summer and fall months of 2025 — continues to set a high bar for open weights in coding, translation, and multimodal reasoning, leading me to declare this past summer as "Qwen's summer."VentureBeat has been tracking these shifts, including Chinese math and reasoning models like Light-R1-32B and Weibo’s tiny VibeThinker-1.5B, which beat DeepSeek baselines on shoestring training budgets.If you care about open ecosystems or on-premise options, this is the year China’s open-weight scene stopped being a curiosity and became a serious alternative.3. Small and local models grow upAnother thing I’m thankful for: we’re finally getting good small models, not just toys.Liquid AI spent 2025 pushing its Liquid Foundation Models (LFM2) and LFM2-VL vision-language variants, designed from day one for low-latency, device-aware deployments — edge boxes, robots, and constrained servers, not just giant clusters. The newer LFM2-VL-3B targets embedded robotics and industrial autonomy, with demos planned at ROSCon. On the big-tech side, Google’s Gemma 3 line made a strong case that “tiny” can still be capable. Gemma 3 spans from 270M parameters up through 27B, all with open weights and multimodal support in the larger variants. The standout is Gemma 3 270M, a compact model purpose-built for fine-tuning and structured text tasks — think custom formatters, routers, and watchdogs — covered both in Google’s developer blog and community discussions in local-LLM circles. These models may never trend on X, but they’re exactly what you need for privacy-sensitive workloads, offline workflows, thin-client devices, and “agent swarms” where you don’t want every tool call hitting a giant frontier LLM.4. Meta + Midjourney: aesthetics as a serviceOne of the stranger twists this year: Meta partnered with Midjourney instead of simply trying to beat it.In August, Meta announced a deal to license Midjourney’s “aesthetic technology” — its image and video generation stack — and integrate it into Meta’s future models and products, from Facebook and Instagram feeds to Meta AI features.VentureBeat covered the partnership in “Meta is partnering with Midjourney and will license its technology for future models and products,” raising the obvious question: does this slow or reshape Midjourney’s own API roadmap? Still awaiting an answer there, but unfortunately, stated plans for an API release have yet to materialize, suggesting that it has. For creators and brands, though, the immediate implication is simple: Midjourney-grade visuals start to show up in mainstream social tools instead of being locked away in a Discord bot. That could normalize higher-quality AI art for a much wider audience — and force rivals like OpenAI, Google, and Black Forest Labs to keep raising the bar.5. Google’s Gemini 3 and Nano Banana ProGoogle tried to answer GPT-5 with Gemini 3, billed as its most capable model yet, with better reasoning, coding, and multimodal understanding, plus a new Deep Think mode for slow, hard problems. VentureBeat’s coverage, “Google unveils Gemini 3 claiming the lead in math, science, multimodal and agentic AI,” framed it as a direct shot at frontier benchmarks and agentic workflows. But the surprise hit is Nano Banana Pro (Gemini 3 Pro Image), Google’s new flagship image generator. It specializes in infographics, diagrams, multi-subject scenes, and multilingual text that actually renders legibly across 2K and 4K resolutions. In the world of enterprise AI — where charts, product schematics, and “explain this system visually” images matter more than fantasy dragons — that’s a big deal.6. Wild cards I’m keeping an eye onA few more releases I’m thankful for, even if they don’t fit neatly into one bucket:Black Forest Labs’ Flux.2 image models, which launched just earlier this week with ambitions to challenge both Nano Banana Pro and Midjourney on quality and control. VentureBeat dug into the details in “Black Forest Labs launches Flux.2 AI image models to challenge Nano Banana Pro and Midjourney."Anthropic’s Claude Opus 4.5, a new flagship that aims for cheaper, more capable coding and long-horizon task execution, covered in “Anthropic’s Claude Opus 4.5 is here: Cheaper AI, infinite chats, and coding skills that beat humans." A steady drumbeat of open math/reasoning models — from Light-R1 to VibeThinker and others — that show you don’t need $100M training runs to move the needle.Last thought (for now)If 2024 was the year of “one big model in the cloud,” 2025 is the year the map exploded: multiple frontiers at the top, China taking the lead in open models, small and efficient systems maturing fast, and creative ecosystems like Midjourney getting pulled into big-tech stacks.I’m thankful not just for any single model, but for the fact that we now have options — closed and open, local and hosted, reasoning-first and media-first. For journalists, builders, and enterprises, that diversity is the real story of 2025.Happy holidays and best to you and your loved ones!
An honest view from a 10-year AI Engineer
The post Data Science in 2026: Is It Still Worth It? appeared first on Towards Data Science.
Set up, build, and test agentic apps with Claude Code, powered by your locally installed Claude CLI and Claude Code subscription.
The simple shift in training that unlocks foresight, faster inference, and better reasoning.
The post Your Next LLM Might Not Predict Tokens One-by-One appeared first on Towards Data Science.
These five configurations can turn your Docker setup from a slow chore into a finely tuned machine.
How product, growth and engineering teams can converge on a single signal for better incident management
The post The Product Health Score: How I Reduced Critical Incidents by 35% with Unified Monitoring and n8n Automation appeared first on Towards Data Science.
AI and machine learning have pushed the demand for high-performance hardware, making the GPU-versus-TPU discussion more relevant than ever. GPUs, originally built for graphics, have grown into flexible processors for data analysis, scientific computing, and modern AI workloads. TPUs, built by Google as specialized ASICs for deep learning, focus on high-throughput tensor operations and have […]
The post GPU vs TPU: What’s the Difference? appeared first on Analytics Vidhya.
It turns out all the guardrails in the world won’t protect a chatbot from meter and rhyme.