Just a few short weeks ago, Google debuted its Gemini 3 model, claiming it scored a leadership position in multiple AI benchmarks. But the challenge with vendor-provided benchmarks is that they are just that — vendor-provided. A new vendor-neutral evaluation from Prolific, however, puts Gemini 3 at the top of the leaderboard. This isn't on a set of academic benchmarks; rather, it's on a set of real-world attributes that actual users and organizations care about. Prolific was founded by researchers at the University of Oxford. The company delivers high-quality, reliable human data to power rigorous research and ethical AI development. The company's “HUMAINE benchmark” applies this approach by using representative human sampling and blind testing to rigorously compare AI models across a variety of user scenarios, measuring not just technical performance but also user trust, adaptability and communication style.The latest Humane test evaluated 26,000 users in a blind test of models. In the evaluation, Gemini 3 Pro's trust score surged from 16% to 69%, the highest ever recorded by Prolific. Gemini 3 now ranks number one overall in trust, ethics and safety 69% of the time across demographic subgroups, compared to its predecessor Gemini 2.5 Pro, which held the top spot only 16% of the time.Overall, Gemini 3 ranked first in three of four evaluation categories: performance and reasoning, interaction and adaptiveness and trust and safety. It lost only on communication style, where DeepSeek V3 topped preferences at 43%. The Humane test also showed that Gemini 3 performed consistently well across 22 different demographic user groups, including variations in age, sex, ethnicity and political orientation. The evaluation also found that users are now five times more likely to choose the model in head-to-head blind comparisons.But the ranking matters less than why it won."It's the consistency across a very wide range of different use cases, and a personality and a style that appeals across a wide range of different user types," Phelim Bradley, co-founder and CEO of Prolific, told VentureBeat. "Although in some specific instances, other models are preferred by either small subgroups or on a particular conversation type, it's the breadth of knowledge and the flexibility of the model across a range of different use cases and audience types that allowed it to win this particular benchmark."How blinded testing reveals what academic benchmarks missHUMAINE's methodology exposes gaps in how the industry evaluates models. Users interact with two models simultaneously in multi-turn conversations. They don't know which vendors power each response. They discuss whatever topics matter to them, not predetermined test questions.It's the sample itself that matters. HUMAINE uses representative sampling across U.S. and UK populations, controlling for age, sex, ethnicity and political orientation. This reveals something static benchmarks can't capture: Model performance varies by audience."If you take an AI leaderboard, the majority of them still could have a fairly static list," Bradley said. "But for us, if you control for the audience, we end up with a slightly different leaderboard, whether you're looking at a left-leaning sample, right-leaning sample, U.S., UK. And I think age was actually the most different stated condition in our experiment."For enterprises deploying AI across diverse employee populations, this matters. A model that performs well for one demographic may underperform for another.The methodology also addresses a fundamental question in AI evaluation: Why use human judges at all when AI could evaluate itself? Bradley noted that his firm does use AI judges in certain use cases, although he stressed that human evaluation is still the critical factor."We see the biggest benefit coming from smart orchestration of both LLM judge and human data, both have strengths and weaknesses, that, when smartly combined, do better together," said Bradley. "But we still think that human data is where the alpha is. We're still extremely bullish that human data and human intelligence is required to be in the loop."What trust means in AI evaluationTrust, ethics and safety measures user confidence in reliability, factual accuracy and responsible behavior. In HUMAINE's methodology, trust isn't a vendor claim or a technical metric — it's what users report after blinded conversations with competing models.The 69% figure represents probability across demographic groups. This consistency matters more than aggregate scores because organizations can serve diverse populations."There was no awareness that they were using Gemini in this scenario," Bradley said. "It was based only on the blinded multi-turn response."This separates perceived trust from earned trust. Users judged model outputs without knowing which vendor produced them, eliminating Google's brand advantage. For customer-facing deployments where the AI vendor remains invisible to end users, this distinction matters.What enterprises should do nowOne of the critical things that enterprises should do now when considering different models is embrace an evaluation framework that works."It is increasingly challenging to evaluate models exclusively based on vibes," Bradley said. "I think increasingly we need more rigorous, scientific approaches to truly understand how these models are performing."The HUMAINE data provides a framework: Test for consistency across use cases and user demographics, not just peak performance on specific tasks. Blind the testing to separate model quality from brand perception. Use representative samples that match your actual user population. Plan for continuous evaluation as models change.For enterprises looking to deploy AI at scale, this means moving beyond "which model is best" to "which model is best for our specific use case, user demographics and required attributes." The rigor of representative sampling and blind testing provides the data to make that determination — something technical benchmarks and vibes-based evaluation cannot deliver.
The Chinese video game giant Tencent is now building some of the world’s best 3D AI models. This could have implications far outside gaming.
OpenAI is testing another new way to expose the complicated processes at work inside large language models. Researchers at the company can make an LLM produce what they call a confession, in which the model explains how it carried out a task and (most of the time) owns up to any bad behavior. Figuring out…
PyTorch Model Performance Analysis and Optimization — Part 11
The post Overcoming the Hidden Performance Traps of Variable-Shaped Tensors: Efficient Data Sampling in PyTorch appeared first on Towards Data Science.
Large language models generate text, not structured data.
See how different time series methods reveal the shifts, surges, and stabilization in inflation expectations.
From local distance to global probability
The post The Machine Learning “Advent Calendar” Day 3: GNB, LDA and QDA in Excel appeared first on Towards Data Science.
Learn more about Google Photos Recap — now available for 2025 — and how you can explore, customize and share it today.
The most famous applications of LLMs are the ones that I like to call the “wow effect LLMs.” There are plenty of viral LinkedIn posts about them, and they all sound like this: “I built [x] that does [y] in [z] minutes using AI.” Where: If you notice carefully, the focus of the sentence is […]
The post How to Turn Your LLM Prototype into a Production-Ready System appeared first on Towards Data Science.
What mattered: robust agents, glass-box reasoning, and red-team resilience
The post Multi-Agent Arena: Insights from London Great Agent Hack 2025 appeared first on Towards Data Science.
With insect-like speed and agility, the tiny robot could someday aid in search-and-rescue missions.
Explore top GitHub repositories to help you master this new style of coding and ship full-stack products faster than ever.
AI-powered startup Fortell has become a secret handshake for the privileged hearing-impaired crowd who swear by the product. Now, it wants to be in your ears.
Learn how to vibe-code your own website
The post How to Code Your Own Website with AI appeared first on Towards Data Science.
When AI systems were just a single model behind an API, life felt simpler. You trained, deployed, and maybe fine-tuned a few hyperparameters. But that world’s gone. Today, AI feels less like a single engine and more like a busy city—a network of small, specialized agents constantly talking to each other, calling APIs, automating workflows, […]
At the European Health Summit in Brussels, Greg Corrado, Distinguished Scientist at Google, released a new report authored by Implement Consulting Group and commissioned…
Macro, a modeling tool developed by the MIT Energy Initiative, enables energy-system planners to explore options for developing infrastructure to support decarbonized, reliable, and low-cost power grids.
The debate about open source AI has largely featured open weight models. But that’s a bit like arguing that in the PC era, the most important goal would have been to have Intel open source its chip designs. That might have been useful to some people, but it wouldn’t have created Linux, Apache, or the […]
Presented by CelonisWhen tariff rates change overnight, companies have 48 hours to model alternatives and act before competitors secure the best options. At Celosphere 2025 in Munich, enterprises demonstrated how they’re turning that chaos into competitive advantage — with quantifiable results that separate winners from losers.Vinmar International: Theglobal plastics and chemicals distributor created a real-time digital twin of its $3B supply chain, cutting default expedites by more than 20% and improving delivery agility across global operations.Florida Crystals: One of America's largest cane sugar producers, the company unlocked millions in working capital and strengthened supply chain resilience by eliminating manual rework across Finance, Procurement, and Inbound Supply. AI pilots now extend gains into invoice processing, predictive maintenance, and order management.ASOS: The ecommerce fashion giant connected its end-to-end supply chain for full transparency, reducing process variation, accelerating speed-to-market, and improving customer experience at scale. The common thread here: process intelligence that bridges the gap traditional ERP systems can’t close — connecting operational dots across ERP, finance, and logistics systems when seconds matter. “The question isn’t whether disruptions will hit,” says Peter Budweiser, General Manager of Supply Chain at Celonis. “It’s whether your systems can show you what’s breaking fast enough to fix it.”That visibility gap costs the average company double-digit millions in working capital and competitive positioning. As 54% of supply chain leaders face disruptions daily, the pressure is shifting to AI agents that execute real actions: triggering purchase orders, rerouting shipments, adjusting inventory. But an autonomous agent acting on stale or siloed data can make million-dollar mistakes when tariff structures shift overnight. Tariffs, as old as trade itself, have become the ultimate stress test for enterprise AI — revealing whether companies truly understand their supply chains and whether their AI can be trusted to act.Modern ERP: Data rich, insight poorSupply chain leaders face a paradox: drowning in data while starving for insight. Traditional enterprise systems — SAP, Oracle, PeopleSoft — capture every transaction meticulously. SAP logs the purchase order. Oracle tracks the shipment. The warehouse system records inventory movement. Each performs its function, but when tariffs change and companies need to model alternative sourcing scenarios across all three simultaneously, the data sits in silos.“What’s changed is the speed at which disruptions cascade,” says Manik Sharma, Head of Supply Chain GTM AI at Celonis. “Traditional ERP systems weren’t built for today’s volatility.”Companies generate thousands of reports showing what happened last quarter. They struggle to answer what happens if tariffs increase 25% tomorrow and need to switch suppliers within days.Tariffs: The 48-hour scrambleGlobal trade volatility has transformed tariffs from predictable costs into strategic weapons. When new rates drop with unprecedented frequency, input costs spike across suppliers, finance teams scramble to calculate margin impact, and procurement races to identify alternatives buried in disconnected systems where no one knows if switching suppliers delays shipments or violates contracts.By hour 48, competitors who already modeled scenarios execute supplier switches while late movers face capacity constraints and premium pricing. Process intelligence changes that dynamic by allowing businesses to continuously model “what-if” scenarios, showing leaders how tariff changes cascade through suppliers, contracts, production lines, warehouses, and customers. When rates hit, companies can move within hours instead of days.No AI without PI: Why process intelligence is non-negotiable for supply chainsAI and supply chains are mutually dependent: AI needs operational context, and supply chains need AI to keep pace with volatility. But here's the truth — there is no AI without PI. Without process intelligence, AI agents operate blindly.The ongoing SAP migration wave illustrates why. An estimated 85–90% of SAP customers are still moving from ECC to S/4HANA. Moving to newer databases doesn’t solve supply chain visibility — it provides faster access to the same fragmented data.Kerry Brown, a transformation evangelist at Celonis, sees this across industries. “Organizations are shifting from PeopleSoft to Oracle, or EBS to Fusion. The bulk is in SAP,” she explains. “But what they really need isn’t a new ERP. They need to understand how work actually flows across systems they already have.”That requires end-to-end operational context. Process intelligence provides this by enabling companies to extract and connect event data across systems, showing how processes execute in real time.This distinction becomes critical when deploying autonomous agents. When visibility is fragmented, autonomous agents can easily make decisions that appear rational locally but create downstream disruption. With real-time context, AI can operate with clarity and precision, and supply chains can stay ahead of tariff-driven disruption.Digital Twins: Powering real-time responseThe companies highlighted at Celosphere all applied the same principle: understand how processes run across systems in real time. Celonis PI creates a digital twin above existing systems, using its Process Intelligence Graph to link orders, shipments, invoices, and payments end-to-end. Dependencies that traditional integrations miss become visible. A delay in SAP instantly reveals its impact across Oracle, warehouse scheduling, and customer delivery commitments.“The platform brings together process data spanning systems and departments, enriched with business context that powers AI agents to transform operations effectively,” says Daniel Brown, Chief Product Officer at Celonis. With this cross-system awareness, Celonis coordinates actions across complex workflows involving AI agents, humans, and automations — especially critical when tariffs force rapid decisions about suppliers, shipments, and customers.Zero-copy integration enables instant modelingA key advancement unveiled at Celosphere — zero-copy integration with Databricks — removes another barrier. Traditionally, analyzing supply chain data meant copying from source systems into central warehouses, creating data latency.Celonis Data Core now integrates directly with platforms like Databricks and Microsoft Fabric, querying billions of records in near real time without duplication. When trade policy shifts, companies model alternatives instantly, not after overnight data refresh cycles.Enhanced Task Mining extends this by connecting desktop activity — keystrokes, mouse clicks, screen scrolls — to business processes. This exposes manual work invisible to system logs: spreadsheet gymnastics, email negotiations, phone calls that keep supply chains moving during urgent changes.Competitive advantage in volatile marketsMost companies can’t rip out and replace systems running critical operations — nor should they. Process intelligence offers a different path: compose workflows from existing systems, deploy AI where it creates value, and adapt continuously as conditions change. This “Free the Process” movement liberates companies from rigid architectures without forcing wholesale replacement.As global trade volatility intensifies, the companies that model will move faster, make smarter decisions, and turn tariff chaos into competitive advantage — all while existing ERPs keep running.When the next wave of tariffs hits — and it will — companies won’t have days to respond. They’ll have hours. The question isn’t whether your ERP captures the data. It’s whether your systems connect the dots fast enough to matter.Missed Celosphere 2025? Catch up with all the highlights here. Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
One problem enterprises face is getting employees to actually use the AI agents their dev teams have built. Google, which has already shipped many AI tools through its Workspace apps, has made Google Workspace Studio generally available to give more employees access to design, manage and share AI agents, further democratizing agentic workflows. This puts Google directly in competition with Microsoft’s Copilot and undercuts some integrations that brought OpenAI’s ChatGPT into enterprise applications. Workspace Studio is powered by Gemini 3, and while it primarily targets business teams rather than developers, it offers builders a way to offload lower-priority agent tasks. “We’ve all lost countless hours to the daily grind: Sifting through emails, juggling calendar logistics and chasing follow-up tasks,” Farhaz Karmali, product director for the Google Workspace Ecosystem, wrote in a blog post. “Legacy automation tools tried to help, but they were simply too rigid and technical for the everyday user. That’s why we’re bringing custom agents directly into Workspace with Studio — so you can delegate these repetitive tasks to agents that can reason, understand context and handle the work that used to slow you down.”The platform can bring agents to Workspace apps such as Google Docs and Sheets, as well as to third-party tools like Salesforce or Jira.More AI in applications Interest in AI agents continues to grow, and while many enterprises have begun deploying them in their workflows, they're finding it isn’t as easy to get users on board as expected. The problem is that using agents can sometimes break employees out of their flow, so organizations have to figure out how to integrate agents where users are already fully engaged. The most common way of interacting with agents so far remains a chat screen. AWS released Quick Sight in hopes of attracting more front- and middle-office workers to use AI agents, although access to agents is still through a chatbot. OpenAI has desktop integrations that bring ChatGPT to specific apps. And, of course, Microsoft Copilot helped was ahead of this trend. Google has an advantage that only Microsoft rivals: It already offers applications that most people use. Enterprise employees use Google Workspace applications, host data and documents on Drive and send emails through Gmail. This means Google can easily get the context enterprises need to power their agents and reach millions of users. If people build agents through Workspace Studio, the platform can prove that agents targeting workplace applications, not just Google Docs, but also Microsoft Word, could be a winning strategy to increase agent adoption from employees. Templatizing agent creation
Enterprise employees can choose from a template or write out what they need in a prompt window. A look around the Workspace Studio platform showed templates such as “auto-create tasks when files are added to a folder” or “create Jira issues for emails with action issues.”Karmali said Workspace Studio is being “deeply integrated with Workspace apps like Gmail, Drive and Chat,” and agents built on the platform can “understand the full context of your work.” “This allows them to provide help that matches your company’s policies and processes while generating personalized content in your tone and style," he said. "You can even view your agent activity directly from the side panels of your favorite Workspace apps." Teams can extend agents to third-party enterprise platforms, but they can also configure custom steps to integrate with other tools.
Presented by IndeedAs AI continues to reshape how we work, organizations are rethinking what skills they need, how they hire, and how they retain talent. According to Indeed’s 2025 Tech Talent report, tech job postings are still down more than 30% from pre-pandemic highs, yet demand for AI expertise has never been greater. New roles are emerging almost overnight, from prompt engineers to AI operations managers, and leaders are under growing pressure to close skill gaps while supporting their teams through change. Shibani Ahuja, SVP of enterprise IT strategy at Salesforce; Matt Candy, global managing partner of generative AI strategy and transformation at IBM; and Jessica Hardeman, global head of attraction and engagement at Indeed came together for a recent roundtable conversation about the future of tech talent strategy, from hiring and reskilling to how it's reshaping the workforce.Strategies for sourcing talentTo find the right candidates, organizations need to be certain their communication is clear from the get-go, and that means beginning with a well-thought-out job description, Hardeman said. "How clearly are you outlining the skills that are actually required for the role, versus using very high-level or ambiguous language," she said. "Something that I also highly recommend is skill-cluster sourcing. We use that to identify candidates that might be adjacent to these harder-to-find niche skills. That’s something we can upskill people into. For example, skills that are in distributed computing or machine learning frameworks also share other high-value capabilities. Using these clusters can help recruiters identify candidates that may not have that exact skill set you’re looking for, but can quickly upskill into it."Recruiters should also be upskilled, able to spot that potential in candidates. And once they're hired, companies have to be intentional about how they’re growing talent from the day they step in the door. "What that means in the near term is focusing on the mentorship, embedding that AI fluency into their onboarding experience, into their growth, into their development," she said. "That means offering upskilling that teaches not just the tools they’ll need, but how to think with those tools and alongside those. The new early career sweet spot is where technical skills meet our human strengths. Curiosity. Communication. Data judgment. Workflow design. Those are the things that AI cannot replicate or replace. We have to create mentorship and sponsorship opportunities. Well-being and culture are critical components to ensuring that we’re creating good places for that early-in-career talent to land."How work will evolve along AIAs AI becomes embedded into daily technical work, organizations are rethinking what it means to be a developer, designer, or engineer. Instead of automating roles end to end, companies are increasingly building AI agents that act as teammates, supporting workers across the entire software development lifecycle.Candy explained that IBM is already seeing this shift in action through its Consulting Advantage platform, which serves as a unified AI experience layer for consultants and technical teams.“This is a platform that every one of our consultants works with,” he said. “It’s supported by every piece of AI technology and model out there. It’s the place where our consultants can access thousands of agents that help them in each job role and activity they’re doing.”These aren’t just prebuilt tools — teams can create and publish their own agents into an internal marketplace. That has sparked a systematic effort to map every task across traditional tech roles and build agents to enhance them.“If I think about your traditional designer, DevOps engineer, AI Ops engineer — what are all the different agents that are supporting them in those activities?” Candy said. “It’s far more than just coding. Tools like Cursor, Windsurf, and GitHub Copilot accelerate coding, but that’s only one part of delivering software end to end. We’re building agents to support people at every stage of that journey.”Candy said this shift leads toward a workplace where AI becomes a collaborative partner rather than a replacement, something that enables tech workers to spend more time on creative, strategic, and human-centered tasks."This future where employees have agents working alongside them, taking care of some of these repetitive activities, focusing on higher-value strategic work where human skills are innately important, I think becomes right at the heart of that,” he explained. “You have to unleash the organization to be able to think and rethink in that way."A lot of that depends on the mindset of company leaders, Ahuja said. "I can see the difference between leaders that look at AI as cost-cutting, reduction — it’s a bottom-line activity,” she said. “And then there are organizations that are starting to shift their mindset to say, no, the goal is not about replacing people. It’s about reimagining the work to make us humans more human, ironically. For some leaders that’s the story their PR teams have told them to say. But for those that actually believe that AI is about helping us become more human, it’s interesting how they’re bringing that to life and bridging this gap between humanity and digital labor." Shifting the culture toward AIThe companies that are most successful at navigating the obstacles around successful AI implementation and culture change make employees their first priority, Ahuja added. They prioritize use cases that solve the most boring problems that are burdening their teams, demonstrating how AI will help, as opposed to looking at what the maximum number of jobs automation can replace."They’re thinking of it as preserving human accountability, so in high-stakes moments, people will still make that final call," she said. "Looking at where AI is going to excel at scale and speed with pattern recognition, leaving that space for humans to bring their judgement, their ethics, and their emotional intelligence. It seems like a very subtle shift, but it’s pretty big in terms of where it starts at the beginning of an organization and how it trickles down."It's also important to build a level of comfort in using AI in employees’ day-to-day work. Salesforce created a Slack chat called Bite-Sized AI in which they encourage every colleague, including company leaders, to talk about where they're using AI and why, and what hacks they've found. "That’s creating a safe space," Ahuja explained. "It’s creating that psychological safety — that this isn’t just a buzzword. We’re trying to encourage it through behavior.""This is all about how you ignite, especially in big enterprises, the kind of passion and fire inside everyone’s belly," Candy added. "Storytelling, showing examples of what great looks like. The expression is 'demos, not memos'. Stop writing PowerPoint slides explaining what we're going to do and actually getting into the tools to show it in real life.”AI makes that continuous learning a non-negotiable, Hardeman added, with companies training employees in understanding how to use the AI tools they're provided, and that goes a long way toward building that AI culture. "We view upskilling as a retention lever and a performance driver," she said. "It creates that confidence, it reduces the fear around AI adoption. It helps people see a future for themselves as the technology evolves. AI didn’t just raise the bar on skills. It raised the bar on how we’re trying to support our people. It’s important that we are also rising to that occasion, and we’re not just raising expectations on the folks that we work with."Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
Code red declared, Mistral 3 drops, AWS coder agent, Google search shift, and more...
Today, we are celebrating the extraordinary impact of Nobel Prize-winner Geoffrey Hinton by investing in the future of the field he helped build. Google is proud to supp…
A newly enacted New York law requires retailers to say whether your data influences the price of basic goods like a dozen eggs or toilet paper, but not how.
You can now use Circle to Search and Google Lens to detect scammy messages you receive on your phone.
Exploring the k-NN classifier with its variants and improvements
The post The Machine Learning “Advent Calendar” Day 2: k-NN Classifier in Excel appeared first on Towards Data Science.
Amazon Web Services on Tuesday announced a new class of artificial intelligence systems called "frontier agents" that can work autonomously for hours or even days without human intervention, representing one of the most ambitious attempts yet to automate the full software development lifecycle.The announcement, made during AWS CEO Matt Garman's keynote address at the company's annual re:Invent conference, introduces three specialized AI agents designed to act as virtual team members: Kiro autonomous agent for software development, AWS Security Agent for application security, and AWS DevOps Agent for IT operations.The move signals Amazon's intent to leap ahead in the intensifying competition to build AI systems capable of performing complex, multi-step tasks that currently require teams of skilled engineers."We see frontier agents as a completely new class of agents," said Deepak Singh, vice president of developer agents and experiences at Amazon, in an interview ahead of the announcement. "They're fundamentally designed to work for hours and days. You're not giving them a problem that you want finished in the next five minutes. You're giving them complex challenges that they may have to think about, try different solutions, and get to the right conclusion — and they should do that without intervention."Why Amazon believes its new agents leave existing AI coding tools behindThe frontier agents differ from existing AI coding assistants like GitHub Copilot or Amazon's own CodeWhisperer in several fundamental ways.Current AI coding tools, while powerful, require engineers to drive every interaction. Developers must write prompts, provide context, and manually coordinate work across different code repositories. When switching between tasks, the AI loses context and must start fresh.The new frontier agents, by contrast, maintain persistent memory across sessions and continuously learn from an organization's codebase, documentation, and team communications. They can independently determine which code repositories require changes, work on multiple files simultaneously, and coordinate complex transformations spanning dozens of microservices."With a current agent, you would go microservice by microservice, making changes one at a time, and each change would be a different session with no shared context," Singh explained. "With a frontier agent, you say, 'I need to solve this broad problem.' You point it to the right application, and it decides which repos need changes."The agents exhibit three defining characteristics that AWS believes set them apart: autonomy in decision-making, the ability to scale by spawning multiple agents to work on different aspects of a problem simultaneously, and the capacity to operate independently for extended periods."A frontier agent can decide to spin up 10 versions of itself, all working on different parts of the problem at once," Singh said.How each of the three frontier agents tackles a different phase of developmentKiro autonomous agent serves as a virtual developer that maintains context across coding sessions and learns from an organization's pull requests, code reviews, and technical discussions. Teams can connect it to GitHub, Jira, Slack, and internal documentation systems. The agent then acts like a teammate, accepting task assignments and working independently until it either completes the work or requires human guidance.AWS Security Agent embeds security expertise throughout the development process, automatically reviewing design documents and scanning pull requests against organizational security requirements. Perhaps most significantly, it transforms penetration testing from a weeks-long manual process into an on-demand capability that completes in hours.SmugMug, a photo hosting platform, has already deployed the security agent. "AWS Security Agent helped catch a business logic bug that no existing tools would have caught, exposing information improperly," said Andres Ruiz, staff software engineer at the company. "To any other tool, this would have been invisible. But the ability for Security Agent to contextualize the information, parse the API response, and find the unexpected information there represents a leap forward in automated security testing."AWS DevOps Agent functions as an always-on operations team member, responding instantly to incidents and using its accumulated knowledge to identify root causes. It connects to observability tools including Amazon CloudWatch, Datadog, Dynatrace, New Relic, and Splunk, along with runbooks and deployment pipelines.Commonwealth Bank of Australia tested the DevOps agent by replicating a complex network and identity management issue that typically requires hours for experienced engineers to diagnose. The agent identified the root cause in under 15 minutes."AWS DevOps Agent thinks and acts like a seasoned DevOps engineer, helping our engineers build a banking infrastructure that's faster, more resilient, and designed to deliver better experiences for our customers," said Jason Sandry, head of cloud services at Commonwealth Bank.Amazon makes its case against Google and Microsoft in the AI coding warsThe announcement arrives amid a fierce battle among technology giants to dominate the emerging market for AI-powered development tools. Google has made significant noise in recent weeks with its own AI coding capabilities, while Microsoft continues to advance GitHub Copilot and its broader AI development toolkit.Singh argued that AWS holds distinct advantages rooted in the company's 20-year history operating cloud infrastructure and Amazon's own massive software engineering organization."AWS has been the cloud of choice for 20 years, so we have two decades of knowledge building and running it, and working with customers who've been building and running applications on it," Singh said. "The learnings from operating AWS, the knowledge our customers have, the experience we've built using these tools ourselves every day to build real-world applications—all of that is embodied in these frontier agents."He drew a distinction between tools suitable for prototypes versus production systems. "There's a lot of things out there that you can use to build your prototype or your toy application. But if you want to build production applications, there's a lot of knowledge that we bring in as AWS that apply here."The safeguards Amazon built to keep autonomous agents from going rogueThe prospect of AI systems operating autonomously for days raises immediate questions about what happens when they go off track. Singh described multiple safeguards built into the system.All learnings accumulated by the agents are logged and visible, allowing engineers to understand what knowledge influences the agent's decisions. Teams can even remove specific learnings if they discover the agent has absorbed incorrect information from team communications."You can go in and even redact that from its knowledge like, 'No, we don't want you to ever use this knowledge,'" Singh said. "You can look at the knowledge like it's almost—it's like looking at your neurons inside your brain. You can disconnect some."Engineers can also monitor agent activity in real-time and intervene when necessary, either redirecting the agent or taking over entirely. Most critically, the agents never commit code directly to production systems. That responsibility remains with human engineers."These agents are never going to check the code into production. That is still the human's responsibility," Singh emphasized. "You are still, as an engineer, responsible for the code you're checking in, whether it's generated by you or by an agent working autonomously."What frontier agents mean for the future of software engineering jobsThe announcement inevitably raises concerns about the impact on software engineering jobs. Singh pushed back against the notion that frontier agents will replace developers, framing them instead as tools that amplify human capabilities."Software engineering is craft. What's changing is not, 'Hey, agents are doing all the work.' The craft of software engineering is changing—how you use agents, how do you set up your code base, how do you set up your prompts, how do you set up your rules, how do you set up your knowledge bases so that agents can be effective," he said.Singh noted that senior engineers who had drifted away from hands-on coding are now writing more code than ever. "It's actually easier for them to become software engineers," he said.He pointed to an internal example where a team completed a project in 78 days that would have taken 18 months using traditional practices. "Because they were able to use AI. And the thing that made it work was not just the fact that they were using AI, but how they organized and set up their practices of how they built that software were maximized around that."How Amazon plans to make AI-generated code more trustworthy over timeSingh outlined several areas where frontier agents will evolve over the coming years. Multi-agent architectures, where systems of specialized agents coordinate to solve complex problems, represent a major frontier. So does the integration of formal verification techniques to increase confidence in AI-generated code.AWS recently introduced property-based testing in Kiro, which uses automated reasoning to extract testable properties from specifications and generate thousands of test scenarios automatically."If you have a shopping cart application, every way an order can be canceled, and how it might be canceled, and the way refunds are handled in Germany versus the US—if you're writing a unit test, maybe two, Germany and US, but now, because you have this property-based testing approach, your agent can create a scenario for every country you operate in and test all of them automatically for you," Singh explained.Building trust in autonomous systems remains the central challenge. "Right now you still require tons of human guardrails at every step to make sure that the right thing happens. And as we get better at these techniques, you will use less and less, and you'll be able to trust the agents a lot more," he said.Amazon's bigger bet on autonomous AI stretches far beyond writing codeThe frontier agents announcement arrived alongside a cascade of other news at re:Invent 2025. AWS kicked off the conference with major announcements on agentic AI capabilities, customer service innovations, and multicloud networking.Amazon expanded its Nova portfolio with four new models delivering industry-leading price-performance across reasoning, multimodal processing, conversational AI, code generation, and agentic tasks. Nova Forge pioneers "open training," giving organizations access to pre-trained model checkpoints and the ability to blend proprietary data with Amazon Nova-curated datasets.AWS also added 18 new open weight models to Amazon Bedrock, reinforcing its commitment to offering a broad selection of fully managed models from leading AI providers. The launch includes new models from Mistral AI, Google's Gemma 3, MiniMax's M2, NVIDIA's Nemotron, and OpenAI's GPT OSS Safeguard.On the infrastructure side, Amazon EC2 Trn3 UltraServers, powered by AWS's first 3nm AI chip, pack up to 144 Trainium3 chips into a single integrated system, delivering up to 4.4x more compute performance and 4x greater energy efficiency than the previous generation. AWS AI Factories provides enterprises and government organizations with dedicated AWS AI infrastructure deployed in their own data centers, combining NVIDIA GPUs, Trainium chips, AWS networking, and AI services like Amazon Bedrock and SageMaker AI.All three frontier agents launched in preview on Tuesday. Pricing will be announced when the services reach general availability.Singh made clear the company sees applications far beyond coding. "These are the first frontier agents we are releasing, and they're in the software development lifecycle," he said. "The problems and use cases for frontier agents—these agents that are long running, capable of autonomy, thinking, always learning and improving—can be applied to many, many domains."Amazon, after all, operates satellite networks, runs robotics warehouses, and manages one of the world's largest e-commerce platforms. If autonomous agents can learn to write code on their own, the company is betting they can eventually learn to do just about anything else.
Nova Forge lets Amazon’s customers train frontier models for different tasks—a potential breakthrough in making AI actually useful for businesses.
As Google and Microsoft continue to surge, the AWS chief lays out his pitch: cheaper, reliable AI delivered at hyperscale.
Benchmarking JSON libraries for large payloads
The post JSON Parsing for Large Payloads: Balancing Speed, Memory, and Scalability appeared first on Towards Data Science.