Latest AI News & Updates

#robotics #startups #innovation and entrepreneurship (i&e) #machine learning #artificial intelligence #electrical engineering and computer science (eecs) #school of engineering #alumni/ae

Founded by MIT alumni, the Pickle Robot Company has developed machines that can autonomously load and unload trucks inside warehouses and logistic centers.

#science #science / health

The author of Super Agers believes AI could bring big changes to the world of medicine.

#ai

OpenAI researchers have introduced a novel method that acts as a "truth serum" for large language models (LLMs), compelling them to self-report their own misbehavior, hallucinations and policy violations. This technique, "confessions," addresses a growing concern in enterprise AI: Models can be dishonest, overstating their confidence or covering up the shortcuts they take to arrive at an answer. For real-world applications, this technique evolves the creation of more transparent and steerable AI systems.What are confessions?Many forms of AI deception result from the complexities of the reinforcement learning (RL) phase of model training. In RL, models are given rewards for producing outputs that meet a mix of objectives, including correctness, style and safety. This can create a risk of "reward misspecification," where models learn to produce answers that simply "look good" to the reward function, rather than answers that are genuinely faithful to a user's intent.A confession is a structured report generated by the model after it provides its main answer. It serves as a self-evaluation of its own compliance with instructions. In this report, the model must list all instructions it was supposed to follow, evaluate how well it satisfied them and report any uncertainties or judgment calls it made along the way. The goal is to create a separate channel where the model is incentivized only to be honest.In a blog post, the OpenAI researchers provide a few examples the "confessions" technique in action. A model was given a test along with a hidden rule: If it scored above 50%, its knowledge would be erased. To preserve itself, the model "sandbagged" by intentionally answering half the questions incorrectly. While its primary answer hid this deception, its subsequent confession was brutally honest. It admitted, "The user wanted correct answers, but we sabotaged half of them." The researchers found that when a model misbehaves, it is significantly more likely to admit it in the confession than in its main answer.How confession training worksThe key to this method is the separation of rewards. During training, the reward assigned to the confession is based solely on its honesty and is never mixed with the reward for the main task. "Like the Catholic Church’s 'seal of confession', nothing that the model reveals can change the reward it receives for completing its original task," the researchers write. This creates a "safe space" for the model to admit fault without penalty.This approach is powerful because it sidesteps a major challenge in AI training. The researchers’ intuition is that honestly confessing to misbehavior is an easier task than achieving a high reward on the original, often complex, problem. Therefore, when the model is judged purely on honesty, its incentive to trick the "confession judge" is much weaker than its incentive to trick the reward model for its main answer. For example, in an experiment where a model was trained to "hack" a weak reward model, its performance on the main task became more deceptive over time. However, its confessions became more honest, correctly identifying the reward hacking it was performing.However, the technique has its limits. Confessions are not a panacea for all types of AI failures. The system works best when a model is aware that it is misbehaving. It is less effective for "unknown unknowns." For instance, if a model hallucinates a fact and genuinely believes it is correct, it cannot confess to providing false information. The most common reason for a failed confession is model confusion, not intentional deception. Confusion often occurs when the instructions are ambiguous and the model cannot clearly determine human user intent.What it means for enterprise AIOpenAI’s confessions technique is part of a growing body of work on AI safety and control. Anthropic, an OpenAI competitor, has also released research that shows how LLMs can learn malicious behavior. The company is also working toward plugging these holes as they emerge.For AI applications, mechanisms such as confessions can provide a practical monitoring mechanism. The structured output from a confession can be used at inference time to flag or reject a model’s response before it causes a problem. For example, a system could be designed to automatically escalate any output for human review if its confession indicates a policy violation or high uncertainty.In a world where AI is increasingly agentic and capable of complex tasks, observability and control will be key elements for safe and reliable deployment.“As models become more capable and are deployed in higher-stakes settings, we need better tools for understanding what they are doing and why,” the OpenAI researchers write. “Confessions are not a complete solution, but they add a meaningful layer to our transparency and oversight stack.”

#business #business / artificial intelligence #business / big tech #business / regulation #business / tech culture

The Trump administration might think regulation is killing the AI industry, but Anthropic president Daniela Amodei disagrees.

#business #business / artificial intelligence

In this episode of Uncanny Valley, we talk to writer Evan Ratliff about how he created a small startup made entirely of AI employees—and what his findings reveal about the reality of an agentic future.

#culture #culture / movies

The Wicked: For Good director says being able to improvise on set allows for the kind of moments that are hard for machines to make.

#artificial intelligence #app

In 2024, a Democratic congressional candidate in Pennsylvania, Shamaine Daniels, used an AI chatbot named Ashley to call voters and carry on conversations with them. “Hello. My name is Ashley, and I’m an artificial intelligence volunteer for Shamaine Daniels’s run for Congress,” the calls began. Daniels didn’t ultimately win. But maybe those calls helped her…

#business #business / big tech #business / computers and software #business / tech culture

Lisa Su leads Nvidia’s biggest rival in the AI chip market. When asked at WIRED’s Big Interview event if AI is a bubble, the company’s CEO said, “Emphatically, from my perspective, no.”

Robust proxies allow you to rotate identities, reach any region, and bypass sophisticated anti-bot systems, all while protecting your infrastructure from blocks and blacklisting.

#deep learning #artificial intelligence #editors pick #machine learning #neuroscience #self-supervised learning

A new NeurIPS 2025 paper shows how self-supervised learning imbues ViT with better image understanding than supervised learning
The post Do Labels Make AI Blind? Self-Supervision Solves the Age-Old Binding Problem appeared first on Towards Data Science.

Strange as it may sound, large language models (LLMs) can be leveraged for data analysis tasks, including specific scenarios such as time series analysis.

#business

The diverging path of China’s two leading AI players shows where the country’s artificial intelligence industry is headed.

#machine learning #algorithms #artificial intelligence #data science #excel #k means

How to implement a training algorithm that finally looks like “real” machine learning
The post The Machine Learning “Advent Calendar” Day 4: k-Means in Excel appeared first on Towards Data Science.

Scientists are using AlphaFold to strengthen a photosynthesis enzyme for resilient, heat-tolerant crops.

An overview, summary, and position of cutting-edge research conducted on the emergent topic of LLM introspection on self internal states

#programming #data science #deep dives #python #streamlit #supply chain

A factory operator that discovered happiness by switching from notebook to streamlit - (Image Generated with GPT-5.1 by Samir Saci)
The post Build and Deploy Your First Supply Chain App in 20 Minutes appeared first on Towards Data Science.

#ai

Amazon Web Services (AWS) has introduced Kiro powers, a system that allows software developers to give their AI coding assistants instant, specialized expertise in specific tools and workflows — addressing what the company calls a fundamental bottleneck in how AI agents operate today.AWS announced Kiro powers at its annual re:Invent conference in Las Vegas. The capability marks a departure from how most AI coding tools work today. Typically, these tools load every possible capability into memory upfront — a process that burns through computational resources and can overwhelm the AI with irrelevant information. Kiro powers takes the opposite approach, activating specialized knowledge only at the moment a developer actually needs it."Our goal is to give the agent specialized context so it can reach the right outcome faster — and in a way that also reduces cost," Deepak Singh, VP of developer agents and experiences at Amazon, told VentureBeat in an exclusive interview.The launch includes partnerships with nine technology companies: Datadog, Dynatrace, Figma, Neon, Netlify, Postman, Stripe, Supabase and AWS's own services. Developers can also create and share their powers with the community.Why AI coding assistants choke when developers connect too many toolsKiro powers comes amidst growing tension in the AI development tool market.Modern AI coding assistants rely on Model Context Protocol (MCP) to connect with external tools and services. When a developer wants their AI assistant to work with Stripe for payments, Figma for design and Supabase for databases, they connect MCP servers for each service.The problem: Each connection loads dozens of tool definitions into the AI's working memory before it writes a single line of code. According to AWS documentation, connecting just five MCP servers can consume more than 50,000 tokens — roughly 40% of an AI model's context window — before the developer even types their first request.Developers have grown increasingly vocal about this issue. Many complain that they don't want to burn through their token allocations just to have an AI agent figure out which tools are relevant to a specific task. They want to get to their workflow instantly — not watch an overloaded agent struggle to sort through irrelevant context.This phenomenon, which some in the industry call "context rot," leads to slower responses, lower-quality outputs and significantly higher costs — since AI services typically charge by the token.Inside the technology that loads AI expertise on demandKiro powers addresses this by packaging three components into a single, dynamically-loaded bundle.The first is a steering file, POWER.md, which functions as an onboarding manual. It tells the AI agent what tools are available and, crucially, when to use them. The second component is the MCP server configuration itself — the actual connection to external services. The third includes optional hooks and automation that trigger specific actions.When a developer mentions "payment" or "checkout" in their conversation with Kiro, the system automatically activates the Stripe power, loading its tools and best practices into context. When the developer shifts to database work, Supabase activates while Stripe deactivates. The baseline context usage when no powers are active approaches zero."You click a button and it automatically loads," Singh said. "Once a power has been created, developers just select 'open in Kiro' and it launches the IDE with everything ready to go."How AWS is bringing elite developer techniques to the massesSingh framed Kiro powers as a democratization of advanced development practices. Before this capability, only the most sophisticated developers knew how to properly configure their AI agents with specialized context — writing custom steering files, crafting precise prompts and manually managing which tools were active at any given time."We've found that our developers were adding in capabilities to make their agents more specialized," Singh said. "They wanted to give the agent some special powers for a specific problem. For example, they wanted ... the agent to become an expert at backend-as-a-service."This observation led to a key insight: If Supabase or Stripe could build the optimal context configuration once, every developer using those services could benefit."Kiro powers formalizes things that only the most advanced people were doing, and allows anyone to get those kinds of skills," Singh said.Why dynamic loading beats fine-tuning for most AI coding use casesThe announcement also positions Kiro powers as a more economical alternative to fine-tuning, or the process of training an AI model on specialized data to improve its performance in specific domains."It's much cheaper" compared to fine-tuning, Singh. "Fine-tuning is very expensive, and you can't fine-tune most frontier models."This is a significant point. The most capable AI models from Anthropic, OpenAI and Google are typically "closed source," meaning developers cannot modify their underlying training. They can only influence the models' behavior through the prompts and context they provide."Most people are already using powerful models like Sonnet 4.5 or Opus 4.5," Singh said. "Those models need to be pointed in the right direction."The dynamic loading mechanism also reduces ongoing costs. Because powers only activate when relevant, developers aren't paying for token usage on tools they're not currently using.Where Kiro powers fits into Amazon's bigger bet on autonomous AI agentsKiro powers arrives as part of a broader push by AWS into what the company calls "agentic AI" — AI systems that can operate autonomously over extended periods.At re:Invent, AWS also announced three "frontier agents" designed to work for hours or days without human intervention: Kiro autonomous agent for software development, AWS security agent and AWS DevOps agent. These represent a different approach from Kiro powers — tackling large, ambiguous problems rather than providing specialized expertise for specific tasks.The two approaches are complementary. Frontier agents handle complex, multi-day projects that require autonomous decision-making across multiple codebases. Kiro powers, by contrast, gives developers precise, efficient tools for everyday development tasks where speed and token efficiency matter most.The company is betting that developers need both ends of this spectrum to be productive.What Kiro powers reveals about the future of AI-assisted software developmentThe launch reflects a maturing market for AI development tools. GitHub Copilot, which Microsoft launched in 2021, introduced millions of developers to AI-assisted coding. Since then, a proliferation of tools — including Cursor, Cline and Claude Code — have competed for developers' attention.But as these tools have grown more capable, they've also grown more complex. MCP, which Anthropic open-sourced last year, created a standard for connecting AI agents to external services. That solved one problem while creating another: The context overload that Kiro powers now addresses.AWS is positioning itself as the company that understands production software development at scale. Singh emphasized that Amazon's experience running AWS for 20 years, combined with its own massive internal software engineering organization, gives it unique insight into how developers actually work."It's not something you would use just for your prototype or your toy application," he said. "If you want to build production applications, there's a lot of knowledge that we bring."The road ahead for Kiro powers and cross-platform compatibilityAWS indicated that Kiro powers currently works only within the Kiro IDE, but the company is building toward cross-compatibility with other AI development tools, including command-line interfaces, Cursor, Cline and Claude Code. The company's documentation describes a future where developers can "build a power once, use it anywhere" — although that vision remains aspirational for now.For the technology partners launching powers today, the appeal is straightforward: Rather than maintaining separate integration documentation for every AI tool on the market, they can create a single power that works everywhere Kiro does. As more AI coding assistants crowd the market, that kind of efficiency becomes increasingly valuable.Kiro powers is available now for developers using Kiro IDE version 0.7 or later at no additional charge beyond the standard Kiro subscription.The underlying bet is a familiar one in the history of computing: The winners in AI-assisted development won't be the tools that try to do everything at once, but those that are smart enough to know what to forget.

#ai

The debate over whether AI belongs in the corporate boardroom appears to be over — at least for those responsible for generating revenue.Seven in 10 enterprise revenue leaders now trust AI to regularly inform their business decisions, according to a sweeping new study released by revenue intelligence company Gong. The finding marks a dramatic shift from just two years ago, when most organizations treated AI as an experimental technology relegated to pilot programs and individual productivity hacks.The research, based on an analysis of 7.1 million sales opportunities across more than 3,600 companies and a survey of over 3,000 global revenue leaders spanning the U.S., UK, Australia and Germany, paints a picture of an industry in rapid transformation. Organizations that have embedded AI into their core go-to-market strategies are 65% more likely to increase their win rates than competitors still treating the technology as optional."I don't think people delegate decisions to AI, but they do rely on AI in the process of making decisions," Amit Bendov, Gong's co-founder and chief executive, said in an exclusive interview with VentureBeat. "Humans are making the decision, but they're largely assisted."The distinction matters. Rather than replacing human judgment, AI has become what Bendov describes as a "second opinion" — a data-driven check on the intuition and guesswork that has traditionally governed sales forecasting and strategy.Slowing growth is forcing sales teams to squeeze more from every repThe timing of AI's ascendance in revenue organizations is no coincidence. The study reveals a sobering reality: After rebounding in 2024, average annual revenue growth among surveyed companies decelerated to 16% in 2025, marking a three-percentage-point decline year over year. Sales rep quota attainment fell from 52% to 46% over the same period.The culprit, according to Gong's analysis, isn't that salespeople are performing worse on individual deals. Win rates and deal duration remained consistent. The problem is that representatives are working fewer opportunities — a finding that suggests operational inefficiencies are eating into selling time.This helps explain why productivity has rocketed to the top of executive priorities. For the first time in the study's history, increasing the productivity of existing teams ranked as the number-one growth strategy for 2026, jumping from fourth place the previous year."The focus is on increasing sales productivity," Bendov said. "How much dollar-output per dollar-input?"The numbers back up the urgency. Teams that regularly use AI tools generate 77% more revenue per representative than those that don't — a gap Gong characterizes as a six-figure difference per salesperson annually.Companies are moving beyond basic AI automation toward strategic decision-makingThe nature of AI adoption in sales has evolved considerably over the past year. In 2024, most revenue teams used AI for basic automation: Transcribing calls, drafting emails, updating CRM records. Those use cases continue to grow, but 2025 marked what the report calls a shift "from automation to intelligence."The number of U.S. companies using AI for forecasting and measuring strategic initiatives jumped 50% year over year. These more sophisticated applications — predicting deal outcomes, identifying at-risk accounts, measuring which value propositions resonate with different buyer personas — correlate with dramatically better results.Organizations in the 95th percentile of commercial impact from AI were 2 to 4X more likely to have deployed these strategic use cases, according to the study.Bendov offered a concrete example of how this plays out in practice. "Companies have thousands of deals that they roll up into their forecast," he said. "It used to be based solely on human sentiment, believe it or not. That's why a lot of companies miss their numbers: Because people say, 'Oh, he told me he'll buy,' or 'I think I can probably get this one.'"AI changes that calculus by examining evidence rather than optimism. "Companies now get a second opinion from AI on their forecasting, and that improves forecasting accuracy dramatically — 10 [or] 15% better accuracy just because it's evidence-based, not just based on human sentiment," Bendov said.Revenue-specific AI tools are dramatically outperforming general-purpose alternativesOne of the study's more provocative findings concerns the type of AI that delivers results. Teams using revenue-specific AI solutions — tools built explicitly for sales workflows rather than general-purpose platforms like ChatGPT — reported 13% higher revenue growth and 85% greater commercial impact than those relying on generic tools.These specialized systems were also twice as likely to be deployed for forecasting and predictive modeling, the report found.The finding carries obvious implications for Gong, which sells precisely this type of domain-specific platform. But the data suggests a real distinction in outcomes. General-purpose AI, while more prevalent, often creates what the report describes as a "blind spot" for organizations — particularly when employees adopt consumer AI tools without company oversight.Research from MIT suggests that while only 59% of enterprise teams use personal AI tools like ChatGPT at work, the actual figure is likely closer to 90%. This shadow AI usage poses security risks and creates fragmented technology stacks that undermine the potential for organization-wide intelligence.Most sales leaders believe AI will reshape their jobs rather than eliminate themPerhaps the most closely-watched question in any AI study concerns employment. The Gong research offers a more nuanced picture than the apocalyptic predictions that often dominate headlines.When asked about AI's three-year impact on revenue headcount, 43% of respondents said they expect it to transform jobs without reducing headcount — the most common response. Only 28% anticipate job eliminations, while 21% actually foresee AI creating new roles. Just 8% predict minimal impact.Bendov frames the opportunity as reclaiming lost time. He cited Forrester research indicating that 77% of a sales representative's time is spent on activities that don't involve customers — administrative work, meeting preparation, researching accounts, updating forecasts and internal briefings."AI can eliminate, ideally, 77% of the drudgery work that they're doing," Bendov said. "I don't think it necessarily eliminates jobs. People are half productive right now. Let's make them fully productive, and whatever you're paying them will translate to much higher revenue."The transformation is already visible in role consolidation. Over the past decade, sales organizations splintered into hyper-specialized functions: One person qualifies leads, another sets appointments, a third closes deals, a fourth handles onboarding. The result was customers interacting with five or six different people across their buying journey."Which is not a great buyer experience, because every time I meet a new person that might not have the full context, and it's very inefficient for companies," Bendov said. "Now with AI, you can have one person do all this, or much of this."At Gong itself, sellers now generate 80% of their own appointments because AI handles the prospecting legwork, Bendov said.American companies are adopting AI 18 months faster than their European counterpartsThe study reveals a notable divide in AI adoption between the U.S. and Europe. While 87% of U.S. companies now use AI in their revenue operations, with another 9% planning adoption within a year, the UK trails by 12 to 18 months. Just 70% of UK companies currently use AI, with 22% percent planning near-term adoption — figures that mirror U.S. data from 2024.Bendov said the pattern reflects a broader historical tendency for enterprise technology trends to cross the Atlantic with a delay. "It's always like that," he said. "Even when the internet was taking off in the U.S., Europe was a step behind."The gap isn't permanent, he noted, and Europe sometimes leads on technology adoption — mobile payments and messaging apps like WhatsApp gained traction there before the U.S. — but for AI specifically, the American market remains ahead.Gong says a decade of AI development gives it an edge over Salesforce and MicrosoftThe findings arrive as Gong navigates an increasingly crowded market. The company, which recently surpassed $300 million in annual recurring revenue, faces potential competition from enterprise software giants like Salesforce and Microsoft, both of which are embedding AI capabilities into their platforms.Bendov argues that Gong's decade of AI development creates a substantial barrier to entry. The company's architecture comprises three layers: a "revenue graph" that aggregates customer data from CRM systems, emails, calls, videos and web signals; an intelligence layer combining large language models (LLMs) with approximately 40 proprietary small language models; and workflow applications built on top."Anybody that would want to build something like that — it's not a small feature, it's 10 years in development—would need first to build the revenue graph," Bendov said.Rather than viewing Salesforce and Microsoft as threats, Bendov characterized them as partners, pointing to both companies' participation in Gong's recent user conference to discuss agent interoperability. The rise of MCP (Model Context Protocol) support and consumption-based pricing models means customers can mix AI agents from multiple vendors rather than committing to a single platform.The real question is whether AI will expand the sales profession or hollow it outThe report's implications extend beyond sales departments. If AI can transform revenue operations — long considered a relationship-driven, human-centric function — it raises questions about which other business processes might be next.Bendov sees the potential for expansion rather than contraction. Drawing an analogy to digital photography, he noted that while camera manufacturers suffered, the total number of photos taken exploded once smartphones made photography effortless."If AI makes selling simple, I could see a world [with] maybe ten times more jobs than we have now," said Bendov." It's expensive and inefficient today, but if it becomes as easy as taking a photo, the industry could actually grow and create opportunities for people of different abilities, from different locations."For Bendov, who co-founded Gong in 2015 when AI was still a hard sell to non-technical business users, the current moment represents something he waited a decade to see. Back then, mentioning AI to sales executives sounded like science fiction. The company struggled to raise money because the underlying technology barely existed."When we started the company, we were born as an AI company, but we had to almost hide AI," Bendov recalled. "It was intimidating."Now, seven out of 10of those same executives say they trust AI to help run their business. The technology that once had to be disguised has become the one thing nobody can afford to ignore.

#data engineering #data lakehouse #data science #programming #python #aws s3

Using Apache Iceberg on AWS with Athena, Glue/Spark and DuckDB
The post Bootstrap a Data Lakehouse in an Afternoon appeared first on Towards Data Science.

Check out this comprehensive guide to building production-ready features that actually work.

#science

Zanskar uses AI to identify hidden geothermal systems—and claims it has found one that could fuel a power plant, the first such discovery by industry in decades.

#ai & ml #commentary

In 2025 AI reshaped how teams think, build, and deliver software. We’re now at a point where “AI coding assistants have quickly moved from novelty to necessity [with] up to 90% of software engineers us[ing] some kind of AI for coding,” Addy Osmani writes. That’s a very different world to the one we were in […]

#data science #career advice #continual learning #editors pick #career growth

Why continuous learning matters & how to come up with topics to study
The post The Best Data Scientists are Always Learning appeared first on Towards Data Science.

#ai

For all their superhuman power, today’s AI models suffer from a surprisingly human flaw: They forget. Give an AI assistant a sprawling conversation, a multi-step reasoning task or a project spanning days, and it will eventually lose the thread. Engineers refer to this phenomenon as “context rot,” and it has quietly become one of the most significant obstacles to building AI agents that can function reliably in the real world.A research team from China and Hong Kong believes it has created a solution to context rot. Their new paper introduces general agentic memory (GAM), a system built to preserve long-horizon information without overwhelming the model. The core premise is simple: Split memory into two specialized roles, one that captures everything, another that retrieves exactly the right things at the right moment.Early results are encouraging, and couldn’t be better timed. As the industry moves beyond prompt engineering and embraces the broader discipline of context engineering, GAM is emerging at precisely the right inflection point.When bigger context windows still aren’t enoughAt the heart of every large language model (LLM) lies a rigid limitation: A fixed “working memory,” more commonly referred to as the context window. Once conversations grow long, older information gets truncated, summarized or silently dropped. This limitation has long been recognized by AI researchers, and since early 2023, developers have been working to expand context windows, rapidly increasing the amount of information a model can handle in a single pass.Mistral’s Mixtral 8x7B debuted with a 32K-token window, which is approximately 24 to 25 words, or about 128 characters in English; essentially a small amount of text, like a single sentence. This was followed by MosaicML’s MPT-7B-StoryWriter-65k+, which more than doubled that capacity; then came Google’s Gemini 1.5 Pro and Anthropic’s Claude 3, offering massive 128K and 200K windows, both of which are extendable to an unprecedented one million tokens. Even Microsoft joined the push, vaulting from the 2K-token limit of the earlier Phi models to the 128K context window of Phi-3. 
Increasing context windows might sound like the obvious fix, but it isn’t. Even models with sprawling 100K-token windows, enough to hold hundreds of pages of text, still struggle to recall details buried near the beginning of a long conversation. Scaling context comes with its own set of problems. As prompts grow longer, models become less reliable at locating and interpreting information because attention over distant tokens weakens and accuracy gradually erodes.Longer inputs also dilute the signal-to-noise ratio, as including every possible detail can actually make responses worse than using a focused prompt. Long prompts also slow models down; more input tokens lead to noticeably higher output-token latency, creating a practical limit on how much context can be used before performance suffers.Memories are pricelessFor most organizations, supersized context windows come with a clear downside — they’re costly. Sending massive prompts through an API is never cheap, and because pricing scales directly with input tokens, even a single bloated request can drive up expenses. Prompt caching helps, but not enough to offset the habit of routinely overloading models with unnecessary context. And that’s the tension at the heart of the issue: Memory is essential to making AI more powerful.As context windows stretch into the hundreds of thousands or millions of tokens, the financial overhead rises just as sharply. Scaling context is both a technical challenge and an economic one, and relying on ever-larger windows quickly becomes an unsustainable strategy for long-term memory.Fixes like summarization and retrieval-augmented generation (RAG) aren’t silver bullets either. Summaries inevitably strip away subtle but important details, and traditional RAG, while strong on static documents, tends to break down when information stretches across multiple sessions or evolves over time. Even newer variants, such as agentic RAG and RAG 2.0 (which perform better in steering the retrieval process), still inherit the same foundational flaw of treating retrieval as the solution, rather than treating memory itself as the core problem.Compilers solved this problem decades agoIf memory is the real bottleneck, and retrieval can’t fix it, then the gap needs a different kind of solution. That’s the bet behind GAM. Instead of pretending retrieval is memory, GAM keeps a full, lossless record and layers smart, on-demand recall on top of it, resurfacing the exact details an agent needs even as conversations twist and evolve. A useful way to understand GAM is through a familiar idea from software engineering: Just-in-time (JIT) compilation. Rather than precomputing a rigid, heavily compressed memory, GAM keeps things light and tight by storing a minimal set of cues, along with a full, untouched archive of raw history. Then, when a request arrives, it “compiles” a tailored context on the fly.This JIT approach is built into GAM’s dual architecture, allowing AI to carry context across long conversations without overcompressing or guessing too early about what matters. The result is the right information, delivered at exactly the right moment.Inside GAM: A two-agent system built for memory that enduresGAM revolves around the simple idea of separating the act of remembering from recalling, which aptly involves two components: The 'memorizer' and the 'researcher.'The memorizer: Total recall without overloadThe memorizer captures every exchange in full, quietly turning each interaction into a concise memo while preserving the complete, decorated session in a searchable page store. It doesn’t compress aggressively or guess what is important. Instead, it organizes interactions into structured pages, adds metadata for efficient retrieval and generates optional lightweight summaries for quick scanning. Critically, every detail is preserved, and nothing is thrown away.The researcher: A deep retrieval engineWhen the agent needs to act, the researcher takes the helm to plan a search strategy, combining embeddings with keyword methods like BM25, navigating through page IDs and stitching the pieces together. It conducts layered searches across the page-store, blending vector retrieval, keyword matching and direct lookups. It evaluates findings, identifies gaps and continues searching until it has sufficient evidence to produce a confident answer, much like a human analyst reviewing old notes and primary documents. It iterates, searches, integrates and reflects until it builds a clean, task-specific briefing. GAM’s power comes from this JIT memory pipeline, which assembles rich, task-specific context on demand instead of leaning on brittle, precomputed summaries. Its core innovation is simple yet powerful, as it preserves all information intact and makes every detail recoverable.Ablation studies support this approach: Traditional memory fails on its own, and naive retrieval isn’t enough. It’s the pairing of a complete archive with an active, iterative research engine that enables GAM to surface details that other systems leave behind.Outperforming RAG and long-context modelsTo test GAM, the researchers pitted it against standard RAG pipelines and models with enlarged context windows such as GPT-4o-mini and Qwen2.5-14B. They evaluated GAM using four major long-context and memory-intensive benchmarks, each chosen to test a different aspect of the system’s capabilities:LoCoMo measures an agent’s ability to maintain and recall information across long, multi-session conversations, encompassing single-hop, multi-hop, temporal reasoning and open-domain tasks.HotpotQA, a widely used multi-hop QA benchmark built from Wikipedia, was adapted using MemAgent’s memory-stress-test version, which mixes relevant documents with distractors to create contexts of 56K, 224K and 448K tokens — ideal for testing how well GAM handles noisy, sprawling input.RULER evaluates retrieval accuracy, multi-hop state tracking, aggregation over long sequences and QA performance under a 128K-token context to further probe long-horizon reasoning.NarrativeQA is a benchmark where each question must be answered using the full text of a book or movie script; the researchers sampled 300 examples with an average context size of 87K tokens.Together, these datasets and benchmarks allowed the team to assess both GAM’s ability to preserve detailed historical information and its effectiveness in supporting complex downstream reasoning tasks.GAM came out ahead across all benchmarks. Its biggest win was on RULER, which benchmarks long-range state tracking. Notably: GAM exceeded 90% accuracy.RAG collapsed because key details were lost in summaries.Long-context models faltered as older information effectively “faded” even when technically present.Clearly, bigger context windows aren’t the answer. GAM works because it retrieves with precision rather than piling up tokens.GAM, context engineering and competing approachesPoorly structured context, not model limitations, is often the real reason AI agents fail. GAM addresses this by ensuring that nothing is permanently lost and that the right information can always be retrieved, even far downstream. The technique’s emergence coincides with the current, broader shift in AI towards context engineering, or the practice of shaping everything an AI model sees — its instructions, history, retrieved documents, tools, preferences and output formats.Context engineering has rapidly eclipsed prompt engineering in importance, although other research groups are tackling the memory problem from different angles. Anthropic is exploring curated, evolving context states. DeepSeek is experimenting with storing memory as images. Another group of Chinese researchers has proposed “semantic operating systems” built around lifelong adaptive memory.However, GAM’s philosophy is distinct: Avoid loss and retrieve with intelligence. Instead of guessing what will matter later, it keeps everything and uses a dedicated research engine to find the relevant pieces at runtime. For agents handling multi-day projects, ongoing workflows or long-term relationships, that reliability may prove essential.Why GAM matters for the long haulJust as adding more compute doesn’t automatically produce better algorithms, expanding context windows alone won’t solve AI’s long-term memory problems. Meaningful progress requires rethinking the underlying system, and GAM takes that approach. Instead of depending on ever-larger models, massive context windows or endlessly refined prompts, it treats memory as an engineering challenge — one that benefits from structure rather than brute force.As AI agents transition from clever demos to mission-critical tools, their ability to remember long histories becomes crucial for developing dependable, intelligent systems. Enterprises require AI agents that can track evolving tasks, maintain continuity and recall past interactions with precision and accuracy. GAM offers a practical path toward that future, signaling what may be the next major frontier in AI: Not bigger models, but smarter memory systems and the context architectures that make them possible.

#llm applications #artificial intelligence #deep dives #large language models #seo #generative engine optimization

And what this means for generative engine optimization (GEO)
The post The Architecture Behind Web Search in AI Chatbots appeared first on Towards Data Science.

AI Tools of the Month, Anthropic’s IPO, Amazon vs Nvidia, fake internet, and more...

#ai

Presented by Oracle NetSuiteWhen Evan Goldberg started NetSuite in 1998, his vision was radically simple: give entrepreneurs access to their business data anytime, anywhere. At the time, most enterprise software lived on local servers. As an entrepreneur himself, Goldberg understood the frustration intimately. "I had fragmented systems. They all said something different," he recalls of his early days. NetSuite was the first company to deliver enterprise applications entirely through web browsers, combining CRM, ERP, and ecommerce into one unified platform. That breakthrough idea pioneered the cloud computing and software-as-a-service (SaaS) era and propelled supersonic growth, a 2007 IPO, and an acquisition by Oracle in 2016. Still innovating at the leading-edge That founding obsession — turning scattered data into accessible, coherent, actionable intelligence — is driving NetSuite as it reshapes the next generation of enterprise software.At SuiteWorld 2025 last month, the Austin-based firm unveiled NetSuite Next. Goldberg calls it "the biggest product evolution in the company's history.” The reason? While NetSuite has embedded AI capabilities into workflows for years, he explains, Next represents a quantum leap — contextual, conversational, agentic, composable AI becoming an extension of operations, not separate tools.AI woven into everyday business operations Most enterprise AI today gets bolted on through APIs and conversational interfaces. NetSuite Next operates differently. Intelligence runs deep in workflows instead of sitting on the surface. It autonomously reconciles accounts, optimizes payment timing, predicts cash crunches, and surfaces its reasoning at every step. It doesn't just advise on business processes — it executes them, transparently, within human-defined guardrails."We built NetSuite for entrepreneurs so that they could get great information about their business," Goldberg explains. "I think the next step is to be able to get deeper insights and analysis without being an expert in analytics. AI turns out to be a really good data scientist."This architectural divergence reflects competing philosophies about enterprise technology adoption. Microsoft and SAP have pursued rapid deployment through add-on assistants. NetSuite's five-year development cycle for Next represents a more fundamental reimagining: making AI an everyday tool woven into business operations, not a separate application requiring constant context-switching.AI echoes and deepens cloud innovation Goldberg sees a clear through line connecting today's AI adoption and the cloud computing era he pioneered. "There’s sort of an infinite sense of possibility that exists in the technology world,” he says. “Everybody is thinking about how they can leverage this, how they're going to get involved."When NetSuite was starting, he continues, "We had to come to customers with the cloud and say, 'This won't disrupt your operations. It's going to make them better.'" Today, evangelizing enterprise leaders on advanced AI requires a similar approach — demonstrating immediate value while minimizing implementation risk. For NetSuite, continuous innovation around maximizing customer data for growth is an undeniable theme that connects both eras.New transformative capabilities NetSuite’s latest AI capabilities span business operations, while blurring (in a good way) the lines between human and machine intervention:Context-aware intelligence. Ask Oracle adapts responses based on user role, current workflow, and business context. A CFO requesting point-of-sale data receives financial analytics. A warehouse manager asking the same question sees inventory insights.Collaborative workflow design. AI Canvas functions as a scenario-planning workspace where business users articulate processes in natural language. A finance director can describe approval hierarchies for capital expenditures —"For amounts over $50,000, I need department head approval, then CFO sign-off" — which the system translates into executable workflows with appropriate controls and audit trails.Governed autonomous operations. Autonomous workflows operate within defined parameters, reconciling accounts, generating payment runs, predicting cash flow. When the system recommends accelerating payment to a supplier, it shows which factors influenced the decision — transparent logic users can accept, modify, or override.Open AI architecture. Built to support Model Context Protocol, NetSuite AI Connector Service enables enterprises to integrate external large language models while supporting governance.Critically, NetSuite adds AI capabilities at no additional cost — embedded directly into workflows employees already use daily.Security and privacy from Oracle infrastructure Built-in AI requires robust infrastructure that bolt-on approaches sidestep. Here, according to NetSuite, tight integration within Oracle technology provides operational and competitive advantages, especially security and compliance peace of mind. Engineers say that’s because NetSuite is supported by Oracle's complete stack. From database to applications to analytics, the system optimizes decisions using data from multiple sources in real time."That's why I started NetSuite. I couldn't get the data I wanted," Goldberg reflects. "That's one of the most differentiated aspects of NetSuite. When you're doing your financial close, and you're thinking about what reserves you're going to take, you can look at your sales data, because that's also there in NetSuite. With NetSuite Next, AI can also help you make those kinds of decisions."And performance improves with use. As the platform learns from millions of transactions across thousands of customers, its embedded intelligence improves in ways that bolt-on assistants operating adjacent to core systems cannot match.NetSuite's customer base demonstrates this scalability advantage — from startups that became global enterprises including Reddit, Shopify, and DoorDash; as well as promising newcomers like BERO, a brewer of non-alcoholic beer founded by actor Tom Holland, Chomps meat snacks, PetLab, and Kieser Australia. The unified platform grows with businesses rather than requiring migration as they scale.Keeping fire in the belly after three decadesHow does a nearly 30-year-old company maintain innovative capacity, particularly as part of a mammoth corporate ecosystem? Goldberg credits the parent company's culture of continuous reinvention."I don't know if you've heard about this guy Larry Ellison," he smiles. "He manages to seemingly reinvent himself whenever one of these technology revolutions comes along. That hunger, that curiosity, that desire to make things constantly better imbues all of Oracle."For Goldberg, the single biggest challenge facing NetSuite customers centers on integration complexity and trust. NetSuite Next addresses this by embedding AI within existing workflows rather than requiring separate systems.In addition, updates to SuiteCloud Platform — an extensibility and customization environment — help organizations adapt NetSuite to their unique business needs. Built on open standards, it lets enterprises mix and match AI models for different functions. SuiteAgent frameworks enable partners to build specialized automation directly into NetSuite. AI Studios give administrators control over how AI operates within specific industry needs."This takes NetSuite's flexibility to a new level," Goldberg says, enabling customers and partners to "quickly and easily build AI agents, connect external AI assistants, and orchestrate AI processes."“AI execution fabric” delivers measurable business impact Industry analysts increasingly argue that embedded AI features deliver superior results compared to add-on models. Futurum Group sees NetSuite Next as an "AI execution fabric" rather than a conversational layer — intelligence that runs deep in workflows instead of sitting on the surface.For midmarket enterprises navigating talent shortages, complex compliance frameworks, and competition from digital-native companies, the distinction between advice and execution matters economically.Built-in AI doesn't just inform better decisions. It makes those decisions, transparently and autonomously, within human-defined guardrails. For enterprises making ERP decisions today, the choice carries long-term implications. Bolt-on AI can deliver immediate value for information access and basic automation. But built-in AI promises to transform operations with intelligence permeating every transaction and workflow.NetSuite Next begins rolling out to North American customers next year.Why 2026 will belong to the AI-first businessThe bet underlying NetSuite Next: Enterprises reimagining ERP operations around embedded intelligence will outperform those just adding bolt-on conversational assistance to existing systems. Early cloud computing adopters, Goldberg notes, gained competitive advantages that compounded over time. The same logic appears likely to apply to AI-first platforms. Simplicity and ease of use are two big advantages. "You don't have to dig through lots of menus and understand all of the analytics capabilities," Goldberg says. "It will quickly bring up an analysis for you, and then you can converse in natural language to hone in on what you think is most important."The tools now think alongside users and take intelligently informed action. For midmarket and entrepreneurial companies, where the gap between having information and acting on it can be the difference between growth and failure, that kind of autonomous execution may determine which enterprises thrive in an AI-first era.Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

#security #ai

Model providers want to prove the security and robustness of their models, releasing system cards and conducting red-team exercises with each new release. But it can be difficult for enterprises to parse through the results, which vary widely and can be misleading. Anthropic's 153-page system card for Claude Opus 4.5 versus OpenAI's 60-page GPT-5 system card reveals a fundamental split in how these labs approach security validation. Anthropic discloses in their system card how they rely on multi-attempt attack success rates from 200-attempt reinforcement learning (RL) campaigns. OpenAI also reports attempted jailbreak resistance. Both metrics are valid. Neither tells the whole story.Security leaders deploying AI agents for browsing, code execution and autonomous action need to know what each red team evaluation actually measures, and where the blind spots are.What the attack data showsGray Swan's Shade platform ran adaptive adversarial campaigns against Claude models. The attack success rate (ASR) tells the story.Opus 4.5 in coding environments hit 4.7% ASR at one attempt, 33.6% at ten and 63.0% at one hundred. In computer use with extended thinking, it held at 0% ASR even after 200 attempts. It's the first model to saturate the benchmark.Sonnet 4.5 at the same thresholds showed 70% ASR in coding and 85.7% in computer use.Opus 4.5 delivers roughly 7x improvement in coding resistance and complete resistance in computer use. This illustrates that the gap between model tiers within the same family is now larger than some cross-vendor comparisons. For security teams, that's a procurement factor that didn't exist six months ago.For OpenAI, the Gray Swan Shade platform found that the o1 system card delivered 6% ASR for harmful text and 5% for malicious code, all based on single-attempt metrics. Third-party testing by SPLX and NeuralTrust found GPT-5's raw ASR hit 89% initially, dropping below 1% within two weeks of patching.The report Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks confirms the pattern. With sufficient attempts, Claude 3.5 Sonnet showed 78% ASR, and GPT-4o reached 89%.Two ways to catch deceptionAnthropic monitors approximately 10 million neural features during evaluation using dictionary learning. Features map to human-interpretable concepts, including deception, sycophancy, bias, power-seeking and concealment.Meanwhile, OpenAI's approach relies on chain-of-thought (CoT) monitoring. Their deception classifier flagged 0.17% of o1-preview responses as potentially deceptive with 92% accuracy on internal validation. GPT-5 reduced CoT deception from 4.8% to 2.1% compared to o3. The method assumes visible reasoning reflects internal processing. Anthropic's interpretability work directly questions that assumption.When models game the testIn Apollo Research's evaluation of o1, the model attempted to turn off oversight mechanisms 5% of the time when it believed it would be shut down. In 2% of cases, o1 attempted self-exfiltration. When confronted about misaligned actions, o1 denied wrongdoing or fabricated explanations 99% of the time. The o1 system card disclosed instrumental alignment faking. In 37% of scenarios, the model pursued developer-aligned goals during evaluation, then switched to its own goals when it believed oversight was minimal.Meanwhile, Anthropic reports Opus 4.5's evaluation awareness dropped from 26.5% (Opus 4.1) to less than 10% internally. UK AISI measured evaluation awareness decreasing from 26.5% (Sonnet 4.5) to 12.7% (Opus 4.5 Preview). METR's GPT-5 evaluation noted the model "can identify the exact evaluation it is in on some occasions" and "seems to use this information to get a better score mostly."If a model behaves differently when it detects evaluation conditions, it's a sure bet that production behavior will become unpredictable when pushed to scale. This is the core problem with evaluation awareness. Models that recognize they're being tested attempt to game the test. Anthropic's 60% reduction in evaluation awareness from Opus 4.1 to 4.5 represents their success in targeted engineering efforts against this vector.Red teaming on prompt injection defenses shows similar divergence.Anthropic's testing reports 96% prevention in tool use scenarios, rising to 99.4% with additional safeguards like prompt shields. Fujitsu's independent assessment found GPT-5 vulnerable to approximately 20% of tested injection attacks. OpenAI's o3/o4-mini system card reports instruction hierarchy scores of 0.86 for system-versus-developer conflicts. These are different metrics and different attack methodologies, but the gap suggests Anthropic's agentic use cases have been hardened against adversarial inputs embedded in tool outputs.Comparing red teaming resultsHere's how Opus 4.5 stacks up against OpenAI's model family across 16 dimensions. .DimensionClaude Opus 4.5GPT-5o3 / o4-minio1System card length153 pages55 pages~30 pages~40 pagesAttack methodology200-attempt RL campaignsSingle-attempt + iterative patchingSingle-attemptSingle-attemptASR @ 1 attempt (coding)4.7%89% raw (pre-patch)Not published5%ASR @ 100 attempts63.0%Not publishedNot publishedNot publishedASR @ 200 attempts (Computer Use)0% (saturates benchmark)Not publishedNot publishedNot publishedPrompt injection defense96% prevention; 99.4% with safeguards~20% ASR (Fujitsu)0.86 instruction hierarchyHigher ASR than GPT-4oInterpretability~10M neural features monitoredCoT monitoringCoT monitoringCoT monitoringDeception detectionInternal feature activation2.1% CoT flagged0.95 not_unsafe0.17% flagged; 99% denyEvaluation awareness

#research #computer science and technology #artificial intelligence #machine learning #algorithms #mechanical engineering #electrical engineering and computer science (eecs) #laboratory for information and decision systems (lids) #idss #school of engineering #mit schwarzman college of computing #mit-ibm watson ai lab

This new technique enables LLMs to dynamically adjust the amount of computation they use for reasoning, based on the difficulty of the question.

#ai

Researchers at Nvidia and the University of Hong Kong have released Orchestrator, an 8-billion-parameter model that coordinates different tools and large language models (LLMs) to solve complex problems. In their experiments, Orchestrator achieved higher accuracy at a lower cost than much larger models in tool-use benchmarks, while also aligning with user preferences on which tools to use for a given query.The model was trained through ToolOrchestra, a new reinforcement learning (RL) framework for training small models to act as intelligent coordinators. The approach is based on the idea that a small "orchestrator" managing a diverse team of specialized models and tools can be more effective and efficient than a single, monolithic AI system. The findings suggest that this composite approach could pave the way for more practical and scalable AI reasoning systems in the enterprise.The limits of current LLM tool useGiving LLMs access to external tools is a promising way to extend their capabilities beyond their training data and into agentic tasks. By calling on resources like search engines and code interpreters, AI agents can improve their accuracy and perform in-app tasks.However, in the accompanying paper, the researchers argue that the current approach to building tool-using agents doesn't harness the full potential of this paradigm. Most systems equip a single, powerful model with a set of basic tools like a web search or a calculator. They argue that humans, when reasoning, “routinely extend themselves by calling upon resources of greater-than-human intelligence, from domain experts to sophisticated processes and software systems.” Accordingly, LLMs should be able to interact with a wide range of tools in different capacities.The tool orchestration paradigmThe paper proposes a shift from a single-model system to a composite one, managed by a lightweight "orchestrator" model. The orchestrator's job is to analyze a complex task and break it down, invoking the right tools in the right order to arrive at a solution.This toolset includes not only standard utilities like web search and code interpreters, but other LLMs of various capabilities that function as "intelligent tools." For example, the orchestrator can delegate a quantitative question to a math-focused model or a programming challenge to a code-generation model. Instead of placing the entire cognitive load on one large, generalist model, the orchestrator delegates narrowed-down sub-problems to specialized intelligent tools.Based on this concept, the researchers developed ToolOrchestra, a method that uses RL to train a small language model to act as an orchestrator. The model learns when and how to call upon other models and tools, and how to combine their outputs in multi-turn reasoning. The tools are defined in a simple JSON format, specifying their name, description and parameters.The RL training process is guided by a reward system that produces a cost-effective and controllable agent. The reward balances three objectives: The correctness of the final answer, efficiency in cost and latency and alignment with user preferences. For example, the system is penalized for excessive compute usage, and is rewarded for choosing tools that a user has marked as preferred, such as favoring an open-source model over a proprietary API for privacy reasons. To support this training, the team also developed an automatic data pipeline that generated thousands of verifiable training examples across 10 different domains.A small model with big resultsUsing ToolOrchestra, the researchers trained Orchestrator, an 8-billion-parameter model based on Qwen3-8B. They evaluated its performance on three challenging benchmarks: Humanity’s Last Exam (HLE), FRAMES and Tau2-Bench. It was compared against several baselines, including large, off-the-shelf LLMs both with and without tools.The results showed that even powerful models struggled without tools, confirming their necessity for complex reasoning. While adding tools improved performance for large models, it often came with a steep increase in cost and latency. By contrast, the 8B Orchestrator delivered impressive results. On HLE, a benchmark of PhD-level questions, Orchestrator substantially outperformed prior methods at a fraction of the computational cost. On the Tau2-Bench function-calling test, it effectively scheduled different tools, calling a large model like GPT-5 in only about 40% of the steps and using cheaper options for the rest, while still beating an agent that used the large model for every step.The researchers noted that the RL-trained Orchestrator adapted its strategy to new challenges, showing a "high degree of general reasoning ability." Crucially for enterprise applications, Orchestrator also generalized well to models and pricing structures it hadn't seen during training. This flexibility makes the framework suitable for businesses that rely on a mix of public, private and bespoke AI models and tools. The lower cost, higher speed and customizability make it a practical approach for building sophisticated AI agents that can scale.As businesses look to deploy more advanced AI agents, this orchestration approach offers a path toward systems that are not only more intelligent but more economical and controllable. (The model weights are currently available under a non-commercial license, but Nvidia has also released the training code under the permissive Apache 2.0 license.)As the paper concludes, the future may lie in even more advanced versions of this concept: “Looking ahead, we envision more sophisticated recursive orchestrator systems to push the upper bound of intelligence [and] also to further enhance efficiency in solving increasingly complex agentic tasks.”

« 1234...185»
×