Page 20 | AI News & Updates | Latest Artificial Intelligence Developments

#security #datadecisionmakers #ai

Human-centric IAM is failing: Agentic AI requires a new identity control plane

The race to deploy agentic AI is on. Across the enterprise, systems that can plan, take actions and collaborate across business applications promise unprecedented efficiency. But in the rush to automate, a critical component is being overlooked: Scalable security. We are building a workforce of digital employees without giving them a secure way to log in, access data and do their jobs without creating catastrophic risk.The fundamental problem is that traditional identity and access management (IAM) designed for humans breaks at agentic scale. Controls like static roles, long-lived passwords and one-time approvals are useless when non-human identities can outnumber human ones by 10 to one. To harness the power of agentic AI, identity must evolve from a simple login gatekeeper into the dynamic control plane for your entire AI operation.“The fastest path to responsible AI is to avoid real data. Use synthetic data to prove value, then earn the right to touch the real thing.” — Shawn Kanungo, keynote speaker and innovation strategist; bestselling author of The Bold OnesWhy your human-centric IAM is a sitting duckAgentic AI does not just use software; it behaves like a user. It authenticates to systems, assumes roles and calls APIs. If you treat these agents as mere features of an application, you invite invisible privilege creep and untraceable actions. A single over-permissioned agent can exfiltrate data or trigger erroneous business processes at machine speed, with no one the wiser until it is too late.The static nature of legacy IAM is the core vulnerability. You cannot pre-define a fixed role for an agent whose tasks and required data access might change daily. The only way to keep access decisions accurate is to move policy enforcement from a one-time grant to a continuous, runtime evaluation.Prove value before production dataKanungo’s guidance offers a practical on-ramp. Start with synthetic or masked datasets to validate agent workflows, scopes and guardrails. Once your policies, logs and break-glass paths hold up in this sandbox, you can graduate agents to real data with confidence and clear audit evidence.Building an identity-centric operating model for AISecuring this new workforce requires a shift in mindset. Each AI agent must be treated as a first-class citizen within your identity ecosystem.First, every agent needs a unique, verifiable identity. This is not just a technical ID; it must be linked to a human owner, a specific business use case and a software bill of materials (SBOM). The era of shared service accounts is over; they are the equivalent of giving a master key to a faceless crowd.Second, replace set-and-forget roles with session-based, risk-aware permissions. Access should be granted just in time, scoped to the immediate task and the minimum necessary dataset, then automatically revoked when the job is complete. Think of it as giving an agent a key to a single room for one meeting, not the master key to the entire building.Three pillars of a scalable agent security architectureContext-aware authorization at the core. Authorization can no longer be a simple yes or no at the door. It must be a continuous conversation. Systems should evaluate context in real time. Is the agent’s digital posture attested? Is it requesting data typical for its purpose? Is this access occurring during a normal operational window? This dynamic evaluation enables both security and speed.Purpose-bound data access at the edge. The final line of defense is the data layer itself. By embedding policy enforcement directly into the data query engine, you can enforce row-level and column-level security based on the agent’s declared purpose. A customer service agent should be automatically blocked from running a query that appears designed for financial analysis. Purpose binding ensures data is used as intended, not merely accessed by an authorized identity.Tamper-evident evidence by default. In a world of autonomous actions, auditability is non-negotiable. Every access decision, data query and API call should be immutably logged, capturing the who, what, where and why. Link logs so they are tamper evident and replayable for auditors or incident responders, providing a clear narrative of every agent’s activities.A practical roadmap to get startedBegin with an identity inventory. Catalog all non-human identities and service accounts. You will likely find sharing and over-provisioning. Begin issuing unique identities for each agent workload.Pilot a just-in-time access platform. Implement a tool that grants short-lived, scoped credentials for a specific project. This proves the concept and shows the operational benefits.Mandate short-lived credentials. Issue tokens that expire in minutes, not months. Seek out and remove static API keys and secrets from code and configuration.Stand up a synthetic data sandbox. Validate agent workflows, scopes, prompts and policies on synthetic or masked data first. Promote to real data only after controls, logs and egress policies pass.Conduct an agent incident tabletop drill. Practice responses to a leaked credential, a prompt injection or a tool escalation. Prove you can revoke access, rotate credentials and isolate an agent in minutes.The bottom lineYou cannot manage an agentic, AI-driven future with human-era identity tools. The organizations that will win recognize identity as the central nervous system for AI operations. Make identity the control plane, move authorization to runtime, bind data access to purpose and prove value on synthetic data before touching the real thing. Do that, and you can scale to a million agents without scaling your breach risk. Michelle Buckner is a former NASA Information System Security Officer (ISSO). Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

People Are Marrying AI Now

Woman weds AI, Gemini 3.0, mind-reading AI, mass humanoids deployed, and more...

A single beam of light runs AI with supercomputer power

Aalto University researchers have developed a method to execute AI tensor operations using just one pass of light. By encoding data directly into light waves, they enable calculations to occur naturally and simultaneously. The approach works passively, without electronics, and could soon be integrated into photonic chips. If adopted, it promises dramatically faster and more energy-efficient AI systems.

#agentic ai #agents #automation #data science #llm

How to Automate Workflows with AI

Learn how to take a manual process and optimize it using AI
The post How to Automate Workflows with AI appeared first on Towards Data Science.

#machine learning #deep learning #deep neural networks #explainable ai #neural network

I Measured Neural Network Training Every 5 Steps for 10,000 Iterations

What high-resolution NN training dynamics taught me about feature formation
The post I Measured Neural Network Training Every 5 Steps for 10,000 Iterations appeared first on Towards Data Science.

#gear #gear / how to and advice

How to Use the New AI Features in OmniFocus, the Power User’s To-Do List

One of the Mac’s most popular productivity apps is incorporating generative artificial intelligence in a way that keeps it offline, private, and customizable.

ChatGPT's Social Experiment Begins

Social GPT, ElevenLabs Music, crab wheelchair, self-teaching agents, and more...

#ai

Google’s new AI training method helps small models tackle complex reasoning

Researchers at Google Cloud and UCLA have proposed a new reinforcement learning framework that significantly improves the ability of language models to learn very challenging multi-step reasoning tasks. Supervised Reinforcement Learning (SRL) reformulates problem-solving as a sequence of logical “actions,” providing rich learning signals during the training process.This approach enables smaller models to learn complex problems that were previously out of reach for other common training techniques. Experiments show that SRL not only excels on math reasoning benchmarks but also generalizes effectively to agentic software engineering tasks.SRL is a versatile training framework that can elevate smaller and less expensive models to higher reasoning abilities.The limits of current LLM reasoning trainingRecent advances in training large language models (LLMs) for reasoning have largely been driven by reinforcement learning with verifiable rewards (RLVR), a method where a model is rewarded based on the correctness of its final answer. By repeatedly trying to solve problems and getting feedback on the final outcome, the model gradually learns effective problem-solving strategies. However, the success of this outcome-based approach depends on the model's ability to discover a correct solution within a limited number of attempts, or "rollouts." Since each rollout is computationally expensive, models can't try indefinitely. This method hits a wall when problems are so difficult that the model rarely, if ever, finds the right answer within its budget.This creates a critical learning bottleneck. In many multi-step reasoning problems, a model might correctly solve several steps but get derailed by a single mistake, leading to an incorrect answer. With RLVR, this entire effort receives a negative reward, and the model learns nothing from its partially correct work. It’s an all-or-nothing approach that fails to provide granular feedback and provides sparse rewards.An alternative method is supervised fine-tuning (SFT), where the model learns from examples containing the full reasoning process laid out by experts. While SFT can instill reasoning abilities, it often leads to overfitting (the model simply learns to imitate the trajectories in the training data instead of learning to generalize to problems beyond the examples it has seen). This issue is made worse by the fact that high-quality, human-created training data is both scarce and expensive to produce.As the paper notes, these limitations leave "a critical gap for training small open-source models to effectively learn difficult problems."How supervised reinforcement learning worksSRL introduces a framework that reformulates problem-solving as a "sequential decision-making process," striking a balance between pure outcome-based RL and pure imitation learning. Instead of optimizing only for the final answer or forcing the model to imitate an expert's entire thought process, SRL teaches the model to reproduce a sequence of key actions that form the backbone of expert reasoning. This allows the model to learn to take actions similar to an expert while developing its own internal reasoning style.In the SRL framework, expert demonstrations are broken down into a series of intermediate, concrete actions, each representing a meaningful step. For a math problem, an action might be an algebraic manipulation. For a software engineering agent, it could be a command executed in a code repository. To generate training data, SRL uses a powerful teacher model to create solution trajectories, which are then used to train a smaller model.According to I-Hung Hsu, a research scientist at Google and co-author of the paper, this middle-ground approach is key to its effectiveness in real-world scenarios. "SRL sits in the middle: It captures the structured flexibility of real-world problem solving, where there are multiple valid strategies but also clear notions of what ‘good reasoning’ looks like at each step," Hsu told VentureBeat. "This makes SRL suitable for domains like data science automation or probably supply chain optimization — tasks that reward sound intermediate reasoning rather than mere final answers."During training, the model first generates an "inner monologue" (its internal reasoning process, enclosed in tags) before committing to an action. At each step, SRL provides a reward based on the similarity between the model's predicted action and the expert's action. This step-wise reward system provides dense, fine-grained feedback, allowing the model to learn and improve even if its overall solution isn't perfect. This solves the sparse reward problem RLVR faces.SRL in actionThe researchers' experiments show that SRL significantly outperforms strong baselines in both challenging mathematical reasoning and agentic software engineering benchmarks. They also observed that SRL encourages more flexible and sophisticated reasoning patterns in models, such as interleaved planning and self-verification, which improve solution quality without just making the outputs longer.For enterprise leaders, performance gains are only valuable if they don't come with runaway costs. Hsu clarifies that SRL-trained models are more efficient in their reasoning. "The gains come from better reasoning quality and structure, not from verbosity," he said. "In terms of efficiency, SRL-trained models are roughly on par with the base model in token usage... while SRL isn’t designed to reduce inference cost, it achieves stronger reasoning performance without increasing it."For the math tests, the team fine-tuned Qwen2.5-7B-Instruct on a dataset of 1,000 difficult math questions. They compared its performance against models trained with SFT and RLVR (using the GRPO algorithm common in models like DeepSeek-R1) on four competition-level math benchmarks. The SRL-trained model achieved a substantial 3.0% average performance boost over other methods. The team extended SRL to agentic software engineering, a domain critical for enterprise automation. They trained a coding-specialized model, Qwen2.5-Coder-7B-Instruct, on 5,000 expert trajectories of agents interacting with a coding environment. The SRL-trained model was benchmarked against the original base model and SWE-Gym-7B, a strong baseline fine-tuned with SFT. SRL achieved a 14.8% task resolve rate, representing a 74% relative improvement over the SFT-based model. This shows SRL's ability to train more competent AI agents for complex, real-world programming tasks.A new standard for high-stakes AI?The paper's strongest results came from combining methods: First, using SRL to teach foundational reasoning, then using RLVR to refine that skill. In their experiments, when the researchers used SRL as a pre-training and applied RLVR in post-training, they observed a 3.7% average increase, demonstrating a powerful curriculum learning strategy.This raises the question of whether this could become a new blueprint for building specialized AI."We view SRL as a strong foundation," Hsu said. "In a sense, SRL provides a curriculum — teaching models to think and act step by step — before we refine those behaviors with outcome-based reinforcement learning. This SRL-first approach not only stabilizes the later RL stage but also makes reasoning more interpretable and generalizable, which is critical for high-stakes applications."Looking ahead, Hsu acknowledges that scaling this pipeline still faces challenges, particularly the high cost and complexity of end-to-end RLVR for agentic tasks. However, he is optimistic about the path forward. "While high-quality expert trajectories remain important," he concluded, "we think the next big leap will come from automating their generation and filtering — leveraging strong teacher models or even self-improving student models to bootstrap new data."

#amazon bedrock #amazon bedrock agentcore #amazon machine learning #healthcare #life sciences #technical how-to

Build a biomedical research agent with Biomni tools and Amazon Bedrock AgentCore Gateway

In this post, we demonstrate how to build a production-ready biomedical research agent by integrating Biomni's specialized tools with Amazon Bedrock AgentCore Gateway, enabling researchers to access over 30 biomedical databases through a secure, scalable infrastructure. The implementation showcases how to transform research prototypes into enterprise-grade systems with persistent memory, semantic tool discovery, and comprehensive observability for scientific reproducibility .

#amazon bedrock #amazon nova #technical how-to #ai/ml #generative ai

Make your web apps hands-free with Amazon Nova Sonic

Graphical user interfaces have carried the torch for decades, but today’s users increasingly expect to talk to their applications. In this post we show how we added a true voice-first experience to a reference application—the Smart Todo App—turning routine task management into a fluid, hands-free conversation.

#advanced (300) #amazon bedrock #amazon bedrock agentcore #customer solutions #generative ai #learning levels #amazon sagemaker studio classic

Harnessing the power of generative AI: Druva’s multi-agent copilot for streamlined data protection

Generative AI is transforming the way businesses interact with their customers and revolutionizing conversational interfaces for complex IT operations. Druva, a leading provider of data security solutions, is at the forefront of this transformation. In collaboration with Amazon Web Services (AWS), Druva is developing a cutting-edge generative AI-powered multi-agent copilot that aims to redefine the customer experience in data security and cyber resilience.

#artificial intelligence #author spotlights #data science #machine learning #product management

“The success of an AI product depends on how intuitively users can interact with its capabilities”

Janna Lipenkova on AI strategy, AI products, and how domain knowledge can change the entire shape of an AI solution.
The post “The success of an AI product depends on how intuitively users can interact with its capabilities” appeared first on Towards Data Science.

#ai

ChatGPT Group Chats are here … but not for everyone (yet)

It was originally found in leaked code and publicized by AI influencers on X, but OpenAI has made it official: ChatGPT now offers Group Chats, allowing multiple users to join the same, single ChatGPT conversation and send messages to each other and the underlying large language model (LLM), online and via its mobile apps. Imagine adding ChatGPT as another member of your existing group chats, allowing you to text it as you would one of your friends or family members and have them respond as well, and you'll have an idea of the intriguing power and potential of this feature.However, the feature is only available as a limited pilot for now to ChatGPT users in Japan, New Zealand, South Korea, and Taiwan (all tiers, including free usage).“Group chats are just the beginning of ChatGPT becoming a shared space to collaborate and interact with others,” OpenAI wrote in its announcement.This development builds on internal experimentation at OpenAI, where technical staffer Keyan Zhang said in a post on X that OpenAI's team initially considered multiplayer ChatGPT to be “a wild, out-of-distribution idea.”According to Zhang, the model’s performance in those early tests demonstrated far more potential than existing interfaces typically allow.The move follows OpenAI investor yet competitor Microsoft's update of its Copilot AI assistant to allow group chats last month, as well as Anthropic's introduction of shareable context and chat histories from its Claude AI models through its Projects feature introduced summer 2024, though this is not a simultaneous, realtime group chat in the same way. Collaborative functionality integrated into ChatGPTGroup chats function as shared conversational spaces where users can plan events, brainstorm ideas, or collaborate on projects with the added support of ChatGPT. These conversations are distinct from individual chats and are excluded from ChatGPT’s memory system—meaning no data from these group threads is used to train or personalize future interactions.Users can initiate a group chat by selecting the people icon in a new or existing conversation. Adding others creates a copy of the original thread, preserving the source dialogue. Participants can join via a shareable link and are prompted to create a profile with a name, username, and photo. The feature supports 1 to 20 participants per group.Each group chat is listed in a new section of the ChatGPT interface, and users can manage settings like naming the group, adding or removing participants, or muting notifications.Powered by GPT-5.1 with expanded toolsThe new group chat feature runs on GPT-5.1 Auto, a backend setting that chooses the optimal model based on the user’s subscription tier and the prompt. Functionality such as search, image generation, file upload, and dictation is available inside group conversations.Importantly, the system applies rate limits only when ChatGPT is producing responses. Direct messages between human users in the group do not count toward any plan’s message cap.OpenAI has added new social features to ChatGPT in support of this group dynamic. The model can react with emojis, interpret conversational context to decide when to respond, and personalize generated content using members’ profile photos—such as inserting user likenesses into images when asked.Privacy by default, controls for younger usersOpenAI emphasized that privacy and user control are integral to group chat design. The feature operates independently of the user’s personalized ChatGPT memory, and no new memories are created from these interactions. Participation requires an invitation link, and members are always able to see who is in a chat or leave at any time.Users under the age of 18 are automatically shielded from sensitive content in group chats. Parents or guardians can disable group chat access altogether via built-in parental controls.Group creators retain special permissions, including immunity from being removed by others. All other participants can be added or removed by group members.A testbed for shared AI experiencesOpenAI frames group chats as an early step toward richer, multi-user applications of AI, hinting at broader ambitions for ChatGPT as a shared workspace. The company expects to expand access over time and refine the feature based on how early users engage with it.Keyan Zhang’s post suggests that the underlying model capabilities are far ahead of the interfaces users currently interact with. This pilot, in OpenAI’s view, offers a new “container” where more of the model’s latent capacity can be surfaced.“Our models have a lot more room to shine than today’s experiences show, and the current containers only use a fraction of their capabilities,” Zhang said.With this initial pilot focused on a limited set of markets, OpenAI is likely monitoring both usage patterns and cultural fit as it plans for broader deployment. For now, the group chat experiment offers a new way for users to interact with ChatGPT—and with each other—in real time, using a conversational interface that blends productivity and personalization.Developer access: Still unclearOpenAI has not provided any indication that Group Chats will be accessible via the API or SDK. The current rollout is framed strictly within the ChatGPT product environment, with no mention of tool calls, developer hooks, or integration support for programmatic use. This absence of signaling leaves it unclear whether the company views group interaction as a future developer primitive or as a contained UX feature for end users only.For enterprise teams exploring how to replicate multi-user collaboration with generative models, any current implementation would require custom orchestration—such as managing multi-party context and prompts across separate API calls, and handling session state and response merging externally. Until OpenAI provides formal support, Group Chats remain a closed interface feature rather than a developer-accessible capability.Here is a standalone concluding subsection tailored for the article, focusing on what the ChatGPT Group Chat rollout means for enterprise decision makers in both pilot regions and globally:Implications for enterprise AI and data leadersFor enterprise teams already leveraging AI platforms—or preparing to—OpenAI’s group chat feature introduces a new layer of multi-user collaboration that could shift how generative models are deployed across workflows. While the pilot is limited to users in Japan, New Zealand, South Korea, and Taiwan, its design and roadmap offer key signals for AI engineers, orchestration specialists, and data leads globally.AI engineers managing large language model (LLM) deployments can now begin to conceptualize real-time, multi-user interfaces not just as support tools, but as collaborative environments for research, content generation, and ideation. This adds another front in model tuning: not just how models respond to individuals, but how they behave in live group settings with context shifts and varied user intentions.For AI orchestration leads, the ability to integrate ChatGPT into collaborative flows without exposing private memory or requiring custom builds may reduce friction in piloting generative AI in cross-functional teams. These group sessions could serve as lightweight alternatives to internal tools for brainstorming, prototyping, or knowledge sharing—useful for teams constrained by infrastructure, budget, or time.Enterprise data managers may also find use cases in structured group chat sessions for data annotation, taxonomy validation, or internal training support. The system’s lack of memory persistence adds a level of data isolation that aligns with standard security and compliance practices—though global rollout will be key to validating regional data handling standards.As group chat capabilities evolve, decision makers should monitor how shared usage patterns might inform future model behaviors, auditing needs, and governance structures. In the long term, features like these will influence not just how organizations interact with generative AI, but how they design team-level interfaces around it.

#machine learning #artificial intelligence #deep dives #interview #job search

How to Crack Machine Learning System-Design Interviews

A comprehensive guide into Meta, Apple, Reddit, Amazon, Google, and Snap ML design interviews
The post How to Crack Machine Learning System-Design Interviews appeared first on Towards Data Science.

Building AI Automations with Google Opal

Google Opal is a no-code, experimental tool from Google Labs. It is designed to enable users to build and share AI-powered micro-applications using natural language.

#large language models #artificial intelligence #editors pick #machine learning #music #python

Music, Lyrics, and Agentic AI: Building a Smart Song Explainer using Python and OpenAI

This is how to build an AI-powered Song Explainer using Python and OpenAI
The post Music, Lyrics, and Agentic AI: Building a Smart Song Explainer using Python and OpenAI appeared first on Towards Data Science.

Mastering JSON Prompting for LLMs

LLMs

#ai

OpenAI experiment finds that sparse models could give AI builders the tools to debug neural networks

OpenAI researchers are experimenting with a new approach to designing neural networks, with the aim of making AI models easier to understand, debug, and govern. Sparse models can provide enterprises with a better understanding of how these models make decisions. Understanding how models choose to respond, a big selling point of reasoning models for enterprises, can provide a level of trust for organizations when they turn to AI models for insights. The method called for OpenAI scientists and researchers to look at and evaluate models not by analyzing post-training performance, but by adding interpretability or understanding through sparse circuits.OpenAI notes that much of the opacity of AI models stems from how most models are designed, so to gain a better understanding of model behavior, they must create workarounds. “Neural networks power today’s most capable AI systems, but they remain difficult to understand,” OpenAI wrote in a blog post. “We don’t write these models with explicit step-by-step instructions. Instead, they learn by adjusting billions of internal connections or weights until they master a task. We design the rules of training, but not the specific behaviors that emerge, and the result is a dense web of connections that no human can easily decipher.”To enhance the interpretability of the mix, OpenAI examined an architecture that trains untangled neural networks, making them simpler to understand. The team trained language models with a similar architecture to existing models, such as GPT-2, using the same training schema. The result: improved interpretability. The path toward interpretabilityUnderstanding how models work, giving us insight into how they're making their determinations, is important because these have a real-world impact, OpenAI says. The company defines interpretability as “methods that help us understand why a model produced a given output.” There are several ways to achieve interpretability: chain-of-thought interpretability, which reasoning models often leverage, and mechanistic interpretability, which involves reverse-engineering a model’s mathematical structure.OpenAI focused on improving mechanistic interpretability, which it said “has so far been less immediately useful, but in principle, could offer a more complete explanation of the model’s behavior.”“By seeking to explain model behavior at the most granular level, mechanistic interpretability can make fewer assumptions and give us more confidence. But the path from low-level details to explanations of complex behaviors is much longer and more difficult,” according to OpenAI. Better interpretability allows for better oversight and gives early warning signs if the model’s behavior no longer aligns with policy. OpenAI noted that improving mechanistic interpretability “is a very ambitious bet,” but research on sparse networks has improved this. How to untangle a model To untangle the mess of connections a model makes, OpenAI first cut most of these connections. Since transformer models like GPT-2 have thousands of connections, the team had to “zero out” these circuits. Each will only talk to a select number, so the connections become more orderly.Next, the team ran “circuit tracing” on tasks to create groupings of interpretable circuits. The last task involved pruning the model “to obtain the smallest circuit which achieves a target loss on the target distribution,” according to OpenAI. It targeted a loss of 0.15 to isolate the exact nodes and weights responsible for behaviors. “We show that pruning our weight-sparse models yields roughly 16-fold smaller circuits on our tasks than pruning dense models of comparable pretraining loss. We are also able to construct arbitrarily accurate circuits at the cost of more edges. This shows that circuits for simple behaviors are substantially more disentangled and localizable in weight-sparse models than dense models,” the report said. Small models become easier to trainAlthough OpenAI managed to create sparse models that are easier to understand, these remain significantly smaller than most foundation models used by enterprises. Enterprises increasingly use small models, but frontier models, such as its flagship GPT-5.1, will still benefit from improved interpretability down the line. Other model developers also aim to understand how their AI models think. Anthropic, which has been researching interpretability for some time, recently revealed that it had “hacked” Claude’s brain — and Claude noticed. Meta also is working to find out how reasoning models make their decisions. As more enterprises turn to AI models to help make consequential decisions for their business, and eventually customers, research into understanding how models think would give the clarity many organizations need to trust models more.

Claude Exploited by Hackers Again

Automated cyberattacks, brainrot meets coding, anti-creep glasses, and more...

New prediction breakthrough delivers results shockingly close to reality

Researchers have created a prediction method that comes startlingly close to real-world results. It works by aiming for strong alignment with actual values rather than simply reducing mistakes. Tests on medical and health data showed it often outperforms classic approaches. The discovery could reshape how scientists make reliable forecasts.

#ai

Baidu unveils proprietary ERNIE 5 beating GPT-5 performance on charts, document understanding and more

Mere hours after OpenAI updated its flagship foundation model GPT-5 to GPT-5.1, promising reduced token usage overall and a more pleasant personality with more preset options, Chinese search giant Baidu unveiled its next-generation foundation model, ERNIE 5.0, alongside a suite of AI product upgrades and strategic international expansions.The goal: to position as a global contender in the increasingly competitive enterprise AI market.Announced at the company's Baidu World 2025 event, ERNIE 5.0 is a proprietary, natively omni-modal model designed to jointly process and generate content across text, images, audio, and video. Unlike Baidu’s recently released ERNIE-4.5-VL-28B-A3B-Thinking, which is open source under an enterprise-friendly and permissive Apache 2.0 license, ERNIE 5.0 is a proprietary model and is available only via Baidu’s ERNIE Bot website (I needed to select it manuallyu from the model picker dropdown) and the Qianfan cloud platform application programming interface (API) for enterprise customers. Alongside the model launch, Baidu introduced major updates to its digital human platform, no-code tools, and general-purpose AI agents — all targeted at expanding its AI footprint beyond China.The company also introduced ERNIE 5.0 Preview 1022, a variant optimized for text-intensive tasks, alongside the general preview model that balances across modalities.Baidu emphasized that ERNIE 5.0 represents a shift in how intelligence is deployed at scale, with CEO Robin Li stating: “When you internalize AI, it becomes a native capability and transforms intelligence from a cost into a source of productivity.”Where ERNIE 5.0 outshines GPT-5 and Gemini 2.5 ProERNIE 5.0’s benchmark results suggest that Baidu has achieved parity—or near-parity—with the top Western foundation models across a wide spectrum of tasks. In public benchmark slides shared during the Baidu World 2025 event, ERNIE 5.0 Preview outperformed or matched OpenAI’s GPT-5-High and Google’s Gemini 2.5 Pro in multimodal reasoning, document understanding, and image-based QA, while also demonstrating strong language modeling and code execution abilities. The company emphasized its ability to handle joint inputs and outputs across modalities, rather than relying on post-hoc modality fusion, which it framed as a technical differentiator.On visual tasks, ERNIE 5.0 achieved leading scores on OCRBench, DocVQA, and ChartQA, three benchmarks that test document recognition, comprehension, and structured data reasoning. Baidu claims the model beat both GPT-5-High and Gemini 2.5 Pro on these document and chart-based benchmarks, areas it describes as core to enterprise applications like automated document processing and financial analysis. In image generation, ERNIE 5.0 tied or exceeded Google’s Veo3 across categories including semantic alignment and image quality, according to Baidu’s internal GenEval-based evaluation. Baidu claimed that the model’s multimodal integration allows it to generate and interpret visual content with greater contextual awareness than models relying on modality-specific encoders.For audio and speech tasks, ERNIE 5.0 demonstrated competitive results on MM-AU and TUT2017 audio understanding benchmarks, as well as question answering from spoken language inputs. Its audio performance, while not as heavily emphasized as vision or text, suggests a broad capability footprint intended to support full-spectrum multimodal applications.In language tasks, the model showed strong results on instruction following, factual question answering, and mathematical reasoning—core areas that define the enterprise utility of large language models. The Preview 1022 variant of ERNIE 5.0, tailored for textual performance, showed even stronger language-specific results in early developer access. While Baidu does not claim broad superiority in general language reasoning, its internal evaluations suggest that ERNIE 5.0 Preview 1022 closes the gap with top-tier English-language models and outperforms them in Chinese-language performance.While Baidu did not release full benchmark details or raw scores publicly, its performance positioning suggests a deliberate attempt to frame ERNIE 5.0 not as a niche multimodal system but as a flagship model competitive with the largest closed models in general-purpose reasoning. Where Baidu claims a clear lead is in structured document understanding, visual chart reasoning, and integration of multiple modalities into a single, native modeling architecture. Independent verification of these results remains pending, but the breadth of claimed capabilities positions ERNIE 5.0 as a serious alternative in the multimodal foundation model landscape.Enterprise Pricing StrategyERNIE 5.0 is positioned at the premium end of Baidu’s model pricing structure. The company has released specific pricing for API usage on its Qianfan platform, aligning the cost with other top-tier offerings from Chinese competitors like Alibaba.ModelInput Cost (per 1K tokens)Output Cost (per 1K tokens)SourceERNIE 5.0$0.00085 (¥0.006)$0.0034 (¥0.024)QianfanERNIE 4.5 Turbo (ex.)$0.00011 (¥0.0008)$0.00045 (¥0.0032)QianfanQwen3 (Coder ex.)$0.00085 (¥0.006)$0.0034 (¥0.024)QianfanThe contrast in cost between ERNIE 5.0 and earlier models such as ERNIE 4.5 Turbo underscores Baidu’s strategy to differentiate between high-volume, low-cost models and high-capability models designed for complex tasks and multimodal reasoning.Compared to other U.S. alternatives, it remains mid-range in pricing:ModelInput (/1 M tokens)Output (/1 M tokens)SourceGPT-5.1$1.25$10.00OpenAIERNIE 5.0$0.85$3.40QianfanERNIE 4.5 Turbo (ex.)$0.11$0.45QianfanClaude Opus 4.1$15.00$75.00Anthropic Gemini 2.5 Pro$1.25 (≤200k) / $2.50 (>200k)$10.00 (≤200k) / $15.00 (>200k)Google Vertex AI PricingGrok 4 (grok-4-0709)$3.00$15.00 xAI APIGlobal Expansion: Products and PlatformsIn tandem with the model release, Baidu is expanding internationally:GenFlow 3.0, now with 20M+ users, is the company’s largest general-purpose AI agent and features enhanced memory and multimodal task handling.Famou, a self-evolving agent capable of dynamically solving complex problems, is now commercially available via invite.MeDo, the international version of Baidu’s no-code builder Miaoda, is live globally via medo.dev.Oreate, a productivity workspace with document, slide, image, video, and podcast support, has reached over 1.2M users worldwide.Baidu’s digital human platform, already rolled out in Brazil, is also part of the global push. According to company data, 83% of livestreamers during this year’s “Double 11” shopping event in China used Baidu’s digital human tech, contributing to a 91% increase in GMV.Meanwhile, Baidu’s autonomous ride-hailing service Apollo Go has surpassed 17 million rides, operating driverless fleets in 22 cities and claiming the title of the world’s largest robotaxi network.Open-Source Vision-Language Model Garners Industry AttentionTwo days before the flagship ERNIE 5.0 event, Baidu also released an open-source multimodal model under the Apache 2.0 license: ERNIE-4.5-VL-28B-A3B-Thinking. As reported by my colleague Michael Nuñez at VentureBeat, the model activates just 3 billion parameters while maintaining a total of 28 billion, using a Mixture-of-Experts (MoE) architecture for efficient inference.Key technical innovations include:“Thinking with Images”, which enables dynamic zoom-based visual analysisSupport for chart interpretation, document understanding, visual grounding, and temporal awareness in videoRuntime on a single 80GB GPU, making it accessible to mid-sized organizationsFull compatibility with Transformers, vLLM, and Baidu’s FastDeploy toolkitsThis release adds pressure on closed-source competitors. With Apache 2.0 licensing, ERNIE-4.5-VL-28B-A3B-Thinking becomes a viable foundation model for commercial applications without licensing restrictions — something few high-performing models in this class offer.Community Feedback and Baidu’s ResponseFollowing the launch of ERNIE 5.0, developer and AI evaluator Lisan al Gaib (@scaling01) posted a mixed review on X. While initially impressed by the model’s benchmark performance, they reported a persistent issue where ERNIE 5.0 would repeatedly invoke tools — even when explicitly instructed not to — during SVG generation tasks.“ERNIE 5.0 benchmarks looked insane until I tested it… unfortunately it’s RL braindamaged or they have a serious issue with their chat platform / system prompt,” Lisan wrote.In a matter of hours, Baidu’s developer-focused support account, @ErnieforDevs, responded:“Thanks for the feedback! It’s a known bug — certain syntax can consistently trigger it. We’re working on a fix. You can try rephrasing or changing the prompt to avoid it for now.”The quick turnaround reflects Baidu’s increasing emphasis on developer communication, especially as it courts international users through both proprietary and open-source offerings.Outlook for Baidu and its ERNIE foundational LLM familyBaidu’s ERNIE 5.0 marks a strategic escalation in the global foundation model race. With performance claims that put it on par with the most advanced systems from OpenAI and Google, and a mix of premium pricing and open-access alternatives, Baidu is signaling its ambition to become not just a domestic AI leader, but a credible global infrastructure provider.At a time when enterprise AI users are increasingly demanding multimodal performance, flexible licensing, and deployment efficiency, Baidu’s two-track approach—premium hosted APIs and open-source releases—may broaden its appeal across both corporate and developer communities.Whether the company’s performance claims hold up under third-party testing remains to be seen. But in a landscape shaped by rising costs, model complexity, and compute bottlenecks, ERNIE 5.0 and its supporting ecosystem give Baidu a competitive position in the next wave of AI deployment.

#ai #automation

Upwork study shows AI agents excel with human partners but fail independently

Artificial intelligence agents powered by the world's most advanced language models routinely fail to complete even straightforward professional tasks on their own, according to groundbreaking research released Thursday by Upwork, the largest online work marketplace.But the same study reveals a more promising path forward: When AI agents collaborate with human experts, project completion rates surge by up to 70%, suggesting the future of work may not pit humans against machines but rather pair them together in powerful new ways.The findings, drawn from more than 300 real client projects posted to Upwork's platform, marking the first systematic evaluation of how human expertise amplifies AI agent performance in actual professional work — not synthetic tests or academic simulations. The research challenges both the hype around fully autonomous AI agents and fears that such technology will imminently replace knowledge workers."AI agents aren't that agentic, meaning they aren't that good," Andrew Rabinovich, Upwork's chief technology officer and head of AI and machine learning, said in an exclusive interview with VentureBeat. "However, when paired with expert human professionals, project completion rates improve dramatically, supporting our firm belief that the future of work will be defined by humans and AI collaborating to get more work done, with human intuition and domain expertise playing a critical role."How AI agents performed on 300+ real freelance jobs—and why they struggledUpwork's Human+Agent Productivity Index (HAPI) evaluated how three leading AI systems — Gemini 2.5 Pro, OpenAI's GPT-5, and Claude Sonnet 4 — performed on actual jobs posted by paying clients across categories including writing, data science, web development, engineering, sales, and translation.Critically, Upwork deliberately selected simple, well-defined projects where AI agents stood a reasonable chance of success. These jobs, priced under $500, represent less than 6% of Upwork's total gross services volume — a tiny fraction of the platform's overall business and an acknowledgment of current AI limitations."The reality is that although we study AI, and I've been doing this for 25 years, and we see significant breakthroughs, the reality is that these agents aren't that agentic," Rabinovich told VentureBeat. "So if we go up the value chain, the problems become so much more difficult, then we don't think they can solve them at all, even to scratch the surface. So we specifically chose simpler tasks that would give an agent some kind of traction."Even on these deliberately simplified tasks, AI agents working independently struggled. But when expert freelancers provided feedback — spending an average of just 20 minutes per review cycle — the agents' performance improved substantially with each iteration.20 minutes of human feedback boosted AI completion rates up to 70%The research reveals stark differences in how AI agents perform with and without human guidance across different types of work. For data science and analytics projects, Claude Sonnet 4 achieved a 64% completion rate working alone but jumped to 93% after receiving feedback from a human expert. In sales and marketing work, Gemini 2.5 Pro's completion rate rose from 17% independently to 31% with human input. OpenAI's GPT-5 showed similarly dramatic improvements in engineering and architecture tasks, climbing from 30% to 50% completion.The pattern held across virtually all categories, with agents responding particularly well to human feedback on qualitative, creative work requiring editorial judgment — areas like writing, translation, and marketing — where completion rates increased by up to 17 percentage points per feedback cycle.The finding challenges a fundamental assumption in the AI industry: that agent benchmarks conducted in isolation accurately predict real-world performance."While we show that in the tasks that we have selected for agents to perform in isolation, they perform similarly to the previous results that we've seen published openly, what we've shown is that in collaboration with humans, the performance of these agents improves surprisingly well," Rabinovich said. "It's not just a one-turn back and forth, but the more feedback the human provides, the better the agent gets at performing."Why ChatGPT can ace the SAT but can't count the R's in 'strawberry'The research arrives as the AI industry grapples with a measurement crisis. Traditional benchmarks — standardized tests that AI models can master, sometimes scoring perfectly on SAT exams or mathematics olympiads — have proven poor predictors of real-world capability."With advances of large language models, what we're now seeing is that these static, academic datasets are completely saturated," Rabinovich said. "So you could get a perfect score in the SAT test or LSAT or any of the math olympiads, and then you would ask ChatGPT how many R's there are in the word strawberry, and it would get it wrong."This phenomenon — where AI systems ace formal tests but stumble on trivial real-world questions — has led to growing skepticism about AI capabilities, even as companies race to deploy autonomous agents. Several recent benchmarks from other firms have tested AI agents on Upwork jobs, but those evaluations measured only isolated performance, not the collaborative potential that Upwork's research reveals."We wanted to evaluate the quality of these agents on actual real work with economic value associated with it, and not only see how well these agents do, but also see how these agents do in collaboration with humans, because we sort of knew already that in isolation, they're not that advanced," Rabinovich explained.For Upwork, which connects roughly 800,000 active clients posting more than 3 million jobs annually to a global pool of freelancers, the research serves a strategic business purpose: establishing quality standards for AI agents before allowing them to compete or collaborate with human workers on its platform.The economics of human-AI teamwork: Why paying for expert feedback still saves moneyDespite requiring multiple rounds of human feedback — each lasting about 20 minutes — the time investment remains "orders of magnitude different between a human doing the work alone, versus a human doing the work with an AI agent," Rabinovich said. Where a project might take a freelancer days to complete independently, the agent-plus-human approach can deliver results in hours through iterative cycles of automated work and expert refinement.The economic implications extend beyond simple time savings. Upwork recently reported that gross services volume from AI-related work grew 53% year-over-year in the third quarter of 2025, one of the strongest growth drivers for the company. But executives have been careful to frame AI not as a replacement for freelancers but as an enhancement to their capabilities."AI was a huge overhang for our valuation," Erica Gessert, Upwork's CFO, told CFO Brew in October. "There was this belief that all work was going to go away. AI was going to take it, and especially work that's done by people like freelancers, because they are impermanent. Actually, the opposite is true."The company's strategy centers on enabling freelancers to handle more complex, higher-value work by offloading routine tasks to AI. "Freelancers actually prefer to have tools that automate the manual labor and repetitive part of their work, and really focus on the creative and conceptual part of the process," Rabinovich said.Rather than replacing jobs, he argues, AI will transform them: "Simpler tasks will be automated by agents, but the jobs will become much more complex in the number of tasks, so the amount of work and therefore earnings for freelancers will actually only go up."AI coding agents excel, but creative writing and translation still need humansThe research reveals a clear pattern in agent capabilities. AI systems perform best on "deterministic and verifiable" tasks with objectively correct answers, like solving math problems or writing basic code. "Most coding tasks are very similar to each other," Rabinovich noted. "That's why coding agents are becoming so good."In Upwork's tests, web development, mobile app development, and data science projects — especially those involving structured, computational work — saw the highest standalone agent completion rates. Claude Sonnet 4 completed 68% of web development jobs and 64% of data science projects without human help, while Gemini 2.5 Pro achieved 74% on certain technical tasks.But qualitative work proved far more challenging. When asked to create website layouts, write marketing copy, or translate content with appropriate cultural nuance, agents floundered without expert guidance. "When you ask it to write you a poem, the quality of the poem is extremely subjective," Rabinovich said. "Since the rubrics for evaluation were provided by humans, there's some level of variability in representation."Writing, translation, and sales and marketing projects showed the most dramatic improvements from human feedback. For writing work, completion rates increased by up to 17 percentage points after expert review. Engineering and architecture projects requiring creative problem-solving — like civil engineering or architectural design — improved by as much as 23 percentage points with human oversight.This pattern suggests AI agents excel at pattern matching and replication but struggle with creativity, judgment, and context — precisely the skills that define higher-value professional work.Inside the research: How Upwork tested AI agents with peer-reviewed scientific methodsUpwork partnered with elite freelancers on its platform to evaluate every deliverable produced by AI agents, both independently and after each cycle of human feedback. These evaluators created detailed rubrics defining whether projects met core requirements specified in job descriptions, then scored outputs across multiple iterations.Importantly, evaluators focused only on objective completion criteria, excluding subjective factors like stylistic preferences or quality judgments that might emerge in actual client relationships. "Rubric-based completion rates should not be viewed as a measure of whether an agent would be paid in a real marketplace setting," the research notes, "but as an indicator of its ability to fulfill explicitly defined requests."This distinction matters: An AI agent might technically complete all specified requirements yet still produce work a client rejects as inadequate. Conversely, subjective client satisfaction — the true measure of marketplace success — remains beyond current measurement capabilities.The research underwent double-blind peer review and was accepted to NeurIPS, the premier academic conference for AI research, where Upwork will present full results in early December. The company plans to publish a complete methodology and make the benchmark available to the research community, updating the task pool regularly to prevent overfitting as agents improve."The idea is for this benchmark to be a living and breathing platform where agents can come in and evaluate themselves on all categories of work, and the tasks that will be offered on the platform will always update, so that these agents don't overfit and basically memorize the tasks at hand," Rabinovich said.Upwork's AI strategy: Building Uma, a 'meta-agent' that manages human and AI workersThe research directly informs Upwork's product roadmap as the company positions itself for what executives call "the age of AI and beyond." Rather than building its own AI agents to complete specific tasks, Upwork is developing Uma, a "meta orchestration agent" that coordinates between human workers, AI systems, and clients."Today, Upwork is a marketplace where clients look for freelancers to get work done, and then talent comes to Upwork to find work," Rabinovich explained. "This is getting expanded into a domain where clients come to Upwork, communicate with Uma, this meta-orchestration agent, and then Uma identifies the necessary talent to get the job done, gets the tasks outcomes completed, and then delivers that to the client."In this vision, clients would interact primarily with Uma rather than directly hiring freelancers. The AI system would analyze project requirements, determine which tasks require human expertise versus AI execution, coordinate the workflow, and ensure quality — acting as an intelligent project manager rather than a replacement worker."We don't want to build agents that actually complete the tasks, but we are building this meta orchestration agent that figures out what human and agent talent is necessary in order to complete the tasks," Rabinovich said. "Uma evaluates the work to be delivered to the client, orchestrates the interaction between humans and agents, and is able to learn from all the interactions that happen on the platform how to break jobs into tasks so that they get completed in a timely and effective manner."The company recently announced plans to open its first international office in Lisbon, Portugal, by the fourth quarter of 2026, with a focus on AI infrastructure development and technical hiring. The expansion follows Upwork's record-breaking third quarter, driven partly by AI-powered product innovation and strong demand for workers with AI skills.OpenAI, Anthropic, and Google race to build autonomous agents—but reality lags hypeUpwork's findings arrive amid escalating competition in the AI agent space. OpenAI, Anthropic, Google, and numerous startups are racing to develop autonomous agents capable of complex multi-step tasks, from booking travel to analyzing financial data to writing software.But recent high-profile stumbles have tempered initial enthusiasm. AI agents frequently misunderstand instructions, make logical errors, or produce confidently wrong results — a phenomenon researchers call "hallucination." The gap between controlled demonstration videos and reliable real-world performance remains vast."There have been some evaluations that came from OpenAI and other platforms where real Upwork tasks were considered for completion by agents, and across the board, the reported results were not very optimistic, in the sense that they showed that agents—even the best ones, meaning powered by most advanced LLMs — can't really compete with humans that well, because the completion rates are pretty low," Rabinovich said.Rather than waiting for AI to fully mature — a timeline that remains uncertain—Upwork is betting on a hybrid approach that leverages AI's strengths (speed, scalability, pattern recognition) while retaining human strengths (judgment, creativity, contextual understanding).This philosophy extends to learning and improvement. Current AI models train primarily on static datasets scraped from the internet, supplemented by human preference feedback. But most professional work is qualitative, making it difficult for AI systems to know whether their outputs are actually good without expert evaluation."Unless you have this collaboration between the human and the machine, where the human is kind of the teacher and the machine is the student trying to discover new solutions, none of this will be possible," Rabinovich said. "Upwork is very uniquely positioned to create such an environment because if you try to do this with, say, self-driving cars, and you tell Waymo cars to explore new ways of getting to the airport, like avoiding traffic signs, then a bunch of bad things will happen. In doing work on Upwork, if it creates a wrong website, it doesn't cost very much, and there's no negative side effects. But the opportunity to learn is absolutely tremendous."Will AI take your job? The evidence suggests a more complicated answerWhile much public discourse around AI focuses on job displacement, Rabinovich argues the historical pattern suggests otherwise — though the transition may prove disruptive."The narrative in the public is that AI is eliminating jobs, whether it's writing, translation, coding or other digital work, but no one really talks about the exponential amount of new types of work that it will create," he said. "When we invented electricity and steam engines and things like that, they certainly replaced certain jobs, but the amount of new jobs that were introduced is exponentially more, and we think the same is going to happen here."The research identifies emerging job categories focused on AI oversight: designing effective human-machine workflows, providing high-quality feedback to improve agent performance, and verifying that AI-generated work meets quality standards. These skills—prompt engineering, agent supervision, output verification—barely existed two years ago but now command premium rates on platforms like Upwork."New types of skills from humans are becoming necessary in the form of how to design the interaction between humans and machines, how to guide agents to make them better, and ultimately, how to verify that whatever agentic proposals are being made are actually correct, because that's what's necessary in order to advance the state of AI," Rabinovich said.The question remains whether this transition— from doing tasks to overseeing them — will create opportunities as quickly as it disrupts existing roles. For freelancers on Upwork, the answer may already be emerging in their bank accounts: The platform saw AI-related work grow 53% year-over-year, even as fears of AI-driven unemployment dominated headlines.

#artificial intelligence #app #intelligent machines #summary #subscriber-only stories

OpenAI’s new LLM exposes the secrets of how AI really works

ChatGPT maker OpenAI has built an experimental large language model that is far easier to understand than typical models. That’s a big deal, because today’s LLMs are black boxes: Nobody fully understands how they do what they do. Building a model that is more transparent sheds light on how LLMs work in general, helping researchers…

#ai #data infrastructure #data management #enterprise

Inside LinkedIn’s generative AI cookbook: How it scaled people search to 1.3 billion users

LinkedIn is launching its new AI-powered people search this week, after what seems like a very long wait for what should have been a natural offering for generative AI.It comes a full three years after the launch of ChatGPT and six months after LinkedIn launched its AI job search offering. For technical leaders, this timeline illustrates a key enterprise lesson: Deploying generative AI in real enterprise settings is challenging, especially at a scale of 1.3 billion users. It’s a slow, brutal process of pragmatic optimization.The following account is based on several exclusive interviews with the LinkedIn product and engineering team behind the launch.First, here’s how the product works: A user can now type a natural language query like, "Who is knowledgeable about curing cancer?" into LinkedIn’s search bar.LinkedIn's old search, based on keywords, would have been stumped. It would have looked only for references to "cancer". If a user wanted to get sophisticated, they would have had to run separate, rigid keyword searches for "cancer" and then "oncology" and manually try to piece the results together.The new AI-powered system, however, understands the intent of the search because the LLM under the hood grasps semantic meaning. It recognizes, for example, that "cancer" is conceptually related to "oncology" and even less directly, to "genomics research." As a result, it surfaces a far more relevant list of people, including oncology leaders and researchers, even if their profiles don't use the exact word "cancer."The system also balances this relevance with usefulness. Instead of just showing the world's top oncologist (who might be an unreachable third-degree connection), it will also weigh who in your immediate network — like a first-degree connection — is "pretty relevant" and can serve as a crucial bridge to that expert.See the video below for an example.Arguably, though, the more important lesson for enterprise practitioners is the "cookbook" LinkedIn has developed: a replicable, multi-stage pipeline of distillation, co-design, and relentless optimization. LinkedIn had to perfect this on one product before attempting it on another."Don't try to do too much all at once," writes Wenjing Zhang, LinkedIn's VP of Engineering, in a post about the product launch, and who also spoke with VentureBeat last week in an interview. She notes that an earlier "sprawling ambition" to build a unified system for all of LinkedIn's products "stalled progress."Instead, LinkedIn focused on winning one vertical first. The success of its previously launched AI Job Search — which led to job seekers without a four-year degree being 10% more likely to get hired, according to VP of Product Engineering Erran Berger — provided the blueprint.Now, the company is applying that blueprint to a far larger challenge. "It's one thing to be able to do this across tens of millions of jobs," Berger told VentureBeat. "It's another thing to do this across north of a billion members."For enterprise AI builders, LinkedIn's journey provides a technical playbook for what it actually takes to move from a successful pilot to a billion-user-scale product.The new challenge: a 1.3 billion-member graphThe job search product created a robust recipe that the new people search product could build upon, Berger explained. The recipe started with with a "golden data set" of just a few hundred to a thousand real query-profile pairs, meticulously scored against a detailed 20- to 30-page "product policy" document. To scale this for training, LinkedIn used this small golden set to prompt a large foundation model to generate a massive volume of synthetic training data. This synthetic data was used to train a 7-billion-parameter "Product Policy" model — a high-fidelity judge of relevance that was too slow for live production but perfect for teaching smaller models.However, the team hit a wall early on. For six to nine months, they struggled to train a single model that could balance strict policy adherence (relevance) against user engagement signals. The "aha moment" came when they realized they needed to break the problem down. They distilled the 7B policy model into a 1.7B teacher model focused solely on relevance. They then paired it with separate teacher models trained to predict specific member actions, such as job applications for the jobs product, or connecting and following for people search. This "multi-teacher" ensemble produced soft probability scores that the final student model learned to mimic via KL divergence loss.The resulting architecture operates as a two-stage pipeline. First, a larger 8B parameter model handles broad retrieval, casting a wide net to pull candidates from the graph. Then, the highly distilled student model takes over for fine-grained ranking. While the job search product successfully deployed a 0.6B (600-million) parameter student, the new people search product required even more aggressive compression. As Zhang notes, the team pruned their new student model from 440M down to just 220M parameters, achieving the necessary speed for 1.3 billion users with less than 1% relevance loss.But applying this to people search broke the old architecture. The new problem included not just ranking but also retrieval.“A billion records," Berger said, is a "different beast."The team’s prior retrieval stack was built on CPUs. To handle the new scale and the latency demands of a "snappy" search experience, the team had to move its indexing to GPU-based infrastructure. This was a foundational architectural shift that the job search product did not require.Organizationally, LinkedIn benefited from multiple approaches. For a time, LinkedIn had two separate teams — job search and people search — attempting to solve the problem in parallel. But once the job search team achieved its breakthrough using the policy-driven distillation method, Berger and his leadership team intervened. They brought over the architects of the job search win — product lead Rohan Rajiv and engineering lead Wenjing Zhang — to transplant their 'cookbook' directly to the new domain.Distilling for a 10x throughput gainWith the retrieval problem solved, the team faced the ranking and efficiency challenge. This is where the cookbook was adapted with new, aggressive optimization techniques.Zhang’s technical post (I’ll insert the link once it goes live) provides the specific details our audience of AI engineers will appreciate. One of the more significant optimizations was input size.To feed the model, the team trained another LLM with reinforcement learning (RL) for a single purpose: to summarize the input context. This "summarizer" model was able to reduce the model's input size by 20-fold with minimal information loss.The combined result of the 220M-parameter model and the 20x input reduction? A 10x increase in ranking throughput, allowing the team to serve the model efficiently to its massive user base.Pragmatism over hype: building tools, not agentsThroughout our discussions, Berger was adamant about something else that might catch peoples’ attention: The real value for enterprises today lies in perfecting recommender systems, not in chasing "agentic hype." He also refused to talk about the specific models that the company used for the searches, suggesting it almost doesn't matter. The company selects models based on which one it finds the most efficient for the task.The new AI-powered people search is a manifestation of Berger’s philosophy that it’s best to optimize the recommender system first. The architecture includes a new "intelligent query routing layer," as Berger explained, that itself is LLM-powered. This router pragmatically decides if a user's query — like "trust expert" — should go to the new semantic, natural-language stack or to the old, reliable lexical search.This entire, complex system is designed to be a "tool" that a future agent will use, not the agent itself."Agentic products are only as good as the tools that they use to accomplish tasks for people," Berger said. "You can have the world's best reasoning model, and if you're trying to use an agent to do people search but the people search engine is not very good, you're not going to be able to deliver." Now that the people search is available, Berger suggested that one day the company will be offering agents to use it. But he didn’t provide details on timing. He also said the recipe used for job and people search will be spread across the company’s other products.For enterprises building their own AI roadmaps, LinkedIn's playbook is clear:Be pragmatic: Don't try to boil the ocean. Win one vertical, even if it takes 18 months.Codify the "cookbook": Turn that win into a repeatable process (policy docs, distillation pipelines, co-design).Optimize relentlessly: The real 10x gains come after the initial model, in pruning, distillation, and creative optimizations like an RL-trained summarizer.LinkedIn's journey shows that for real-world enterprise AI, emphasis on specific models or cool agentic systems should take a back seat. The durable, strategic advantage comes from mastering the pipeline — the 'AI-native' cookbook of co-design, distillation, and ruthless optimization.(Editor's note: We will be publishing a full-length podcast with LinkedIn's Erran Berger, which will dive deeper into these technical details, on the VentureBeat podcast feed soon.)

#artificial intelligence #app #intelligent machines #summary

Google DeepMind is using Gemini to train agents inside Goat Simulator 3

Google DeepMind has built a new video-game-playing agent called SIMA 2 that can navigate and solve problems in a wide range of 3D virtual worlds. The company claims it’s a big step toward more general-purpose agents and better real-world robots. Google DeepMind first demoed SIMA (which stands for “scalable instructable multiworld agent”) last year. But…

SIMA 2: An Agent that Plays, Reasons, and Learns With You in Virtual 3D Worlds

Introducing SIMA 2, a Gemini-powered AI agent that can think, understand, and take actions in interactive environments.

#ai & ml #commentary

AI Overviews Shouldn’t Be “One Size Fits All”

The following originally appeared on Asimov’s Addendum and is being republished here with the author’s permission. The other day, I was looking for parking information at Dulles International Airport, and was delighted with the conciseness and accuracy of Google’s AI overview. It was much more convenient than being told that the information could be found […]

5 Essential Python Scripts for Intermediate Machine Learning Practitioners

As a machine learning engineer, you probably enjoy working on interesting tasks like experimenting with model architectures, fine-tuning hyperparameters, and analyzing results.

#ai

Alembic melted GPUs chasing causal A.I. — now it's running one of the fastest supercomputers in the world

Alembic Technologies has raised $145 million in Series B and growth funding at a valuation 15 times higher than its previous round, betting that the next competitive advantage in artificial intelligence will come not from better language models but from proprietary data and causal reasoning.The San Francisco-based startup, which builds AI systems that identify cause-and-effect relationships rather than mere correlations, is using a significant portion of the capital to deploy what it claims is one of the fastest privately owned supercomputers ever built — an Nvidia NVL72 superPOD that will power its enterprise-grade causal AI models.The investment, led by Prysm Capital and Accenture with participation from Silver Lake Waterman, Liquid 2 Ventures, NextEquity, Friends & Family Capital and WndrCo, positions Alembic among a select group of well-funded AI laboratories transforming how corporations make multimillion-dollar decisions.The funding round and the company's strategic direction reflect a broader shift taking place in enterprise AI as the performance gap between competing large language models narrows. While startups and tech giants have poured billions into building ever-larger chatbots, Alembic is pursuing a different thesis: that the real value in AI will accrue to systems that can process private corporate data to answer questions that generic models cannot."As powerful artificial intelligence models increasingly converge in capability, the key competitive advantage shifts to proprietary data," said Tomás Puig, Alembic's founder and chief executive, in an interview with VentureBeat. "Getting a real edge isn't about using the best LLM; it's leveraging the unique information rivals can't access."Puig illustrated the problem facing enterprise executives: "Imagine I run a CPG company and I install the latest ChatGPT. I ask, 'Hey, ChatGPT, give me a strategy for how to increase my revenue share in the northeast.' Then your competitor down the road asks the exact same question. How much trouble are you in when they get the exact same answer?"How a broke startup on Mac Pros discovered a breakthrough that changed everythingThe dramatic valuation increase—from roughly $50 million at the Series A to approximately $645 million now, according to people familiar with the matter — reflects a fundamental transformation in Alembic's technology and market positioning since its previous funding round.When the company raised its Series A in early 2024, it was primarily a signal processing and correlation analytics company focused on marketing measurement. "Causal did not exist as a technology for us till after the Series A," Puig told VentureBeat. The company was so resource-constrained that it couldn't even run simulations to test whether its causal models would work.The breakthrough came after the Series A when the company finally had enough capital to test its theories. "We were so broke that we couldn't even run the simulation to see if it worked," Puig recalled. When they did run the tests — initially on an "army of Mac Pros" because they didn't yet have GPU infrastructure — they discovered something unexpected: their causal model worked not just for marketing analytics but across virtually any business domain with time-series data."We started adding capabilities as customers requested them, which was just sensible—iterative," Puig explained. "We found out the model works across a huge majority of data universally. What we thought might be a model for a specific vertical ended up being a full, generalized foundational model."That discovery transformed Alembic from a marketing technology vendor into a company building what Puig describes as "the entire central nervous system of the enterprise across all verticals — not just sales, marketing, supply chain, finance, and beyond."Why cause-and-effect AI matters more than correlation for enterprise decision-makingCausal AI is a fundamentally different approach from the correlation-based analytics that dominate most business intelligence tools and even many AI systems. Where traditional analytics might show that social media engagement correlates with sales increases, causal AI can determine whether the social media activity actually caused the sales lift — or whether both were driven by some third factor, like a viral news event.The distinction matters enormously for executives making budget allocation decisions. "Most businesses are not short on data," Puig said. "They are short on answers."For Alembic's customers, which now include Delta Air Lines, Mars, Nvidia and several Fortune 500 companies across financial services, technology and consumer packaged goods, the platform can answer previously unanswerable questions about marketing effectiveness, operational efficiency and strategic investments."Alembic's ability to connect marketing exposure directly to business outcomes—with speed, precision and granularity—is what made this relationship so transformative for us," said Alicia Tillman, chief marketing officer at Delta Air Lines. "Unlike traditional measurement tools, Alembic gave us a unified view across channels and campaigns, unlocking insights we simply couldn't access before."The airline used Alembic to quantify the revenue lift from its Team USA Olympics sponsorship within days of activation, directly linking brand activities to ticket sales—a type of measurement that has eluded marketers for decades. Traditional attribution models either ignore brand-building entirely or assign it vague "awareness" metrics that don't translate to financial impact."It's very transformative," Puig said of the customer impact. "What's interesting is that executives themselves are the users of our software and our outputs. It's not a tool used by a single campaign manager."Inside the two-story liquid-cooled supercomputer that literally melted GPUsAlembic's decision to invest heavily in owned computing infrastructure rather than rely on cloud providers stems from both the technical demands of its causal models and the extreme data sensitivity requirements of its enterprise customers.The company is deploying an Nvidia NVL72 superPOD — a massive liquid-cooled system equipped with Nvidia's most advanced Blackwell graphics processing units — in partnership with data center operator Equinix in San Jose, Calif. According to Puig, Nvidia informed Alembic that it is the only non-Fortune 500 company in the world to operate such a system.The need for this level of compute stems from how Alembic's models work. Unlike large language models that are trained once on historical data and then deployed, Alembic's system uses "online and evolving" models built on spiking neural networks — brain-inspired architectures that continuously learn as new data arrives."It creates itself as you feed it data, like human evolution," Puig explained. "The model is singular, but it ends up creating a different brain for every single company."This continuous learning happens at massive scale. When a customer brings in data, Alembic's system automatically permutates through billions of possible combinations of how that data could be analyzed — testing every conceivable way to slice metrics and dimensions to find the strongest causal signals. That level of computation requires what Puig calls "F1 car" infrastructure rather than the "production Porsche" offered by cloud providers.The company writes custom CUDA code and low-level GPU kernels optimized specifically for causal inference workloads — optimizations that aren't possible on standard cloud configurations. The approach has proven so demanding that Alembic famously once melted down its GPUs by pushing them beyond their thermal limits. "We literally just drive these circuits so hard that we need the F1 car version and we have to have access to it," Puig said.The move to liquid-cooled systems addresses that problem, but it also enables Alembic to run workloads that would cost orders of magnitude more on cloud platforms. "We did the math—if we were to buy just one subsection of our compute from AWS, it would be $62 million a year," Puig said. Owning the infrastructure costs "a fraction of that."The supercomputer strategy serves another crucial purpose: data sovereignty. Many of Alembic's customers — particularly in financial services, consumer packaged goods and regulated industries — have contractual prohibitions against putting sensitive data on Amazon Web Services, Microsoft Azure or Google Cloud."CPG companies do not want any data to exist on Amazon, ever," Puig said. "They simply won't allow it. Some customers refuse to use Microsoft, others avoid different providers. And certain banks and financial institutions are legally prohibited from using cloud platforms at all."By operating its own infrastructure in neutral data centers, Alembic can serve customers who would never consider cloud-based analytics — a competitive moat that would be difficult for hyperscale cloud providers to replicate.How Jensen Huang read a news article and changed Alembic's destinyAlembic's relationship with Nvidia illustrates both the startup's technical ambitions and how the chip giant supports promising AI companies. Nvidia is Alembic's founding enterprise customer, exclusive supercomputing partner and a key technical collaborator — though notably not an investor.The relationship began in an unlikely way. After Alembic announced its Series A funding in early 2024, Nvidia co-founder and CEO Jensen Huang read the VentureBeat coverage and emailed his staff suggesting they explore the company, according to Puig. Because Alembic didn't yet have a contact form on its website, an Nvidia director reached out via LinkedIn.The partnership nearly foundered on a basic constraint: computing capacity. After Alembic delivered its first causal analysis — which took weeks to generate on an array of Mac Pros — Nvidia asked if they could produce weekly reports. "I said no, because it took weeks on this army of machines," Puig recalled.When Alembic said they could do it with GPUs but couldn't secure the necessary compute — cloud providers at the time required committee approvals and offered two- to six-week lead times with no guarantees — Nvidia intervened directly. The chip maker arranged for Equinix to provide a private cage in Northern Virginia with sufficient power capacity and helped Alembic source its first H100 GPU cluster."Without that, the company would never have existed," Puig said. "We couldn't get the compute in the configuration we needed anywhere else."The partnership has since deepened. Alembic uses Nvidia's AI Enterprise software suite, including specialized libraries like cuGraph for graph processing and TensorRT for high-speed inference. The tight integration, Puig said, allows "our research teams to leverage multi-exaflop-level compute and Nvidia's algorithmic software stack. This integration is one of our secret weapons: we spend more time on breakthrough research and mathematics and less time on repetitive low-level engineering."Nvidia's support extended beyond technology. When Alembic kept destroying GPUs under extreme workloads — pushing chips so hard that thermal stress cracked circuit boards — Nvidia fast-tracked the startup's access to next-generation liquid-cooled systems. "The funny reason we got [the NVL72]," Puig said, "is because when we melted the chips, Nvidia was literally annoyed with how often they had to service our warranty."From Olympics sponsorships to viral candy moments: How Fortune 500s measure what was unmeasurableAlembic's customer roster has expanded rapidly as enterprises seek ways to measure AI and marketing investments that traditional analytics cannot capture. The company now works with Delta Air Lines, Mars, multiple Fortune 500 technology and financial services firms, and Texas A&M University's athletics program.The use cases span far beyond Alembic's original marketing focus. Mars used the platform to measure the sales impact of changing candy shapes for themed promotions. A Fortune 500 technology company expanded its sales pipeline by 37% using Alembic's attribution models. Financial services firms are using it to connect CEO public appearances and co-marketing expenditures to actual fund flows."Alembic helped us move past impression counts to show what actually drove net-new investment," said the head of co-marketing at a Fortune 200 financial services company. "For the first time, we could see how our CEO in the public eye and our co-marketing dollars with exchanges translated into real fund flows."For Mars, the ability to measure previously unmeasurable activities has transformed decision-making. "We are using math to liberate creativity," said Gülen Bengi, lead global chief marketing officer for Mars and global chief growth officer for Mars Snacking. "Our fans and communities create billions of organic conversations and content about our brands. When a viral moment happens, we normally know it's directionally positive, but we can't attribute the sales uplift or its place in the customer journey. Alembic's Causal AI is a breakthrough, allowing us to move beyond correlation to see exactly how that organic conversation created a sequence that directly impacted sales."The platform can predict revenue, close rates and customer acquisition up to two years in advance with 95% confidence, according to Puig. "What they were doing before was they actually literally did not know about certain things," he said, describing how customers previously estimated the value of stadium naming rights or major sponsorships without ever measuring actual dollar impact. "Now you can go and be like it had this effect on this much P&L, and this is where it's flowing, and you can know within days or near real time."Why Google, Meta and Nielsen can't easily replicate what Alembic builtAlembic operates in a competitive landscape that includes traditional marketing measurement vendors like Nielsen, analytics platforms from Google and Meta, and emerging AI-powered analytics startups. But Puig argues the company has built structural advantages that would be difficult to replicate.First, the company's causal models rely on proprietary mathematics developed over years and protected by patents. "You would have to start from scratch," Puig said. "This is not like an LLM that uses a transformer that has a paper, and you could attempt to recreate. You'd actually have to go and recreate the methodology from scratch."Second, the massive computing requirements create a natural barrier. Alembic operates at "foundational model levels of compute, not like even something you would run from [AWS] Sagemaker," Puig said. "We're talking about hundreds of millions of dollars a year" in equivalent cloud costs.Third, the data sovereignty requirements of enterprise customers create opportunities for neutral third parties that hyperscale cloud providers struggle to address. As one venture capital investor noted, enterprises increasingly worry about putting strategic data into systems owned by potential competitors.Finally, Alembic's ability to work with messy, fragmented data reflects years of engineering that preceded its causal AI breakthrough. "The first four [or] five years of the company's life was building that giant signal processor that dealt with messy data," Puig said. "We would not be able to do it if we had not taken all that time."Why Alembic's contrarian bet on private data could reshape enterprise AIThe $145 million funding round validates a contrarian bet in an AI landscape dominated by the race to build ever-larger language models. While OpenAI, Anthropic and others compete on whose chatbot can write better code or answer more trivia questions, Alembic is building infrastructure for a different kind of intelligence — one that understands cause and effect in the messy, proprietary data that defines each company's unique competitive position.The company's evolution from a bootstrapped startup running simulations on Mac Pros to operating one of the world's fastest private supercomputers mirrors the broader maturation of enterprise AI. As the technology moves from experimentation to mission-critical deployment, companies need more than general-purpose models trained on public data. They need systems that can process their private information to answer questions their competitors cannot.Puig's thesis — that private data becomes the key differentiator as public models converge — resonates with how other technologies evolved. Search engines commoditized access to public information, making proprietary data more valuable. Cloud computing made infrastructure a utility, elevating the importance of what you build on top of it. If large language models similarly converge in capability, the competitive advantage flows to whoever can best extract intelligence from data others cannot access.The company is already testing its technology beyond marketing analytics. Pilots are underway in robotics, where causal models could help autonomous systems understand how actions lead to outcomes. New product lines are launching, including the GPU-accelerated database that customers are buying separately. The ambition, Puig said, is to become "the central nervous system" of the enterprise — the layer that connects cause and effect across every business function.Whether Alembic can deliver on that vision remains to be seen. The company operates in complex enterprise environments where sales cycles are long and integration challenges are significant. Competitors aren't standing still, and the technical moats that protect it today may erode as causal AI techniques become better understood.But for now, Alembic occupies a unique position. It has marquee customers achieving measurable results. It has infrastructure that would cost hundreds of millions to replicate on cloud platforms. It has proprietary mathematics refined over years of dealing with messy enterprise data. And it has $145 million to scale what Puig describes as a fundamental shift from correlation to causation.In his interview with VentureBeat, Puig drew a parallel to quantitative hedge funds that use mathematics to gain trading advantages that general-purpose AI cannot match. "ChatGPT still can't equal Renaissance Technologies," he said, referring to the secretive firm that has generated historic returns through quantitative models.The comparison captures Alembic's core insight: that in a world where everyone has access to the same general-purpose AI, sustainable advantage comes from specialized systems that understand the cause-and-effect relationships hiding in your data. It's a bet that the future of enterprise AI looks less like a universal chatbot and more like a private intelligence engine — one that, to Puig's original point, prevents your competitor from getting the same answer when they ask the same question.

#gemini #pixel #google in europe #ai

Google Pixel and Golden Goose partner to bring AI to global ateliers

Advances in AI have opened up the possibilities for greater personalisation in the fashion world. Exploring this intersection between technology and high fashion, we’re …

Latest AI News & Updates

Select Categories