In this blog post, we demonstrate an end-to-end approach for building and evaluating a KIE solution using Amazon Nova models available through Amazon Bedrock. This end-to-end approach encompasses three critical phases: data readiness (understanding and preparing your documents), solution development (implementing extraction logic with appropriate models), and performance measurement (evaluating accuracy, efficiency, and cost-effectiveness). We illustrate this comprehensive approach using the FATURA dataset—a collection of diverse invoice documents that serves as a representative proxy for real-world enterprise data.
With the new cluster creation experience, you can create your SageMaker HyperPod clusters, including the required prerequisite AWS resources, in one click, with prescriptive default values automatically applied. In this post, we explore the new cluster creation experience for Amazon SageMaker HyperPod.
Develop simple yet powerful business intelligence tools tailored to meet your company's specific needs.
Egor Howell on breaking into ML without a CS degree, surviving 80+ interviews, and what to do if you feel stuck in your career.
The post Writing Is Thinking appeared first on Towards Data Science.
This introductory article discussed what data science is (and isn't), its relationship with math, statistic, and computer science, and why its high importance and impact nowadays... all in simple terms.
Extreme gradient boosting ( XGBoost ) is one of the most prominent machine learning techniques used not only for experimentation and analysis but also in deployed predictive solutions in industry.
AI-powered "vibe coding" promises rapid development but creates unprecedented security risks for data applications handling sensitive information.
As concerns grow about AI chatbots leading users into delusional spirals, prominent spiritual influencers are capitalizing on an emerging form of techno-spirituality.
Online gambling is a massive industry. The AI boom keeps booming. It was only a matter of time before people tried to put them together.
Everywhere I look, I see AI clones. On X and LinkedIn, “thought leaders” and influencers offer their followers a chance to ask questions of their digital replicas. OnlyFans creators are having AI models of themselves chat, for a price, with followers. “Virtual human” salespeople in China are reportedly outselling real humans. Digital clones—AI models that…
Declan would never have found out his therapist was using ChatGPT had it not been for a technical mishap. The connection was patchy during one of their online sessions, so Declan suggested they turn off their video feeds. Instead, his therapist began inadvertently sharing his screen. “Suddenly, I was watching him use ChatGPT,” says Declan,…
ChatGPT glasses win, .ai island gold, AI schools, spotting fake AI, and more...
Is over-specialization ending and are data generalists on the rise?
The post The Generalist: The New All-Around Type of Data Professional? appeared first on Towards Data Science.
WIRED talks to the director of the Chilean National Center for Artificial Intelligence about Latam-GPT, the large-language model that aims to address the region’s specific needs and change the current technological dynamic.
On this episode of Uncanny Valley, we look back at the week's biggest stories—from the researchers leaving Meta's new superintelligence lab, to the dark money group funding Democratic influencers.
What exactly does word2vec learn, and how? Answering this question amounts to understanding representation learning in a minimal yet interesting language modeling task. Despite the fact that word2vec is a well-known precursor to modern language models, for many years, researchers lacked a quantitative and predictive theory describing its learning process. In our new paper, we finally provide such a theory. We prove that there are realistic, practical regimes in which the learning problem reduces to unweighted least-squares matrix factorization. We solve the gradient flow dynamics in closed form; the final learned representations are simply given by PCA.
Learning dynamics of word2vec. When trained from small initialization, word2vec learns in discrete, sequential steps. Left: rank-incrementing learning steps in the weight matrix, each decreasing the loss. Right: three time slices of the latent embedding space showing how embedding vectors expand into subspaces of increasing dimension at each learning step, continuing until model capacity is saturated.
Before elaborating on this result, let’s motivate the problem. word2vec is a well-known algorithm for learning dense vector representations of words. These embedding vectors are trained using a contrastive algorithm; at the end of training, the semantic relation between any two words is captured by the angle between the corresponding embeddings. In fact, the learned embeddings empirically exhibit striking linear structure in their geometry: linear subspaces in the latent space often encode interpretable concepts such as gender, verb tense, or dialect. This so-called linear representation hypothesis has recently garnered a lot of attention since LLMs exhibit this behavior as well, enabling semantic inspection of internal representations and providing for novel model steering techniques. In word2vec, it is precisely these linear directions that enable the learned embeddings to complete analogies (e.g., “man : woman :: king : queen”) via embedding vector addition.
Maybe this shouldn’t be too surprising: after all, the word2vec algorithm simply iterates through a text corpus and trains a two-layer linear network to model statistical regularities in natural language using self-supervised gradient descent. In this framing, it’s clear that word2vec is a minimal neural language model. Understanding word2vec is thus a prerequisite to understanding feature learning in more sophisticated language modeling tasks.
The Result
With this motivation in mind, let’s describe the main result. Concretely, suppose we initialize all the embedding vectors randomly and very close to the origin, so that they’re effectively zero-dimensional. Then (under some mild approximations) the embeddings collectively learn one “concept” (i.e., orthogonal linear subspace) at a time in a sequence of discrete learning steps.
It’s like when diving head-first into learning a new branch of math. At first, all the jargon is muddled — what’s the difference between a function and a functional? What about a linear operator vs. a matrix? Slowly, through exposure to new settings of interest, the words separate from each other in the mind and their true meanings become clearer.
As a consequence, each new realized linear concept effectively increments the rank of the embedding matrix, giving each word embedding more space to better express itself and its meaning. Since these linear subspaces do not rotate once they’re learned, these are effectively the model’s learned features. Our theory allows us to compute each of these features a priori in closed form – they are simply the eigenvectors of a particular target matrix which is defined solely in terms of measurable corpus statistics and algorithmic hyperparameters.
What are the features?
The answer is remarkably straightforward: the latent features are simply the top eigenvectors of the following matrix:
\[M^{\star}_{ij} = \frac{P(i,j) - P(i)P(j)}{\frac{1}{2}(P(i,j) + P(i)P(j))}\]
where $i$ and $j$ index the words in the vocabulary, $P(i,j)$ is the co-occurrence probability for words $i$ and $j$, and $P(i)$ is the unigram probability for word $i$ (i.e., the marginal of $P(i,j)$).
Constructing and diagonalizing this matrix from the Wikipedia statistics, one finds that the top eigenvector selects words associated with celebrity biographies, the second eigenvector selects words associated with government and municipal administration, the third is associated with geographical and cartographical descriptors, and so on.
The takeaway is this: during training, word2vec finds a sequence of optimal low-rank approximations of $M^{\star}$. It’s effectively equivalent to running PCA on $M^{\star}$.
The following plots illustrate this behavior.
Learning dynamics comparison showing discrete, sequential learning steps.
On the left, the key empirical observation is that word2vec (plus our mild approximations) learns in a sequence of essentially discrete steps. Each step increments the effective rank of the embeddings, resulting in a stepwise decrease in the loss. On the right, we show three time slices of the latent embedding space, demonstrating how the embeddings expand along a new orthogonal direction at each learning step. Furthermore, by inspecting the words that most strongly align with these singular directions, we observe that each discrete “piece of knowledge” corresponds to an interpretable topic-level concept. These learning dynamics are solvable in closed form, and we see an excellent match between the theory and numerical experiment.
What are the mild approximations? They are: 1) quartic approximation of the objective function around the origin; 2) a particular constraint on the algorithmic hyperparameters; 3) sufficiently small initial embedding weights; and 4) vanishingly small gradient descent steps. Thankfully, these conditions are not too strong, and in fact they’re quite similar to the setting described in the original word2vec paper.
Importantly, none of the approximations involve the data distribution! Indeed, a huge strength of the theory is that it makes no distributional assumptions. As a result, the theory predicts exactly what features are learned in terms of the corpus statistics and the algorithmic hyperparameters. This is particularly useful, since fine-grained descriptions of learning dynamics in the distribution-agnostic setting are rare and hard to obtain; to our knowledge, this is the first one for a practical natural language task.
As for the approximations we do make, we empirically show that our theoretical result still provides a faithful description of the original word2vec. As a coarse indicator of the agreement between our approximate setting and true word2vec, we can compare the empirical scores on the standard analogy completion benchmark: word2vec achieves 68% accuracy, the approximate model we study achieves 66%, and the standard classical alternative (known as PPMI) only gets 51%. Check out our paper to see plots with detailed comparisons.
To demonstrate the usefulness of the result, we apply our theory to study the emergence of abstract linear representations (corresponding to binary concepts such as masculine/feminine or past/future). We find that over the course of learning, word2vec builds these linear representations in a sequence of noisy learning steps, and their geometry is well-described by a spiked random matrix model. Early in training, semantic signal dominates; however, later in training, noise may begin to dominate, causing a degradation of the model’s ability to resolve the linear representation. See our paper for more details.
All in all, this result gives one of the first complete closed-form theories of feature learning in a minimal yet relevant natural language task. In this sense, we believe our work is an important step forward in the broader project of obtaining realistic analytical solutions describing the performance of practical machine learning algorithms.
Learn more about our work: Link to full paper
This post originally appeared on Dhruva Karkada’s blog.
Photo revival, fast food flop, AI startup playbook, unsafe AI, and more...
Exploring ways to make voice assistants more personal
The post How to Develop a Bilingual Voice Assistant appeared first on Towards Data Science.
For ML practitioners, the natural expectation is that a new ML model that shows promising results offline will also succeed in production. But often, that’s not the case. ML models that outperform on test data can underperform for real production users. This discrepancy between offline and online metrics is often a big challenge in applied […]
The post Beyond AUC and RMSE: How to Align Offline Metrics with Real-World KPIs appeared first on Analytics Vidhya.
August 2025: logging, lab notebooks, overnight runs
The post The Machine Learning Lessons I’ve Learned This Month appeared first on Towards Data Science.
The physical meaning of matrix inversion, related formulas, and how inversion behaves on several special types of matrices.
The post Understanding Matrices | Part 4: Matrix Inverse appeared first on Towards Data Science.
How to build a fully functional, hands-free voice assistant on a Raspberry Pi
The post Crafting a Custom Voice Assistant with Perplexity appeared first on Towards Data Science.
Meta’s bold move, major AI breach, AI’s secret life, TIME100 AI, AGI hype, and more...
In today’s dynamic business environment, a company’s approach to customer experience can significantly impact its brand perception. One poor interaction, such as a missed delivery or an unhelpful agent, and the relationship often doesn’t recover. Industry data puts it into perspective: Nearly 32% of consumers abandon a brand after just one bad experience. The stakes […]
The post How Dentsu Uses Generative Machine Learning to Transform Customer Service? appeared first on Analytics Vidhya.
Software spending now makes up 40% of cybersecurity budgets, with investment expected to grow as CISOs prioritize real-time AI defenses.
M2N2 is a model merging technique that creates powerful multi-skilled agents without the high cost and data needs of retraining.
This is the inside story of Intuit's transformation journey with AI — including a grueling nine-month pivot to "burn the boats" and reinvent how the 40-year-old finance giant builds its products.
To kick off the second episode in Season 8 of the Made by Google podcast, host Rachid Finge asks Pixel Product Manager Stephanie Scott to describe the Pixel 10 phones in…
AI cybercrime and ransomware, GPT-Realtime, AI NPCs, Anthropic deadline, and more...
We’re excited to announce new security capabilities in Datadog Cloud Security that can help you detect and remediate Amazon Bedrock misconfigurations before they become security incidents. This integration helps organizations embed robust security controls and secure their use of the powerful capabilities of Amazon Bedrock by offering three critical advantages: holistic AI security by integrating AI security into your broader cloud security strategy, real-time risk detection through identifying potential AI-related security issues as they emerge, and simplified compliance to help meet evolving AI regulations with pre-built detections.