One thing that frustrates me, looking at the technical news on AI, is how little attention we give to one of the most impactful yet underappreciated innovations of this era: embeddings.

Over the last year, the conversation has been almost entirely about agents. Systems that can write code, plan multi-step workflows and reason through complex problems. This evolution has genuinely changed what’s possible with language models. But here’s the paradox: while agents dominate the headlines, the infrastructure powering most production AI systems has remained in the background. Semantic search, memory systems, recommendation engines, clustering…embeddings are the quiet foundation behind all of them.

Embeddings are why your photo gallery can find every picture of your dog, how Netflix knows what you’ll want to watch next and how Spotify builds playlists that feel like they’re reading your mind. We’ve been using this technology for years, long before ChatGPT. The same techniques powering these everyday systems are also what makes modern agents actually work. Whether it’s searching through documentation, deciding which code files to load into context, maintaining long-term memory across conversations, embeddings are doing the heavy lifting. The reasoning everyone celebrates happens on top of this infrastructure.

This article is about giving embeddings the credit they deserve. Not as a competitor to agents but as a complementary building block that’s deterministic and evaluable. Understanding when to reach for embeddings versus agents (and how they work together) is what separates effective AI engineering from hype-driven experimentation.


What Embeddings Actually Are

Look at this visualization: these are embeddings! Each small square represents how a model encoded a handwritten digit. The image you see is the actual digit, but its position in this space is determined by the model’s internal representation of what that digit means.

MNIST Embedding Space

Notice the clustering? Different digits naturally form distinct groups, each shown in a different color. This isn’t random. The model represents semantically similar digits closer together in this space. Look closely at the boundaries between clusters: you’ll see that 3s, 5s and 8s occupy neighboring regions because they share curved shapes. The geometry encodes semantics: distance in this space measures similarity. You’ll also spot a few outliers, digits that landed in the “wrong” cluster. A poorly written 3 that looks like an 8 might end up in the 8 cluster. The model isn’t perfect, but these mistakes reveal something important: it’s responding to actual visual similarity, not memorized rules. When humans struggle to read a digit, the model often struggles too.

When we say two embeddings are “close,” we’re using geometric distance to measure semantic similarity. The most common metric is cosine similarity, which measures the angle between two vectors. It produces a score between -1 (opposite) and 1 (identical), with 0 meaning unrelated. There are other distances, like for instance the Euclidean and the Dot Product, but cosine similarity dominates in production systems for its reliability and speed.

From Similarity to Relationships

The example that made embeddings famous came from natural language. Word2vec learned to represent words as points in space where you could do more than just measure similarity.

Word2vec Embeddings

The model was trained by learning to predict which words appear near each other in sentences. If you see “king” in a text, you’re likely to see words like “throne” or “reign” nearby. The core insight: words that appear in similar contexts should have similar vectors. The model learned that “king” and “queen” both appear in royal contexts but with different gender-related words around them. It encoded this pattern as geometry: the distance between “king” and “man” is roughly the same as the distance between “queen” and “woman”. Subtract “man” from “king” and you get a vector pointing toward royalty without gender. Add “woman” and you land near “queen”.

Here’s what makes this powerful: nobody programmed these relationships. The neural network discovered them purely from reading text. The relationships we understand intuitively (gender, royalty, size) became mathematical operations (vector addition and subtraction) in the space.

Putting It All Together

Now that we’ve built intuition, here’s a more precise definition: embeddings are a representation learning technique that maps complex, high-dimensional data into a vector space, where semantic relationships in the original data are preserved as geometric relationships.

Breaking this down

  • Representation learning: The model learns how to represent data, not just how to classify it.
  • High-dimensional data: A 1024×1024 image has over 1 million pixels of information.
  • Vector space: Each input becomes a numerical vector (typically 256 to 4096 dimensions).
  • Preserves relationships: Similar inputs map to nearby vectors, dissimilar inputs to distant ones.

How are embeddings created? During training, a neural network learns to map inputs into the required output, passing through several intermediate layers. The embedding is the internal representation extracted from one of these intermediate layers. Remember the hidden layers from the introductive series? Those activations are embeddings. Every neural network produces them as an internal representation of the input, we just need to extract and use them. The key is choosing the right layer: typically one that has compressed the input into a more compact form (256-4096 dimensions) while preserving the semantic information needed for the task.

The visualizations you saw above are 2D projections of much higher-dimensional spaces, compressed down so we can see them.


Building with Embeddings

Understanding what embeddings are is one thing. Knowing how to actually use them is another. The good news: you don’t need to build everything from scratch. A mature ecosystem of models and infrastructure already exists, most of it open source.

Embeddings Across Domains

Embeddings aren’t limited to text or images. They work for any data type where you can define similarity. The key question is: how much effort do you want to invest?

Approach Effort Level When to Use Example
Pre-trained models Zero setup, ready to use Your data resembles public training data CLIP for images, sentence-transformers for text
Fine-tuned models Moderate effort, train on your data Domain-specific patterns matter Security-focused text embeddings, medical image analysis
Train from scratch High effort, custom architecture Your data is truly unique Stripe’s payment embeddings, network traffic analysis

Text Embeddings

For text, start with sentence-transformers. These models use architectures similar to GPT but are specialized for creating embeddings rather than generating text. Unlike generative models that predict the next token in a sequence, sentence-transformers process the entire input sequence and produce a single fixed-size vector that captures its meaning. Use cases include semantic search (find documents by meaning, not keywords), document deduplication (group similar reports) and intent detection (classify user queries by purpose).

Most of the models are pre-trained, meaning someone else already did the heavy lifting of training on massive datasets. You download the model and use it immediately. You can find hundreds of them on HuggingFace, a repository of open-source models where you can browse by task, language and performance. Most work out of the box for general text.

For security-specific text (threat reports, vulnerability descriptions, malware analysis), fine-tuning can improve results. Fine-tuning means taking a pre-trained model and continuing its training on your domain data. The model adapts to your specific patterns and vocabulary.

Consider code similarity detection. Even using an embedding model designed for code, you’ll hit limitations with obfuscated malware. Two functions that do the same thing can look completely different after obfuscation: variable names changed, control flow restructured, strings encrypted. A pre-trained code embedding model, which will have had a very low percentage of malicious samples in its dataset, is likely to rely heavily on textual patterns, which can create limitations in the representation of obfuscated samples. Fine-tuning on malware samples teaches the model to look past surface-level differences and focus on behavioral patterns that survive obfuscation.

Image Embeddings

For images, start with CNN-based models (Convolutional Neural Networks) like ResNet. CNNs process images as grids of pixels, scanning for visual patterns (edges, shapes, textures) and progressively compressing this information into a compact vector representation based purely on visual patterns. Use cases include reverse image search, visual similarity detection and content moderation. A security researcher could use them to cluster malware samples by UI screenshots, find similar app icons or group phishing pages by visual design.

For more advanced use cases, CLIP (Contrastive Language-Image Pre-training) adds text understanding. It’s a multi-modal model that learned to connect images with their text descriptions, letting you search images using natural language queries or find visual content that matches a concept described in words. This opens up possibilities like “find all screenshots that look like banking apps” without manually defining what “banking app” means visually.

Custom Domain Data

Sometimes your data doesn’t fit public models. Stripe’s Payments Foundation Model is a perfect example. They built a transformer-based network trained on tens of billions of transactions to learn embeddings for every payment, treating transaction sequences like sentences in language . The model improved their detection rate for card-testing attacks on large users from 59% to 97% overnight.

What makes this interesting: payments are like language in some ways (structural patterns similar to syntax and semantics, temporal sequences) but extremely unlike language in others (contextual sparsity, fewer organizing principles like grammatical rules) . The model distills each transaction’s key signals into a single, versatile embedding that can be reused across different tasks (e.g. detection of different fraud methods).

This “train from scratch” approach makes sense when:

  • Your data type has no public analog (proprietary logs, specialized sensors, internal business metrics)
  • Domain patterns are highly specific (financial transactions, network flows, industrial telemetry)
  • You have enough data and compute to train effectively (billions of examples, significant infrastructure)

For most security researchers, pre-trained models are the starting point. Fine-tuning is the next level when you have domain data. Training from scratch typically requires a data science team and significant infrastructure investment.

Vector Search Infrastructure

Once you have embeddings, you need to search them efficiently. Sounds simple, right? It’s not. Finding the 10 most similar items among millions of vectors requires comparing your query against every stored vector: a brute force approach that doesn’t scale. With millions of entries, you need specialized data structures and in-memory databases to achieve good performance.

The breakthrough came from graph-based algorithms, particularly HNSW (Hierarchical Navigable Small World). HNSW works by building a multi-layer graph where each node represents a vector and edges connect similar vectors. The “hierarchical” part means it has multiple layers: upper layers provide long-range connections for quick navigation to the right neighborhood, while lower layers have dense local connections for precise results . When you query, the algorithm starts at the top layer and greedily follows edges toward your target, dropping down layers as it gets closer. This makes approximate nearest neighbor search dramatically faster than brute force comparison.

In practice, approximate results are perfectly acceptable for most use cases. Finding the top 10 most similar items out of millions doesn’t require mathematical perfection: if you get 9 out of 10 correct matches, or the results are in slightly different order, the system still works.

The good news: you don’t need to implement this yourself. Multiple production-ready open-source vector databases exist that handle the complexity:

  • Redis: Exposes HNSW as a native data structure, allowing you to add vectors, remove them and query for similar items just like any other Redis data type.
  • Milvus: Purpose-built for vector search, designed to scale to billions of vectors.
  • Qdrant: Rust-based, optimized for combining similarity search with filtering.
  • ChromaDB: Python-native, excellent for prototyping and smaller projects.

The infrastructure exists, it’s mature and it’s accessible. You can go from zero to a working vector search system in an afternoon. The hard part isn’t the tooling, it’s choosing the right embeddings for your use case and understanding what patterns matter in your data.


Why Agents Need Embeddings

Embeddings and agents solve different problems. Understanding when each is appropriate comes down to constraints and requirements.

The Context Window Problem

Context windows are finite. Even the most advanced language models have hard limits, typically 256K to 1M tokens. In some cases you can’t load an entire codebase, document repository or dataset into context every time you need information.

Embeddings solve this by enabling precise retrieval. An AI agent analyzing code can’t load your entire repository into context, but it can query an embedding-based index to retrieve only the 3-5 most relevant files for the current task. This prevents both context window saturation (trying to fit too much) and context pollution (irrelevant information degrading performance).

This is why modern agent systems are built on embedding infrastructure:

  • RAG (Retrieval-Augmented Generation): Embeddings retrieve relevant documents, agents reason over them.
  • Long-term memory: Embeddings store and retrieve past interactions efficiently.
  • Context optimization: Embeddings decide what goes into the limited context window.

Why Embeddings Excel in Automated Systems

Beyond solving context limitations, embeddings have properties that make them ideal for production systems:

  • Determinism: Same input always produces the same embedding. No temperature parameter, no prompt sensitivity, no “it worked yesterday but not today”. This predictability is critical for systems running thousands of operations per day.
  • Evaluability: You can build comprehensive test suites. Measure recall, precision, track drift over time, set hard thresholds. When your fraud detection system’s recall drops from 0.95 to 0.89, you know something changed.

These properties make embeddings the backbone of automated pipelines: fraud detection, content moderation, recommendation engines. The relationship between embeddings and agents is complementary. An agent analyzing malware can’t load 5,000 reference samples into context, but it can query an embedding index for the top 5 similar samples and reason about what those similarities mean.

Limitations to Understand: Embeddings can’t reason, explain decisions or handle multi-step logic. They identify similarity but can’t tell you why two things are similar. Training data biases carry over, and out-of-distribution inputs produce bad vectors without warning.

This section walks through a working PoC that demonstrates what building on embedding infrastructure actually looks like: the resources required, the challenges you’ll face and what you get in return.

Introduction

The demo is a visual similarity search tool for Android malware icons, indexing 1.5K APKs obtained from MalwareBazaar. It indexes launcher icons from APK files, encodes them with CLIP embeddings and enables search by visual similarity or natural language queries:

  • The extraction pipeline scans a folder of APKs (organized by malware family in subfolders), extracts the launcher icon from each sample and generates CLIP embeddings. These embeddings get stored in ChromaDB along with metadata: SHA-256 hash, family label, APK path and a base64-encoded thumbnail of the icon itself.
  • The search interface provides three modes. Visual clustering shows all indexed samples projected into 2D space using UMAP, color-coded by family. Text search lets you query with natural language descriptions like “payment card icon”. Image search accepts either an uploaded icon or a new APK file and returns the most visually similar samples from the index.

Malware Icon Cluster View

Infrastructure and Challenges

The model is CLIP ViT-B/32, a vision transformer pre-trained by OpenAI. When you first launch the model, approximately 350 MB of weights are downloaded, after which everything works locally. This model works fine with the CPU. Encoding 1,500 icons takes 10–20 seconds. Using a GPU would speed this process up, but it is not required. The full pipeline (extracting icons from APKs, generating embeddings and indexing in ChromaDB) processes 1,500 samples in under a minute.

The hard part isn’t the ML, it’s everything else:

  • Icon extraction is messier than it looks: Android APKs contain icons at multiple resolutions. Some use adaptive icons defined in XML, which can’t be rendered without the full Android framework. The tool needs heuristics: prefer higher-density PNGs, fall back to lower resolutions if needed, skip XML-only icons entirely. This is pure data engineering.
  • UMAP projection is expensive: Reducing 512-dimensional embeddings to 2D for visualization requires running UMAP across the entire dataset. With 1,500 samples this takes a few seconds. With 50,000 it becomes noticeable. The solution: cache the projection and only recompute when the indexed count changes. This is a UX tradeoff, not an ML problem.

Obtained Results

Visual similarity survives repackaging. The screenshot shows a settings gear icon from the Tispy malware family queried against the index. All four nearest neighbors return with distance 0.0000 (perfect matches). These are visually identical icons from different APK samples of the same malware family. Even slight pixel variations (compression artifacts, color shifts, minor modifications) change the icon’s hash completely, but the CLIP embedding captures the semantic similarity. This is where hash-based detection breaks but embedding-based similarity thrives.

Malware Icon Similar

Text search works surprisingly well. CLIP’s multi-modal training means “payment card” finds relevant samples without any fine-tuning. The model learned general visual-language associations from internet images (photos, illustrations, memes) and those associations transfer to malware icons. A credit card icon is a credit card icon whether it’s in a legitimate banking app or a trojan. This is the power of transfer learning.

Malware Icon Text Search

Conclusion

Embeddings are a foundational tool that unlocks a class of reliable, automated AI systems. They’re deterministic, evaluable and production-ready: properties that make them ideal for malware classifiers, fraud detection and any pipeline that needs to run millions of operations without human oversight. For security researchers, the path forward is clear: start with pre-trained models and open-source vector databases. You can build powerful similarity-based tools with minimal ML expertise. Fine-tuning adapts these models to your domain when accuracy matters, and the infrastructure to deploy them at scale already exists.

The best AI systems don’t choose between embeddings and agents, they use both where each excels. Embeddings handle large-scale similarity search, persistent memory and deterministic retrieval. Agents provide reasoning, explanation and decision-making on top of those results. Understanding this complementary relationship is what separates effective AI engineering from hype-driven experimentation.


References