Vectors: The Unsung Heroes Powering Machine Learning

Have you ever wondered how Netflix recommends your next binge-worthy show, how Google understands your search queries, or how your phone recognizes faces in photos? The magic behind these seemingly complex feats of artificial intelligence often boils down to a surprisingly simple yet incredibly powerful mathematical concept: vectors.

In the world of Machine Learning (ML), vectors are the fundamental language. They allow computers to process, understand, and learn from data that is inherently non-numerical, like text, images, sounds, and even abstract concepts.

Vector 101: What Exactly is a Vector?

At its core, a vector is simply an ordered list of numbers.

You can visualize a vector as:

A point in space: Each number in the list corresponds to a coordinate along a specific dimension. For example, `[3, 4]` represents a point that is 3 units along the x-axis and 4 units along the y-axis in a 2-dimensional plane.

An arrow from the origin: This arrow points from the origin `(0,0)` to the point defined by the vector’s elements.

The number of elements in a vector determines its dimensionality. A vector with 5 numbers is a 5-dimensional vector.

Example:

Consider a simple 2-dimensional vector: `V = [3, 4]`

This vector can represent many things:

A location: 3 steps East, 4 steps North.

A feature set: If we’re describing a fruit, `3` could be its sweetness score and `4` its crunchiness score.

Every vector is uniquely defined by two fundamental properties:

1. Magnitude (Length or Size):

This is a single, non-negative number that tells you the vector’s “length” or “strength,” regardless of its orientation. It’s calculated using the Pythagorean theorem.

For `V = [3, 4]`, the magnitude (denoted as `||V||`) is:

`||V|| = sqrt(3^2 + 4^2) = sqrt(9 + 16) = sqrt(25) = 5`

So, the vector `[3, 4]` has a length of 5 units.

2. Direction (Orientation):

This specifies “which way” the vector is pointing in space. It’s often described by the angle it makes with a reference axis (usually the positive x-axis).

For `V = [3, 4]`, the direction (angle `θ`) can be found using trigonometry:

`tan(θ) = opposite / adjacent = 4 / 3`

`θ = arctan(4/3) ≈ 53.13 degrees`

So, the vector `[3, 4]` points at approximately 53.13 degrees counter-clockwise from the positive x-axis.

Together, magnitude and direction completely define a vector.

Vectorization

Machine learning models are built on mathematics; they can only process numbers. This means that any real-world data – be it text, images, audio, or even complex user behavior – must first be converted into a numerical format. This crucial process is called vectorization or embedding.

The goal of vectorization is to represent data in a way that captures its inherent meaning and relationships, allowing ML algorithms to find patterns and make predictions.

How Text is Turned into Vectors (Text Embeddings)

Representing words, phrases, or entire documents as numerical vectors is a cornerstone of Natural Language Processing (NLP).

1. One-Hot Encoding (Basic, but Limited):**

Concept: Each unique word in a vocabulary is assigned a unique position in a vector. A word’s vector will have a `1` at its assigned position and `0`s everywhere else.

Example: If our vocabulary is `[“cat”, “dog”, “run”]`:

“cat” -> `[1, 0, 0]`

“dog” -> `[0, 1, 0]`

Limitations:

No Semantic Relationship: “Cat” and “dog” are equally “distant” from each other as “cat” and “run.” This doesn’t reflect their semantic similarity.

High Dimensionality: For large vocabularies (millions of words), vectors become extremely long and sparse, leading to computational inefficiency.

2. Word Embeddings (e.g., Word2Vec, GloVe):

Concept: These models learn dense, lower-dimensional vectors for each word by analyzing how words appear in context within massive text corpora. The idea is that words appearing in similar contexts tend to have similar meanings.

How they work: Algorithms like Word2Vec predict a word based on its neighbors, or predict neighbors based on a word. Through this process, each word learns a unique vector.

Example: After training, the vector for “king” might be `[0.2, 0.5, -0.1, …]` and “queen” might be `[0.25, 0.48, -0.12, …]`. Notice their numerical closeness. A famous demonstration is `vector(“king”) – vector(“man”) + vector(“woman”)` often results in a vector very close to `vector(“queen”)`, showcasing their ability to capture semantic relationships.

Benefit: Captures semantic relationships and significantly reduces dimensionality compared to one-hot encoding.

3. Contextual Embeddings (e.g., BERT, GPT, Large Language Models – LLMs):

Concept: These are the cutting-edge. Unlike static word embeddings where “bank” always has the same vector, contextual embeddings generate a vector for a word based on its *specific context* in a sentence.

How they work: Large neural networks (often Transformer-based) are pre-trained on vast amounts of text. When you input a sentence, they output a sequence of vectors, one for each word (or sub-word unit), reflecting its meaning in that specific context.

Example: The vector for “bank” in “river bank” would be different from “bank” in “money bank.”

Benefit: Captures nuanced meaning, handles polysemy (words with multiple meanings), and is highly effective for complex NLP tasks like sentiment analysis, translation, and question answering.

The Density Spectrum

The “density” of a vector refers to the proportion of its non-zero elements. This distinction is crucial for understanding storage, computation, and how information is encoded.

Sparse Vectors

Definition: A vector where most of its elements are zero. Only a small fraction of the elements have non-zero values.

Characteristics:

High Dimensionality: Often have a very large number of dimensions.

Memory Efficiency (with specialized storage): Instead of storing all the zeros, sparse vectors are typically stored by only recording the non-zero values and their corresponding indices. This saves significant memory.

Origin: Commonly arise from categorical data or count-based representations.

Examples:

One-Hot Encoding: If you have a vocabulary of 10,000 words, a one-hot vector for a single word will have 9,999 zeros and only one `1`. This is extremely sparse.

Bag-of-Words (BoW): Representing a document as a vector where each dimension corresponds to a word in a large vocabulary, and the value is the count of that word in the document. For any given document, only a small subset of the total vocabulary will appear, resulting in a very sparse vector.

User-Item Interaction Matrix: In recommendation systems, a matrix showing which users interacted with which items is often very sparse, as most users only interact with a small fraction of available items.

Dense Vectors

Definition: A vector where most or all of its elements are non-zero. Every dimension typically carries some meaningful information.

Characteristics:

Lower Dimensionality (relatively): Often have a fixed, relatively smaller number of dimensions compared to sparse vectors representing the same data.

All Values Stored: All values are explicitly stored, as they are all considered significant.

Origin: Typically the result of learning processes, especially from neural networks (e.g., word embeddings, image embeddings).

Examples:

Word Embeddings (Word2Vec, GloVe, BERT): These are dense vectors where each of the hundreds of dimensions contributes to the word’s semantic meaning.

Vector Operations: The Language of ML Calculations

Vectors are not just static data containers; we can perform mathematical operations on them. These operations are the backbone of how ML algorithms process data, learn patterns, and make decisions.

Let’s assume we have two vectors, `A = [a1, a2, a3]` and `B = [b1, b2, b3]`.

1. Vector Addition

What it is: Adding two vectors means adding their corresponding components.

Formula: `A + B = [a1+b1, a2+b2, a3+b3]`

Simple Example: If `A = [1, 2]` and `B = [3, 4]`, then `A + B = [1+3, 2+4] = [4, 6]`

Why it’s useful in ML:

Combining Information: If you have different feature vectors for an entity (e.g., a user’s movie preferences and their music preferences), adding them might create a combined preference vector.

Representing Cumulative Effects: In simulations or sequential models, adding a “change” vector to a current state vector yields a new state.

Neural Networks: Vector addition is a fundamental operation in neural networks, where bias vectors are added to the weighted sums of inputs.

2. Vector Subtraction

What it is: Subtracting one vector from another means subtracting their corresponding components.

Formula: `A – B = [a1-b1, a2-b2, a3-b3]`

Simple Example: If `A = [5, 7]` and `B = [2, 3]`, then `A – B = [5-2, 7-3] = [3, 4]`

Why it’s useful in ML:

Quantifying Differences/Changes: Measuring the difference between two states or entities. For instance, the difference between a user’s sentiment before and after an event.

Error Calculation: In optimization algorithms (like gradient descent), the “error” is often calculated as the difference between a predicted output vector and the actual target vector.

Semantic Relationships: As seen with word embeddings, `vector(“king”) – vector(“man”) + vector(“woman”)` can reveal semantic analogies.

3. Scalar Multiplication

What it is: Multiplying a vector by a single number (a “scalar”) means multiplying each component of the vector by that scalar. This scales the vector’s magnitude without changing its direction.

Formula: `c * A = [c*a1, c*a2, c*a3]` (where `c` is the scalar)

Simple Example: If `A = [2, 3]` and `c = 5`, then `5 * A = [5*2, 5*3] = [10, 15]`

Why it’s useful in ML:

Scaling Features: Normalizing or scaling feature values to bring them into a consistent range.

Adjusting Importance: In some models, you might scale a feature vector to increase or decrease its overall influence.

Learning Rates: In optimization algorithms, the learning rate (a scalar) is multiplied by the gradient vector to determine the step size for updating model parameters.

4. Dot Product (Scalar Product)

What it is: The dot product of two vectors is a single scalar number. It’s calculated by multiplying corresponding components and then summing those products.

Formula: `A · B = (a1*b1) + (a2*b2) + (a3*b3)`

The dot product tells us something about how much two vectors point in the same direction, and considers their magnitudes.

Simple Example: If `A = [1, 2, 3]` and `B = [4, 5, 6]`

`A · B = (1*4) + (2*5) + (3*6) = 4 + 10 + 18 = 32`

Why it’s useful in ML:

Similarity/Relevance (with magnitude): A large positive dot product often indicates that two vectors are pointing in a similar direction and have significant magnitudes.

Neural Networks: The dot product is the core operation in the “weighted sum” part of a neuron. Input feature vectors are multiplied by weight vectors (a dot product) to determine the neuron’s activation. This is also how matrix multiplication, central to neural networks, is defined.

Recommendation Systems (when activity matters): If user and item vectors are based on explicit ratings or interaction counts, the dot product can reflect both preference alignment and the user’s overall activity or the item’s popularity.

What Next

Vectors are the silent workhorses of machine learning. From the simple concept of magnitude and direction to the nuanced differences between sparse and dense representations, and the powerful operations that enable similarity calculations, a solid grasp of vectors is truly foundational for anyone looking to understand or build intelligent systems. In the next blog, we will see how Vector similarities can be calculated and its usecases.

Leave a comment