CS 440/ECE 448

Margaret Fleck

Margaret Fleck

The word2vec (aka skip-gram) algorithm is a newer algorithm that does normalization and dimensionality reduction in one step. That is, it learns how to embed words into an n-dimensional feature space. The algorithm was introduced in a couple papers by Mikolov et al in 2013. However, it's pretty much impossible to understand the algorithm from those papers, so the following derivation is from later papers by Yoav Goldberg and Omer Levy. Two similar embedding algorithms (which we won't cover) are the CBOW algorithm from word2vec and the GloVe algorithm from Stanford.

In broad outline word2vec has three steps

- Gather pairs (w,c) where w is our focus word and c is a nearby context word,
- Randomly embed all words into \(\mathbb{R}^k\)
- Iteratively adjust the embedding so that pairs (w,c) end up with similar coordinates.

A pair (w,c) is considered "similar" if the dot product \(w\cdot c\) is large.

For example, our initial random embedding might put "elephant" near "matchbox" and far from "rhino." Our iterative adjustment would gradually move the embedding for "rhino" nearer to the embedding for "elephant."

Two obvious parameters to select. The dimension of the embedding space would depend on how detailed a representation the end user wants. The size of the context window would be tuned for good performance.

Problem: a great solution for the above optimization problem is to map all words to the same embedding vector. That defeats the purpose of having word embeddings.

Suppose we had negative examples (w,c'), in which c' is not a likely context word for w. Then we could make the optimization move w and c' away from each other. Revised problem: our data doesn't provide negative examples.

Solution: Train against random noise, i.e. randomly generate "bad" context words for each focus word.

- Called "negative sampling" (not a very evocative term)
- For a fixed focus word w, negative context words are picked with a probability based on how often words occur in the training data.
- Depends on good example pairs being relatively sparse in the space of all word pairs (probably correct)

The negative training data will be corrupted by containing some good examples, but this corruption should be a small percentage of the negative training data.

Another mathematical problem happens when we consider a word w with itself as context. The dot product \(w \cdot w\) ls large, by definition. But, for most words, w is not a common context for itself, e.g. "dog dog" is not very common compared to "dog." So setting up the optimization process in the obvious way causes the algorithm to want to move w away from itself, which it obviously cannot do.

To avoid this degeneracy, word2vec builds two embeddings of each word w, one for w seen as a focus word and one for w used as a context word. The two embeddings are closely connected, but not identical. The final representation for w will be a concatenation of the two embedding vectors.

Our classifier tries to predict yes/no from pairs (w,c), where "yes" means c is a good context word for w.

We'd like to make this into a probabilistic model. So we run dot product through a sigmoid to produce a "probability" that (w,c) is a good word-context pair. These numbers probably aren't the actual probabilities, but we're about to treat them as if they are. That is, we approximate

\( P(\text{good} | w,c) \approx \sigma(w\cdot c). \)

By the magic of exponentials (see below), this means that the probability that (w,c) is not a good word-context pair is

\( P(\text{bad} | w,c) \approx \sigma(-w\cdot c). \)

Now we switch to log probabilities, so we can use addition rather than multiplication. So we'll be looking at e.g.

\(\log( P(\text{good} | w,c)) \approx \log(\sigma(w\cdot c)) \)

Suppose that D is the positive training pairs and D' is the set of negative training pairs. So our iterative refinement algorithm adjusts the embeddings (both context and word) so as to maximize

\( \sum_{(w,c) \in D} \ \log(\sigma(w\cdot c)) + \sum_{(w,c) \in D'} \ \log(\sigma(-w\cdot c)) \)

That is, each time we read a new focus word w from the training data, we

- find its context words,
- generate some negative context words, and
- adjust the embeddings of w and the context words in the direction that increase the above sum.

Ok, so how did we get from \( P(\text{good} | w,c) = \sigma(w\cdot c) \) to \( P(\text{bad} | w,c) = \sigma(-w\cdot c) \)?

"Good" and "bad" are supposed to be opposites. So \( P(\text{bad} | w,c) = \sigma(-w\cdot c) \) should be equal to \(1 - P(\text{good}| w,c) = 1- \sigma(w\cdot c) \). I claim that \(1 - \sigma(w\cdot c) = \sigma(-w\cdot c) \).

This claim actually has nothing to do with the dot products. As a general thing \(1 - \sigma(x) = \sigma(-x) \). Here's the math.

Recall the definition of the sigmoid function: \( \sigma(x) = \frac{1}{1+e^{-x}} \).

\( \eqalign{ 1-\sigma(x) &=& 1 - \frac{1}{1+e^{-x}} = \frac{e^{-x}}{1+ e^{-x}} \ \ \text{ (add the fractions)} \\ &=& \frac{1}{1/e^{-x}+ 1} \ \ (\text{divide top and bottom by } e^{-x}) \\ &=& \frac{1}{e^x + 1} \ \ \text{(what does a negative exponent mean?)} \\ &=& \frac{1}{1 + e^x} = \sigma(-x)} \)