UI logo
CS 440/ECE 448
Margaret Fleck

Vector Semantics 5

In order to get nice results from word2vec, the basic algorithm needs to be tweaked a bit to produce this good performance. Similar tweaks are frequently required by other similar algorithms.

Building the set of training examples

In the basic algorithm, we consider the input (focus) words one by one. For each focus word, we extract all words within +/- k positions as positive context words. We also randomly generate a set of negative context words. This produces a set of positive pairs (w,c) and a set of negative pairs (w,c') that are used to update the embeddings of w, c, and c'.

Tweak 1: Word2vec uses more negative training pairs than positive pairs, by a factor of 2 up to 20 (depending on the amount of training data available).

You might think that the positive and negative pairs should be roughly balanced. However, that apparently doesn't work. One reason may be that the positive context words are definite indications of similarity, whereas the negative words are random choices that may be more neutral than actively negative.

Tweak 2: Positive training examples are weighted by 1/m, where m is the distance between the focus and context word. I.e. so adjacent context words are more important than words with a bit of separation.

The closer two words are, the more likely their relationship is strong. This is a common heuristic in similar algorithms.

Smoothing negative context counts

For a fixed focus word w, negative context words are picked with a probability based on how often words occur in the training data. However, if we compute P(c) = count(c)/N (N is total words in data), rare words aren't picked often enough as context words. So instead we replace each raw count count(c) with \( (\text{count}(c))^\alpha\). The probabilities used for selecting negative training examples are computed from these smoothed counts.

\(\alpha\) is usually set to 0.75. But to see how this brings up the probabilities of rare words compared to the common ones, it's a bit easier if you look at \( \alpha = 0.5 \), i.e. we're computing the square root of the input. In the table below, you can see that large probabilities stay large, but very small ones are increased by quite a lot. After this transformation, you need to normalize the numbers so that the probabilities add up to one again.

x \(x^{0.75}\) \(\sqrt{x}\)
.99 .992 .995
.9 .924 .949
.1 .178 .316
.01 .032 .1
.0001 .001 .01

This trick can also be used on PMI values (e.g. if using the methods from the previous lecture).

Deletion, subsampling

Ah, but apparently they are still unhappy with the treatment of very common and very rare words. So, when we first read the input training data, word2vec modifies it as follows:

This improves the balance between rare and common words. Also, deleting a word brings the other words closer together, which improves the effectiveness of our context windows.


The 2014 version of word2vec uses use 1 billion words to train embeddings for basic task.

For the word analogy tasks, they used an embedding with 1000 dimensions and about 33 billion words of training data. Performance on word analogies is about 66%.

By coomparison: children hear about 2-10 million words per year. Assuming the high end of that range of estimates, they've heard about 170 million words by the time they take the SAT. So the algorithm is performing well, but still seems to be underperforming given the amount of data it's consuming.

A more recent embedding method, BERT large, is trained using a 24-layer network with 340M parameters. This has somewhat improved performance but apparently can't be reproduced on a standard GPU. Again, a direction for future research is to figure out why ok performance seems to require so much training data and compute power.

Some follow-on papers

Original Mikolov et all papers:

Goldberg and Levy papers (easier and more explicit)