How LLMs think - The Magic of Embeddings

Word2Vec embeddings adapted from the Gemsim python library and Mikolov, Chen, Corado, and Dean (2013)

Embeddings are numerical representations of words. Imagine giving each word in English two scores - one to represent how "alive" the word is, and one to represent how loud it is - and arranging these words on piece of graph paper with the x-axis representing the first score, and the y-axis the second. The resulting position of every word on the graph would tell you something about the nature of the word.

Embeddings do this with many scores - not just two. This allows them to capture even more of the meaning in each word. Of course, with so many dimensions, we can no longer position those words on a simple x/y axis, there are two many dimensions to do that - but they can be used in other ways. This page will allow you to explore one such embedding called Word2Vec. Click through the tabs above in sequence to explore this embedding.

Word2Vec is an embedding with 300 dimensions - in other words, each word is represented by 300 scores. These scores are created using enormous amounts of text - the training data. Words that often occur in the same context or near each other in the training data are given scores that ensure they are close to each other. Words that rarely occur in the same context are placed further apart.

The table below lists the resulting 300 scores for each word, and includes 100,000 of the most common words in the English language. You can search for any word to find its embedding.

These numbers don't "mean" anything by themselves - think of them as representing the "essence" or "vibe" or "meaning" of a word. In the next tab, "Word Similarity", we will see how to probe these meanings.

Embedding Matrix

0 Words

300 Dimensions

Press enter to search

Word

When words are described by two scores, we can place them on an x/y axis to visualize them. With 300 dimensions, it's more difficult. But we can still calculate distances between words, using a technique called the cosine similarity - it's a little hard to conceptualize in so many dimensions, but it's very similar to what you would do in two or three dimensions.

In the tool below, you can type any word - the embedding for the word will auotmatically be looked up, the distance between your word and every other word in the data will automatically be calculated, and the closest words will be displayed.

Start with the examples below, and then try it with a few words of you own - you'll see that using only 300 numbers, the embedding seems to have a sense of what the words mean, at least in terms of how they relate to each other. In the next tab, "Word Arithmetic", we'll see how the embedding is also aware of concepts, not just individual words.

Look Up Word

Try an example:

Word

Most Similar Words

30 closest words based on distance in 300 dimensions (cosine similarity)

Enter a word above to find similar words

One of the benefits of representing words numerically is that they can be manipulated like numbers. For example, we can add words, subtract them, and so on. This tool will allow you to experiment with these operations.

Try the examples below, starting wtih king - man + woman. As humans, we understand these words and so we realize the answer should be queen. With embeddings, a computer can also calculate this sum by simply summing the individual numbers in the embedding. The tool below does this for you, and the resulting 300 numbers (in the "sum" row) act as the "essence" of result. We can then find the closest words to this "essence".

The examples demonstrate how ably Word2Vec is able to capture concepts like center, cuisine, and basic grammar. Once you've looked at those examples, click "Clear" and try with a few sums of your own.

Note that you will sometimes need to look lower down the list to find the word you're looking for, and that the model will sometimes just fail to capture the relationship you're looking for; Word2Vec is a 2013 model after all, and therefore quite primitive!

In the next and final tab, we'll consider a different way of analyzing word relationships in Word2Vec.

Input Words

Try an example:

Word

Nearest Words to the resulting sum

30 closest words based on distance in 300 dimensions (cosine similarity)

Using a method called multidimensional scaling, we can take words in our 300-dimensional space and visualize them on a two-dimensional graph; the results won't be perfect, but they'll at least give us an idea of what's happening in those embeddings. This is very similar to the way our three-dimensional globe is visualized on the pages of a two-dimensional atlas. The results aren't perfect, but they give us a pretty good idea of what the world looks like.

Start by trying the examples below by clicking on a button, and then scrolling down to see what those words look like in embedding space - the richness of the insights contained in this embedding space should quickly become apparent! Click "Clear" and try it yourself with words for your own - remember Word2Vec is a small and simple embedding so you shouldn't expect too much of it! Nevertheless, it is fascinating how much this simple, 300 dimensional embedding can capture.

Hopefully you now have a feel for how large language models seem to understand language so well. Embeddings, however, are only half the story - the other half is the transformer architecture, which makes it possible to embed entire sentences instead of individual words. If you want to learn more about this, I recommend this excellent explainer.

Words to Visualize

Try an example:

Word

2D Projection (MDS)