What are Embeddings?

Recently, embedding models have been gaining considerable attention in various areas such as natural language processing (NLP), recommender system, knowledge graph completion, etc. This is due to their high accuracy on classification problems, scalability to large datasets and their ability to capture semantic relationships. For example, in NLP, these models represent words in a continuous vector space and map semantically related words to nearby points. We call these vector representations the “embeddings” because they are not directly observed in the data; rather, they are learned from the training data. Word embeddings that incorporate semantic relationships could be used to improve performance in many NLP applications, such as machine translation, information retrieval and question-answering systems.

Let me explain some details about word embeddings. There are many ways to represent a word. One of the most popular approaches is a one-hot encoding, where each word has a value “1” in a unique bit-position, and a value “0” at all other locations in a vector. The size of the one-hot encoding vector is the same as the size of the vocabulary. This approach is simple, but we cannot discover any relationship between words from these vectors. There are also problems related to sparsity and inefficient memory usage with this approach. Suppose you have 1,000,000 words in your vocabulary, and each word is represented as a 1,000,000-dimensional vector. You would have to deal with the big sparse matrix that is 1,000,000 x 1,000,000 to train your models. This sparse representation would not be efficient with such a large vocabulary.

Figure 1 Example of Word Embeddings in Two Dimensions

One of the most popular alternatives is to represent a word as a dense vector in the arbitrary number of dimensions (e.g., 200). This is called a ‘vector embedding.’ In the embedding space, we aim to discover or preserve semantic relationships. For example: “hotel” and “motel” would have similar embeddings; “tell” and “speak” would be close to each other because they have similar meanings; and “Chicago” and “Illinois” could be close because “Chicago” is the capital city of “Illinois”. Such embeddings in 2-dimensions might look like those shown in Figure 1. Generally, neural network models are used for learning word embeddings. Optimization functions (e.g., Stochastic Gradient Descent) gradually cluster similar words or semantically related words close to one another. Once a model has learned word embeddings, they can be reused efficiently in a variety of NLP applications. In practice, instead of training word embeddings from scratch, we can typically download pre-trained word embeddings and use them in NLP applications.

Posts

Gradient Descent

Batch vs. Online Learning

Knowledge Graph Completion

Batch Normalization

Gradient Boosting

The Normal Equation

Handling Categorical Values

How to Deal with Missing Values

Underfitting vs. Overfitting

What are Embeddings?