我认得embdedding

编程入门行业动态更新时间:2024-10-25 06:31:47

我认得embdedding

概述

就是降维！

我们训练一个具有单个隐藏层的简单神经网络，想要的是这些隐藏层的权重，这些权重实际上就是word vectors.

trick:

Subsample: 减少训练的词。
Negative sample: 使每个训练样本只能更新很少的一部分模型权重，加快训练。

简介

词嵌入是自然语言处理（NLP）中语言模型与表征学习技术的统称。概念上而言，它是指把一个维数为所有词的数量的高维空间嵌入到一个维数低得多的连续向量空间中，每个单词或词组被映射为实数域上的向量。

One of the benefits of using dense and low-dimensional vectors is computational: the majority of neural network toolkits do not play well with very high-dimensional, sparse vectors. … The main benefit of the dense representations is generalization power: if we believe some features may provide similar clues, it is worthwhile to provide a representation that is able to capture these similarities.

Algorithms

1. Embedding Layer

It requires that documents are clean and each words are encoded as one-hot. The size of vector space are specified as the part of the model, such as 50, 100, 300.

This approach of learning an embedding layer requires a lot of training data and can be slow, but will learn an embedding both targeted to the specific text data and the NLP task.

每一个单词都需要一个one-hot vector, 计算量大，单词之间相关性没有被表示。

As we can look at following picture, word ‘girl’ won’t make any help of the training of other words in the first layer.

2. Word2Vec

paper: Linguistic Regularities in Continuous Space Word Representations, 2013.

It is good at capturing syntactic and semantic regularities in language.

Two different learning models were introduced that can be used as part of Word2Vec approach to learn word embedding.

Continuous Bag-of-Words, or CBOW model. # 通过已知的周围词对该词进行word embedding.
Continuous Skip-Gram Model. # 通过预测周围词进行 word embbeding.

3. GloVe

paper: GloVe: Global Vectors for Word Representation, 2014.

把全局统计(eg. Latent Semantic Analysis (LSA)) 和局部文本学习 (word2vec) 结合起来，更加有效

The Globe Vector for Word Representation. It is an extension to word2vec and can learn word vector more efficiently.

Classical vector space model representations of words were developed using matrix factorization techniques such as Latent Semantic Analysis (LSA) that do a good job of using global text statistics but are not as good as the learned methods like word2vec at capturing meaning and demonstrating it on tasks like calculating analogies (e.g. the King and Queen example above).

GloVe is an approach to marry both the global statistics of matrix factorization techniques like LSA with the local context-based learning in word2vec.

Rather than using a window to define local context, GloVe constructs an explicit word-context or word co-occurrence matrix using statistics across the whole text corpus. The result is a learning model that may result in generally better word embeddings.

How to use the word embedding

1. Learn an Embedding

You may choose to learn a word embedding for your problem.

This will require a large amount of text data to ensure that useful embeddings are learned, such as millions or billions of words.

You have two main options when training your word embedding:

Learn it Standalone, where a model is trained to learn the embedding, which is saved and used as a part of another model for your task later. This is a good approach if you would like to use the same embedding in multiple models.
Learn Jointly, where the embedding is learned as part of a large task-specific model. This is a good approach if you only intend to use the embedding on one task.

2. Reuse an Embedding

It’s common for researcher to use pre-trained word embbeding. For example, both word2vec and GloVe word embeddings are available for free download.

These can be used on your project instead of training your own embeddings from scratch.

可以选择直接使用，也可以基于此进行更新。

Articles

Word embedding on Wikipedia
Word2vec on Wikipedia
GloVe on Wikipedia
An overview of word embeddings and their connection to distributional semantic models, 2016.
Deep Learning, NLP, and Representations, 2014.

Papers

Distributional structure, 1956.
A Neural Probabilistic Language Model, 2003.
A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning, 2008.
Continuous space language models, 2007.
Efficient Estimation of Word Representations in Vector Space, 2013
Distributed Representations of Words and Phrases and their Compositionality, 2013.
GloVe: Global Vectors for Word Representation, 2014.

Projects

word2vec on Google Code
GloVe: Global Vectors for Word Representation

Word2Vec 算法

就是降维！

我们训练一个具有单个隐藏层的简单神经网络，想要的是这些隐藏层的权重，这些权重实际上就是word vectors.

这种trick还有很多形式。

Another place you may have seen this trick is in unsupervised feature learning, where you train an auto-encoder to compress an input vector in the hidden layer, and decompress it back to the original in the output layer. After training it, you strip off the output layer (the decompression step) and just use the hidden layer–it’s a trick for learning good image features without having labeled training data.

1. Fake Task

Fake task: 给定一个词，给出每个词在它周围的概率。

我们训练一个神经网络，通过输入 word pairs(两个概率相近的词，就是被窗口圈住的进行组合) 来进行训练。

We’re going to train the neural network to do the following. Given a specific word in the middle of a sentence (the input word), look at the words nearby and pick one at random. The network is going to tell us the probability for every word in our vocabulary of being the “nearby word” that we chose.

When I say “nearby”, there is actually a “window size” parameter to the algorithm. A typical window size might be 5, meaning 5 words behind and 5 words ahead (10 in total).

We’ll train the neural network to do this by feeding it word pairs found in our training documents. The below example shows some of the training samples (word pairs) we would take from the sentence “The quick brown fox jumps over the lazy dog.” I’ve used a small window size of 2 just for the example. The word highlighted in blue is the input word.

2. Model Details

设有10e4个词，隐藏层300个神经元。

首先用one-hot代表每个词，然后设计以一个网络，输入是one-hot，输出也是1e4向量，每个元素代表每个词在它周围的概率。

使用 word pairs进行训练

1x1e4 (1e4 x 300) -> 1x300 (300x1e4) -> 1x1e4

Each output neuron (one per word in our vocabulary!) will produce an output between 0 and 1, and the sum of all these output values will add up to 1.

There is no activation function on the hidden layer neurons, but the output neurons use softmax. We’ll come back to this later.

When training this network on word pairs, the input is a one-hot vector representing the input word and the training output is also a one-hot vector representing the output word. But when you evaluate the trained network on an input word, the output vector will actually be a probability distribution (i.e., a bunch of floating point values, not a one-hot vector).

3. The Hidden Layer

hidden layer 有10,000 x 300 个权重，每一行就是我们想要得到的300个词向量。

一个词的one-hot向量和这个矩阵相乘，就可以得到压缩后的词向量，其实就是weight matrix的一行。

If you look at the rows of the weight matrix, there are actually what will be our word vectors.

So the end goal of all of this is really just to learn this hidden layer weight matrix – the output layer we’ll just toss when we’re done!

Let’s get back, though, to working through the definition of this model that we’re going to train.

Now, you might be asking yourself–“That one-hot vector is almost all zeros… what’s the effect of that?” If you multiply a 1 x 10,000 one-hot vector by a 10,000 x 300 matrix, it will effectively just select the matrix row corresponding to the “1”. Here’s a small example to give you a visual.

4. Output Layer

输出层是一个softmax回归，每一个输出层神经元会输出一个(0, 1)之间的概率，所有输出总和是1.

下面是上一层一个ants的word vector 与输出层的car权重进行点积，得到的就是ants旁边是car的概率。

The 1 x 300 word vector for “ants” then gets fed to the output layer. The output layer is a softmax regression classifier. There’s an in-depth tutorial on Softmax Regression here, but the gist of it is that each output neuron (one per word in our vocabulary!) will produce an output between 0 and 1, and the sum of all these output values will add up to 1.

Specifically, each output neuron has a weight vector which it multiplies against the word vector from the hidden layer, then it applies the function exp(x) to the result. Finally, in order to get the outputs to sum up to 1, we divide this result by the sum of the results from all 10,000 output nodes.

The last picture an illustration of calculating the output of the output neuron for the word “car”.

Note that neural network does not know anything about the offset of the output word relative to the input word. It does not learn a different set of probabilities for the word before the input versus the word after. To understand the implication, let’s say that in our training corpus, every single occurrence of the word ‘York’ is preceded by the word ‘New’. That is, at least according to the training data, there is a 100% probability that ‘New’ will be in the vicinity of ‘York’. However, if we take the 10 words in the vicinity of ‘York’ and randomly pick one of them, the probability of it being ‘New’ is not 100%; you may have picked one of the other words in the vicinity.

5. Negative sample

对频繁出现的单词进行二次采样，减少训练的样本数。
负采样，使每个训练样本只能更新很少的一部分模型权重。

You may have noticed somthing-it’s a huge network! The hidden layer and output layer have 1e4 x 300 = 3 million weghts.

The authors of Word2Vec addressed these issues in their second paper with the following two innovations:

Subsampling frequent words to decrease the number of training examples.
Modifying the optimization objective with a technique they called “Negative Sampling”, which causes each training sample to update only a small percentage of the model’s weights.

Subsampling Frequent Words

‘The’ 可以出现在很多地方，对其他单词的理解几乎没有任何帮助。

我们定义了一个概率 P ( w i ) P(w_i) P(wi), 表示在window中保留 w i w_i wi单词的概率，其中 z ( w i ) z(w_i) z(wi)表示单词出现的频率：
P ( w i ) = ( z ( w i ) 0.001 + 1 ) ⋅ 0.001 z ( w i ) P(w_i) = (\sqrt{\frac{z(w_i)}{0.001}} + 1) \cdot \frac{0.001}{z(w_i)} P(wi)=(0.001z(wi) +1)⋅z(wi)0.001
当频率大时被选择的概率(Sampling rate)就小，频率小于某个值被选择的概率就大。

If we have a window size of 10, and we remove a specific instance of “the” from our text:

As we train on the remaining words, “the” will not appear in any of their context windows.
We’ll have 10 fewer training samples where “the” is the input word.

Define sampling rate
P ( w i ) = ( z ( w i ) 0.001 + 1 ) ⋅ 0.001 z ( w i ) P(w_i) = (\sqrt{\frac{z(w_i)}{0.001}} + 1) \cdot \frac{0.001}{z(w_i)} P(wi)=(0.001z(wi) +1)⋅z(wi)0.001

No single word should be a very large percentage of the corpus, so we want to look at pretty small values on the x-axis.

Here are some interesting points in this function (again this is using the default sample value of 0.001).

P ( w i ) = 1.0 P(w_i)=1.0 P(wi)=1.0(100% chance of being kept) when z ( w i ) < 0.0026 z(w_i)<0.0026 z(wi)<0.0026.
P ( w i ) = 0.5 P(w_i)=0.5 P(wi)=0.5 when z ( w i ) = 0.0746 z(w_i) = 0.0746 z(wi)=0.0746
P ( w i ) = 0.033 P(w_i)=0.033 P(wi)=0.033 when z ( w i ) = 1.0 z(w_i) = 1.0 z(wi)=1.0. It is ridiculous because all words in the corpus are the same word w i w_i wi.

Negative sampling

数百万参数，每个训练样本都对所有参数进行调整。Negative sampling 可以让每一个样本只对部分参数进行调整。

如果之前对于output word, 只有它是正样本，设为1，其他都是负样本，设为0。这样就需要更新所有的参数，因为输出层所有神经元都有一个目标。

负采样：随机选择几个其他的样本设置成负样本，这样就只需要更新正样本和负样本对应的输出层神经元。需要更新的参数减少了很多。

Training a neural network means taking a training example and adjusting all of the neuron weights slightly so that it predicts that training sample more accurately. In other words, each training sample will tweak all of the weights in the neural network.

As we discussed above, the size of our word vocabulary means that our skip-gram neural network has a tremendous number of weights, all of which would be updated slightly by every one of our billions of training samples!

Negative sampling addresses this by having each training sample only modify a small percentage of the weights, rather than all of them. Here’s how it works.

When training the network on the word pair (“fox”, “quick”), recall that the “label” or “correct output” of the network is a one-hot vector. That is, for the output neuron corresponding to “quick” to output a 1, and for all of the other thousands of output neurons to output a 0.

With negative sampling, we are instead going to randomly select just a small number of “negative” words (let’s say 5) to update the weights for. (In this context, a “negative” word is one for which we want the network to output a 0 for). We will also still update the weights for our “positive” word (which is the word “quick” in our current example).

The paper says that selecting 5-20 words works well for smaller datasets, and you can get away with only 2-5 words for large datasets.

Recall that the output layer of our model has a weight matrix that’s 300 x 10,000. So we will just be updating the weights for our positive word (“quick”), plus the weights for 5 other words that we want to output 0. That’s a total of 6 output neurons, and 1,800 weight values total. That’s only 0.06% of the 3M weights in the output layer!

In the hidden layer, only the weights for the input word are updated (this is true whether you’re using Negative Sampling or not). # 因为你输入的one-hot vector 只有一个值是1。

Selecting Negative Samples

每个单词被选择的概率应该是频率，但是作者声明进行一些修改后效果更好，我们从修改的公式可以看出，增大频率低的单词被选择的概率，减少频率高的单词被选择的概率。
P ( w i ) = f ( w i ) 3 / 4 ∑ j = 0 n ( f ( w j ) 3 / 4 ) P(w_i) = \frac{ {f(w_i)}^{3/4} }{\sum_{j=0}^{n}\left( {f(w_j)}^{3/4} \right) } P(wi)=∑j=0n(f(wj)3/4)f(wi)3/4
The “negative samples” (that is, the 5 output words that we’ll train to output 0) are selected using a “unigram distribution”, where more frequent words are more likely to be selected as negative samples.

For instance, suppose you had your entire training corpus as a list of words, and you chose your 5 negative samples by picking randomly from the list. In this case, the probability for picking the word “couch” would be equal to the number of times “couch” appears in the corpus, divided the total number of word occus in the corpus. This is expressed by the following equation:

P ( w i ) = f ( w i ) ∑ j = 0 n ( f ( w j ) ) P(w_i) = \frac{ f(w_i) }{\sum_{j=0}^{n}\left( f(w_j) \right) } P(wi)=∑j=0n(f(wj))f(wi)
The authors state in their paper that they tried a number of variations on this equation, and the one which performed best was to raise the word counts to the 3/4 power:
P ( w i ) = f ( w i ) 3 / 4 ∑ j = 0 n ( f ( w j ) 3 / 4 ) P(w_i) = \frac{ {f(w_i)}^{3/4} }{\sum_{j=0}^{n}\left( {f(w_j)}^{3/4} \right) } P(wi)=∑j=0n(f(wj)3/4)f(wi)3/4
If you play with some sample values, you’ll find that, compared to the simpler equation, this one has the tendency to increase the probability for less frequent words and decrease the probability for more frequent words.

The way this selection is implemented in the C code is interesting. They have a large array with 100M elements (which they refer to as the unigram table). They fill this table with the index of each word in the vocabulary multiple times, and the number of times a word’s index appears in the table is given by P(wi) * table_size. Then, to actually select a negative sample, you just generate a random integer between 0 and 100M, and use the word at that index in the table. Since the higher probability words occur more times in the table, you’re more likely to pick those.

6. Word Pairs and “Phrases”

比如 New York 拆开来就是不一样的意思，这样会大大增加词汇量。公开的模型训练出了3百万单词。

The second word2vec paper also includes one more innovation worth discussing. The authors pointed out that a word pair like “Boston Globe” (a newspaper) has a much different meaning than the individual words “Boston” and “Globe”. So it makes sense to treat “Boston Globe”, wherever it occurs in the text, as a single word with its own word vector representation.

You can see the results in their published model, which was trained on 100 billion words from a Google News dataset. The addition of phrases to the model swelled the vocabulary size to 3 million words!

If you’re interested in their resulting vocabulary, I poked around it a bit and published a post on it here. You can also just browse their vocabulary here.

Phrase detection is covered in the “Learning Phrases” section of their paper. They shared their implementation in word2phrase.c–I’ve shared a commented (but otherwise unaltered) copy of this code here.

I don’t think their phrase detection approach is a key contribution of their paper, but I’ll share a little about it anyway since it’s pretty straightforward.

Each pass of their tool only looks at combinations of 2 words, but you can run it multiple times to get longer phrases. So, the first pass will pick up the phrase “New_York”, and then running it again will pick up “New_York_City” as a combination of “New_York” and “City”.

The tool counts the number of times each combination of two words appears in the training text, and then these counts are used in an equation to determine which word combinations to turn into phrases. The equation is designed to make phrases out of words which occur together often relative to the number of individual occurrences. It also favors phrases made of infrequent words in order to avoid making phrases out of common words like “and the” or “this is”.

You can see more details about their equation in my code comments here.

One thought I had for an alternate phrase recognition strategy would be to use the titles of all Wikipedia articles as your vocabulary.、

更多推荐

我认得embdedding

本文发布于:2023-06-25 17:24:37，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/881943.html