## Skip-gram

One of the most successful techniques developed for creating word embeddings is the *Skip-gram* model (SG) [2]. Given a sentence of text, the basic idea behind SG is to use a linear combination of a word as input to predict the occurrence of the words around it. In doing so the coefficients (or weighs) associated with each word becomes its vector representation. The number of words to predict before and after the input word is controlled by a window size parameter.

In more detail, let \(S\) represent a sentence consisting of a set of words \(x_1, …, x_{|S|}\). The optimisation criterion for SG is given by \[max\bigg\{\frac{1}{|S|}\sum^{|S|}_{s=1}\sum_{-l\leq j \leq l, j \neq 0}log\big(p(x_{s+j}|x_s)\big)\bigg\},\] where \(l\) is the window size [2]. Therefore, given an input word \(x_s\) the objective is to maximise the log probability over all the words in the window around it, \(x_{s+j}, -l \leq j \leq l, j \neq 0, \forall x \in S\). Choosing an embedding dimension size of (say) three corresponds to a three term linear combination for each word, such that for the word \(x_1\) the linear combination is given by \(x^*_1 = w_{11}x_1 + w_{12}x_1 + w_{13}x_1\). The derived features \(x^*_1, …, x^*_{|S|}\) can then in turn be used in a simple linear model to obtain scores from which the probabilities are estimated using the softmax function, *i.e.* \[p(x_{s+j}|x_s) = \frac{exp\big(x^{(I)}_{s+j} \cdot x^{(O)}_{s}\big)}{\sum^V_{v=1}exp\big(x^{(I)}_v \cdot x^{(O)}_{s}\big)}.\] Here \(a \cdot b\) represents the dot product between the vectors \(a\) and \(b\) and \(V\) is the total vocabulary size of the corpus. The superscript \(I\) refers to \(x\) as a one-hot encoded input vector, whereas the superscript \(O\) refers to the output vector of scores for \(x\). The optimal weights \(w_{sp}, s=1, …, |S|, p=1,2,3\) are learned using *back-propagation* and represent the components of each embedding vector.

In fact, there are two equivalent perspectives in terms of the formulation of the SG model. On the one hand, SG can be thought of as a three layer neural network model consisting of an input layer, a single hidden layer and an output layer with softmax activation. On the other hand, SG is simply seen as a multinomial logistic regression model with a linear basis expansion in the inputs with learned coefficients. For the remainder of this post, embedding models will be viewed as neural network models.