GMU:The Hidden Layer:Topics: Difference between revisions

Revision as of 18:59, 8 May 2017

General Information on word embeddings

Word embeddings associate words with vectors in a high-dimensional space. Words that are close together in that space are more likely to occur in close proximity in a test than words which are far apart. See this article for details: [1]

The whole process goes through a number of stages:

1. The text corpus

This is the raw data used for learning. Determines language, the topics that are covered and the semantics. Typical sources are Wikipedia and news articles.

2. The tokens

The corpus is split into words. These might be processed further, i.e. to clean up junk, or taking flections (i.e. verb forms) into account.

3. Contexts

The words are grouped into contexts. These might be all words in a sentence, a certain number of words in a neighborhood or words that somehow relate to each other gramatically (as in "dependency based word embeddings" [2]).

4. The algorithm

Different algorithms can be used to map the relationship between words and their context into a vector space. The main contesters are...

Word2vec by Google, uses Neural Networks
FastwordMade by Facebook based on word2vec. Splits words into smaller particles in order to capture capturing syntactic relations (like apparent ---> apparently). Explained here: [3]
GloVe by the Natural language processing group in standford [4]. Uses more conventional math instead of Neural Network "Black Magic" [5]. Seems to perform just slightly less well than Word2vec and FastWord.

Installation + getting started:

Word2vec

Included in the gensim package.

To install, just type

pip install gensim

into a command window.

Here are some of the things you can do with the model: [6]
Here is a bit of background information an an explanation how to train your own models: [7].

Fastword

Made by Facebook based on word2vec. Better at capturing syntactic relations (like apparent ---> apparently) see here: [8]

Pretrained model files are HUGE - this will be a problem on computers with less than 16GB Memory

Installation + getting started:

Included in the gensim package.

To install, just type

pip install gensim

into a command window.

Documentation is here: [9]

GloVe

Invented by the Natural language processing group in standford [10]. Uses more conventional math instead of Neural Network "Black Magic" [11]. Seems to perform just slightly less well than Word2vec and FastWord.

@@ Line 1: / Line 1: @@
 == General Information on word embeddings ==
-For a general explanation look here:
+Word embeddings associate words with vectors in a high-dimensional space. Words that are close together in that space are more likely to occur in close proximity in a test than words which are far apart. See this article for details: [https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/]
-[https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/]
+The whole process goes through a number of stages:
+=== 1.  The text corpus ===
+This is the raw data used for learning. Determines language, the topics that are covered and the semantics.
+Typical sources are Wikipedia and news articles.
+=== 2.  The tokens ===
+The corpus is split into words. These might be processed further, i.e. to clean up junk, or taking flections (i.e. verb forms) into account.
+=== 3. Contexts ===
+The words are grouped into contexts. These might be all words in a sentence, a certain number of words in a neighborhood or words that somehow relate to each other gramatically (as in "dependency based word embeddings" [https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/]).
+=== 4. The algorithm ===
+Different algorithms can be used to map the relationship between words and their context into a vector space. The main contesters are...
+* '''Word2vec''' by Google, uses Neural Networks
+* '''FastwordMade''' by Facebook based on word2vec. Splits words into smaller particles in order to capture capturing syntactic relations (like apparent ---> apparently).  Explained here: [https://rare-technologies.com/fasttext-and-gensim-word-embeddings/]
+* '''GloVe''' by the Natural language processing group in standford [https://nlp.stanford.edu/projects/glove/]. Uses more conventional math instead of Neural Network "Black Magic" [https://www.quora.com/How-is-GloVe-different-from-word2vec]. Seems to perform just slightly less well than Word2vec and FastWord.
+== Installation + getting started: ==
-As wordvector algorithms
 ==Word2vec==
-Made by Google, uses Neural Net, performs good on semantics.
-=== Installation + getting started: ===
 Included in the ''gensim'' package.

Revision as of 18:59, 8 May 2017

General Information on word embeddings

1. The text corpus

2. The tokens

3. Contexts

4. The algorithm

Installation + getting started:

Word2vec

Fastword

Installation + getting started:

GloVe

pre trained models