No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
== General Information on word embeddings == | == General Information on word embeddings == | ||
Word embeddings associate words with vectors in a high-dimensional space. Words that are close together in that space are more likely to occur in close proximity in a test than words which are far apart. See this article for details: [https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/] | |||
[https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/] | |||
The whole process goes through a number of stages: | |||
=== 1. The text corpus === | |||
This is the raw data used for learning. Determines language, the topics that are covered and the semantics. | |||
Typical sources are Wikipedia and news articles. | |||
=== 2. The tokens === | |||
The corpus is split into words. These might be processed further, i.e. to clean up junk, or taking flections (i.e. verb forms) into account. | |||
=== 3. Contexts === | |||
The words are grouped into contexts. These might be all words in a sentence, a certain number of words in a neighborhood or words that somehow relate to each other gramatically (as in "dependency based word embeddings" [https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/]). | |||
=== 4. The algorithm === | |||
Different algorithms can be used to map the relationship between words and their context into a vector space. The main contesters are... | |||
* '''Word2vec''' by Google, uses Neural Networks | |||
* '''FastwordMade''' by Facebook based on word2vec. Splits words into smaller particles in order to capture capturing syntactic relations (like apparent ---> apparently). Explained here: [https://rare-technologies.com/fasttext-and-gensim-word-embeddings/] | |||
* '''GloVe''' by the Natural language processing group in standford [https://nlp.stanford.edu/projects/glove/]. Uses more conventional math instead of Neural Network "Black Magic" [https://www.quora.com/How-is-GloVe-different-from-word2vec]. Seems to perform just slightly less well than Word2vec and FastWord. | |||
== Installation + getting started: == | |||
==Word2vec== | ==Word2vec== | ||
Included in the ''gensim'' package. | Included in the ''gensim'' package. | ||
Revision as of 18:59, 8 May 2017
General Information on word embeddings
Word embeddings associate words with vectors in a high-dimensional space. Words that are close together in that space are more likely to occur in close proximity in a test than words which are far apart. See this article for details: [1]
The whole process goes through a number of stages:
1. The text corpus
This is the raw data used for learning. Determines language, the topics that are covered and the semantics. Typical sources are Wikipedia and news articles.
2. The tokens
The corpus is split into words. These might be processed further, i.e. to clean up junk, or taking flections (i.e. verb forms) into account.
3. Contexts
The words are grouped into contexts. These might be all words in a sentence, a certain number of words in a neighborhood or words that somehow relate to each other gramatically (as in "dependency based word embeddings" [2]).
4. The algorithm
Different algorithms can be used to map the relationship between words and their context into a vector space. The main contesters are...
- Word2vec by Google, uses Neural Networks
- FastwordMade by Facebook based on word2vec. Splits words into smaller particles in order to capture capturing syntactic relations (like apparent ---> apparently). Explained here: [3]
- GloVe by the Natural language processing group in standford [4]. Uses more conventional math instead of Neural Network "Black Magic" [5]. Seems to perform just slightly less well than Word2vec and FastWord.
Installation + getting started:
Word2vec
Included in the gensim package.
To install, just type
pip install gensim
into a command window.
Here are some of the things you can do with the model: [6]
Here is a bit of background information an an explanation how to train your own models: [7].
Fastword
Made by Facebook based on word2vec. Better at capturing syntactic relations (like apparent ---> apparently) see here:
[8]
Pretrained model files are HUGE - this will be a problem on computers with less than 16GB Memory
Installation + getting started:
Included in the gensim package.
To install, just type
pip install gensim
into a command window.
Documentation is here: [9]
GloVe
Invented by the Natural language processing group in standford [10]. Uses more conventional math instead of Neural Network "Black Magic" [11]. Seems to perform just slightly less well than Word2vec and FastWord.
pre trained models
- https://github.com/Kyubyong/wordvectors: Word2Vec and FastText, Multiple languages, no english, trained on Wikipedia
- https://github.com/3Top/word2vec-api Mostly GloVe, some word2vec, English, Trained on News, Wikipedia, Twitter
- https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md: Fasttext, all imaginable languages, trained on Wikipedia
- https://radimrehurek.com/gensim/scripts/glove2word2vec.html convert between GloVe and Word2Vec Format
- https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ an interesting approach that gives similarities between syntaktically equivalent words