No edit summary |
No edit summary |
||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
== General Information on word embeddings == | == General Information on word embeddings == | ||
Word embeddings associate words with vectors in a high-dimensional space. Words that are close together in that space are more likely to occur in close proximity in a test than words which are far apart. See this article for details: [https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/] | |||
[https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/] | |||
The whole process goes through a number of stages: | |||
=== 1. Text corpus === | |||
This is the raw data used for learning. Determines language, the topics that are covered and the semantics. | |||
Typical sources are Wikipedia and news articles. | |||
=== 2. Tokens === | |||
The corpus is split into words. These might be processed further, i.e. to clean up junk, or taking flections (i.e. verb forms) into account. | |||
=== 3. Contexts === | |||
The words are grouped into contexts. These might be all words in a sentence, a certain number of words in a neighborhood (aka. "bag of words") or words that somehow relate to each other gramatically (as in "dependency based word embeddings" [https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/]). | |||
=== 4. The algorithm === | |||
Different algorithms can be used to map the relationship between words and their context into a vector space. The main contesters are... | |||
* '''Word2vec''' by Google, uses Neural Networks | |||
* '''Fastword''' by Facebook based on word2vec. Splits words into smaller particles in order to capture capturing syntactic relations (like apparent --> apparently). Explained here: [https://rare-technologies.com/fasttext-and-gensim-word-embeddings/]. Needs a lot of memory. | |||
* '''GloVe''' by the Natural language processing group in standford [https://nlp.stanford.edu/projects/glove/]. Uses more conventional math instead of Neural Network "Black Magic" [https://www.quora.com/How-is-GloVe-different-from-word2vec]. | |||
The different algorithms seem to perform quite similar, and results depend on the benchmark and training data. Word2Vec seems to be a little less memory hungry, though. | |||
=== 5. Keyed Vecors === | |||
Here comes the '''Good news''': All of the algorithms provide a table with words and and their positions in vector space... So '''all you need is that table'''! | |||
Fastvec is special in beeing able to match also on words that it hasn't seen before... but we probably don't even need that... | |||
==== pre trained models ==== | |||
Here is a collection of Words->Vector tables ("models") that other people have created from big corpuses. This is what you probably want: | |||
* [https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models https://github.com/3Top/word2vec-api Mostly GloVe, some word2vec, English, Trained on News, Wikipedia, Twitter, '''a good mix'''] | |||
* [https://github.com/Kyubyong/wordvectors https://github.com/Kyubyong/wordvectors: Word2Vec and FastText, '''Multiple languages''', no english, trained on Wikipedia] | |||
* [https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md: Fasttext, all imaginable languages, trained on Wikipedia, HUGE files] | |||
* [https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ an interesting approach that gives similarities between syntaktically equivalent words] | |||
In order to convert from GloVe to Word2Vec tables, the following script can be used: | |||
[[https://radimrehurek.com/gensim/scripts/glove2word2vec.html]] | |||
== Installation + getting started: == | |||
==Word2vec== | ==Word2vec== | ||
Included in the ''gensim'' package. | Included in the ''gensim'' package. | ||
Line 38: | Line 71: | ||
==GloVe== | ==GloVe== | ||
Invented by the Natural language processing group in standford [https://nlp.stanford.edu/projects/glove/]. Uses more conventional math instead of Neural Network "Black Magic" [https://www.quora.com/How-is-GloVe-different-from-word2vec]. Seems to perform just slightly less well than Word2vec and FastWord. | Invented by the Natural language processing group in standford [https://nlp.stanford.edu/projects/glove/]. Uses more conventional math instead of Neural Network "Black Magic" [https://www.quora.com/How-is-GloVe-different-from-word2vec]. Seems to perform just slightly less well than Word2vec and FastWord. | ||
Latest revision as of 13:10, 9 May 2017
General Information on word embeddings
Word embeddings associate words with vectors in a high-dimensional space. Words that are close together in that space are more likely to occur in close proximity in a test than words which are far apart. See this article for details: [1]
The whole process goes through a number of stages:
1. Text corpus
This is the raw data used for learning. Determines language, the topics that are covered and the semantics. Typical sources are Wikipedia and news articles.
2. Tokens
The corpus is split into words. These might be processed further, i.e. to clean up junk, or taking flections (i.e. verb forms) into account.
3. Contexts
The words are grouped into contexts. These might be all words in a sentence, a certain number of words in a neighborhood (aka. "bag of words") or words that somehow relate to each other gramatically (as in "dependency based word embeddings" [2]).
4. The algorithm
Different algorithms can be used to map the relationship between words and their context into a vector space. The main contesters are...
- Word2vec by Google, uses Neural Networks
- Fastword by Facebook based on word2vec. Splits words into smaller particles in order to capture capturing syntactic relations (like apparent --> apparently). Explained here: [3]. Needs a lot of memory.
- GloVe by the Natural language processing group in standford [4]. Uses more conventional math instead of Neural Network "Black Magic" [5].
The different algorithms seem to perform quite similar, and results depend on the benchmark and training data. Word2Vec seems to be a little less memory hungry, though.
5. Keyed Vecors
Here comes the Good news: All of the algorithms provide a table with words and and their positions in vector space... So all you need is that table!
Fastvec is special in beeing able to match also on words that it hasn't seen before... but we probably don't even need that...
pre trained models
Here is a collection of Words->Vector tables ("models") that other people have created from big corpuses. This is what you probably want:
- https://github.com/3Top/word2vec-api Mostly GloVe, some word2vec, English, Trained on News, Wikipedia, Twitter, a good mix
- https://github.com/Kyubyong/wordvectors: Word2Vec and FastText, Multiple languages, no english, trained on Wikipedia
- https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md: Fasttext, all imaginable languages, trained on Wikipedia, HUGE files
- https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ an interesting approach that gives similarities between syntaktically equivalent words
In order to convert from GloVe to Word2Vec tables, the following script can be used: [[6]]
Installation + getting started:
Word2vec
Included in the gensim package.
To install, just type
pip install gensim
into a command window.
Here are some of the things you can do with the model: [7]
Here is a bit of background information an an explanation how to train your own models: [8].
Fastword
Made by Facebook based on word2vec. Better at capturing syntactic relations (like apparent ---> apparently) see here:
[9]
Pretrained model files are HUGE - this will be a problem on computers with less than 16GB Memory
Installation + getting started:
Included in the gensim package.
To install, just type
pip install gensim
into a command window.
Documentation is here: [10]
GloVe
Invented by the Natural language processing group in standford [11]. Uses more conventional math instead of Neural Network "Black Magic" [12]. Seems to perform just slightly less well than Word2vec and FastWord.