GMU:The Hidden Layer:Topics: Difference between revisions

← Older edit

GMU:The Hidden Layer:Topics (view source)

Revision as of 13:10, 9 May 2017

612 bytes added , 9 May 2017

no edit summary

Fbonowski

81

edits

@@ Line 5: / Line 5: @@
 The whole process goes through a number of stages:
-=== 1.  The text corpus ===
+=== 1.  Text corpus ===
 This is the raw data used for learning. Determines language, the topics that are covered and the semantics.
 Typical sources are Wikipedia and news articles.
-=== 2.  The tokens ===
+=== 2.  Tokens ===
 The corpus is split into words. These might be processed further, i.e. to clean up junk, or taking flections (i.e. verb forms) into account.
 === 3. Contexts ===
-The words are grouped into contexts. These might be all words in a sentence, a certain number of words in a neighborhood or words that somehow relate to each other gramatically (as in "dependency based word embeddings" [https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/]).
+The words are grouped into contexts. These might be all words in a sentence, a certain number of words in a neighborhood (aka. "bag of words") or words that somehow relate to each other gramatically (as in "dependency based word embeddings" [https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/]).
 === 4. The algorithm ===
 Different algorithms can be used to map the relationship between words and their context into a vector space. The main contesters are...
 * '''Word2vec''' by Google, uses Neural Networks
-* '''FastwordMade''' by Facebook based on word2vec. Splits words into smaller particles in order to capture capturing syntactic relations (like apparent ---> apparently).  Explained here: [https://rare-technologies.com/fasttext-and-gensim-word-embeddings/]
+* '''Fastword''' by Facebook based on word2vec. Splits words into smaller particles in order to capture capturing syntactic relations (like apparent --> apparently).  Explained here: [https://rare-technologies.com/fasttext-and-gensim-word-embeddings/]. Needs a lot of memory.
-* '''GloVe''' by the Natural language processing group in standford [https://nlp.stanford.edu/projects/glove/]. Uses more conventional math instead of Neural Network "Black Magic" [https://www.quora.com/How-is-GloVe-different-from-word2vec]. Seems to perform just slightly less well than Word2vec and FastWord.
+* '''GloVe''' by the Natural language processing group in standford [https://nlp.stanford.edu/projects/glove/]. Uses more conventional math instead of Neural Network "Black Magic" [https://www.quora.com/How-is-GloVe-different-from-word2vec].
+The different algorithms seem to perform quite similar, and results depend on the benchmark and training data. Word2Vec seems to be a little less memory hungry, though.
+=== 5. Keyed Vecors ===
+Here comes the '''Good news''': All of the algorithms provide a table with words and and their positions in vector space... So '''all you need is that table'''!
+Fastvec is special in beeing able to match also on words that it hasn't seen before... but we probably don't even need that...
+==== pre trained models ====
+Here is a collection of Words->Vector tables ("models") that other people have created from big corpuses. This is what you probably want:
+* [https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models https://github.com/3Top/word2vec-api Mostly GloVe, some word2vec, English, Trained on News, Wikipedia, Twitter, '''a good mix''']
+* [https://github.com/Kyubyong/wordvectors https://github.com/Kyubyong/wordvectors: Word2Vec and FastText, '''Multiple languages''', no english, trained on Wikipedia]
+* [https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md: Fasttext, all imaginable languages, trained on Wikipedia, HUGE files]
+* [https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ an interesting approach that gives similarities between syntaktically equivalent words]
+In order to convert from GloVe to Word2Vec tables, the following script can be used:
+[[https://radimrehurek.com/gensim/scripts/glove2word2vec.html]]
 == Installation + getting started: ==
@@ Line 51: / Line 71: @@
 ==GloVe==
 Invented by the Natural language processing group in standford [https://nlp.stanford.edu/projects/glove/]. Uses more conventional math instead of Neural Network "Black Magic" [https://www.quora.com/How-is-GloVe-different-from-word2vec]. Seems to perform just slightly less well than Word2vec and FastWord.
-== pre trained models ==
-* [https://github.com/Kyubyong/wordvectors https://github.com/Kyubyong/wordvectors: Word2Vec and FastText, Multiple languages, no english, trained on Wikipedia]
-* [https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models https://github.com/3Top/word2vec-api Mostly GloVe, some word2vec, English, Trained on News, Wikipedia, Twitter]
-* [https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md: Fasttext, all imaginable languages, trained on Wikipedia]
-* [https://radimrehurek.com/gensim/scripts/glove2word2vec.html https://radimrehurek.com/gensim/scripts/glove2word2vec.html convert between GloVe and Word2Vec Format]
-* [https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ an interesting approach that gives similarities between syntaktically equivalent words]