341
edits
Line 20: | Line 20: | ||
To analyze the relationship between words, the content of the subreddits had to be extracted using the ''pushift''-API to crawl all posts and comments and then merging them into one file per subreddit. | To analyze the relationship between words, the content of the subreddits had to be extracted using the ''pushift''-API to crawl all posts and comments and then merging them into one file per subreddit. | ||
[[File:VRI-LEK-RawJSONData.png|none|1000px|Raw JSON data]] | [[File:VRI-LEK-RawJSONData.png|none|1000px|Raw JSON data]] | ||
For the texts | For the texts to generate the word-clouds, the text of the posts and comments are extracted from the JSON files and are analyzed using the natural language processing technique ''word2vec''. | ||
[[File:VRI-LEK-Epochs.PNG|none|1000px|Learning Process]] | [[File:VRI-LEK-Epochs.PNG|none|1000px|Learning Process]] | ||
The resulting vector space is of very high dimensionality, thus cannot be easily visualized. To reduce the high dimensional space to three.dimensions the method ''t-distributed stochastic neighbor embedding'' is used, which keeps words close together that are close in the high dimensional space. | The resulting vector space is of very high dimensionality, thus cannot be easily visualized. To reduce the high dimensional space to three.dimensions the method ''t-distributed stochastic neighbor embedding'' is used, which keeps words close together that are close in the high dimensional space. |
edits