Pre-trained word2vec models on mostly Swedish tweets

Get started with AI and Deep Learning with no prior code experience

Human language is at the heart of the current revolution in AI and deep learning. You need a lot of it though, in order to make the machines produce anything interesting. I happen to have a lot and to save you the hassle with collection and training (that can take several days) I've prepared it all for you in this post.

Google has been generous and released several smart tools to analyze language for AI. With all the developments that have been made over the last couple of years it's actually possible for almost everyone to join the AI-revolution, even for someone that is totally new to programming!

The very cool algorithm word2vec and Python is a good place to start. All you need is Python installed on your computer (pre-installed in all Apple computers, otherwise I suggest you download and install Python 3.5 in 64-bit version from here. All-in-all this should be possible to do in 10 to 15 minutes from scratch.

Copy paste the code below into your terminal running Python after downloading the models and you're good to go to start exploring language patterns on Twitter, which e.g. can be used to analyze groups of people masterly done by Jontahon Morgan in a recent study on the alt-right on Twitter.

1. Download the models

Here are the statistical language models I've trained and now release under a Creative Commons Zero license which means you can do whatever you fancy with them.

They're based on the data set I've listed here. You can see how the preprocessing (removing retweets, filtering out commons words etc) was done here.

I suggest that you start by downloading the following three files that together make up a complete model:

  • big_model_150_mincount_30_no_stops
  • big_model_150_mincount_30_no_stops.syn0.npy
  • big_model_150_mincount_30_no_stops.syn1neg.npy

Below are some stats from the training of the model named big_model_150_mincount_30_no_stops - meaning 150 dimensions on words occuring at least 30 times in the corpus and with stopwords removed.

INFO : collected 7115378 word types from a corpus of 441303111 raw words and 47609469 sentences

INFO : min_count=30 retains 495569 unique words (drops 6619809)

INFO : min_count leaves 415541707 word corpus (94% of original 441303111)

INFO : deleting the raw counts dictionary of 7115378 items

INFO : sample=0.001 downsamples 20 most-common words

INFO : downsampling leaves estimated 338502774 word corpus (81.5% of prior 415541707)

INFO : estimated required memory for 495569 words and 150 dimensions: 842467300 bytes

INFO : resetting layer weights

INFO : training model with 4 workers on 495569 vocabulary and 150 features, using sg=0 hs=0 sample=0.001 negative=5

INFO : expecting 47609469 sentences, matching count from corpus used for vocabulary survey

2. Install gensim

To load e.g. the big_model you need to install Gensim. You do that by simply copy-pasting the following code in the terminal and wait till it has finished.

In [ ]:
pip install genism 

3. Load the model in your Python terminal

In your terminal go to the directory where you downloaded the files. If you want help on how to do that and understand Swedish I've prepared a beginners tutorial which covers the basics of navigating between folders in the terminal. Start the Python interpreter in your terminal by writing 'python' and pressing enter. Then paste the following code into your terminal and press enter and wait while the model is being loded.

In [5]:
import gensim
model = gensim.models.Word2Vec.load("../../word2vec_on_tweets/models/model_ngrams_150_mincount_30")

Start exploring peoples language on Twitter!

Now it's really easy to start exploring the model. Note that you also need the files ending with syn0.npy and syn1neg.npy in the same directory that you load the model from.

If you would like to see what words are most similar to "Sverige" (Sweden) you can use the method most_similar()

In [3]:
[('sverige', 0.8774566054344177),
 ('Tyskland', 0.792809247970581),
 ('landet', 0.7679420709609985),
 ('Norge', 0.7538720369338989),
 ('Europa', 0.7496431469917297),
 ('världen', 0.7442996501922607),
 ('Ryssland', 0.7381924986839294),
 ('Sv', 0.7376805543899536),
 ('Grekland', 0.7372180819511414),
 ('vårt_land', 0.7331665754318237)]

And to see what what isn't usually used together with the other you can use another fun method:

In [6]:
model.doesnt_match("dator data artificiell_intelligence teologi".split())

There are quite a lot of English words also, plus some Norwegian, Finish etc even though they might be a bit to few in the original data set to give quality results. Try for your self!

In [9]:
[('Obama', 0.8103233575820923),
 ('President_Obama', 0.7765438556671143),
 ('Donald_Trump', 0.7678619027137756),
 ('Mitt_Romney', 0.7673009037971497),
 ('Vladimir_Putin', 0.75397789478302),
 ('obama', 0.7518642544746399),
 ('Marco_Rubio', 0.7498303055763245),
 ('Biden', 0.7391507029533386),
 ('Boris_Johnson', 0.7385252714157104),
 ('president_Obama', 0.7371423244476318)]

Happy exploration! :-)


Comments powered by Disqus