The big picture of public discourse on Twitter by clustering metadata

Authors: Mattias Östmar, mattiasostmar (a) gmail.com, Mikael Huss, mikael.huss (a) gmail.com

Summary: We identify communities in the Swedish twitterverse by analyzing a large network of millions of reciprocal mentions in a sample of 312,292,997 tweets from 435,792 twitter accounts in 2015 and show that politically meaningful communities among others can be detected without having to read or search for specific words or phrases.

Background

Inspired by Hampus Brynolf's Twittercensus, we wanted to perform a large-scale analysis of the Swedish Twitterverse, but from a different perspective where we focus on discussions rather than follower statistics.

All images are licensed under Creative Commons CC-BY (mention the source) and the data is released under Creative Commons Zero which means you can freely download and use it for your own purposes, no matter what. The underlaying tweets are restricted by Twitters Developer Agreement and Policy and cannot be shared due to their restrictions, which are mainly there to protect the privacy of all Twitter users.

Method

Code pipeline

A pipeline connecting the different code parts for reproducing this experiment is available at github.com/hussius/bigpicture-twitterdiscouse.

The Dataset

The dataset was created by continously polling Twitter's REST API for recent tweets from a fixed set of Twitter accounts during 2015. The API gives out tweets from before the polling starts as well, but Twitter does not document how those are selected. A more in depth description of how the dataset was created and what it looks like can be found at mindalyzer.com.

Graph construction

From the full dataset of tweets, the tweets originating from 2015 was filtered out and a network of reciprocal mentions was created by parsing out any at-mentions (e.g. '@mattiasostmar') in them. Retweets of others people's tweets have not been counted, even though they might contain mentions of other users. We look at reciprocal mention graphs, where a link between two users means that both have mentioned each other on Twitter at least once in the dataset (i.e. addressed the other user with that user's Twitter handle, as happens by default when you reply to a Tweet, for instance). We take this as a proxy for a discussion happening between those two users. The mention graphs were generated using the NetworkX package for Python. We naturally model the graph as undirected (as both users sharing a link are interacting with each other, there is no notion of directionality) and unweighted. One could easily imagine a weighted version of the mention graph where the weight would represent the total number of reciprocal mentions between the users, but we did not feel that this was needed to achieve interesting results.

The final graph consisted of 377.545 nodes (Twitter accounts) and 15.862.275 edges (reciprocal mentions connecting Twitter accounts). The average number of reciprocal mentions for nodes in the graph was 42. The code for the graph creation can be found here and you can also download the pickled graph in NetworkX-format (104,5MB, license:CCO).

The visualizations of the graphs were done in Gephi using the Fruchterman Reingold layout algoritm and thereafter adjusting the nodes with the Noverlap algorithm and finally the labels where adjusted with the algoritm Label adjust. Node size were set based on the 'importance' measure that comes out of the Infomap algoritm.

Community detection

In order to find communities in the mention graph (in other words, to cluster the mention graph), we use Infomap, an information-theory based approach to multi-level community detection that has been used for e.g. mapping of biogeographical regions such as Edler, Etal 2015 and scientific publications such as Rosvall, Bergström 2010, among many examples. This algorithm, which can be used both for directed and undirected, weighted and unweighted networks, allows for multi-level community detection, but here we only show results from a single partition into communities. (We also tried a multi-level decomposition, but did not feel that this added to the analysis presented here.)

"The Infomap algorithm returned a bunch of clusters along with a score for each user indicating how "central" that person was in the network, measured by a form of PageRank, which is the measure Google introduced to rank web pages. Roughly speaking, a person involved in a lot of discussions with other users who are in turn highly ranked would get high scores by this measure. For some clusters, it was enough to have a quick glance at the top ranked users to get a sense of what type of discourse that defines that cluster. To be able to look at them all we performed language analysis of each cluster's users tweets to see what words were the most distinguishing. That way we also had words to judge the quaility of the clusters from.

What did we find?

We took the top 20 communities in terms of size, collected the tweets during 2015 from each member in those clusters, and created a textual corpus out of that (more specifically, a Dictionary using the Gensim package for Python). Then, for each community, we tried to find the most over-represented words used by people in that community by calculating the TF-IDF (term frequency-inverse document frequency) for each word in each community, and looking at the top 10 words for each community.

When looking at these overrepresented words, it was really easy to assign "themes" to our clusters. For instance, communities representing Norwegian and Finnish users (who presumably sometimes tweet in Swedish) were trivial to identify. It was also easy to spot a community dedicated to discussing the state of Swedish schools, another one devoted to the popular Swedish band The Fooo Conspiracy, and an immigration-critical cluster. In fact we have defined dozens of thematically distinct communities and continue to find new ones.

A "military defense" community

Number of nodes 1224
Number of edges 14254
Data (GEXF graph) Download (license: CC0)

One of the communities we found, which tends to discuss military defense issues and "prepping", is shown in a graph below. This corresponds almost eerily well to a set of Swedish Twitter users highlighted in large Swedish daily Svenska Dagbladet Försvarstwittrarna som blivit maktfaktor i debatten. In fact, of their list of the top 10 defense bloggers, we find each and every one of them in our top 18. Remember that our analysis uses no pre-existing knowledge of what we are looking for: the defense cluster just fell out of the mention graph decomposition.

Top 10 accounts
1. Cornubot
2. wisemanswisdoms
3. patrikoksanen
4. annikanc
5. hallonsa
6. waterconflict
7. mikaelgrev
8. Spesam
9. JohanneH
10. Jagarchefen

Top distinguishing words (measured by TF-IDF:

#svfm
#säkpol
#fofrk
russian
#föpol
#svpol
ukraine
ryska
#ukraine
ryska
russia
nato
putin

The graph below shows the top 104 accounts in the cluster ranked by Infomap algorithm's importance measure. You can also download a zoomable pdf.

Defence cluster top 104

Graph of "general pundit" community

Number of nodes 1100 (most important of 7332)
Number of edges 38684 (most important of 92332
Data (GEXF) Download (license: CC0)

The largest cluster is a bit harder to summarize easily than many of the other ones, but we think of it as a "pundit cluster" with influential political commentators, for example political journalists and politicians from many different parties. The most influential user in this community according to our analysis is @sakine, Sakine Madon, who was also the most influential Twitter user in Mattias eigenvector centrality based analysis of the whole mention graph (i.e. not divided into communities).

Accounts
1. Sakine
2. oisincantwell
3. RebeccaWUvell
4. fvirtanen
5. niklasorrenius
OhlyLars
7. Edward_Blom
8. danielswedin
9. Ivarpi
10. detljuvalivet

Top distinguishing words (measured by TF-IDF:

#svpol
nya
tycker
borde
läs
bättre
löfven
svensksåld
regeringen
sveriges
#eupol
jobb

The graph below shows the top 106 accounts in the cluster ranked by Infomap algorithm's importance measure. You can also download a zoomable pdf.

pundits cluster top 106

Graph of "immigration" community

Number of nodes 2308)
Number of edges 33546
Data (GEXF) Download (license: CC0)

One of the larger clusters consists of accounts clearly focused on immigration issues judging by the most distinguishing words. An observation is that while all the larger Swedish political parties official Twitter accounts are located within the "general pundit" community, Sverigedemokraterna (The Sweden Democrats) that was born out of immigration critical sentiments is the only one of them located in this commuity. This suggests that they have (or at least had in the period up until 2015) an outsider position in the public discourse on Twitter that might or might not reflect such a position in the general public political discourse in Sweden. There is much debate and worry about filter bubbles formed by algorithms that selects what people get to see. Research such as Credibility and trust of information in online environments suggests that the social filtering of content is a strong factor for influence. Strong ties such as being part of a conversation graph such as this would most likely be an important factor in shaping of your world views.

Accounts
1. Jon_Brenelli
2. perraponken
3. sjunnedotcom
4. RolandXSweden
5. inkonsekvenshen
6. AnnaTSL
7. TommyFunebo
8. doppler3ffect
9. Stassministern
10. rogsahl

Top distiguishing words (TF-IDF):

#motgift
#natpol
#migpol
#dkpol
7-klövern
#svpbs
#artbymisen
#tcot
sanandaji
arnstad
massinvandring
#sdu14
#amazoncart
#ringp1
rlm
riktpunkt.nu:
#pkbor
tino
#pklogik
#nordiskungdom

immigration cluster top 102

Future work

Since we have the pipeline ready, we can easily redo it for 2016 when the data are in hand. Possibly this will reveal dynamical changes in what gets discussed on Twitter, and may give indications on how people are moving between different communities. It could also be interesting to experiment with a weighted version of the graph, or to examine a hierarchical decomposition of the graph into multiple levels.

Pre-trained word2vec models on mostly Swedish tweets

Get started with AI and Deep Learning with no prior code experience

Human language is at the heart of the current revolution in AI and deep learning. You need a lot of it though, in order to make the machines produce anything interesting. I happen to have a lot and to save you the hassle with collection and training (that can take several days) I've prepared it all for you in this post.

Google has been generous and released several smart tools to analyze language for AI. With all the developments that have been made over the last couple of years it's actually possible for almost everyone to join the AI-revolution, even for someone that is totally new to programming!

The very cool algorithm word2vec and Python is a good place to start. All you need is Python installed on your computer (pre-installed in all Apple computers, otherwise I suggest you download and install Python 3.5 in 64-bit version from here. All-in-all this should be possible to do in 10 to 15 minutes from scratch.

Copy paste the code below into your terminal running Python after downloading the models and you're good to go to start exploring language patterns on Twitter, which e.g. can be used to analyze groups of people masterly done by Jontahon Morgan in a recent study on the alt-right on Twitter.

1. Download the models

Here are the statistical language models I've trained and now release under a Creative Commons Zero license which means you can do whatever you fancy with them.

They're based on the data set I've listed here. You can see how the preprocessing (removing retweets, filtering out commons words etc) was done here.

I suggest that you start by downloading the following three files that together make up a complete model:

  • big_model_150_mincount_30_no_stops
  • big_model_150_mincount_30_no_stops.syn0.npy
  • big_model_150_mincount_30_no_stops.syn1neg.npy

Below are some stats from the training of the model named big_model_150_mincount_30_no_stops - meaning 150 dimensions on words occuring at least 30 times in the corpus and with stopwords removed.

INFO : collected 7115378 word types from a corpus of 441303111 raw words and 47609469 sentences

INFO : min_count=30 retains 495569 unique words (drops 6619809)

INFO : min_count leaves 415541707 word corpus (94% of original 441303111)

INFO : deleting the raw counts dictionary of 7115378 items

INFO : sample=0.001 downsamples 20 most-common words

INFO : downsampling leaves estimated 338502774 word corpus (81.5% of prior 415541707)

INFO : estimated required memory for 495569 words and 150 dimensions: 842467300 bytes

INFO : resetting layer weights

INFO : training model with 4 workers on 495569 vocabulary and 150 features, using sg=0 hs=0 sample=0.001 negative=5

INFO : expecting 47609469 sentences, matching count from corpus used for vocabulary survey

2. Install gensim

To load e.g. the big_model you need to install Gensim. You do that by simply copy-pasting the following code in the terminal and wait till it has finished.

In [ ]:
pip install genism 

3. Load the model in your Python terminal

In your terminal go to the directory where you downloaded the files. If you want help on how to do that and understand Swedish I've prepared a beginners tutorial which covers the basics of navigating between folders in the terminal. Start the Python interpreter in your terminal by writing 'python' and pressing enter. Then paste the following code into your terminal and press enter and wait while the model is being loded.

In [5]:
import gensim
model = gensim.models.Word2Vec.load("../../word2vec_on_tweets/models/model_ngrams_150_mincount_30")

Start exploring peoples language on Twitter!

Now it's really easy to start exploring the model. Note that you also need the files ending with syn0.npy and syn1neg.npy in the same directory that you load the model from.

If you would like to see what words are most similar to "Sverige" (Sweden) you can use the method most_similar()

In [3]:
model.most_similar("Sverige")
Out[3]:
[('sverige', 0.8774566054344177),
 ('Tyskland', 0.792809247970581),
 ('landet', 0.7679420709609985),
 ('Norge', 0.7538720369338989),
 ('Europa', 0.7496431469917297),
 ('världen', 0.7442996501922607),
 ('Ryssland', 0.7381924986839294),
 ('Sv', 0.7376805543899536),
 ('Grekland', 0.7372180819511414),
 ('vårt_land', 0.7331665754318237)]

And to see what what isn't usually used together with the other you can use another fun method:

In [6]:
model.doesnt_match("dator data artificiell_intelligence teologi".split())
Out[6]:
'teologi'

There are quite a lot of English words also, plus some Norwegian, Finish etc even though they might be a bit to few in the original data set to give quality results. Try for your self!

In [9]:
model.most_similar("Barack_Obama")
Out[9]:
[('Obama', 0.8103233575820923),
 ('President_Obama', 0.7765438556671143),
 ('Donald_Trump', 0.7678619027137756),
 ('Mitt_Romney', 0.7673009037971497),
 ('Vladimir_Putin', 0.75397789478302),
 ('obama', 0.7518642544746399),
 ('Marco_Rubio', 0.7498303055763245),
 ('Biden', 0.7391507029533386),
 ('Boris_Johnson', 0.7385252714157104),
 ('president_Obama', 0.7371423244476318)]

Happy exploration! :-)

Myers-Briggs and mood training data corpora

Both the Myers-Briggs corpus used on typealyzer.com and the mood corpus used at uClassify are now available on my Github account. They both concist of example texts for each dimension I selected manually from the web based on pretty extensive reading about C.G. Jungs type theory, Myers-Briggs and Naomi Quencks book about how stress plays out differently in people depending on their personality type.

I've since I made these corpora explored several other psychological theories, but more importantly learned how to code in order to improve the process for creating psychological text analysis. You can find out more about what I'm up to right now by watching these slides.

Trustworthiness on Twitter: Sustainability Engagement Mapping 2015

Hösten 2015 presenterade jag ett projekt jag gjort på uppdrag av Whispr Group i samarbete med Sustainable Brand Insight baserat på insamlad Twitterdata. Utmaningen var att poängsätta twitterkonton utifrån hur hög trovärdighet de har som påverkare inom hållbarhet och CSR. Här är videon från presentationen av Sustainability Engagement Mapping 2015 som projektet döptes till. I presentationen berättar jag mer om urvalet av data och hur poängsättningen gjordes.