Swedish twitter data 2015

In order to have a data set large and diverse enough to do experiments with personality and language style I've used Twitter REST API to collect tweets from a list of Twitter screen names since 1 januari 2015. The list of user names was a mash-up of the following sources:

  • Top active users on Twittoppen.se in late 2014.
  • A list of the about 50 000 most active users according to Hampus Brynolf's (thank you!) Twittercensus in 2012.
  • The most frequently occuring account name on a search for a list of common words with Swedish characters ÅÄÖ found in the Swedish language such as "också", "från", "över", on Twitters streaming API over a period of a couple of months.

That way I found screen names to continously call for at Twitters REST API and later on classify according to what languages they actually uses over time.

The Twitter REST API does only provide a minor sample of tweets for each user queried and does not provide a description of how many per cent of a users tweets the API returns. Neither does Twitter describe if, and in that case how, the returned tweets are selected. There is therefore impossible to say that the tweets are statistically representative for each user.

In [20]:
import pickle
accounts = pickle.load(open("../files/users_stats_protocol2.pickle","rb"))
accounts.tweets.sum()
Out[20]:
312292997

There are a total of 312,292,997 tweets in the dataset. Below we can see that there are a total of 435,792 twitter accounts in the data set.

In [14]:
len(accounts)
Out[14]:
435792

The table below shows some stats of the full dataset. I've then used the Python library Polyglot to classify all the tweets from each account in aggregate to determine how many are actually writing in Swedish. Polyglot returns three calssification results for language - primary, secondary and tertiary found. That means that someone like me, who tweets in Swedish and English can have the result sv (Swedish) as the value of primary language (column "lang1") and en ("English") as the value of column lang2.

In [7]:
accounts.describe()
Out[7]:
fsize lang1 lang2 lang3 rt_count tokens tweets
count 435792 435792 435792 435792 435792 435792 435792
unique 265964 96 151 168 3981 45727 5123
top 1716 sv en un 0 48 200
freq 19 288650 252758 278669 55609 203 8245

If we look closer at the language classification we can see the distribution of the different primary languages for the users. 288,650 accounts have been classified as using Swedish as the primary language. 112,499 with English as the primary language. "NaN" in this context means unidentified language.

In [19]:
accounts.lang1.value_counts()[:10]
Out[19]:
sv     288650
en     112499
no       7600
NaN      5279
es       5178
tr       2204
fi       2173
da       1865
fr       1297
de       1128
Name: lang1, dtype: int64

We can actually take a closer look at me to understand what data the overview can give us.

In [11]:
accounts.loc["mattiasostmar"]
Out[11]:
fsize       1218180
lang1            sv
lang2            en
lang3            un
rt_count        421
tokens        65782
tweets         4052
Name: mattiasostmar, dtype: object

From the stats we can see that the data for the user account "mattiasostmar" is 1,2 MB in size and that the primary language is Swedish, secondary language used is English and that there is no specific third language found ("un" = unknown).

Furthermore, there are 421 tweets that are retweets of others and 4052 tweets that are original by the author. The text in the original tweets (retweets uncounted) consists of 65,782 words ("tokens") excluding hash tags and url:s.