In order to have a data set large and diverse enough to do experiments with personality and language style I've used Twitter REST API to collect tweets from a list of Twitter screen names since 1 januari 2015. The list of user names was a mash-up of the following sources:
- Top active users on Twittoppen.se in late 2014.
- A list of the about 50 000 most active users according to Hampus Brynolf's (thank you!) Twittercensus in 2012.
- The most frequently occuring account name on a search for a list of common words with Swedish characters ÅÄÖ found in the Swedish language such as "också", "från", "över", on Twitters streaming API over a period of a couple of months.
That way I found screen names to continously call for at Twitters REST API and later on classify according to what languages they actually uses over time.
The Twitter REST API does only provide a minor sample of tweets for each user queried and does not provide a description of how many per cent of a users tweets the API returns. Neither does Twitter describe if, and in that case how, the returned tweets are selected. There is therefore impossible to say that the tweets are statistically representative for each user.
import pickle accounts = pickle.load(open("../files/users_stats_protocol2.pickle","rb")) accounts.tweets.sum()
There are a total of 312,292,997 tweets in the dataset. Below we can see that there are a total of 435,792 twitter accounts in the data set.
The table below shows some stats of the full dataset. I've then used the Python library Polyglot to classify all the tweets from each account in aggregate to determine how many are actually writing in Swedish. Polyglot returns three calssification results for language - primary, secondary and tertiary found. That means that someone like me, who tweets in Swedish and English can have the result sv (Swedish) as the value of primary language (column "lang1") and en ("English") as the value of column lang2.
If we look closer at the language classification we can see the distribution of the different primary languages for the users. 288,650 accounts have been classified as using Swedish as the primary language. 112,499 with English as the primary language. "NaN" in this context means unidentified language.
sv 288650 en 112499 no 7600 NaN 5279 es 5178 tr 2204 fi 2173 da 1865 fr 1297 de 1128 Name: lang1, dtype: int64
We can actually take a closer look at me to understand what data the overview can give us.
fsize 1218180 lang1 sv lang2 en lang3 un rt_count 421 tokens 65782 tweets 4052 Name: mattiasostmar, dtype: object
From the stats we can see that the data for the user account "mattiasostmar" is 1,2 MB in size and that the primary language is Swedish, secondary language used is English and that there is no specific third language found ("un" = unknown).
Furthermore, there are 421 tweets that are retweets of others and 4052 tweets that are original by the author. The text in the original tweets (retweets uncounted) consists of 65,782 words ("tokens") excluding hash tags and url:s.