TL;DR: Over time, I’ve recognized some potential weaknesses in my original logistic-regression model, and some ways to create and test a much more robust classifier. To suit my new requirements, I turned to Deep Learning. Somewhat surprisingly, I found that Recurrent Neural Networks are insufficient. Rather, I developed what I call the N3CNN: three parallel Convolutional Neural Networks max-pooled into fully-connected layers. Input tweets are normalized by converting to lowercase ASCII, removing the hashtag and username text, and stripping out all URLs and emoji. The tweet is then embedded as a tensor using one-hot encoding of the normalized-text unigrams, as well the day-of-week and time-of-day it was posted. The model was trained on 6,770 @realDonaldTrump tweets made between the day Mr. Trump started his campaign and the day he took office. Testing on an additional ground-truthed set of 15,357 tweets, the classifier has an accuracy of 98% in identifying whether a given tweet is consistent with those written by Mr. Trump or by his staff. A McNemar test shows that this new model unambiguously outperforms the old logistic-regression one. I believe that, in no longer allowing classification to be influenced by the presence of URLs, emoji, or capitalized wording, this N3CNN model represents the new state-of-the-art in classifying the author of @realDonaldTrump tweets.
Why try to identify the author of Donald Trump’s tweets?
In late December of 2016, I stumbled across this tweet:
I don’t know about you, but I found it really odd that Mr. Trump would be thanking himself like that. So I did a little google searching and found a very extensive article written by data-scientist David Robinson called “Text analysis of Mr. Trump’s tweets confirms he writes only the (angrier) Android half.”
It turns out there’s a lot more information buried inside a tweet than you see on the screen. Fortunately, anyone can access this “meta-data” through the twitter API. One of the pieces of hidden data is what program was used to send the tweet. Some examples are predictable, like “Twitter for Websites,” “Twitter for iPhone,” “Twitter for Android.” Some are not, like “TwitLonger” or “Periscope.”
Dr. Robinson showed that there are undeniable differences between the tweets sent from “Twitter for Android” and all the other sources (mostly iPhone and from a web browser). For example, the ones from the Android phone are a lot angrier and negative. Robinson used a very sophisticated method of testing for 10 different moods in addition to sentiment. One of his conclusions is that the Android tweets used “about 40-80% more words related to disgust, sadness, fear, anger, and other ‘negative’ sentiments than the iPhone account does.” Since Mr. Trump had been photographed many times using an Android device, the conclusion was that he was writing those tweets, but that his staff were writing the others. In other words, someone else wrote that self-congratulatory tweet above.
The Atlantic’s Andrew McGill went a little further into the weeds after Dr. Robinson’s article came out, showing that as the campaign went into high gear, fewer tweets came from the Android phone:
This agreed with reports that Mr. Trump’s campaign staff were clamping down on how often he sent out tweets himself, and were instead writing for him on the @realDonaldTrump account.
The nerd in me was left with one burning question: can we formally demonstrate there really is more than one author of the @realDonaldTrump tweets? I decided to test this, using machine learning, under the assumption that if there really are multiple authors, a classification model that is trained on only a portion of the tweets will be able to distinguish the author on the rest of them at much better than the 50% level.
The First Model showed it can be done.
You can read all the gory details about my original model on this page, but here’s a short synopsis. Any brief perusal of @realDonaldTrump tweets from 2015-16 will show that many tweets from each device had their own unique signatures. For example, many of those from the Android phone had repeated punctuation marks (e.g. “…”, “!!!”), or very-short two-word sentences at the end of a tweet (e.g. “So sad!”). On the other hand, links and URLs were almost invariably present only in the iPhone tweets. So as features, I counted the number of sentences per tweet, the number of words per sentence, the number of different kinds of punctuation marks, number of URLs per tweet, etc. Following Dr. Robinson’s work, I also used the results of sentiment analysis as a feature.
For linguistic features, I used TF-IDF to build a dictionary of word n-grams (from 2 to 6). I also made the assumption that while the content (vocabulary) of tweets will change over time, the syntactic structure of an author will not. So I converted each tweet to its parts of speech, including named-object recognition, and built TF-IDF word n-grams out of these as well. In the end, I trained two classifiers – one that used the actual word n-grams, and the other that used the parts-of-speech n-grams, as well as all the other features I listed above.
There are 6,770 tweets from @realDonaldTrump between the day Mr. Trump launched his campaign until the day he was sworn into office. Of these, 2/3 (4,404) were posted with an Android device. Assuming these were written by Mr. Trump himself, and the rest were authored by his staff, I trained Logistic Regression classifiers to distinguish the two classes. I used 80% for training, and 20% for validation, and achieved an accuracy of about 96% on the validation set. In its final form, this model went live the day Mr. Trump was sworn into office.
After mid March 2017, @realDonaldTrump tweets were no longer made from any Android devices. I went back and used the 256 Android and iPhone tweets made between the inauguration and March 8, 2017 as a special testing set. My classifier reported 98% accuracy on these data. The fact that classification was this accurate strongly suggests that the initial hypothesis is correct, that is, Mr. Trump wrote the Android tweets and other people (i.e. his staff) wrote the rest.
If it works, why start over?
There are a number of things I haven’t been entirely happy with since deploying that first machine-learning model. Here are some of them, in no particular order.
- The part-of-speech and named-entity taggers have trouble with capitalized words. Over time, Mr. Trump has increasingly used capitalization to show emphasis, which makes the taggers report proper nouns, people, places, etc. far more than it should. As such, I have less confidence in the the reliability of the model that uses grammatical structure.
- Both models place so much emphasis on the presence of a URL or emoji that one or more of these is almost invariably marked as written by Mr. Trump’s staff. I did the experiment of taking an angry-sounding Android tweet and adding a URL and indeed, the probability jumped from over 98% that he wrote it to under 2%. A common argument is that he never used to post links or emoji in his tweets, so therefore he never will. I don’t believe that is a realistic assumption.
- Twitter went from 140 to 280 characters in November of 2017. With extra space, the number of sentences and their respective lengths would doubtlessly change. Therefore these features became much less reliable.
- Word n-grams can’t properly handle type-o’s, or the newer use of capitalization for emphasis, and may be feeding blank features into the model that change its output.
- Part of the beauty of machine learning is that the machine does the learning on its own. I know, that sounds tautological. When we feed in engineered features, we may be biasing the model to primarily use those rather than finding its own correlations and patterns. Given how I trained the models to use many such engineered features, I’ve become less confident that the classifiers really learned the linguistic patterns of the various authors.
- Logistic regression is a linear combination of features, and unless you specifically engineer non-linear combinations, it can’t find them.
Requirements for a new model
- URLs and emoji should be removed, so the model digs into the linguistics.
- All text should be lowercase, since we have no way to predict how the authors’ use of capitalization may change over time.
- Hashtags and usernames should be normalized to just use the “#” and “@” symbols, since the particular names in use will change over time.
- From my analysis of time stamps and authorship, I know that Mr. Trump has had periods of increased and decreased activity depending on the day of the week and time of day, thus these should be features.
- Word features may be problematic. As noted above, type-o’s show up as unknown features. But also, I’m uncomfortable with a model that can only work on a limited vocabulary that I set up at one fixed point in time. Some other word feature should therefore be used instead.
- The model needs to find its own non-linear combinations of features, rather than me trying to pre-engineer them.
It’s time for Deep Learning.
Actually, convolutional neural networks. But I’m not there quite yet.
When I developed my initial model, I made the conscious decision not to use a neural network (NN), since my prejudice was that you need huge amounts of data to train them. But in thinking about all the considerations I listed above, I decided that if I were going to train up a new model, neural networks really were the only way to proceed. So I started to do some investigating to see what research is out there on using small corpora.
By the way, a twitter corpus isn’t necessarily smaller than standard text. Yes, unlike studying essays and book chapters, tweets are limited to 280 characters (it was 140 before Nov 2017). But a dataset with 1000 tweets per user can have up to 90 single-spaced pages of text equivalent.
It turns out that studying tweets (or “microblogs” more generally) has become quite the cottage industry in the last few years. As just one example, consider the PAN Data Competitions in author profiling, in many of which teams compete to identify authorship traits like gender or regional language dialect in twitter corpora with only 100 tweets per author. Granted, the corpora contain hundreds of authors, but this gave me hope and inspiration, considering that my training corpus has under 7,000 tweets (actually more, as I’ll show later).
The technical particulars
In the following discussion, I’m using python 3.6 and all standard libraries, plus tensorflow 1.9.0 and keras 2.2.0. All models were trained with the standard keras.model.fit(), using binary cross-entropy as the loss function, and a custom implementation of AdamW with periodic warm restart (using cosine reweighting) as the optimizer. I highly recommend investigating this optimizer; combined with the ELU activation function, my models reached near-peak performance in under 8 training epochs.
My “training set” is the 6,770 Android and iPhone tweets made between the day Mr. Trump launched his campaign (15 Jun 2015) until his inauguration (21 Jan 2017). This set has twice as many Android tweets as iPhone, so I did a randomized stratified split, using 80% as the training set and the remaining 20% for validation. Usernames and hashtags are replaced by “@” and “#” symbols; URLs and emoji are removed, and all text is converted to lowercase.
For final testing, I chose a much more extensive corpus than the 256 Andoid and iPhone tweets from Mr. Trump’s first 2 months in office. Instead, I created (what I hope is) a ground-truthed set, so I could avoid the possibility of Mr. Trump writing an iPhone tweet. For the ground-truth set of tweets Mr. Trump is (almost) guaranteed to have written himself, I use all 9,967 Android tweets sent before his campaign started. To build a set of tweets guaranteed not to have been written by Mr. Trump, I turned to his Director of Social Media, Dan Scavino Jr. Mr. Scavino is the confirmed author of some of the @realDonaldTrump tweets, so I submit that some reasonable percentage of the @realDonaldTrump iPhone tweets are written with Mr. Scavino’s own unique style. This makes his own tweets a fair surrogate for a ground-truthed set of @realDonaldTrump iPhone tweets written by his various staff members. Using the twitter-scraping code developed by the curator of the Trump Twitter Archive, I grabbed all of Mr. Scavino’s tweets from his personal (@DanScavino) and White House (@Scavino45) accounts. There were a total of 5,390 of tweets starting the day Mr. Scavino joined Mr. Trump’s staff (01 Feb 2016) at this time I did this analysis. Together, these formed a final ground-truth testing corpus of 15,357 tweets.
I’m not a fan of quoting “accuracy” of a model, since this is a highly biased measurement for an unbalanced dataset. Consider that if a model predicted every tweet in my data came from an Android phone, the accuracy will be 67%. What does an 87% accuracy mean, in this case? Instead, I’ve adopted the Matthews Correlation Coefficient (), which provides a much more balanced metric in unbalanced data. Like accuracy, an indicates the classification is no better than random, while indicates a perfect match. My minimum accuracy goal is to have no more than 2% of Android tweets classified as coming from an iPhone, and no more than 10% of @realDonaldTrump iPhone tweets classified as Android. This latter choice is because we really don’t know whether Mr. Trump wrote any iPhone tweets, so I want to give that class a larger margin of error. For my unbalanced dataset, this goal yields a minimum For those of you who like to think in terms of precision/recall/F1 score, here is the associated table.
author precision recall f1-score Staff 0.97 0.85 0.90 Mr. Trump 0.91 0.98 0.94 avg / total 0.93 0.93 0.93
When using the larger ground-truth testing corpus, I’d like a 2% accuracy for classifying both authors, which corresponds to . That is my ultimate “accuracy” goal.
In the following discussion, I will use the sentence “This is a SHORT tweet.” for demonstration. This is what it would look like as an @realDonaldTrump tweet (N.B. this isn’t real, I made it here):
What didn’t work: the RNN
Anxious to start, I was also a little overconfident when I decided that a Recurrent Neural Network (RNN) should be all that is needed, since it would learn linguistic patterns by studying the sequential ordering and long-distance relationships of the features in each tweet. There is plenty of literature out there that relies on RNNs in just this way. For a very readable introduction for the layperson, I found you this Bachelor’s Thesis by Filip Lagerholm.
The generic architecture is to recode words as vector representations (e.g. word2vec, GLoVe), and pass the words through various flavors of RNN layers. This is where I always got stuck, because I wanted to avoid a limited dictionary of word-based features, even if I retrained the dictionary on all the words in all 34,000+ tweets in the @realDonaldTrump corpus. Then one day I realized I could try to use character n-grams as the features, instead of words. Consider character bigrams, which are sequential pairs of characters, such that “This is” becomes [“Th”,”hi”,”is”,”s “,” i”,”is”]. Counting all ASCII characters plus a null feature to indicate the end of a tweet, there are 9,206 two-character bigrams possible. This is a finite “vocabulary” out of which all tweets can be built, and which can accommodate anything a new tweet throws at it.
I first used the gensim word2vec to create custom 32- and 50-dimensional vector representation of all bigrams in the @realDonaldTrump corpus. Then I built a “vanilla” classifier that embedded a tweet using this dictionary (with padding to 288 characters and masking); fed the data through a bidirectional GRU with 25% dropout; and then generated the prediction with a one-node dense “classification” layer using Sigmoid activation.
The results were disappointing, regardless of the word2vec mapping or the number of GRU units. Trained on the dataset without stripping out URLs or emoji and leaving uppercase alone, I could reach a validation accuracy of 92.6% and of 0.839. However, when I adopt the more robust standard of removing URLs and emoji and using just lowercased text, the best I could achieve was an accuracy of 89.3% and score of 0.764. That’s just unacceptable.
I won’t bore you with the details but I tried very studiously (but thoroughly unsuccessfully) to make an RNN work. Stacking them, adding/removing units, adding/removing dropout, stacking dense NN layers afterward, using unigrams or trigrams…the most improvement I could squeeze out was 1%. Then I remembered that I wanted to use date and time as features also, so I snuck those in by concatenating their floating point values (for our sample tweet, 10 Aug 2015 is a Monday so day-of-week is 1/7, and 2:48 PM rounds up to 15/24) with the RNN output before passing dense layers. Again, no meaningful improvement.
What did work: The CNN.
You know what they say happens when you assume, right. Well, I assumed “everyone knows” that the Convolutional Neural Network (CNN) is for processing images, so why even bother to use it for text analysis? Whoops.
While poking through computational linguistics journals, I came across two papers that opened my eyes to the power of CNNs to study text. Shrestha et al. (2017) served as the basis for the model I ended up adopting, but I think Zhang & Wallace (2017) provides the single-best figure to understand how it works (below). For a more detailed explanation (complete with an animated explanation of convolution) please check out Denny Britz’s very readable blog post.
The basic idea is that convolution filters of different widths will pull out the equivalent to n-gram features as they move their way down the words in a text. In the example above, filters with width 2, 3, and 4 will create their own bigram, trigram, and 4-gram features as they are sequentially applied (2nd column) down the sentence matrix (1st column). The most important feature from each filter position along the text is extracted (Max Pooling; 3rd and 4th columns), eventually yielding a single vector that represents the most significant features of across pairs, triples, and 4-tuples of words (5th column). These are then passed through a 2-node dense layer (last column) to yield a binary score of 0 or 1, corresponding to the two classes.
Genius, no? Dr. Shrestha and her team made a few crucial changes to this basic model: they replaced the words with n-grams, and instead of using just a few separate filters of a given length, they used a depth m of output filters in the hundreds. Their sketch of their so-named “n-gram CNN” is shown below.
In their extensive analysis, they showed that using 500-layers for filters of width 3, 4, and 5 unigrams or bigrams outperforms every other standard type of classifier (including RNNs) that they tried, when identifying the author of a new tweet. Interestingly, they also note, “Despite the competitive performance of neural representation techniques in several NLP tasks, there is a lack of understanding about exactly what these models are learning, or how the parameters relate to the input data.” I must admit, while I understand how this method generates n-gram features (the CNN) and selects the most important ones (the Max Pooling), I’m neither clear how it considers short and long-distance relationships (i.e. whether “This is” is important with respect to “SHORT tweet” or whether “SHORT” is important with respect to “tweet.”) nor why this works better than RNNs.
While I can’t answer the latter, I did a modification of my own to address the former: I stacked multiple fully-connected dense (NN) layers to connect the different features that come out of the convolutional layers. As shown below, I also sneak in the day and time features by concatenating them with the Max-Pooling features before feeding them into the dense network. The whole process is sketched out below (click to enlarge). Since every model seems to need a fancy name, I’ll refer to this as the N3CNN (N-gram Nested Network using CNN)
“Wait!!” you say, “what about putting an RNN after the CNN layers to find that long-term behavior?” Tried it. Despite success reported in the literature (e.g. Xiao & Cho, 2016), it didn’t work for me either.
N3CNN Model Training and Testing
As is standard practiced, I tuned the model by testing a variety of “hyperparameters.” This included a variety of lengths of character n-grams (unigrams, bigrams, and trigrams); the number, widths, and depths of convolutional filters; the number and sizes of stacked dense layers; the activation functions; and the placement and amount of dropout within the N3CNN framework.
In the end, I found that regardless of the other options, using unigrams with one-hot encoding provides by far the best classification performance. For the lay-person, one-hot means “a” might be represented by the vector [1,0,0,0,…], “b” by [0,1,00,…] and so on. Although I was strongly biased toward using bigrams at the outside, this behavior makes sense. There are only 69 lowercase letters and non-letter ASCII characters, which makes one-hot encoding possible from a memory perspective. By moving away from word2vec feature vectors (for which “a” might be [0.2,0.4,0.1,0.6…], “b” by [0.3,0.3,0.2,0.9,…], etc.), each character takes its own unique place in its own unique layer in the character-to-vector (tensor, really) stack. The convolutional filters no longer need to try to distinguish between close floating-point values of different features over many layers, and can function far-more efficiently.
In perusing many of the reports from the PAN Data competitions, authors who have used some variant of the n-gram CNN have found the filter-width combination of [3,4,5] with bigrams to work best, in agreement with Dr. Shrestha’s report. However, in addition to using unigrams, I found that convolutional widths of [2,3,4] gave the best performance here. Surprisingly, this worked even better than using four layers with respective widths of [2,3,4,5]. Go figure. Also, the particular size of each dense layer is less impactful than using more than one.
In trimming down to find the smallest number of trainable parameters that provided high performance, I settled on 288 filters per convolutional layer, sequential dense layers of 96 and 32 neurons, and 25% dropout applied after each dense layer. Note that the number 288 is not entirely random. Twitter now allows up to 280 characters per tweet, not counting URLs and usernames. Adding a marker that a tweet ended, and a single digit for both the day and time features, yields a maximum of 283 features per tweet. Since GPU performance (for training) works better on multiples of 8, I rounded up to 288.
For training, I used the standard batch fitting of 128 tweets at a time, for a total of 128 epochs. For cosine warm restarting of the learning rate, I used a fixed learning rate for each epoch. The period started with 1 epoch and doubled after each sequential cycle. During training, I monitored the value for the validation set, and saved the model whenever this value increased. I trained each set of model parameters at least 15 times using randomized training and validation sets (cf. cross validation), to ensure the uniformity of performance. With google’s colab GPU-enabled testing environment, all the training and trimming only took a few days. For anyone interested, here is the model and the training routines I used.
Using the optimal hyperparameters described above, the best N3CNN model reached an value of 0.985 on the training data, and 0.892 on the validation set. This corresponds to an accuracy on the validation data of 94.7%. Of course, the proof is really in the putting (pardon the pun) the model to the test on the specially-reserved ground-truth testing set. But before showing that, a word about reporting any statistic from a classifier.
The input data is binary, that is, the tweet was either written on an Android device or not, or for my purposes, written by Mr. Trump or a member of his staff. But the N3CNN model outputs a floating-point “score” between 0 and 1. The easy way to check accuracy (using any metric) is to round that score, so that anything under 0.5 is a 0, and everything else is a 1. Obviously, that throws away a lot of information. But score doesn’t represent a true probability. One must construct a function that converts from model score to empirical probability. In the best of all possible worlds, the score will equal the probability. In the worst case, the relationship will be incredibly messy which will tell its own interesting story.
Here is the conversion function for the predictions of 22,465 tweets, constructed from the trained model predictions for all Android and iPhone tweets from @realDonaldTrump through 09 March 2017, and all tweets from Mr. Scavino since he joined Mr. Trump’s staff in 2016.
Using this conversion function, I will label any tweet with more than 50% probability as being written by Mr. Trump, and under 50% as being written by his staff. On the 15,357 tweets in the ground-truth test dataset, the N3CNN predictions have an of 0.961, which is higher than my optimal testing goal of 0.956. This corresponds to a traditionally-defined accuracy of 98.2%. Here’s the corresponding Precision/Recall/F1 table too:
author precision recall f1-score support Staff 0.98 0.97 0.97 5390 Mr. Trump 0.98 0.99 0.99 9967 avg / total 0.98 0.98 0.98 15357
More specifically, this model has a false positive rate (classifying a tweet written by a staffer as coming from Mr. Trump) of 1.78%, and a false negative rate (classifying a tweet written by Mr. Trump as coming from a staffer) of 1.76%. Since these are essentially the same, I conclude that the formal accuracy is an appropriate measure to report to users, that is, this model has an average accuracy of 98% when identifying the author of a new, unseen tweet.
Note also that this testing set includes 1,863 tweets from Mr. Scavino that are between 140 and 280 characters. Of these, only 1.8% were mis-classified as being written by Mr. Trump, which is actually less than the 3.2% false-positive rate for the other 3,527 tweets that are 140 characters and under. This shows that the model is robust and accurate for tweets of all lengths.
The new state of the art?
As I noted earlier, there are many features that journalists and enthusiasts use to identify the author of @realDonaldTrump tweets. Again, the presence of a URL (images, press release, video, etc.) or emoji has been taken to mean that a staffer wrote the tweet, while seemingly random capitalization is used to conclude that Mr. Trump wrote it. By stripping out these features, my N3CNN model focuses on the linguistic structure of the authorship, which I argue is much more robust under the assumption that Mr. Trump has already or will start using URLs, or that his staff are trying to impersonate him with capitalization and exclamation points.
But is this model really any better than the logistic-regression classifier this website has been using since its inception? After all, that logistic-regression classifier yielded an accuracy of 98.2% on the smaller test set of 256 tweets. A simple apples-to-apples comparison is to compute how my original logistic-regression model fares on the ground-truth testing set used for N3CNN. On that set, the original model has an of 0.901, and accuracy of 0.953, and precision/recall/F1 table of:
author precision recall f1-score support Staff 0.92 0.96 0.94 5390 Mr. Trump 0.97 0.95 0.96 9967 avg / total 0.95 0.95 0.95 15357
With a false-positive rate of 4.4%, and a false-negative rate of 4.8%, the logistic-regression model seems to under-perform the new N3CNN model. As a statisical measure, I performed the McNemar’s test, which establishes the likelihood of the results of the old and new classifiers being statistically different. Here is the contingency table:
old classifier old classifier new classifier correct wrong correct 14468 617 wrong 188 84
This yields a chi-squared value of 228, which means these two tests are statistically different at the 99% confidence level with a p-value that is so tiny (), my computer just wants to call it zero.
As of the deployment date of this new model (August 2018), I have not found any other published classification model (short of manual analyses by forensic linguists) that work this way. Given the significant improvement over an already robust classifier, I believe this new model likely represents the current state-of-the-art in machine-learning author identification of @realDonaldTrump tweets.
However, please know that this doesn’t mean that Deep Learning and neural networks are necessarily better than other traditional machine-learning methods. The winning team from the PAN 2017 Author Profiling Competition used SVC to outperform everyone else, Deep Learning included. As is always the case in Data Science, the best method is always application specific.
NOTICE: Any use of this database must be properly acknowledged, e.g. by referencing the website URL.
This database has been built with open-source software, obtained and used via the Apache-2.0 and MIT licensing agreements. All twitter data was obtained according to the Twitter Developer Agreement. All analyses presented in this website, and methods used or created to present these media, are licensed as follows.
Copyright 2017 DidTrumpTweetIt.com
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.