The Original Model: How Machine Learning Predicted Who Wrote The Tweets

Note: This is the original description of the original machine-learning model that was in use until August 14, 2018. For the updated deep-learning model now in use, click here.

The short answer

I start with the well-accepted assumption that @realDonaldTrump tweets coming from an Android phone were written by Donald Trump himself, and that during the election, those posted on an iPhone came from someone else (i.e. his staff). I then used a variety of features of a tweet to train logistic regression classifiers (a kind of machine-learning model) to identify when those features corresponded to an Android or iPhone tweet. The models are 98.8% accurate in correctly identifying tweets from an Android phone, using a validation set of over 400 tweets that were not used to build or test the models.

I have reported the results of this model for every applicable tweet in the complete @realDonaldTrump and @POTUS archive. This site also checks for new tweets every 4 minutes, passes them through these models, and reports the results. Except for retweets and quotes, the archive reports the probability that it is consistent with Mr. Trump’s tweets. To the best of my knowledge, this was the first machine learning model ever implemented to specifically identify whether or not Mr. Trump wrote a tweet.

The long answer

The Machine Learning model

In case you aren’t already in the know, this is going to be a discussion of “machine learning,” which is just a fancy way of saying that a computer program makes predictions based on information you have given to it.  A common example is trying to predict home prices.  First, you gather information about a large number of homes, such as the square footage, number of rooms, and zip code. These are called “features,” and the purpose of machine learning is to find a mathematic model that can predict the home prices as accurately as possible, using some number of these features. Finding that model is called “training,” because we use the known prices of houses and those houses’ features to build the model. To test whether the model does a good job, we then run the model on another set of houses, where we enter the features, and compare the predicted prices to the actual ones.

OK, let’s dive in.

The Two Sources of @realDonaldTrump

There’s a lot more information buried inside a tweet than you see on the screen. But anyone can access this “meta-data” through the twitter API.  One of the pieces of hidden data is what program was used to send the tweet.  Some examples are predictable, like “Twitter for Websites,” “Twitter for iPhone,” “Twitter for Android.” Some are not, like “TwitLonger” or “Periscope.”

Back in mid-2016, people started proposing that there were two different people tweeting from the @realDonaldTrump account: Mr. Trump and staffers. Data-scientist David Robinson wrote a very extensive article called “Text analysis of Mr. Trump’s tweets confirms he writes only the (angrier) Android half,” in which he shows that there are undeniable differences between the tweets sent from “Twitter for Android” and all the other sources (mostly iPhone and from a web browser).

For example, the ones from the Android phone are a lot angrier and negative. Robinson used a very sophisticated method of testing for 10 different moods in addition to sentiment.  One of his conclusions is that the Android tweets used “about 40-80% more words related to disgust, sadness, fear, anger, and other ‘negative’ sentiments than the iPhone account does.”

The Atlantic went a little further into the weeds after Robinson’s article came out, showing that as the campaign went into high gear, fewer tweets came from the Android phone:

This agreed with reports that his campaign staff were clamping down on how often he sent out tweets himself.

I stumbled on all this quite by accident, when reading an @realDonaldTrump tweet that talked about himself in the 3rd person.

But these reports left me with one burning question: can we formally demonstrate there really are more than one author to the @realDonaldTrump tweets? I decided to test this, using machine learning.

Assumptions and data

This project is based on three fundamental assumptions.

  1. Mr. Trump has been using some variant of an Android phone for years. So I assume that any tweet from an Android phone was made by Mr. Trump himself. You can read my subsequent analysis on this assumption here.
  2. Given the result of the sentiment analyses described above, I also assume that tweets posted with iPhones during the Presidential campaign were written by someone else (i.e. staff members).
  3. If there really are different authors of the @realDonaldTrump tweet corpus, then a model trained to distinguish tweets written on the iPhone and Android will correctly classify a new sample that it had not yet encountered.

In the complete database of all @realDonaldTrump tweets, 10790 are either quotes from another person, or retweets. Of the remaining corpus as of April 16, 2017, 14,544 tweets were posted using Twitter for Android, 2,490 using Twitter for iPhone, and 13,502 were posted using other methods. I decided to restrict the corpus to only tweets made during the Presidential campaign (Jun 16, 2015 to Nov 8, 2016) and to only compare those posted with an Android and iPhone.

This yields a database of about 6,400 tweets, of which some 4,200 are from an Android (I assume Mr. Trump wrote these) and the remaining 2,200 are from an iPhone (I assume someone else wrote them).

Feature selection

There are a number of possible features to explore in tweets. How often it includes an emoji or url, the choice and frequency of punctuation, number and length of sentences, and use of common words or phrases. I decided to explore all of these.

Sentiment Analysis

First, I tagged each tweet by its sentiment, using the VADER Sentiment Analysis of Social Media Text algorithm. This is an off-the-shelf (but still state-of-the art) natural language processing code to estimate the sentiment present in the tweet.  That’s really just a fancy way of saying computers have been trained to do the work so that there isn’t any human subjectivity introduced into the results.

Sentiment analysis is a well-established form of rating the tone or feeling of a text, from 100% negative to 100% positive. For example:

  • “You’re all so wonderfully loving and gracious.” is has a highly positive sentiment.
  • “You’re all dressed in clothes.” rates neutral.
  • “You’re all stupid idiotic losers and haters.” has a highly negative sentiment.

VADER stands for “Valence Aware Dictionary for sEntiment Reasoning.” The package is available here, and an  explanation and rationale is explained in paper. In short, it uses has been trained to extract the sentiment from texts in social media like twitter. It has been verified against 11 different benchmarks, and is far more accurate than human test subjects.

When a tweet is fed into the VADER code, it returns four measurements.  The percentage of words that are positive, neutral, and negative, and then a weighted average which we use as the overall sentiment measure for this website.  For the examples above, VADER returns

  • “You’re all so wonderfully loving and gracious.” returns 0% negative, 24% neutral, and 76% positive, for a compound score of 93% positive.
  • “You’re all dressed in clothes.” returns 0% negative, 100% neutral, 0% negative, yielding a perfect score of neutral.
  • “You’re all stupid idiotic losers and haters.” returns 82% negative, 12% neutral, and 0% positive, yielding a compound score of 93%.

I included all four of these as features for each tweet, considering that David Robinson found that the iPhone and Android tweets often showed significantly-different sentiments.

Style characteristics

A simple method to identify authors is to consider the length of sentences, use of different punctuation marks, and since this is twitter, how often each tweet use URLs, @usernames, and #hashtags. I used each of these as features.

Word choice

A more detailed set of features can be built by considering the frequency of words and phrases in each corpus. I constructed these using the Tf-Idf (Term-frequency Inverse-document frequency) method as implemented in python‘s sklearn. Here are the five most common words in each corpus:

  • Android: “I”, “The”, “great”, “just”, “people”
  • iPhone: “Thank”, “Trump”, “I”, “MakeAmericaGreatAgain”, “Join”, “great”

And here are some of the most common groups of 2 tokens (also called a bi-gram):

  • Android: “. I”, “of the”, “. @username”, “I will”, “on @username”
  • iPhone: “! #hashtag”, “Thank you”, “Trump <number>”, “url url”, “<number> url”

where <number> represents some number was written, url means a url was used, and same for @username and #hashtag. A “token” by the way, is any individual “thing” in a sentence, i.e. a word or punctuation mark. For features, I ran Tf-Idf on the full corpus for all n-grams of 1 token or greater. To capture as much of the stylistic differences as possible, I made the choices not to stem words (i.e. remove endings like “s” and “ly”), not to remove stop words (common and meaningless things like “the” and “if”), and not to remove capitalization. I did however convert all times, emoji, URLs, and numbers to common tokens. From here on, I call these the “text features.”

Grammatical structure

People, places, and organizations often show up in tweets, but these will change over time. To accommodate this, I also transformed each tweet into a sentence of just the parts of speech, using the pos_tag feature in python’s Natural Language Toolkit. As an example, the Android tweet “Sanders says he wants to run against me because he doesn’t want to run against me. He would be so easy to beat!” becomes “PERSON VBZ PRP VBZ TO VB IN PRP IN PRP VBZ RB VB TO VB IN PRP. PRP MD VB RB JJ TO VB!” Each code corresponds to a different part of speech (VB is verb, for example) as listed in the Penn Part of Speech Tags.

The most common tri-gram (three consecutive parts of speech) for the Android phone is “DT JJ NN”, or Determiner Adjective Noun, while for the iPhone, it is “NNP NNP NNP,” or three singular proper nouns in a row. Transforming each tweet into its parts of speech captures the syntactic patterning, rather than the actual word choice. I constructed all n-grams of 2 or more parts of speech and punctuation as features. From here on, I call these the “NER features” because I also used the Stanford Named Entity Recognition converter (nltk.tag.stanford) to identify names, places, and organizations.

Exploration of Machine Learning models

I followed the standard procedure of segregating 20% of the corpus as a testing set, and the other 80% for training, but keeping the ratio of Android to iPhone tweets the same in both sets (since there are roughly three times more Android tweets in the corpus). I explored a variety of off-the-shelf classifiers in sklearn, including Gaussian Naive and Multinomial Naive Bayes, Decision Trees, SVM, K-Nearest Neighbors, Logistic Regression, and the ensemble methods of AdaBoost and Random Forests. I also trained each model using either the Tf-Idf text features or the parts-of-speech features, but not both. This was so I could compare whether one was actually more accurate, given my hypothesis that text terms change over time but the parts-of-speech structure does not.

With extensive tuning of the other methods, the SVM model produced the best results, with logistic regression coming in a very close second. I’ve always been a fan of Logistic Regression — it may not be the most fancy of classifiers but it’s a tried and true work horse and doesn’t have too many hyper-parameters to tune. It also generates prediction probabilities directly from fitting, whereas SVM requires extra steps that are less robust. I therefore selected it as the classifier for actually predicting the author of tweets. Here are the accuracy results.

  • Logistic Regression using text features yielded an accuracy of 96.5%
  • Logistic Regression using NER features yielded an accuracy of 95.5%

Comparing precision and recall for these models is complicated, since I believe Mr. Trump wrote all the Android tweets, but he may have also written (or dictated) some of the iPhone tweets. I therefore expected classification to perform better on Android than iPhone tweets, and this was indeed the case, as shown below:

  Text features: precision recall f1-score
        Android    0.91     0.98    0.94
         iPhone    0.95     0.82    0.88
        Average    0.92     0.92    0.92 

   NER features: precision recall f1-score
        Android    0.89     0.98    0.93
         iPhone    0.94     0.76    0.84
        Average    0.91     0.91    0.90 

We also see that the models are extremely accurate at correctly identifying Android tweets, which was the goal here. For those who are interested, here are the ROC and Precision-Recall curves for the two feature sets.

Combining the model predictions

Normally, one selects the best-performing model as the model for prediction. However, I still suspect that as time goes on, the people, places, and things being discussed by Mr. Trump and his staff will change, but the basic structure of his writing will not. As one concrete example, “Crooked Hillary” is the 6th most common word bigram in the Android training corpus, however this didn’t show up before the election and is unlikely to ever show up thereafter.  As such, I decided to combine the predictions of both the text and NER models by combining the classification probabilities. Both models share the same 25 features of sentiment and basic structure, after which each uses its own feature set. The predictions are not entirely independent, but taking the liberty to treat them as if they were, the probabilities from both models for a given tweet can be combined using straightforward application of Bayesian inferrence.

Using the ratio of recalls for Android and iPhone tweets from the testing set as priors, the posterior probability that a given tweet was written by Mr. Trump can be computed by appropriately multiplying the classification probabilities from each model, and normalizing the result:

Poster Bayesian inference from classifier results

After each tweet in the archive, the final posterior probability (i.e. the combination of the two classifiers) is given.


Since the election, there have been 424 @realDonaldTrump tweets posted using an Android device. Using these as a validation set, the models have an accuracy of 98.8%.


The regression model returns the probability that Mr. Trump wrote a tweet, which can range from the tiniest fraction of a percent all the way up to 99.99999%. For ease on the eye, I capped these off at 1% and 99% (also since this is the inherent uncertainty based on validation).

I hope these descriptions of the data and methods have been helpful. Please feel free to contact me with any questions or comments at didtrumptweetit (at)

Copyright 2017

NOTICE: You may use any part of this website, including the archive, machine-learning model ideas, and model predictions, provided you give proper attribution to this website.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file or any data within this website except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.