Coronavirus Twitter Analysis

#Coronavirus Twitter Analysis: Separating Signal from Bots

Over the weekend we looked through the noise of 16 million tweets to find 36 different communities discussing #Coronavirus, only three of them contained high quality sources and discussion. From there we selected 60 of the most influential twitter accounts that we’re following to keep us up-to-date as we follow how experts are processing this situation.

View full twitter list here #Coronavirus Top Sources

The 2019-2020 Coronavirus outbreak is a rapidly evolving global health issue. After a sustained period of steady and rising markets, the rapid growth of coronavirus infections and the high level of uncertainty has had significant effects on global financial markets, economics, politics and the lives of many, many individuals. We will avoid discussing the personal, public health and political implications in this piece and just limit our focus to the financial markets analysis and the flow of information.

View our methodology below and view our embedded list below

UPDATE: March 3rd

From this analysis we have identified more experts who are manually curating their own lists such as Summer Steenburg (@scubagirl007) in a March 2nd set of tweets. We have aggregated here list here: https://twitter.com/i/lists/1234915219903832065

But Why Analyze 16 Million Tweets?

Without a clear understanding of what information you’re receiving, as a financial advisor or investor, how can you make good financial decisions? If you’re using Twitter, there is good info, bad info, noise and disinformation. We take a look at how to categorize this in this piece and to keep you informed of the emerging narratives.

The Atlantic has written a great piece about these challenges: How to Misinform Yourself About the Coronavirus

We first described our approach to using Twitter for financial decisions almost three months ago when we analyzed #FinTwit and determined the communities and micro-influencers within the investing, crypto, wealth management and other subgroups.

When we use Social Media in our investing process we follow two principles:

  1. Identify ‘Primary Source Accounts’ such as @CDCGov to be the first to reach to core information
  2. Identify quality ‘Micro-Influencer’ accounts to help curate the opinions and possibilities of what should happen. DO NOT use twitter dialogue or hashtag frequency to estimate probabilities!

In some sense this outbreak is the perfect Global Macro Risk Event; it is the type of shock, top Global Macro Investors train their careers for:

  • Discretionary Demand Shocks – people reducing their travel and entertainment spending
  • Risk Aversion Shock – people’s elevated risk aversion causing them to ‘hunker down’ and put off major investments rather than smooth their consumption over time
  • Supply Shocks – Major industrial regions in China and other places shutting down
  • Political Dynamics – US Election Cycle and strained US / Chinese relations
  • Conspiracy Theories – Deeply ingrained suspicion, despite mixed hard evidence, that Chinese data is systematically manipulated

Our focus here is to help investors curate the firehose of information coming at them on social media. The way Twitter’s feed works is unfortunately designed to give you ‘interesting’ and ‘engaging’ content, not the most informative. Moreover Twitter tends to keep your information narrow (‘filter bubble’) and does not adequately combat disinformation campaigns by users looking to promote controversy and not genuine discussion.

Methodology

We expanded on the methodology we introduced in the piece: #FINTWIT Micro-Influencer and Community Analysis and brought in an additional steps to help infer the sentiment and account-type of each of the Twitter Users.

Our approach is fundamentally different than most Twitter analysis pieces. Most pieces we have found, look to understand which accounts are the Influencers; these are the accounts with the loudest megaphones and whose followers like to ‘like’ and ‘retweet’ their content the most. This is useful if you are marketing and ‘need to get the word out’ about your product. For us as investors, we care about the quality of the information and the nuance of the debate; there is no nuance with megaphones.

Data Accumulation:

  1. Start with a small list of Primary Source Accounts and targeted hashtags
  2. Download tweets from these accounts and popular tweets with the given hashtags
  3. Continue downloading tweets and biographical information from users mentioned in and replying to these tweets.
  4. Repeat step 3 until sufficient number of accounts and tweets are downloaded.

Data Processing:

  1. Conversation Detection: Aggregate tweets into thread and detect if conversations exist in a given thread. A conversation is a thread where “User A replies to User B who replies to User A”.
  2. Topic Detection: Use standard Natural Language Processing pipelines and unsupervised learning algorithms to detect the likely topic of a given tweet. We manually label each topic to best describe the types of tweets contained therein. Select only the topics most relevant to the topic at hand: Coronavirus
  3. Community Detection: Using a community detection algorithm called Leiden we automatically group users and tweets into coherent groups. We manually label each group to best describe their users and conversations.
  4. Influencer Detection: Using page-rank and other influencer algorithms we can highlight the users who are most responsible for passing information throughout the community and between communities.

Bot Detection:

We are following some excellent research coming from the Autism community to detect disinformation within the Vaccine / Anti-Vax Twittersphere. For example: Weaponized Health Communication: Twitter Bots and Russian Trolls Amplify the Vaccine Debate. Broniatowski et al determine if various Twitter accounts are one of these types:

  • Trolls: Amplify both sides of the argument using techniques called Astroturfing to simulate grass-roots movement. The arguments used in these tweets rely less on fact and more on subjects like God, Country.
  • Content Polluters: Only tweet about the side of the debate with less evidence such as Anti-Vax to create a ‘false equivalence’ and create debate where there is none.
  • Non-bots: A genuine human account with an independent agenda and little automated activity.

To classify these accounts we conducted the following analysis

  1. Sentiment Analysis: The more polarized the less informative they are for our purposes.
  2. Rhetorical complexity: An analysis of the types of arguments being made and the sophistication of the argument
  3. Third Party Bot Risk Score: Botometer is used to create low, medium or high bot risk flags

Intended Results

I try to write down the goals of a project before I spend hours of time and gigabytes of disk space on a project. This will be successful if I can:

  1. Create Twitter lists of the most informative and genuine accounts
  2. Flag and Block all suspected malicious bot accounts (both Trolls and Content Polluters)
  3. Recognize the arguments being made by disinformation (bot) accounts
  4. Track where there is consensus and where there is debate among these informative accounts
  5. Curate a list of primary and secondary websites to monitor for raw information and faster updates

Data

We started with just the core account @CDCGov and pulled in the tweets since Jan 1, 2020. We looked at its three most popular hashtags:

After a couple thousand accounts had run, it was clear that these three other tickers should be added as they were often cited along with the seeds above:

Version 1 Dataset (As of 2020-03-01)

Number
Twitter Accounts16,932
Tweets16,412,608
Tickers Mentioned (total / unique) 168,902 / 8,197
Hashtags Mentioned (total / unique) 10.4 million / 824,343
Other Users Mentioned (total / unique) 29.6 million / 2.1 million
Tweet Thread detected* 3,823,967
Conversations detected** 19,858

Tweet Thread is a set of tweets that start with an original tweet and can contain many replies, retweets and replies of the replies, etc..

** Conversation occurs when at least two users reply to each other within the same thread. As you can see less than 1% of tweets are part of the back and forth conversation we care about here.

A deeper look at the data

We looked through these millions of tweet to give me a sense of what topics are being discussed and to give a quick sense that we haven’t veered too far from the #IDTwitter and #Coronavirus discussion.

Most common #hashtags

RankHashtagCounts
1.#coronavirus639,779
2.#COVID19307,280
3.#2019nCoV148,225
4.#China105,090
5.#Iran79,310
6.#CoronavirusOutbreak74,435
7.#Wuhan61,628
8.#SARSCoV241,422
9.#AI42,684
10.#Trump43,535
11.#COVIDー1934,961
12.#BREAKING41,239
13.#COVID201932,851
14.#DemDebate27,069
15.#nCoV201926,838
16.#US23,287
17.#auspol22,756
18.#MAGA23,719
19.#HongKong19,091
20.#pandemic19,073

Most Discussed Users

Our “relative mention frequency” looks is the number of ‘@mentions’ divided by the sum of twitter followers and twitter friends. We dropped corporate accounts and only included users with at least 6,500 mentions in our dataset.

What is not too surprising here, any why we’re doing this analysis, is the heavy mix of political figures with infectious disease experts. There are also several overly political accounts high up in the ranks, which on a cursory inspection seam to be offering little insight and a lot of emotion regarding this outbreak.

RankUserRelative Mention Freq
1.@realDonaldTrump383,011
2.@BernieSanders78,345
3.@WHO65,322
4.@JoeBiden48,722
5.@ewarren48,455
6.@POTUS43,484
7.@CNN41,021
8.@MikeBloomberg37,458
9.@BNODesk36,975
10.@SpeakerPelosi35,718
11.@CDCgov31,847
12.@YouTube31,036
13.@nytimes28,727
14.@GOP28,610
15.@PersevereEver28,280
16.@annableigh26,736
17.@TrumpSugar26,275
18.@DrTedros26,000
19.@PeteButtigieg25,822
20.@DonaldJTrumpJr24,470

Topics:

After filtering the 16 million tweets down to only the 19,858 conversations, we are able to start better understanding the types of information being communicated. Within this universe of 16,932 twitter accounts we identified 6 topics in which the tweets fit nicely. Below are word clouds where the size of the word is proportional to the relative significance of words / phrases / hashtags in each of the 6 topics.

You can see the variety of topics being discussed. Some are factual, others emotional.

#Coronavirus Subcommunities:

This is where it gets more interesting.

In this step we didn’t look at the content of the tweets, we just looked at ​‘who talked to whom’​. We ignored, ​@mentions​, ​likes ​and ​retweets ​and only looked at tweets that contained a response from one to another. From this exercise we located 16,932 accounts that we monitored since Jan 1st 2020. We looked at the communication patterns and ran a clustering algorithm (called Leiden) to identify the 26 main subcommunities with 7,199 accounts.

Finding the influencers of the communities we care about

These following communities appear, from the word clouds alone, to be more factual and less emotional. We label them based on the largest word in the cloud: #IDTwitter, #2019nCoV, #coronavirus

Of the 7,199 accounts in the top 26 communities, there are .Top Influencers of these communities are 1,689 accounts in these three clusters. We run page-rank on these to identify the top 20 influencers from each partition. We’ve created the list here:

RankUserBotoMeter Score
1.@HelenBranswell3%
2.@KrutikaKuppalli24%
3.@aetiology6%
4.@Trinitydraco111%
5.@angie_rasmussen9%
6.@BNODesk12%
7.@ComradeMarx122%
8.@LoisMcEwan5%
9.@epsilon314125%
10.@rocza8%
11.@OCDrises13%
12.@PndmcSrvvrsUSA24%
13.@howroute9%
14.@florian_krammer5%
15.@MtLionMama12%
16.@2020WriteIn11%
17.@apbeaton3%
18.@John1559707827%
19.@firefoxx6610%
20.@BillHanage8%
21.@WinnieDynasty5%
22.@_b_meyer5%
23.@COVIDRisk14%
24.@k_wuttt18%
25.@mmPharmD19%
26.@StephaniaBecker24%
27.@ETSshow17%
28.@RainbowtearsLj5%
29.@BogochIsaac8%
30.@Trumpery454%
31.@AmastrisDratwka5%
32.@KindrachukJason16%
33.@avatorl6%
34.@FungalDoc11%
35.@hopeseekr6%
36.@R_H_Ebright6%
37.@arghavan_salles19%
38.@ABetterYouToda19%
39.@DrEricDing7%
40.@still_a_nerd8%
41.@arambaut11%
42.@jselanikio6%
43.@FordPrefect74712%
44.@TIMGOLDFINCH7%
45.@JasonPJG17%
46.@piehead99%
47.@K_B799%
48.@chigrl4%
49.@me_nondependent9%
50.@fredwalton2163%
51.@evdefender7%
52.@GerberKawasaki5%
53.@mostcertainty12%
54.@Frances_Coppola3%
55.@l_lucullus8%
56.@mindedmusically24%
57.@LawrenceLepard12%
58.@ErikSTownsend4%
59.@teasri3%
60.@JasonEBurack3%

Note: The BotoMeter score is our quick way of sanity checking the quality of the user account until we can implement our own domain-specific bot meter.

DISCLAIMER: This is an automated list, this is not an endorsement for any particular account. This list was conducted in an automated fashion and may contain fake or malicious accounts. This list is subject to change at any time.

To follow this full list click here:

https://twitter.com/i/lists/1234583647770030080

Next Steps

We have currently identified some of the most influential accounts but that is only step 1 of 5 we outlined in our Intended Results section above. Look for more updates coming soon!

Extra Credit: Understanding Market Impact and Timing

Google Trends: #coronavirus vs $vix in 2020. A highly coincident relationship. Typically internet search trends are coincident or slightly lagging price moves. See #ebola and #deepwaterhorizon below

Image

The #Ebola outbreak and scare in 2014 is a useful analogy here to see how searches for #ebolavirus correlated with $vix. You can see the spikes in risk were less significant but the relationship clearly exists.

Image

Looking back a bit farther to #deepwaterhorizon disaster in 2010. This obviously doesn’t have the ‘pandemic’ risk of #covid19, but anyone on a trading desk back then remembers the real-time feed of the leaking well and the feeling of existential risk in the markets. #deepwater

Image

The links to the Google Trends data can be found here:

Thank you

Leave a comment below or message us on Twitter @PORTFORMER and @SEANKRUZEL. We’d love to hear what you think!

Sean Kruzel

Based in Boston, Sean is a portfolio manager, MIT grad, and founder of Portformer. When he’s not investing or programming, he’s probably sharpening his skis for the next icy, New England snowstorm. Check out his full bio here: https://www.portformer.com/our-team

3 comments

  • Workflows & implementable insights — to elevate each step of the advisor/client experience

    Use our head-to-head Replacement Scores™ to quickly create client proposals, discover funds, improve model portfolios, and monitor investment plans