COVID-19 Fake News and Health Misinformation Detection on Twitter
Project Overview
During the COVID-19 pandemic, social networks have seen a substantial amount of false news. Furthermore, people have been discussing numerous false remedies to cure COVID-19. However, these remedies are extremely dangerous to human health. The director-general of the World Health Organisation calls it “infodemic" because of the amount of misinformation, disinformation, and "false news" relating to COVID-19. With the enormous number of news regarding COVID-19 on the Internet, it is difficult for many to assess truthfulness. Moreover, the riots and panic shopping also occurred due to the propagation of “false news".
In this thesis project, I aim to build an automated COVID-19 misinformation detection system and investigate the value of a social network structure compared to the text-based classification approach. I have implemented a variety of techniques to detect fake news and misinformation in tweets related to COVID-19. The research objective is to classify each tweet as either true/fake with various text feature representations techniques and graph structure to compare and evaluate their performances. The project is comprised of two parts which are text-based and graph structure-based fake news detection techniques. For the first part, I conduct five different classification algorithms relying on various embeddings techniques including BoW, TF-IDF and Word2Vec embeddings. For the second part, I represent the data in a graph structure and learn the feature representations for the nodes using the Node2Vec embedding algorithm, which can then be used for the downstream classification task.
The main achievements of the project are as follows:
- - I learnt how to clean and prepare large scale data in JSON format and convert it into a suitable format to be ready for the study analysis.
- - I wrote a total of 18898 lines of source code (in Python), implementing all the experiments analysed in this thesis.
- - I learnt how to build an NLP pipeline including word tokenisation and removing stop- words from the tweet text.
- - I wrote functions in python code that encode different text feature representation includ- ing BoW, TF-IDF and Word2Vec.
- - I investigated and trained many classification models with different feature representation combinations.
- - I represented the data as a network structure (graph) and modelled it to be used in the context of detecting false information.
- - I applied continuous feature representations for the nodes in a graph to be used for the classification task.
Research Question
This project studies the value of both text and a social network structure for an automated misinformation detection system. The main research question to explore in this thesis – is misinformation better detected using text, or can the social network graph structure itself be a better detector?
Project Presentation (10 minute)
Sample - Thesis Report
View A Sample of Thesis Document in PDF Format