Characteristics of Datasets for Fake News Detection to Mitigate Domain Bias

  • Linshuo Yang Kyushu University
Keywords: BERT, cross-domain, fake news detection


“Fake news”, news intentionally containing false information, has become quite common and often causes social disruption. Many researches on automatic detection of fake have been extensively studied. The classification accuracy is improving, but a major challenge for practical application still remains: models can not work well for news in unknown fields, called “domains”, due to bias caused by different words and phrases among domains. To improve the accuracy of cross-domain fake news detection, it is crucial to mitigate the domain bias since unknown news articles to be classified can be in unknown domains. As a preliminary experiment, we trained a classifier using news articles whose noun phrases were masked because they are considered as a major source of the bias. However, contrary to expectations, masking did not improve accuracy. From the preliminary experiment, we obtained the hypothesis that pairs of fake and real news on the same topic can mitigate the domain bias. Using comparative experiments, we show that accuracy is higher when trained on paired news articles than when trained on unpaired ones.This result strongly suggests that a fake news dataset consisting of paired news could be effective for cross-domain detection.


[1] Xinyi Zhou and Reza Zafarani. A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities. ACM Comput. Surv., 53(5), 2020.
[2] Washington Post. A new study suggests fake news might have won Donald Trump the 2016 election. fix/wp/2018/04/03/a-new-study-suggests-fake-news-might-have-won- donald-trump-the-2016-election/, 2018.
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Language Technologies, NAACL- HLT, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
[4] Amila Silva, Ling Luo, Shanika Karunasekera, and Christopher Leckie. Embrac- ing Domain Differences in Fake News: Cross-domain Fake News Detection using Multi-modal Data. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, pages 557–565. AAAI Press, 2021.
[5] Benjamin Horne and Sibel Adali. This Just In: Fake News Packs A Lot In Title, Uses Simpler, Repetitive Content in Text Body, More Similar To Satire Than Real News. Proceedings of the International AAAI Conference on Web and Social Media, 11(1):759–766, 2017.
[6] Shingo Kato, Linshuo Yang, and Daisuke Ikeda. Domain Bias in Fake News Datasets Consisting of Fake and Real News Pairs. the 12th International Congress on Advanced Applied Informatics (IIAI-AAI), the 14th International Conference on E-Service and Knowledge Management (ESKM 2022), pages 101– 106, 2022.
[7] Tanik Saikh, Arkadipta De, Asif Ekbal, and Pushpak Bhattacharyya. A deep learning approach for automatic detection of fake news. In Proceedings of the 16th International Conference on Natural Language Processing, pages 230–238, 2019.
[8] Murayama Taichi, Wakayama Shoko, and Aramaki Eiji. Diachronic bias in fake news detection datasets (in Japanese). Proceedings of the Twenty-seventh Annual Meeting of the Association for Natural Language Processing, pages 1011–1016, 2021.
[9] Juan Pablo Posadas-Durán, Helena Gomez-Adorno, Grigori Sidorov, and Jesús Jaime Moreno Escobar. Detection of fake news in a new corpus for the Spanish language. Journal of Intelligent and Fuzzy Systems, 36:4868–4876, 2019.
Technical Papers (Information and Communication Technology)