Wikipedia:Wikipedia Signpost/2020-06-28/Recent research

Recent research

Wikipedia and COVID-19; automated Wikipedia-based fact-checking

By Matthew Sumpter, Martin Gerlach, and Tilman Bayer

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

Facebook research about automated Wikipedia-based fact-checking using language models

Reviewed by Matthew Sumpter

A group of researchers at Facebook investigated whether computational language models inherently contain knowledge from the source they were trained on.^[1] To test this hypothesis, they conducted preliminary experiments using the FEVER dataset, which "consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from".^[2]

The typical fact-checking pipeline involves storing an external knowledge base, retrieving evidence from it that is relevant to the given claim, and then verifying the claim based on the retrieved evidence. The authors replaces this traditional pipeline with a language model, specifically BERT. This approach is computationally cheaper, utilizes widely-available technology, and increases the versatility of language models.

Exploratory experiments involved selecting random claims from the FEVER dataset and masking them. An example of this would be taking the claim Thomas Jefferson founded the University of Virginia after retiring and masking it to be Thomas Jefferson founded the University of <MASK> after retiring. The masked example was given to the language model to predict the masked word, and the claims were verified by humans as either supporting or refuting the claim. For a sample set of 50 claims, there was an average accuracy of 55%, encouraging the researchers to try a computational approach. Full experiments involved using named-entity recognition (NER) to remove the last named entity in a claim – this is based on the observation that factuality hinges upon the correctness of the two entities and the relationship between them, not on how the claim is phrased. Then these entities were masked and passed to the BERT language model to predict the missing word, and automatically classified as support, refute, or not enough information (NEI).

Results from these experiments had an accuracy of 49%, which is comparable to the FEVER baseline of 48.8% (although it falls short of the traditional state-of-the art model of 68.21%). Additionally, the average F1 score of 0.58 for identifying the support claim indicates the model was unable to distinguish between the refute and NEI classes. An analysis of the predicted tokens also reveals some insights about the nature of using a language model for fact-checking – a claim such as Tim Roth was born in <MASK> should have predicted 1961, but predicted London. BERT is trained on Wikipedia, and is therefore subject to its stylistic patterns (on Wikipedia, dates of birth are typically found in parentheses, whereas locations are likely to be presented in a claim format). This indicates the pretraining of the model determines the way in which it should be 'queried' for information. While not comparable to the state of the art, the researchers conclude that their approach has strong potential for improvement, and can lead to stronger and more efficient solutions for generating evidences and masking claims.

How Wikipedia keeps up with COVID-19 research

Reviewed by Martin Gerlach

COVID-19 research in Wikipedia^[3] by Giovanni Colavizza from University of Amsterdam (available as pre-print on bioRxiv) investigates how editors on Wikipedia find, select, and integrate scientific information on COVID-19 into Wikipedia articles. Given the surge of new scientific publications on COVID-19 – since the beginning of 2020, more than 20,000 new scientific articles have been published around the topic – how do editors keep up with the amount of information, while at the same time ensuring high quality?

For this, the author assembles a corpus representing research on COVID-19 from several publicly available resources such as Pubmed, bioRxiv, WHO, etc., comprising more than 60,000 publications in total. To determine whether these publications have been integrated into Wikipedia, the author uses data from Altmetric which matches citations in Wikipedia articles with known identifiers of publications such as DOI.

Using this approach, the study draws a detailed picture of the editorial work around COVID-19 in Wikipedia. First, editors seem to have been able to cope with the rapidly growing literature. Slightly more than 3% of publications are cited at least once in Wikipedia. Taking into account a more than 10-fold increase in the number of publications in 2020, this fraction has been remarkably stable for publications published in recent years in comparison to, say, 20 years ago. Second, editors are citing a largely representative sample of the literature in terms of the topic diversity. Clustering publications into 7 different topics using an LDA topic model reveals that coverage of topics in Wikipedia reflect the overall imbalance in the scientific literature (with most research on coronaviruses as well as public health and epidemics). Third, editors seem to follow the same inclusion standards for publications in 2020 as before (see also WP:MEDRS), relying on research that is not only visible and impactful (e.g. mentions in news and blogposts) but also appears in peer-reviewed and specialized journals (e.g. The Lancet) and avoiding pre-prints, which is revealed through different regression models.

One of the main limitations of this study is that it only considers the citations to scientific publications in Wikipedia articles. Thus, directions for future work include taking into account content of Wikipedia articles or studying edit and discussion history of the respective pages as well as comparing coverage with expert reviews on COVID-19.

Briefly

See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.
The Wikimedia Foundation and the Community Data Science Collective have published datasets about COVID-19 and Wikidata. See also "Open data and COVID-19: Wikipedia as an informational resource during the pandemic" in the April Signpost issue.

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

Compiled by Tilman Bayer

"A Quantitative Portrait of Wikipedia's High-Tempo Collaborations during the 2020 Coronavirus Pandemic"

From the abstract:^[4]

"Using 973,940 revisions from 134,337 editors to 4,238 articles, this study examines the dynamics of the English Wikipedia’s response to the coronavirus pandemic through the first five months of 2020 as a 'quantitative portrait' describing the emergent collaborative behavior at three levels of analysis: article revision, editor contributions, and network dynamics. Across multiple data sources, quantitative methods, and levels of analysis, we find four consistent themes characterizing Wikipedia’s unique large-scale, high-tempo, and temporary online collaborations: external events as drivers of activity, spillovers of activity, complex patterns of editor engagement, and the shadows of the future."

COVID-19 mobility restrictions increased interest in health and entertainment topics on Wikipedia

From the abstract:^[5]

"We study how the coronavirus disease 2019 (COVID-19) pandemic, alongside the severe mobility restrictions that ensued, has impacted information access on Wikipedia [...]. A longitudinal analysis that combines pageview statistics for 12 Wikipedia language editions with mobility reports published by Apple and Google reveals a massive increase in access volume, accompanied by a stark shift in topical interests. Health- and entertainment- related topics are found to have gained, and sports- and transportation- related topics, to have lost attention. Interestingly, while the interest in health-related topics was transient, that in entertainment topics is lingering and even increasing. These changes began at the time when mobility was restricted and are most pronounced for language editions associated with countries, in which the most severe mobility restrictions were implemented, indicating that the interest shift might be caused by people's spending more time at home."

"Collective response to the media coverage of COVID-19 Pandemic on Reddit and Wikipedia"

From the abstract:^[6]

"Our results show that public attention, quantified as users activity on Reddit and active searches on Wikipedia pages, is mainly driven by media coverage and declines rapidly, while news exposure and COVID-19 incidence remain high. Furthermore, by using an unsupervised, dynamical topic modeling approach, we show that while the attention dedicated to different topics by media and online users are in good accordance, interesting deviations emerge in their temporal patterns."

"A protocol for adding knowledge to Wikidata, a case report"

From the abstract:^[7]

"Pandemics, even more than other medical problems, require swift integration of knowledge. When caused by a new virus, understanding the underlying biology may help finding solutions. In a setting where there are a large number of loosely related projects and initiatives, we need common ground, also known as a “commons”. Wikidata, a public knowledge graph aligned with Wikipedia, is such a commons and uses unique identifiers to link knowledge in other knowledge bases [...] we describe the process of aligning resources on the genomes and proteomes of the SARS-CoV-2 virus and related viruses as well as how Shape Expressions can be defined for Wikidata to model the knowledge, helping others studying the SARS-CoV-2 pandemic. How this model can be used to make data between various resources interoperable, is demonstrated by integrating data from NCBI Taxonomy, NCBI Genes, UniProt, and WikiPathways. Based on that model, a set of automated applications or bots were written for regular updates of these sources in Wikidata and added to a platform for automatically running these updates."

"Swat: A system for detecting salient Wikipedia entities in texts"

From the abstract:^[8]

"... SWAT, a system that identifies the salient Wikipedia entities occurring in an input document. SWAT consists of several modules that are able to detect and classify on-the-fly Wikipedia entities as salient or not, based on a large number of syntactic, semantic and latent features properly extracted via a supervised process which has been trained over millions of examples drawn from the New York Times corpus. [...]SWAT improves known solutions over all publicly available datasets. We release SWAT via an API that we describe and comment in the paper in order to ease its use in other software."

"WAC: A Corpus of Wikipedia Conversations for Online Abuse Detection"

From the abstract:^[9]

"... we propose an original framework, based on the Wikipedia Comment corpus, with comment-level abuse annotations of different types. The major contribution concerns the reconstruction of conversations, by comparison to existing corpora, which focus only on isolated messages (i.e. taken out of their conversational context). This large corpus of more than 380k annotated messages opens perspectives for online abuse detection and especially for context-based approaches. We also propose, in addition to this corpus, a complete benchmarking platform to stimulate and fairly compare scientific works around the problem of content abuse detection, trying to avoid the recurring problem of result replication. Finally, we apply two classification methods to our dataset to demonstrate its potential."

"Multi-class Multilingual Classification of Wikipedia Articles Using Extended Named Entity Tag Set"

From the abstract:^[10]

"we introduce the Shinra 5-Language Categorization Dataset (SHINRA-5LDS), a large multi-lingual and multi-labeled set of annotated Wikipedia articles in Japanese, English, French, German, and Farsi using Extended Named Entity (ENE) tag set."

"Computational Fact Validation from Knowledge Graph using Structured and Unstructured Information"

From the abstract:^[11]

"Given a Knowledge Graph, a knowledge corpus, and a fact (triple statement), the goal of fact-checking is to decide whether the fact or knowledge is correct or not. Existing approaches extensively used several structural features of the input Knowledge Graph to address the mentioned problem. [...] Our approach considers finding evidence from Wikipedia and structured information from Wikidata, which helps in determining the validity of the input facts. [...] The similarity of input fact with elements of relevant Wikipedia pages has been used as unstructured features. The experiments with a dataset consisting of nine relations of Wikidata has established the advantage of combining unstructured features with structured features for the given task."

References

^ Lee, Nayeon; Li, Belinda Z.; Wang, Sinong; Yih, Wen-tau; Ma, Hao; Khabsa, Madian (7 June 2020). "Language Models as Fact Checkers?". arXiv:2006.04102 [cs.CL]. Accepted in FEVER Workshop (ACL2020)
^ Thorne, James; Vlachos, Andreas; Christodoulopoulos, Christos; Mittal, Arpit (June 2018). "FEVER: a Large-scale Dataset for Fact Extraction and VERification". Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics. pp. 809–819. arXiv:1803.05355. doi:10.18653/v1/N18-1074. S2CID 4711425.
^ Colavizza, Giovanni (2020-05-12). "COVID-19 research in Wikipedia". bioRxiv 10.1101/2020.05.10.087643.{{cite bioRxiv}}: CS1 maint: date and year (link)
^ Keegan, Brian C.; Tan, Chenhao (2020-06-15). "A Quantitative Portrait of Wikipedia's High-Tempo Collaborations during the 2020 Coronavirus Pandemic". arXiv:2006.08899 [cs.SI].
^ Ribeiro, Manoel Horta; Gligorić, Kristina; Peyrard, Maxime; Lemmerich, Florian; Strohmaier, Markus; West, Robert (2020-05-19). "Sudden Attention Shifts on Wikipedia Following COVID-19 Mobility Restrictions". arXiv:2005.08505 [cs.CY].
^ Gozzi, Nicolò; Tizzani, Michele; Starnini, Michele; Ciulla, Fabio; Paolotti, Daniela; Panisson, André; Perra, Nicola (2020-06-08). "Collective response to the media coverage of COVID-19 Pandemic on Reddit and Wikipedia". arXiv:2006.06446 [cs.SI].
^ Waagmeester, Andra; Willighagen, Egon L.; Su, Andrew I.; Kutmon, Martina; Gayo, Jose Emilio Labra; Fernández-Álvarez, Daniel; Groom, Quentin; Schaap, Peter J.; Verhagen, Lisa M.; Koehorst, Jasper J. (2020-06-04). "A protocol for adding knowledge to Wikidata, a case report". bioRxiv 10.1101/2020.04.05.026336v2.
^ Ponza, Marco; Ferragina, Paolo; Piccinno, Francesco (2019). "Swat: A system for detecting salient Wikipedia entities in texts". Computational Intelligence. 35 (4): 858–890. arXiv:1804.03580. doi:10.1111/coin.12216. ISSN 1467-8640. S2CID 4748467. , Eprint version: Ponza, Marco; Ferragina, Paolo; Piccinno, Francesco (2019-05-16). "SWAT: A System for Detecting Salient Wikipedia Entities in Texts". arXiv:1804.03580 [cs.IR].
^ Cecillon, Noé; Labatut, Vincent; Dufour, Richard; Linares, Georges (2020-03-13). "WAC: A Corpus of Wikipedia Conversations for Online Abuse Detection". arXiv:2003.06190 [cs.CL].
^ Shavarani, Hassan S.; Sekine, Satoshi (2020-03-05). "Multi-class Multilingual Classification of Wikipedia Articles Using Extended Named Entity Tag Set". arXiv:1909.06502 [cs.CL].
^ Khandelwal, Saransh; Kumar, Dhananjay (2020-01-05). "Computational Fact Validation from Knowledge Graph using Structured and Unstructured Information". Proceedings of the 7th ACM IKDD CoDS and 25th COMAD. CoDS COMAD 2020. Hyderabad, India: Association for Computing Machinery. pp. 204–208. doi:10.1145/3371158.3371187. ISBN 9781450377386.

← Previous "Recent research"

Next "Recent research" →

In this issue

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

" a claim such as Tim Roth was born in <MASK> should have predicted "1961", but predicted "London"." - At risk of stating the obvious... what's the problem with that autocompletion? There are clearly multiple possible valid completions here, even if some of them are facts no normal human would bring up. But both city of birth and birthdate are valid things in normal conversation as well as syntactically. SnowFire (talk) 17:49, 1 July 2020 (UTC)[reply]

CCing Matthew Sumpter who wrote this review and has read the actual paper, but my understanding is that this ambiguity becomes a problem if one wants to use the method to e.g. fact-check the claim "Tim Roth was born in 1944". Regards, HaeB (talk) 02:21, 2 July 2020 (UTC)[reply]

Yes, User:HaeB, this is a correct interpretation - while the fact is valid, it is not useful for the task. Matthew Sumpter (talk) 20:40, 24 August 2020 (UTC)[reply]

Facebook research about automated Wikipedia-based fact-checking using language models

Forgive me, but I couldn’t help shake my head when I read this. Facebook is the leading source of coronavirus misinformation in the world. Perhaps these "researchers" should focus their efforts on their employer. Viriditas (talk) 22:35, 2 July 2020 (UTC)[reply]

Explore Wikipedia history by browsing The Signpost archives.

Home

About