CheckUser and editing patterns

From Devwiki

Jump to: navigation, search

HaeB

Sockpuppets (multiple accounts operated by the same person), or rather their abuse, are a permanent problem in Wikimedia projects. As a partial remedy, the CheckUser tool was introduced, which allows a few trusted users to examine the IP data of logged-in editors.

An analysis of editing patterns (i.e. of publicly available data) almost always complements the CheckUser data, although it is rarely done in the systematic way known from forensic linguistics and stylometry.

I will highlight some basic concepts from statistics which are implicitly used in analyzing CheckUser results and editing patterns, and also point to some well-known statistical fallacies which have to be avoided.

I will also discuss the significant privacy concerns which are associated with this topic.

This talk will start out by briefly describing the problem of the abuse of sockpuppets (multiple accounts operated by the same person) in wiki communities. It will then lay out the basic features of the CheckUser tool, which was introduced into MediaWiki to enable a few trusted users to examine the IP data of logged-in editors to prove such abuse. Using some anonymized real-world examples, I explain how the tool works and describe some of the knowledge necessary to analyze its results (WHOIS data, dynamic vs. static IPs, etc.).

Existing results from forensic linguistics and stylometry (research fields with a long history) suggest that public information from a user's edit history can identify sockpuppets with a high degree of accuracy in many cases. In fact, on Wikipedia such public data has already been used for a long time in a manual, non-systematic way as sockpuppet evidence, which is complemented by the CheckUser results. Wikipedians are beginning to analyze editing patterns in a more systematic way using more sophisticated tools, and I will argue that there is potential for many more of such data-mining tools with significant privacy concerns.

I will try to give a brief overview of the statistical concepts which are (mostly unconsciously) used by Wikipedians when examining cases of suspected sockpuppetry by the means of CheckUser data and editing pattern analysis. I will also mention some well-known statistical fallacies which have to be avoided in this process (prosecutor's fallacy, defendant's fallacy, selection bias).

The last part will describe some more of the community issues and privacy concerns that CheckUser creates, with emphasis on the privacy policy of the Wikimedia Foundation (currently being rewritten) and the point of view of German/EU privacy laws.

Average familiarity with wikis and Wikipedia's community processes (such as account blocking), combined with a basic understanding of IP addresses should suffice to understand most of the talk. Some basic knowledge of statistics will be helpful for the brief statistics part.