Comment by shaggorama on 18/10/2014 at 12:36 UTC*

11 upvotes, 1 direct replies (showing 1)

View submission: mod tool: sockpuppet detector

I don't have the time to build this for you, but I have thought about making something similar myself and can give you a few metrics that would be useful. This way, if you get in contact with someone actually motivated to make this (really wouldn't even be that hard): you can make a more concrete request.

Same phrases

Collapse all of a particular users comments into a single "super document." Convert this document into vector representation by counting the occurrence of each word in the document, removing words that appear in a list of "stop words"[1] (such as 'the', 'an', 'or', etc). Scale word occurence relative normal word usage on reddit by collecting a random corpus of comments from r/all/new (a "background" corpus to help you understand what normal word usage on reddit looks like) and using the TF-IDF[2] transformation for your "document vectors." Then calculate the cosine[3] between the two vectors as your distance score. Values close to 0 indicate more similar users. Calibrate this test by performing it against randomly paired users selected from r/all/new to identify the typical distribution random cosine similarities on reddit (i.e. to determine a meaningful "these users are way too similar" cutoff).

1: http://www.ranks.nl/stopwords

2: http://en.wikipedia.org/wiki/Tf%E2%80%93idf

3: http://en.wikipedia.org/wiki/Cosine_similarity

Same subs frequented

For each comment you collect from a given user, identify which sub it came from. Do this for both users. Determine which user has the smaller number of unique subreddits visited. call this U1. Calculate a modified jaccard similarity[4] for the two users subreddits as (number of unique subreddits the tow users have in common)/(number of unique subreddits commented in by U1)

4: http://en.wikipedia.org/wiki/Jaccard_index

Replies to themselves

For each comment from each user, extract the "parent_id" attribute which identifies the comment they were responding to. Also extract the id of each comment/submission (which will need to have the appropriate "kind" prefix appended to it) created by each user. Calculate the intersection of user1's parent_ids with user2's comment/submission ids. Do this for both users separately, and report both the raw counts and as a percentage of that user's comments.

Timing

For a given user, extract the "created_date" timestamps of all their comments/submissions. extract the hour component from the timestamp and calculate an activity profile for the user. It will look something like this[5] (this plot is broken down by day of the week, but I don't think you need to get this granular). Do the same thing for both users and overly their plots. If you just want a numeric score, scale their profiles so each data point is a "percent of overall activity" instead of a raw count of comments/submissions posted that hour, and then calculate the mean squared error between the two users activity profiles. A lower error means they are active at very similar times. I don't think this is necessarily a good approach and you're probably better off doing this comparison via visual inspection.

5: http://unsupervisedlearning.files.wordpress.com/2012/09/vectorized-daily-acitivty-graphic-rdc.png

Similar arguments supported.

This is a really tough one. Like, a *really* tough one. I think there are a few simpler approaches that can give you the gist of this. a) construct a report on the top N most frequently used words by each user, ignoring stop words. b) Use text summarization to extract the N sentences most representative of all of each users comments. There are many free tools available for automating text summarization, but if you or your bot creator want to do it from scratch, here's a tutorial[6] for an easy approach, and here's an article[7] going into more detail. These approaches won't give you a score, but they will help you understand what these users tend to talk about.

6: http://engineering.flipboard.com/2014/10/summarization/

7: http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html

Likelihood of appearing in same submission

You didn't ask for this one, but I think it's important. Use the same approach as I suggested for comparing subreddit occurrence and extend that to submission ids for the comments you collect (and also each user's submissions). Additionally, given that there does exist overlap in the two users posting to the same submission, calculate the smallest time delta between the two users activity on submissions in which they both appear for all submission in which they appear together. Flag all of these submissions for more detailed investigation and calculate the mean shortest delta. You should also do something similar for the "replies to themselves" analysis: calculate the mean time it took for one user to respond to the other, given that they respond to each other.

"% chance" rating

Again, this is tough. The problem is that to really calibrate this score, you need known cases of sockpuppetry. But we can use outlier analysis as a proxy. For each of the above analyses that spits out a score, concatenate all the scores into a vector. Grab random users from r/all/new and calculate a score vector for each random pair of users so you have a distribution of these score vectors. Calculate the mean and estimate the covariance matrix for this distribution. Call these your "baseline" statistics. Now, when you have a pair of users you are suspicious of, calculate their "score" vector as above and calculate the mahalnobis distance[8] of the score vector relative to your baseline distribution to give you a score of how much of an outlier this pair is relative to what you observe at random. Pro-tip: augment your baseline by continuously scraping random pairs of users and building up your dataset. Scraping users will probably be a slow process, but the more data the better. So when you're not using your tool to investigate suspicious activity, set it to scrape random users so you can build up your baseline data. For any random user you pull down, you can permute their data against all of the other random users you've scraped data for (NB:*random* users. Don't add your "suspicious" users to this data set).