Catching spammy comments with Markov chains

2 76
Avatar for codemojo
4 years ago

I read @jtoomim 's ASERT as the new DAA upgrade proposal and though I'm not tech savvy enough to comment on the proposal itself, I did find it extremely annoying that there were loads of comments on that article, which can be - by almost any measure - classified as spam.

As I was thinking on how one could try to solve this annoyance, I remembered the good ol' Markov chains.

Here's how it works. Given a list of sample comments, for each comment C:

  • manually assign score S to C

  • extract words from C

  • update the chain with word W1 pointing to the next word, W2, in the sentence, and this connection between W1 and W2 holds the score S12 + S, where S12 is the score that might have been assigned to this pair of words by a previous comment

To clarify a bit, here are the scores for the very first word in a comment:

In [1]: m.chain[None]
Out[1]:
{'hi': 1,
 'wow': -4,
 'thanks': -4,
 'such': -1,
 'thank': -2,
 [...]
}

So, if a comment starts with the word "hi", it adds 1 to its total score, but if it starts with "wow", the score is reduced by 4. By now you can guess that this "anti-spam" system can easily be defeated... well, yes but I'm assuming a casual spammer won't have the know-how on how to bypass the trap.

So, we built the chain with the comments from the mentioned article. Now, let's try to evaluate some made up comments.

In [2]: m.evaluate_words(['exponential', 'moving', 'average', 'ema', 'sounds', 
                          'like', 'a', 'decent', 'solution'])
                None  --[  -0.6 ]--> exponential
         exponential  --[  1.0  ]--> moving
             average  --[  1.0  ]--> ema
                 ema  --[  1.0  ]--> sounds
                like  --[   1   ]--> a
                   a  --[  0.8  ]--> decent
Out[2]: 4.2

If there's no connection between W1 and W2, we take the average score of all the pairs having W1 as the first word. So if "like" points to "a" and "this", and we have to evaluate the pair ("like" --[ 0.5 ]--> "some"), the score will be the average of ("like" --[ 1 ]--> "a") and ("like" --[ 0 ]--> "this" ).

In [3]: m.evaluate_words(['thanks', 'exponential', 'moving', 'average', 'sounds',
                          'like', 'a', 'cool', 'solution',
                          'thanks', 'for', 'sharing'])
                None  --[   -4  ]--> thanks
              thanks  --[  -1.5 ]--> exponential
         exponential  --[  1.0  ]--> moving
             average  --[  1.0  ]--> sounds
                like  --[   1   ]--> a
                   a  --[  0.8  ]--> cool
                cool  --[  1.0  ]--> solution
            solution  --[  1.0  ]--> thanks
              thanks  --[   -4  ]--> for
                 for  --[   -7  ]--> sharing
Out[3]: -10.7
In [4]: m.evaluate_words(['dude', 'this', 'was', 'hard', 'to', 'follow'])
Out[4]: 3.7
In [5]: m.evaluate_words(['bch', 'price', 'will', 'soar', 'good', 'luck'])
Out[5]: -4.7
In [6]: m.evaluate_words(['i', 'don', 't', 'know', 'much', 'about', 'the',
                          'technical', 'side', 'but', 'this', 'was', 'informative'])
Out[6]: 12.7
In [7]: m.evaluate_words(['good', 'luck', 'with', 'your', 'endeavour', 'sir'])
Out[7]: -4.8
In [8]: m.evaluate_words(['good', 'article', 'shows', 'obviously', 
                          'bch', 'will', 'be', 'crucial', 'without', 'daa'])
Out[8]: -4.1
In [9]: m.evaluate_words(['variations', 'of', 'what', 'do', 'you', 'mean',
                          'change', 'in', 'block', 'interval', 'or', 'hashrate'])
Out[9]: 17.3
In [10]: m.evaluate_words(['the', 'problem', 'is', 'not', 'one', 'and', 'exact'])
Out[10]: 14.9

The complete python code is here: markov1.py gist

5
$ 3.00
$ 2.00 from @Read.Cash
$ 1.00 from @DrNums
Avatar for codemojo
4 years ago

Comments

This is pretty neat. Look into Naive Bayes classifiers - it's much more suitable for the purpose. For us, of course, it's not an option, since... well, sometimes I leave a comment like "wow, nice!" too... not meaning to be a spammer at all :) Also, the idea really fails when you arrive at a comment like: "এটা খুব তথ্যপূর্ণ ছিল স্যার" or even "先生,这非常有用" (where even finding a word boundary is a problem of its own) :) But nice. If we were hiring, I'd probably invite you for an interview. Though we're not even close to even having any profit :)

But the markup of the article is brilliant! Exactly what we wanted the articles about code to look like! Bravo!

$ 0.00
4 years ago

I wrote some articles on yours.org back in the days when it was active, and I remember spam posts and comments being the biggest issue for me. I see the same pattern here on read.cash - people probably having multiple accounts and upvoting each other, writing nonsense comments, etc. It would be really cool if there was a community here on read.cash, for example "technical", which would have some sort of PageRank and spam filter implemented, so that only comments adding value would be shown under an article.

$ 0.00
4 years ago

The problem with that would be that every user would start filtering it and new guys would automatically be in a "mute" mode. But if you let newbies talk then you still have spam comments :) Kind of like a catch-22.

It's actually not as bad as it was about a month ago when we first implemented points and were giving them just for a comment... that was terrible, 100% of comments were pure spam.

$ 0.00
4 years ago