Catching spammy comments with Markov chains

0 12
Avatar for Stephen2
3 years ago

I read @jtoomim 's ASERT as the new DAA upgrade proposal and though I'm not tech savvy enough to comment on the proposal itself, I did find it extremely annoying that there were loads of comments on that article, which can be - by almost any measure - classified as spam.

As I was thinking on how one could try to solve this annoyance, I remembered the good ol' Markov chains.

Here's how it works. Given a list of sample comments, for each comment C:

manually assign score S to C

extract words from C

update the chain with word W1 pointing to the next word, W2, in the sentence, and this connection between W1 and W2 holds the score S12 + S, where S12 is the score that might have been assigned to this pair of words by a previous comment

To clarify a bit, here are the scores for the very first word in a comment:

In [1]: m.chain[None]

Out[1]:

{'hi': 1,

'wow': -4,

'thanks': -4,

'such': -1,

'thank': -2,

[...]

}

So, if a comment starts with the word "hi", it adds 1 to its total score, but if it starts with "wow", the score is reduced by 4. By now you can guess that this "anti-spam" system can easily be defeated... well, yes but I'm assuming a casual spammer won't have the know-how on how to bypass the trap.

So, we built the chain with the comments from the mentioned article. Now, let's try to evaluate some made up comments.

In [2]: m.evaluate_words(['exponential', 'moving', 'average', 'ema', 'sounds',

'like', 'a', 'decent', 'solution'])

None --[ -0.6 ]--> exponential

exponential --[ 1.0 ]--> moving

average --[ 1.0 ]--> ema

ema --[ 1.0 ]--> sounds

like --[ 1 ]--> a

a --[ 0.8 ]--> decent

Out[2]: 4.2

If there's no connection between W1 and W2, we take the average score of all the pairs having W1 as the first word. So if "like" points to "a" and "this", and we have to evaluate the pair ("like" --[ 0.5 ]--> "some"), the score will be the average of ("like" --[ 1 ]--> "a") and ("like" --[ 0 ]--> "this" ).

In [3]: m.evaluate_words(['thanks', 'exponential', 'moving', 'average', 'sounds',

'like', 'a', 'cool', 'solution',

'thanks', 'for', 'sharing'])

None --[ -4 ]--> thanks

thanks --[ -1.5 ]--> exponential

exponential --[ 1.0 ]--> moving

average --[ 1.0 ]--> sounds

like --[ 1 ]--> a

a --[ 0.8 ]--> cool

cool --[ 1.0 ]--> solution

solution --[ 1.0 ]--> thanks

thanks --[ -4 ]--> for

for --[ -7 ]--> sharing

Out[3]: -10.7

In [4]: m.evaluate_words(['dude', 'this', 'was', 'hard', 'to', 'follow'])

Out[4]: 3.7

In [5]: m.evaluate_words(['bch', 'price', 'will', 'soar', 'good', 'luck'])

Out[5]: -4.7

In [6]: m.evaluate_words(['i', 'don', 't', 'know', 'much', 'about', 'the',

'technical', 'side', 'but', 'this', 'was', 'informative'])

Out[6]: 12.7

In [7]: m.evaluate_words(['good', 'luck', 'with', 'your', 'endeavour', 'sir'])

Out[7]: -4.8

In [8]: m.evaluate_words(['good', 'article', 'shows', 'obviously',

'bch', 'will', 'be', 'crucial', 'without', 'daa'])

Out[8]: -4.1

In [9]: m.evaluate_words(['variations', 'of', 'what', 'do', 'you', 'mean',

'change', 'in', 'block', 'interval', 'or', 'hashrate'])

Out[9]: 17.3

In [10]: m.evaluate_words(['the', 'problem', 'is', 'not', 'one', 'and', 'exac t'])

Out[10]: 14.9

The complete python code is here: markov1.py gist

1
$ 0.02
$ 0.02 from @TheRandomRewarder
Avatar for Stephen2
3 years ago

Comments