I read @jtoomim 's ASERT as the new DAA upgrade proposal and though I'm not tech savvy enough to comment on the proposal itself, I did find it extremely annoying that there were loads of comments on that article, which can be - by almost any measure - classified as spam.
As I was thinking on how one could try to solve this annoyance, I remembered the good ol' Markov chains.
Here's how it works. Given a list of sample comments, for each comment C:
manually assign score S to C
extract words from C
update the chain with word W1 pointing to the next word, W2, in the sentence, and this connection between W1 and W2 holds the score S12 + S, where S12 is the score that might have been assigned to this pair of words by a previous comment
To clarify a bit, here are the scores for the very first word in a comment:
In [1]: m.chain[None]
Out[1]:
{'hi': 1,
'wow': -4,
'thanks': -4,
'such': -1,
'thank': -2,
[...]
}
So, if a comment starts with the word "hi", it adds 1 to its total score, but if it starts with "wow", the score is reduced by 4. By now you can guess that this "anti-spam" system can easily be defeated... well, yes but I'm assuming a casual spammer won't have the know-how on how to bypass the trap.
So, we built the chain with the comments from the mentioned article. Now, let's try to evaluate some made up comments.
In [2]: m.evaluate_words(['exponential', 'moving', 'average', 'ema', 'sounds',
'like', 'a', 'decent', 'solution'])
None --[ -0.6 ]--> exponential
exponential --[ 1.0 ]--> moving
average --[ 1.0 ]--> ema
ema --[ 1.0 ]--> sounds
like --[ 1 ]--> a
a --[ 0.8 ]--> decent
Out[2]: 4.2
If there's no connection between W1 and W2, we take the average score of all the pairs having W1 as the first word. So if "like" points to "a" and "this", and we have to evaluate the pair ("like" --[ 0.5 ]--> "some"), the score will be the average of ("like" --[ 1 ]--> "a") and ("like" --[ 0 ]--> "this" ).
In [3]: m.evaluate_words(['thanks', 'exponential', 'moving', 'average', 'sounds',
'like', 'a', 'cool', 'solution',
'thanks', 'for', 'sharing'])
None --[ -4 ]--> thanks
thanks --[ -1.5 ]--> exponential
exponential --[ 1.0 ]--> moving
average --[ 1.0 ]--> sounds
like --[ 1 ]--> a
a --[ 0.8 ]--> cool
cool --[ 1.0 ]--> solution
solution --[ 1.0 ]--> thanks
thanks --[ -4 ]--> for
for --[ -7 ]--> sharing
Out[3]: -10.7
In [4]: m.evaluate_words(['dude', 'this', 'was', 'hard', 'to', 'follow'])
Out[4]: 3.7
In [5]: m.evaluate_words(['bch', 'price', 'will', 'soar', 'good', 'luck'])
Out[5]: -4.7
In [6]: m.evaluate_words(['i', 'don', 't', 'know', 'much', 'about', 'the',
'technical', 'side', 'but', 'this', 'was', 'informative'])
Out[6]: 12.7
In [7]: m.evaluate_words(['good', 'luck', 'with', 'your', 'endeavour', 'sir'])
Out[7]: -4.8
In [8]: m.evaluate_words(['good', 'article', 'shows', 'obviously',
'bch', 'will', 'be', 'crucial', 'without', 'daa'])
Out[8]: -4.1
In [9]: m.evaluate_words(['variations', 'of', 'what', 'do', 'you', 'mean',
'change', 'in', 'block', 'interval', 'or', 'hashrate'])
Out[9]: 17.3
In [10]: m.evaluate_words(['the', 'problem', 'is', 'not', 'one', 'and', 'exact'])
Out[10]: 14.9
The complete python code is here: markov1.py gist
This is pretty neat. Look into Naive Bayes classifiers - it's much more suitable for the purpose. For us, of course, it's not an option, since... well, sometimes I leave a comment like "wow, nice!" too... not meaning to be a spammer at all :) Also, the idea really fails when you arrive at a comment like: "এটা খুব তথ্যপূর্ণ ছিল স্যার" or even "先生,这非常有用" (where even finding a word boundary is a problem of its own) :) But nice. If we were hiring, I'd probably invite you for an interview. Though we're not even close to even having any profit :)
But the markup of the article is brilliant! Exactly what we wanted the articles about code to look like! Bravo!