I read @jtoomim 's ASERT as the new DAA upgrade proposal and though I'm not tech savvy enough to comment on the proposal itself, I did find it extremely annoying that there were loads of comments on that article, which can be - by almost any measure - classified as spam.
As I was thinking on how one could try to solve this annoyance, I remembered the good ol' Markov chains.
Here's how it works. Given a list of sample comments, for each comment C:
manually assign score S to C
extract words from C
update the chain with word W1 pointing to the next word, W2, in the sentence, and this connection between W1 and W2 holds the score S12 + S, where S12 is the score that might have been assigned to this pair of words by a previous comment
To clarify a bit, here are the scores for the very first word in a comment:
In [1]: m.chain[None]
Out[1]:
{'hi': 1,
'wow': -4,
'thanks': -4,
'such': -1,
'thank': -2,
[...]
}
So, if a comment starts with the word "hi", it adds 1 to its total score, but if it starts with "wow", the score is reduced by 4. By now you can guess that this "anti-spam" system can easily be defeated... well, yes but I'm assuming a casual spammer won't have the know-how on how to bypass the trap.
So, we built the chain with the comments from the mentioned article. Now, let's try to evaluate some made up comments.
In [2]: m.evaluate_words(['exponential', 'moving', 'average', 'ema', 'sounds',
'like', 'a', 'decent', 'solution'])
None --[ -0.6 ]--> exponential
exponential --[ 1.0 ]--> moving
average --[ 1.0 ]--> ema
ema --[ 1.0 ]--> sounds
like --[ 1 ]--> a
a --[ 0.8 ]--> decent
Out[2]: 4.2
If there's no connection between W1 and W2, we take the average score of all the pairs having W1 as the first word. So if "like" points to "a" and "this", and we have to evaluate the pair ("like" --[ 0.5 ]--> "some"), the score will be the average of ("like" --[ 1 ]--> "a") and ("like" --[ 0 ]--> "this" ).
In [3]: m.evaluate_words(['thanks', 'exponential', 'moving', 'average', 'sounds',
'like', 'a', 'cool', 'solution',
'thanks', 'for', 'sharing'])
None --[ -4 ]--> thanks
thanks --[ -1.5 ]--> exponential
exponential --[ 1.0 ]--> moving
average --[ 1.0 ]--> sounds
like --[ 1 ]--> a
a --[ 0.8 ]--> cool
cool --[ 1.0 ]--> solution
solution --[ 1.0 ]--> thanks
thanks --[ -4 ]--> for
for --[ -7 ]--> sharing
Out[3]: -10.7
In [4]: m.evaluate_words(['dude', 'this', 'was', 'hard', 'to', 'follow'])
Out[4]: 3.7
In [5]: m.evaluate_words(['bch', 'price', 'will', 'soar', 'good', 'luck'])
Out[5]: -4.7
In [6]: m.evaluate_words(['i', 'don', 't', 'know', 'much', 'about', 'the',
'technical', 'side', 'but', 'this', 'was', 'informative'])
Out[6]: 12.7
In [7]: m.evaluate_words(['good', 'luck', 'with', 'your', 'endeavour', 'sir'])
Out[7]: -4.8
In [8]: m.evaluate_words(['good', 'article', 'shows', 'obviously',
'bch', 'will', 'be', 'crucial', 'without', 'daa'])
Out[8]: -4.1
In [9]: m.evaluate_words(['variations', 'of', 'what', 'do', 'you', 'mean',
'change', 'in', 'block', 'interval', 'or', 'hashrate'])
Out[9]: 17.3
In [10]: m.evaluate_words(['the', 'problem', 'is', 'not', 'one', 'and', 'exac t'])
Out[10]: 14.9
The complete python code is here: markov1.py gist