Posted on | January 6, 2008 | No Comments
I’ve been experimenting on ways to nip comment SPAM in the bud. Ever since noticing that Google sent me more robots than humans, I’ve been really frustrated.
This is the scenario. Bot starts up, bot queries Google to see who shows up in the top 20 results for keyword ‘fud’, bot sees if each result is a blog or a forum (or something that can be spammed, i.e. a site with a guestbook) with a high page rank, bot goes to work on trying to plant forward links. As these bots are built better (distributed), this becomes a major problem.
Joe Q. Spammer is smart, he’s got access to a few hundred servers spread out all over the globe – most of which are accessed illegally through weak scripts hosted on the respective servers. Joe wants to send out another SPAM campaign, so he submits his keyword and a list of URL’s to plant in links. This work load is then divided up amongst all of the servers that Joe controls, altering the text and alternating links to make it very difficult to see that the SPAM is originating from one place.
Why am I so irritated about search engines not doing much (at least not obviously doing much) about this? Many of you have a blog, or a forum. I have to watch a few thousand blogs and forums scattered across quite a few servers. When I look at what CPU time is being spent to accomplish, I find that a large percentage of it is wasted on blocking, queuing or entertaining SPAM bots. The single biggest culprit for this phenomenon is Google page rank.
Imagine if you would, walking down the street late at night in a really bad neighborhood with $100,000 in cash on you. That’s scary. Now imagine a Google employee walking beside you holding a neon sign that says “HE HAS A HUNDRED GRAND IN CASH!!!” … pointed right at you. As long as Google continues to make their Page Rank available for anyone to see, sites with high ranks will always be singled out for abuse. In essence, the Google Page Rank indicator makes sites become a victim of their own success.
It is pretty obvious that Google is not going to hide the PR data for sites. Sure, they could, and just use their ranking internally to produce their results, but why would they?
The only sensible means of cutting down on this (and not completely banning them) is at the application level. This is next to impossible since the bots don’t indicate Google as a referrer. The more distributed these SPAM nets become, the less effective things like Akismet are going to be. The only way to see patterns that indicate varied SPAM that is likely from a single origin is through piecing it together from months worth of logs. Who has time for that?
The only real solution is going to come in the form of yet another blacklist. Akismet really isn’t a blacklist, it just catches a large percentage of junk comments. In an ideal world, SPAM would not even get that far.
I’m kicking around the idea of a plug-in that reports positive bounces from Akismet to a central place. No, not (yet another) SPAM stopping service, this would be a plug-in that a group of friends could install and configure to share IP/Host data from positive bounces in order to form firewall rules. I’ve already developed an abstraction layer that would allow WordPress to safely talk to netfilter/iptables, I think I’m finally frustrated enough to push it into something usable. SNORT style string matching support also comes to mind.
On shared hosting? Why not share your SPAM data with all of the other blogs hosted on your server that produces firewall rules that your host can import? It seems like a good idea.
Will post back with more results next month, I refuse to accept that the only way to cut down on my SPAM and maintain a decent page rank is to seriously hinder what search engines can see.
There are fundamental issues with one entity controlling and harvesting this kind of data (offered as a service, even freely) – most of you will remember a ton of drama surrounding Spamhaus. Something distributed to allow people to turn their sites into cooperative honey pots would be ideal.
Most good things are born out of frustration I’ll post back as I progress.