An update, 1/28/2011: It turns out there was a legitimate problem on the site after all! Specifically, hacked by spammers at some point and filled with invisible links & keywords. Skip down to the comments for the details, and be aware that the rant that follows is based on a fairly complete lack of information! (Said information being over a year in coming, but I have to say that hearing it from the man in charge himself pretty solidly exceeds my expectations….and makes me hang my head in shame a little for even thinking of defecting to Bing.)
So, as of sometime within the last few weeks or so, I am a convicted Google Spammer! As you may guess from this tone, I have not actually done or considered any such thing. But as tends to happen when you take all humans out of the loop and put all faith in the smartness of an infallible Algorithm, you throw your fair share of babies out with the bathwater, and today, one of those babies is me. I discovered this (not like I would have gotten an email or anything) when I happened to Google the name of my product to get to its mainpage (much faster than typing the full URL, which is longish), and it wasn’t there. With some further digging, I found my entire domain and all subdomains have been blacklisted from Google’s index.
That particular level of digging leads to a tool called Google Webmaster Tools, which, after proving myself the owner of the site, returns this:
This site may be in violation of Google’s quality guidelines. More Details
Pages on your site may not appear in Google search results pages due to violations of the Google webmaster guidelines. Please review our webmaster guidelines and modify your site so that it meets those guidelines. Once your site meets our guidelines, you can request reconsideration and we’ll evaluate your site.
Submit a reconsideration request
These webmaster guidelines cover “quality” (i.e. don’t spam the index)–including such helpful first-grade spam no-nos as not stuffing pages with invisible keywords, bogus META tags, providing special fake pages to known search bots, and other stuff that might have worked on Lycos in 1996–as well as various style-guide suggestions, admonishing webmasters to be on their best grammar, and even going as far as discussing size and placement of images on the page. Is a webmaster really expected to perform feng shui to stay in Google’s good graces?
Anyway, I’m at quite a loss to explain why I would be banned from Google, as getting an entire high-ranking site removed from a search index seems like something that would require some pretty big-ticket shenanigans. Of course, this is The Algorithm we’re talking about; there seems to be no indication that a live human was involved in this decision*.
Of course, any of the usual SEO tricks would fit the bill. But I don’t engage in any of that (for here, the boards or the main site), and the only “optimization” I do to this blog’s traffic is to post something interesting once in a while. (Really, since everything on this server is ad-less and free anyway, the only thing More Traffic can get me is a bigger bandwidth bill.) I do know that Google will display warnings / block content if it detects a site has been compromised, but a thorough dig through the server-side files indicates this is not the case, either.
So what’s left are some straws to grasp at:
- Legitimate incoming links from adult-ish sites and adult-ish search queries for my side project
- “Bad Words” or links appearing in message board posts (forget this piddly blog, what the rest of the free world knows cexx.org for is one of the oldest continuously running spyware help forums in existence. Posts here–especially diagnostic logfiles posted by affected users– contain references to bad programs/sites, links to same, and repetitive content (e.g. Windows registry trees) that occasionally generate false positives. A human would easily see that we are helping people rather than spamming the world; The Algorithm may not do as well.)
- Occasional comment spams slipping through in UGC (“User-generated content”, the current buzzword for “stuff the site owners didn’t write themselves”, such as blog comments, message board posts, every video on Youtube, etc.) As you can see from the numbers showing at the bottom of this page, comment spam is as fundamental to the Internet as the threat of rickroll, and the filter’s doing pretty well at blocking them. If transient v1@gra comment spams were grounds for being delisted, half the internet wouldn’t show up. Then again, for those of us who have not used Lycos et al lately, how would we google-addicts know if half the internet wasn’t showing up in our searches…?)
- My content being scraped and appearing on third-party spamblogs (yes, it happens. I–of all people!– have fired off a couple DMCA takedown demands in the last couple years, but really, for splogs on splog-friendly Korean ISPs this has about as much effect as firing off complaints for every email spam your receive. These automated scrapers usually end up scraping from someone with deeper pockets and much better arm-twisting power at some point, and the problem (for lil old me) solves itself.)
- Old pages/posts, dead links, occasional bad grammer or speling mistakes?
- Maybe Google are still mad at me for exposing a huge bug in their search some years ago? (In theory, this would make it trivial for someone to determine whether they ranked higher than a competitor, or see how a specific tweak to their keywords/etc. affected their ranking. But since I’m pretty sure no humans were actually involved in this, I kinda have to rule this hypothesis out…)
- Statistically anomalous distribution in keyword content of sites that link to mine?
- Statistically anomalous distribution of topics I discuss, tag clouds, etc. (or as mentioned earlier, help forum posts)
- Someone I’ve pissed off in the past robo-submitting my URL to the automated “report a spammer” page?**
- Googlebombs or other shenanigans performed (maliciously or not) by third-party sites?
- Wild Conspiracy Theories (paid off by a malware company? Malware authors have been trying to block their victims from being able to reach help forums such as cexx.org’s for years; maybe the’ve found a way to up the ante. Or maybe Sergey bought my Trance Vibrator and didn’t like it.)
The possibility that any site could be delisted by the actions of third-party sites (e.g. competitors) is simply disturbing. As unlikely as I’d hope it to be, Google’s complete secrecy regarding its delisting criteria (even I, after proving myself the legal owner, can’t get boo about what’s going on with my own site) makes such a scenario impossible to rule out. For what it’s worth, Google does explicitly mention links to “bad neighborhoods” in this Guidelines page, and some sites by and for the SEO people (who presumably know their stuff, this being their entire business model) seem to think this does apply to incoming links as well.
That is unacceptable.
If I haven’t gotten to the bottom of it soon, my only choice might just be to block Google from the site outright (why pay for the bandwidth their crawler uses if we are being excluded from the results?) and personally wean myself from Google search, for whatever that is worth as a personal stance. Is my best alternative really “Bing, and it’s done“?? Google, you really put me between a rock and a hard place.
* nor, based on analysis of the server logs, the “Reconsideration request” you can submit via the Webmaster tools thingy. Unless there are human reviewers lurking in an underground bunker somewhere disguised as residential cable customers from Peoria, or a vast distributed network of speed-readers who are each assigned one line to read, the speed of the hit-streams identifiable as coming from Google during the time of said review easily beat my personal best, and, much like the MTV Music Awards, show no evidence of human intelligence.
** From said page: “If you believe that another site is abusing Google’s quality guidelines, please report that site at https://www.google.com/webmasters/tools/spamreport. Google prefers developing scalable and automated solutions to problems, so we attempt to minimize hand-to-hand spam fighting. The spam reports we receive are used to create scalable algorithms that recognize and block future spam attempts.” Great, The Algorithm is now in charge of deciding who the humans on the Web are.