After a little tweaking, my rule is growing, and proving extremely effective:
# bad websites: domains which regularly or overwhelmingly feature spam
SecRule REQUEST_HEADERS:REFERER “http://[^/]*(yijiezi|yourhcg|lukejaten|squidoo|answerbag|jvlai|chaohuis|cledit|bait|lukejaten)” “t:lowercase,deny,nolog,status:500”# porn and gambling: they make much cash out of random visitors
SecRule REQUEST_HEADERS:REFERER “http://[^/]*(holdem|poker|casino|porn|girlz|pussy|penis|babe|exposed|sex)” “t:lowercase,deny,nolog,status:500”# fake / illegal designer clothing and luxury goods
SecRule REQUEST_HEADERS:REFERER “http://[^/]*(shop|store|cheap|gossip|handbag|money|deluxe|sunglass|chanel|replica|buy|sale|furniture)” “t:lowercase,deny,nolog,status:500”# celebrity gossip and trying to make money out of children, I guess
SecRule REQUEST_HEADERS:REFERER “http://[^/]*(miley|bieber|pokemon)” “t:lowercase,deny,nolog,status:500”# side-effects of Republican America?
SecRule REQUEST_HEADERS:REFERER “http://[^/]*(health|dental|pills|treatment|seller)” “t:lowercase,deny,nolog,status:500”# side-effects of weakly-regulated investment markets?
SecRule REQUEST_HEADERS:REFERER “http://[^/]*(forex|realty|invest|loans)” “t:lowercase,deny,nolog,status:500”# the people that created this problem
SecRule REQUEST_HEADERS:REFERER “http://[^/]*(seo)” “t:lowercase,deny,nolog,status:500”# webhosting and bodybuilding: apparently, these industries are as commoditized as porn and gambling – LOL
SecRule REQUEST_HEADERS:REFERER “http://[^/]*(download|hosting|videos|bodybuilding|bodybuild)” “t:lowercase,deny,nolog,status:500”
Incidentally, I looked into using wordlists for this, but they don’t work. The most effective anti-spam is to look at the domain-names – these sites are trying to get good rankings for their domains, not for specific pages. Apart from the spam-friendly sites, where it’s a combination of both.
So .. sadly … we need the regexp so that we can target the domain-name specifically. If ModSecurity were better (documented) I’m sure it could easily do that. I’m suspicious it *does* do that, but with their shotgun approach to documentation, it could take days or weeks to discover it if so :).