Reducing Website Spam

1
Official Construct Post
Tom's avatar
Tom
  • 1 Dec, 2011
  • 1,566 words
  • ~6-10 mins
  • 2,916 visits
  • 0 favourites

User WGfunStorm recently asked on our forum how we keep the forum spam free without using a CAPTCHA. I've taken a bit of time with this website to try and make it as spam free as possible and at the moment it seems to be working pretty well. Here's a few of the techniques we use.

Rename default pages

One of the simplest and most effective techniques to reduce spam signups is by renaming your registration and login pages.

We use Web Wiz Forums for our website at the moment. It's a relatively well known piece ASP forum software. The default registration and login pages are register.asp and login_user.asp. If you are someone who writes spam software the easiest way to write your software is probably by finding out all the default registration and login pages for popular forum software.

By moving these pages to different locations with different names it seems to have a drastic effect on the number of spam registrations we receive. Our new pages are now scirra.com/register and scirra.com/login.

Although this seems rather easy to circumnavigate with a bit of thought from the spam software they generally don't. This is probably for a number of reasons:

  • Not many sites do this, so it's not worth the time
  • Sites that take these sorts of measures probably clean up spam quickly so they aren't very good targets
  • There isn't a short supply of easy targets out there

We used to be on PHPBB3, and spam software seems to know this. We get hundreds of 404's to old PHPBB registration pages, as well as lots of 404's to the old WebWiz forum registration and login pages. It's important to note this is of no fault of the forum software writers, even if they engineered a way of tackling this the spam software would adapt to this. Some of our top 404's to various pages over roughly a 2 month period are:

404 URL Count
/forum/signup493
/forum/ucp.php?mode=register263
/register/forum165
/forum/ucp.php?mode=login141
/phpbb3/ucp.php?mode=register&sid=c69...116
/phpbb3/ucp.php?mode=register&sid=725...113
/phpbb3/ucp.php?mode=register&sid=a87b8...112
/forum/viewtopic.php/profile.php?mode=register107
/phpbb3/ucp.php?mode=register&coppa=098
/phpbb3/ucp.php?mode=register&sid=93...88
And lots more...

As you can see there are also some other URL's in the table which don't exist and never have. This is just spam software probing or guessing common URLs.

When I first started noticing these 404's I made the mistake of creating 301 redirects to the actual new page URLs as a good webmaster should! However for these particular pages it was naive and had bad consequences. It was basically akin to flipping a switch on for spammers. When the old URLs redirected to the new ones we got dozens of spammers daily wasting a lot of our time! It's a good idea to check that no sites are linking to your registration and login pages as referers from legitimate sources but this is rare and those can be dealt with individually using HTTP Referer if it's an important source. The best and easiest way though is just to let them all 404 and contact individual sites to update there links if any do exist.

Renaming your common entry point URLS to something different seems to block a lot of spam users. I don't think naming it to something obscure would offer any other benefits, just different is the most important thing.

Setting up Honeypots

Putting honeypots on our entry scripts seems to be an effective measure as well. Some bots that do find these pages (either because they are more intelligently written or because they function differently). A honeypot is a juicy looking target which is actually a trap, like the common jam-in-a-bottle wasp trap.

If you look at our registration page you will see it's pretty trim as this is designed to make registering as accessible as possible.

If you view the source you will see that there are some hidden fields:

<xmp><input type="hidden" name="Username" value=""></xmp>

The field called Username is actually a hidden unused field that users can't type anything into. The actual username field has an obscure name.

As spam software will probably be scouring the HTML for fields it will sometimes come across the username field and automatically fill it in. This means when the form is submitted with this username field containing any value we can reject the registration. Actual users wont be filling this input box with any values!

This method is also very effective, blocking a few registrations a day. There are other ways of doing honeypots - they all rely on the fact that the spam bot isn't smart enough to realise the actual username field is a different one, or that the field is hidden. To work those things out is actually pretty darned difficult so I don't imagine there are many spam bots out there that do this.

Following up Spammers

On occasion I have visited some URLs spammers have posted before deleting them. I contact the site support/owners and ask them why they are spamming our website.

Most of the time they are oblivious to it. Some of the time they feign ignorance. The ones who are oblivious to it after a bit more questioning appear to have hired 'SEO Experts' to help improve their website rankings. These 'experts' then start up their various pieces of spam software and sit back often charging the site owners a lot of money for that service.

The SEO industry is full of spammers and ignorance. There are GOOD and HONEST SEO people out there, but they are rare and to find them you need to know what you are looking for in the first place which is a skill in itself. When buying SEO, always understand exactly what you are buying. If you're hiring in the dark, you're probably helping to support the spam industry.

Other times the site owners just tell me to get over it and remove the links if it bothers me. This is frustrating for webmasters as it's really not empathising with how much time folk like us have to spend daily cleaning up other people's spam! It can be laborious and frustrating. It's also an important job, a clean forum and website leaves a good impact on new visitors.

Awesome Moderators and Users

A small amount of manual work is still necessary to clean up the small amount of spam that slips through. Also, some spam is posted by actual hired people rather than automatic software, and it's never going to be easy to automatically prevent that kind of "manual" spam. Fortunately, since it's expensive to do this, the volume of manual spam is small. We're also lucky on this site to have an excellent group of moderators and users! The moderators spend time helping us deal with any spam that does get through and for that we are very grateful!

The same applies for our users who report spam when it gets through - a big thank you as well! All of this allows us to promptly clean up whenever something gets through.

The Problem with CAPTCHAS

CAPTCHAS are those boxes on websites that verify you're a human being by asking you to type in some words you see, or answering a question to verify your probably a human.

Uhhhhhh.....

The above image is of course an exception usually but it illustrates the point well. Sometimes they go wrong, and the assumption your users can actually complete them reliably can be costly. They take time to fill out and can be annoying. All these factors will lose you signups.

Not only this, but some websites I've visited and attempted to register on get you stuck in an endless washing machine of re-entering information. You squint and carefully enter the CAPTCHA. It's wrong! You re-enter it correctly and resubmit. You need to enter your password again! You enter your password again. Please re enter the CAPTCHA code! No! I can't be bothered anymore! Using a CAPTCHA on your website has to be executed very carefully as common implementations like this will lose you a lot of registrations.

General accessibility is another important point in regards to CAPTCHAS that has been debated heavily. It's really best not to have them if possible. Also, some of them are so reliably solved by software that they provide no protection at all! This sometimes makes them a good way to frustrate all your users for no advantage at all.

Final Words

With the simple honeypot and renamed entry point pages we get one or two spammers a day now. This is easily manageable with manual anti-spam and worth the efforts of prevention. These days it's also a much more effective prevention than a CAPTCHA.

A lot of spam prevention on a website is staying ahead of the pack. Most people can't be bothered/don't know how to implement some spam prevention techniques. This means for a site that does, the spammers will generally move on to the easier and juicier targets. Cleaning up any spam that makes it through promptly is one way of staying ahead of the pack.

Some spammers are paid humans, or even backed by CAPTCHA human farms in poorer countries. For this kind of spam there really is not way to block it easily. The only thing we can do is discourage the behaviour by making it not cost effective. The way to do that is by cleaning the spam as soon as it appears!

Subscribe

Get emailed when there are new posts!