Methods to Fight Comment Spam

Posted on November 16, 2013

GeekThis happily runs on Vultr. Get $300 of free hosting credits to try out their cloud compute, kubernetes engine, or managed databases. Try Vultr today to claim your free $300.

With the increase in spam over the last few years, preventing comment spam is getting even more difficult. Bots are running JavaScript, learning what fields to populate and more. But, if you implement enough features, you can fight back and be almost spam free.

Method 1: Moderate Comments

For every comment you receive, you read through it before it gets posted to your site or blog. Most blogging systems have a moderation page where you have to manually approve each comment, but people end up disabling this or allowing only users to post comments. Bots can create fake user accounts and spam from those and just disabling moderation will lead to even more spam. When a spam bot notices that a site they spammed previously showed their comment, they are going to add that site to a list to spam in the future because it works.

The only impact the users will notice is that their comments don’t appear right away. This could be somewhat frustrating to the user, but most of them are aware of moderation by now since nearly every site has it going on. To make your users impact even less, you could set specific users as being “Approved” and make all of their comments skip moderation by default. I suggest making users who have 5 - 10 good comments and not a single bad one be approved right away. Also make sure the user accounts are not brand new, set a timeout of a month before users can be free and post without having to be moderated.

The success rate for this method is 100%, unless you clicked “Approved” by mistake. Your moderators and administrators will read through all the comments manually and will know if it’s a genuine comment or spam.

Method 2: Hidden Field - Honey-pot

This method which only sometimes works, is setting an input field with the type of “text” and styling it to display:none;. Most bots will fill in this field as to not take the chance of it being optional or required and empty, making the bot spam the comment again. Because they fill this in, and users won’t see it, you can almost certainly create a case so when the field is not empty, it’s a spam bot.

Spam Bots are getting better at detecting this method though, so it’s not 100% spam proof, but I did notice it cutting my spam by more than half on a few of my sites. You can make it more difficult for the bots to detect that it’s a trap by having the style in an external style-sheet as opposed to having it inline.

This detection is great for the real visitors, since it doesn’t affect them at all. They won’t see any extra inputs, they don’t need any extra browser requirements (other than CSS, which all browsers support unless text only), and your visitors will see less spam on your site.

Method 3: JavaScript Hidden Input

While this method has a fairly high success rate, relying on it can be horrific to your visitors since it does require them to have JavaScript enabled. Other than the requirement of having JavaScript, the user won’t even notice it. I suggest having this method submit the comment, but flag it as spam for your to moderate at a later date.

The way this method works is very similar to Method 2, as we create a hidden text field inside the comment form using JavaScript. If the field doesn’t exist, the client isn’t using JavaScript. The lack for JavaScript be can a sign of a spam bot, or could just be a user who doesn’t like having JavaScript enabled.

Spam bots can run JavaScript, but most of them don’t so they can quickly run through a huge list of sites. When a spam bot has JavaScript enabled, they have to read the whole site and then run the JavaScript one site at a time. If your site is running a known CMS such as WordPress, the bots are less likely to load the JavaScript because they know exactly where to send the HTTP Post Request for their comment to appear. For unknown sites and scripts, the Spam Bots need to parse your site to figure out what page and data is required to submit spam on your website.

Spam Bots don’t care about your content, just how they can populate their links all over your site so they can make an extra dollar or two. Because of this, they don’t spend much time on your site, only enough to parse the comment form and submit the spam. We combat this by setting a cookie at the time the page loads having the value of the time (epoch or Unix time-stamp). When the comment is being processed by your code, make sure the cookie is set and is at least 3 seconds past the value you set for it. Also make sure you set a timeout period, so if the cookie is over 2 hours it’s also marked as spam.

Using this method alone has resulted in huge success. I don’t suggest just using one method, but if you had to pick an automated method to combat spam, this is the choice I would make for now. It’s fairly simple to implement into existing systems and works really well.

All browsers use Cookies by default unless disabled. This method will work with all visitors without them even knowing about it, unless they have cookies disabled. I ignore that users may have cookies disabled and just don’t insert the comment in my database. Cookies are used in all browsers including text browsers, making this method better than using JavaScript in Method 3.

Identical to the method above, but you set the cookie on your style sheet page or another external resource you load (images, JavaScript, etc). This method is far better than having the cookies set on the same page, but will require extra work to get working properly. The reason why this method is so much better is because most spam bots don’t load external resources, especially images. When you are spamming thousands of sites, why should you waste the bandwidth on loading every single image. Now, spam bots can load images, JavaScript, and style-sheets, but most spam bots only load JavaScript to fight back against some of these methods.

It has all the same benefits and downsides as the other cookie method, such as no user interaction required, visitors don’t notice it, but it does require a browser that saves cookies.

Method 6: IFrame Comment Form

If you have your comment form located inside an <iframe>, the chances of a spam bot finding the form is less likely. This method won’t work for WordPress sites and other sites running popular Content Management Systems, since the spam bots don’t need to see your comment form because it’s well known.

This method is less desirable since incorporating <iframe>'s into a site is usually bad practice, so only use this as a last resort pretty much. I have never had to use this method, thankfully, due to the other methods available out there.

Users are hardly impacted by this change on your site, but browsers can have <iframe>'s disabled. Modifying your code to reflect to the change of <iframe> comments may be difficult, depending on how you currently have your comment system setup.

Method 7: External Comment System

A lot of sites now will use external comment systems, such as Facebook, Disqus or IntenseDebate. External comment systems are websites that store the comments for you and then just load them onto your website using JavaScript. Users who don’t have JavaScript won’t be able to comment as easily using some of these services, and with Facebook, they will need an account.

Another downside is that Search Engines won’t pick up on your comments when they scrape your site. This may or may not affect your rankings that much, but most people prefer comments appearing to search engines to help make the page active.

Method 8: Registered Users Only

Having your visitors only post if they have an account is an old method, and a pretty good one, if you don’t mind taking a huge hit to the discussion on your site. Spammers don’t want to spend the time of creating an account, checking for the activation email, logging in and then finally spamming. Most users don’t even want to take the time to do those actions just to post a comment too.

The success rate for this method is around 60% for known CMS’s and about 90% for unknown website scripts. The success rate is lower than you probably want for taking such a huge hit to your site’s discussion and how much is required from a user to post a comment. I don’t recommend this method at all, but it may be something you have to enable.

Method 9: Captcha

The well known method of Captcha should be mentioned on this post, but is another method I like to avoid at all costs. Having users and visitors take extra steps to discuss on your site is what you want to avoid.

Captchas are images that display random text in a funky and groovy font. In-front and behind the text are random shapes, lines, colors and more. Most OCR’s are unable to process the image so the spammers have to send the image to a Captcha Solving service where thousands of people solve Captchas for their daily jobs.

You will remove a lot of spam from your site by adding Captchas, but as computers get more powerful and better OCR techniques are created, captchas will soon be too difficult for humans to read and easy for bots. At it’s current state, I have trouble reading Captchas and about a third of the time I enter them incorrectly.

Method 10: Comment Rules

Creating a set of rules for comment text and marking those as spam is very worthwhile. I have all comments that include multiple links, random text, mix between language characters and more marked as spam. A lot of spam on my site contain both Chinese and English characters and multiple links. This was easy enough to make a rule for, and cut back on spam. The likelihood of a real comment containing that is not very high, but I still keep the comments to moderate in the future, but flagged as spam so I don’t have to deal with them daily.

Users won’t have to enter any extra data into your site, but their comment may get placed into a spam folder for future review.

Method 11: IP Banning and IP Databases

This method is suggested by some, but probably won’t help all that much unless a single IP has some sort of vendetta against your site. If there is a comment on your site that is spam, you ban that IP Address from future comments. Since getting access to proxies and spoofing IP Addresses is simple, banning thousands of IP addresses will be required. Then if a user of yours is using that same IP Address, they won’t be able to comment.

There are also services and sites that list all known spam IP Addresses that you can search for a specific address and decide if you should allow it or not. These services are better than just building your own because they are used by many people. People will report IP Address as spammers, and then if your site is visited by one of them, you know they spammed x sites in the last x hours or so.

Now, get to work on fixing up your comment section so you can work on blogging instead of hitting delete on thousands of comments.

Method 1: Moderate Comments

Method 2: Hidden Field - Honey-pot

Method 3: JavaScript Hidden Input

Method 4: Cookie Timeout

Method 5: External Cookie

Method 6: IFrame Comment Form

Method 7: External Comment System

Method 8: Registered Users Only

Method 9: Captcha

Method 10: Comment Rules

Method 11: IP Banning and IP Databases

Using Olimex with AVRDUDE in Linux

Automatically Start Docker Container

Writing GUIs in Golang

Local Mercurial Hosting