Understanding Robots.txt and Duplicate Content issues with Drupal Comments

blog by Knooq A guide to the Robots.txt file and its use within the Drupal content management system, particularly in relation to duplicate content created by 'Comments' in Drupal.

Table of Contents


What is Robots.txt

The Robots.txt file is a set of instructions to the search engines which tells them what URl's, folders and files you think they should crawl and ultimately index.

Preventing the indexation of duplicate URL's arising from Comments is an important aspect of managing your Drupal website. Having separate pages for comments is not what most sites prefer. So, you will need to ensure that the limit set on comments per page in Drupal is sufficiently high, say 100, so that there is little risk of comments running onto a second page. To do this, adjust your 'Comment Settings'.

Avoiding comments indexing by search bots can be achieved by using Robots.txt. It is important to note however, that the instructions contained in the Robots.txt file are merely a request to the crawlers and search bots, not to index the URLs and directories specified there. In this regard, you can only expect the reputable search spiders and bots to adhere to it. Even so, not all search engines will treat the instructions in the Robots.txt file the same, and there are options to specify instructions for individual search engines. By default the syntax of the Robots.txt file uses the instruction User-agent: * to specify that the instructions apply to 'All' user agents.

Treatment of Robots.txt by Search Engines

The statements contained in the Robots.txt file are sometimes referred to as rules. However they, are more closely aligned with a 'polite request' because they have nothing whatsoever to do with the security of your directories and files. With terms such as rules and search bot instructions flying around, you could be forgiven for thinking that the Robots.txt file affords some sort of protection to your web directories and files.

This is a misnomer, and the way to protect vulnerable files and directories is by using server-side security. With this in mind, you can begin to look at the Robots.txt file for what it is i.e. 'A do not disturb' sign on an unlocked door. Its main purpose being, to help you and the search engines better understand the URL and folder structure of your website, which in turn leads to a more appropriate and cleaner indexation on the web.

In this respect, Google states that it won't crawl these URL's but that the links to them and their corresponding anchor text may show up in search results. We have seen this with comments on our own web design blog, however as expected those links didn't contain a caption (normally the meta-description) as with other search results, and were just simple links which were, in some cases, obfuscated in the search results.

This can be a bit confusing, and make it hard to tell if the instructions in the Robots.txt files are being adhered to. The best way to check this is in Webmaster Tools under the crawl section, where it will detail those URL's that are 'restricted by robots'.

Duplicate URL's Arising from Drupal Comments

To start with, we took a fresh installation of Drupal 7 and added a node /node/1. We have enabled comments by anonymous users as people don't generally like to have to register in order to leave a brief comment.

With comments activated we made a few comments on /node/1 as well as a few replies to those comments. Then we used a link scanning utility to scan all of the URLS on the site. As expected, it picked up the new node /node/1 but also the following urls:

ourtestsite/comment/1
ourtestsite/comment/2
ourtestsite/comment/3

What the link scanner didn't spot, but which we know exist, were the following URL's:

ourtestsite/comment/reply/1
ourtestsite/comment/reply/1/1
ourtestsite/comment/reply/1/2
ourtestsite/comment/reply/1/3

These URL's are a bit difficult to come across as they are not listed under 'Content' as a content type, nor are they listed under 'URL Aliases' despite having a relatively clean URL format. So you need to know what to look for.

You may come across this issue if you have comments enabled in Drupal. For example, say we have a node called node/123 with comments enabled. When comments are made, an additional URL is generated called /comment/reply/123. Unfortunately, this new URL is a page, and contains a full copy of the content in the original page /node/123 (minus the <h1></h1> header which is replaced by a <h2></h2> link back to the node). This gets picked up unless restricted by Robots.txt and could be construed as duplicate content by the search bots.

The URL's produced above are a necessary part of the functionality of Drupal's Comment utility, however they are unwanted from an SEO perspective as they may lead to duplicate content issues.

Fortunately, the default installation of Drupal makes sure that the Robots.txt file is setup to deal with this and the default file comes with the following 'Disallow' rules:

User-agent: *
Crawl-delay: 10
....
# Paths (clean URLs)
Disallow: /comment/reply/
...
# Paths (no clean URLs)
Disallow: /?q=comment/reply/
.......
 
 
note: abridged summary of the full robots.txt file

This in itself should be enough to capture URL's containing the '/comments/' folder and the '/comments/reply' sub-folder. However it can be difficult to tell what way each Bot interprets these instructions and getting to grips with which search bots enforce what standards can be confusing and time consuming. For example, Google supports the Allow instruction as well as the Disallow instruction, and both can be used together for more fine grained controls. The problem here however is that the Allow instruction is not part of the Robots.txt standard and may confuse other crawlers/bots.

Pattern Matching with Robots.txt

Another instruction that Googlebot follows is 'Pattern Matching' as seen in the example below:

User-agent: Googlebot
Disallow: /comment*/

By using the pattern that includes the asterisk * in the folder URL, we are telling Googlebot to ignore all sub-directories that begin with '/comment/'. This can be very useful, when dealing with comments in Drupal or any other URL's. Therefore we can include the following in our Robots.txt file.

User-agent: *
Crawl-delay: 10
....
# Paths (clean URLs)
Disallow: /comment/
Disallow: /comment*/
Disallow: /comment/reply/
Disallow: /comment/reply*/
Disallow: /comment_notify/
Disallow: /comment_notify*/
...
# Paths (no clean URLs)
Disallow: /?q=comment/
Disallow: /?q=comment*/
Disallow: /?q=comment/reply/
Disallow: /?q=comment/reply*/
Disallow: /?q=comment_notify/
Disallow: /?q=comment_notify*/
.......
 
 
note: abridged summary of the full robots.txt file

Although this might seem superfluous given that Drupal already comes with '/comments/reply' disallowed, it is only superfluous if you know the exact treatment of the '/comments/reply' instruction by each search bot. These couple of extra lines may be followed or may be ignored just like all the rest of the instructions. In any event, we felt it was better to provide pattern matching for these tricky Comment URL's. You can decide after monitoring your sites indexation, whether you think these extra lines help or not.

Note that we have included the instructions above in an attempt to Disallow the URL's comment/1 comment/2 etc. that were also showing up. We have also disallowed URL's relating to the Comment Notify module.

You will, most likely have come across the Robots.txt file during your time using Drupal and if your are using the Webmaster Tools utilities for Google or Bing, or indeed Yahoo's Site Explorer, you will be likely to have seen it mentioned.

In Google Webmaster Tools, there is a useful section dealing with Robots.txt file which allows you to build your own rules by entering various URLs.

For more on the syntax and mark up of the Robots.txt file you should visit Robots.Txt.org Documentation and Help.

As always, feel free to comment and discuss below.

Home Blogs knooq.com's blog