Monday, 17 June 2013

Preventing crawling of web sites

Robots.txt is a standard way of preventing web crawlers from indexing a page.
However, it is still possible for a website with a robots.txt file to appear in search results.
This usually happens if the site is linked to from other sites, with a description of it in the anchor text. In such a scenario, the site will still not be crawled, but search results can still deduce that it is a useful site for certain queries due to the number of links to it.
In these cases, there will be no snippet of content from the site in the search results since it still can't be crawled.

Alternative ways are to allow the site to be crawled but add a no index mate tag to the pages. This will allow search engines to crawl the site but they will drop it again as soon as they see the no index tags.

You can also use URL removal tools to remove sites from indexes.

