Introduction Robots.txt file
Robots.txt is a text (not html) file you put on your site to tell
search robots which pages you would like them not to visit.
Robots.txt is by no means mandatory for search engines but generally
search engines obey what they are asked not to do. It is important to
clarify that robots.txt is not a way from preventing search engines
from crawling your site (i.e. it is not a firewall, or a kind of
password protection) and the fact that you put a robots.txt file is
something like putting a note “Please, do not enter” on
an unlocked door – e.g. you cannot prevent thieves from coming
in but the good guys will not open to door and enter. That is why we
say that if you have really sensitive data, it is too naive to
rely on robots.txt to protect it from being indexed and displayed in
search results.
The location of robots.txt in a website is very important. It must be in the root directory because otherwise user agents (search engines) will
not be able to find it – they do not search the whole site for
a file named robots.txt. Instead, they look first in the main
directory (i.e. http://mysite.com/robots.txt)
and if they don't find it there, they simply assume that this site
does not have a robots.txt file and therefore they index everything
they find along the way. So, if you don't put robots.txt in the proper
place, do not be surprised that search engines index your whole site.
Create Your Robots.txt file
So lets get moving. Create a regular text file called "robots.txt", and make
sure it's named exactly that. This file must be uploaded to the root
accessible directory of your site, not a sub directory (ie:
http://www.mysite.com but NOT
http://www.mysite.com/stuff/). It is
only by following the above two rules will search engines interpret the
instructions contained in the file. Deviate from this, and "robots.txt" becomes
nothing more than a regular text file, like Cinderella after midnight.
Now that you know what to name your text file and where to upload it, you
need to learn what to actually put in it to send commands off to search
engines that follow this protocol (formally the "Robots Exclusion
Protocol"). The format is simple enough for most intents and purposes: a
USERAGENT line to identify the crawler in question followed by one or
more DISALLOW: lines to disallow it from crawling certain parts of
your site.
1) Here's a basic "robots.txt":
User-agent: *
Disallow: /
With the above declared, all robots (indicated by "*") are instructed to
not index any of your pages (indicated by "/"). Most likely not what you
want, but you get the idea.
2) Lets get a little more discriminatory now. While every
webmaster loves Google, you may not want Google's Image bot crawling your
site's images and making them searchable
online, if just to save bandwidth. The below declaration will do the
trick:
User-agent: Googlebot-Image
Disallow: /
3) The following disallows all search engines and robots from
crawling select directories and pages:
User-agent: *
Disallow: /cgi-bin/
Disallow: /privatedir/
Disallow: /tutorials/blank.htm
4) You can conditionally target multiple robots in "robots.txt."
Take a look at the below:
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /privatedir/
This is interesting- here we declare that crawlers in general should not
crawl any parts of our site, EXCEPT for Google, which is allowed to
crawl the entire site apart from /cgi-bin/ and /privatedir/.
So the rules of specificity apply, not inheritance.
5) There is a way to use Disallow: to essentially turn it into
"Allow all", and that is by not entering a value after the semicolon(:):
User-agent: *
Disallow: /
User-agent: ia_archiver
Disallow:
Here I'm saying all crawlers should be prohibited from crawling our site,
except for Alexa,
which is allowed.
6) Finally, some crawlers now support an additional field called
"Allow:", most notably, Google. As its name implies, "Allow:" lets you
explicitly dictate what files/folders can be crawled. However, this field is
currently not part of the "robots.txt" protocol, so my recommendation is to
use it only if absolutely needed, as it might confuse some less intelligent
crawlers.
Per Google's FAQs for
webmasters, the below is the preferred way to disallow all crawlers from
your site EXCEPT Google:
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /