Working with the robots.txt file
Jagdeep. S. Pannu, SEORank
What is the robots.txt file?
Working with the robots.txt file;
Advantages of robots.txt;
Disadvantages of the robots.txt file;
Optimization of the robots.txt file;
Using the robots.txt file;
Related reading.
What is the robots.txt file?
The robots.txt file is an ASCII text file that has specific instructions
for search engine robots about specific content that they are not allowed
to index. These instructions are the deciding factor of how a search
engine indexes your website’s pages. The universal address of the
robots.txt file is:
www.example.com/robots.txt This is the first file
that a robot visits. It picks up instructions for indexing the site
content and follows them. This file contains two text fields. Lets study
this robots.txt example :
User-agent: *
Disallow:
The User-agent field is for specifying robot name for which the access
policy follows in the Disallow field. Disallow field specifies URLs which
the specified robots have no access to. An example :
User-agent: *
Disallow: /
Here “*” means all robots and “/ ” means all URLs. This is read as, “ No
access for any search engine to any URL” Since all URLs are preceded by “/
” so it bans access to all URLs when nothing follows after “/ ”. If
partial access has to be given, only the banned URL is specified in the
Disallow field. Lets consider this example :
# Research access for Googlebot.
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /concepts/new/
Here we see that both the fields have been repeated. Multiple commands
can be given for different user agents in different lines. The above
commands mean that all user agents are banned access to /concepts/new/
except Googlebot which has full access. Characters following # are ignored
up to the line termination as they are considered to be comments.
Working with the robots.txt file
1. The robots.txt file is always named in all lowercase (e.g. Robots.txt
or robots.Txt is incorrect)
2. Wildcards are not supported in both the fields. Only * can be used in
the User-agent fields’ command syntax because it is a special character
denoting “all”. Googlebot is the only robot that now supports some
wildcard file extensions.
Ref:
http://www.google.com/webmasters/faq.html#12
3. The robots.txt file is an exclusion file meant for search engine robot
reference and not obligatory for a website to function. An empty or absent
file simply means that all robots are welcome to index any part of the
website.
4. Only one file can be maintained per domain.
5. Website owners who do not have administrative rights cannot sometimes
make a robots.txt file.In such situations, the
Robots Meta Tag can be configured which will solve the same purpose.
Here we must keep in mind that lately, questions have been raised about robot behavior regarding the Robot Meta Tag. Some robots might skip it altogether. Protocol makes it obligatory for all robots to start with the robots.txt thereby making it the default starting point for all robots.
6. Separate lines are required for specifying access to different user
agents and Disallow field should not carry more than one command in a line
in the robots.txt file. There is no limit to the number of lines though
i.e. both the User-agent and Disallow fields can be repeated with
different commands any number of times. Blank lines will also not work
within a single record set of both the commands.
7. Use lower-case for all robots.txt file content. Please also note that
filenames on Unix systems are case sensitive. Be careful about case
sensitivity when defining directory or files for Unix hosted domains.
You can use The robots.txt Validator to check your robots.txt from
Searchengineworld.
Advantages of the robots.txt file
Protocol demands that all search engine robots start with the robots.txt
file. This is the default entry point for robots if the file is present.
Specific instructions can be placed on this file to help index your site
on the web. Major search engines will never violate the Standard for
Robots Exclusion.
1. The robots.txt file can be used to keep out unwanted robots like email
retrievers, image strippers etc.
2. The robots.txt file can be used to specify the directories on your
server that you don’t want robots to access and/or index e.g. temporary,
cgi, and private/back-end directories.
3. An absent robots.txt file could generate a 404 error and redirect the
robot to your default 404 error page. Here it was noticed after careful
research that sites that do not have a robots.txt file present and had a
customized 404-error page, would serve the same to the robots. The robot
is bound to treat it as the robots.txt file, which can confuse it’s
indexing.
4. The robots.txt file is used to direct select robots to relevant pages
to be indexed. This specially comes in handy where the site has
multilingual content or where the robot is searching for only specific
content.
5. The need for the robots.txt file was also felt to stop robots from
deluging servers with rapid-fire requests or re-indexing the same files
repeatedly. If you have duplicate content on your site for any reason, the
same can be controlled from getting indexed. This will help you avoid any
duplicate content penalties.
Disadvantages of the robots.txt file
Careless handling of directory and filenames can lead hackers to snoop
around your site by studying the robots.txt file, as you sometimes may
also list filenames and directories that have classified content. This is
not a serious issue as deploying some effective security checks to the
content in question can take care of it. For example if you have your
traffic log on your site on a URL such as
www.example.com/stats/index.htm
which you do not want robots to index, then you would have to add a
command to your robots.txt file. As an example:
User-agent: *
Disallow: /stats/
However, it is easy for a snooper to guess what you are trying to hide
and simply typing the URL
www.example.com/stats in his browser would
enable access to the same. This calls for one of the following remedies -
1. Change file names:
Change the stats filename from index.htm to something different, such
as stats-new.htm so that your stats URL now becomes
www.example.com/stats/stats-new.htm
Place a simple text file containing the text, “Sorry you are not
authorized to view this page”, and save it as index.htm in your
/stats/directory.
This way the snooper cannot guess your actual filename and get to your
banned content.
2. Use login passwords:
Password-protect the sensitive content listed in your robots.txt
file.
Optimization of the robots.txt file
The right commands: Use correct commands. Most common errors
include - putting the command meant for “User-agent” field in the
“Disallow field” and vice-versa.
Please also note that there is no “Allow” command in the standard
robots.txt protocol. Content not blocked in the “Disallow” field is
considered allowed. Currently, only two fields are recognized: “The User-
agent field” and the “Disallow field”. Experts are considering the
addition of more robot recognizable commands to make the robots.txt file
more Webmaster and robot friendly.
Note- Google is the only search engine, which is experimenting
with certain new robots.txt commands. There are indications that Google
now recognizes the "Allow" command. Please refer:
http://www.google.com/webmasters/faq.html#12
Bad Syntax: Do not put multiple file URLs in one Disallow line in
the robots.txt file. Use a new Disallow line for every directory that you
want to block access to. Incorrect example :
User-agent: *
Disallow: /concepts/ /links/ /images/
Correct example:
User-agent: *
Disallow: /concepts/
Disallow: /links/
Disallow: /images/
Files and directories: If a specific file has to be disallowed,
end it with the file extension and without a forward slash in the end.
Study the following example :
For file:
User-agent: *
Disallow: /hilltop.html
For Directory:
User-agent: *
Disallow: /concepts/
Remember if you have to block access to all files in the directory, you
don’t have to specify each and every file in robots.txt . You can simply
block the directory as shown above. Another common error is leaving out
the slashes altogether. This would leave a very different message than
intended.
The right location: No robot will access a badly placed
robots.txt file. Make sure that the location is
www.example.com/robots.txt.
Capitalization: Never capitalize your syntax commands. Directory
and filenames are case sensitive in Unix platforms. The only capitals used
per standard are: “User-agent ” and “Disallow ”
Correct Order: If you want to block access to all but one or more
than one robot, then the specific ones should be mentioned first. Lets
study this robots.txt example :
User-agent: *
Disallow: /
User-agent: MSNBot
Disallow:
In the above case, MSNBot would simply leave the site without indexing
after reading the first command. Correct syntax is:
User-agent: MSNBot
Disallow:
User-agent: *
Disallow: /
The robots.txt file: Presence - Not having a robots.txt file at
all could generate a 404 error for search engine robots, which could
redirect the robot to the default 404-error page or your customized 404-
error page. If this happens seamlessly, it is up to the robot to decide if
the target file is a robots.txt file or an html file. Typically it would
not cause many problems but you may not want to risk it. It’s always a
better idea to put the standard robots.txt file in the root directory,
than not having it at all.
The standard robots.txt file for allowing all robots to index all pages is:
User-agent: *
Disallow:
Using # carefully in the robots.txt file: Adding comments after
the syntax commands is not a good idea using “#”. Some robots might
misinterpret the line although it is acceptable as per the robots
exclusion standard. New lines are always preferred for comments.
Using the robots.txt file
1. Robots are configured to read text. Too much graphic content could
render your pages invisible to the search engine. Use the robots.txt file
to block irrelevant and graphic-only content.
2. Indiscriminate access to all files, it is believed, can dilute
relevance to your site content after being indexed by robots. This could
seriously affect your site’s ranking with search engines. Use the
robots.txt file to direct robots to content relevant to your site’s theme
by blocking the irrelevant files or directories.
3. The file can be used for multilingual websites to direct robots to
relevant content for relevant topics for different languages. It
ultimately helps the search engines to present relevant results for
specific languages. It also helps the search engine in its advanced search
options where language is a variable.
4. Some robots could cause severe server loading problems by rapid firing
too many requests at peak hours. This could affect your business. By
excluding some robots that might be irrelevant to your site, in the
robots.txt file, this problem can be taken care of. It is really not a
good idea to let malevolent robots use up precious bandwidth to harvest
your emails, images etc.
5. Use the robots.txt file to block out folders with sensitive
information, text content, demo areas or content yet to be approved by
your editors before it goes live.
The robots.txt file is an effective tool to address certain issues
regarding website ranking. Used in conjunction with other SEO strategies,
it can significantly enhance a website’s presence on the net.
Article last updated : 11th March 2004
Related Reading
A Standard for Robots Exclusion
Guide to The Robots Exclusion Protocol
W3C Recommendations
Meta Tags Optimization for Search Engines
(c) Copyright 2004 Jagdeep.S. Pannu, SEORank
This Article is Copyright protected. If you would like to have this article republished on your site, please contact the author here:
SEO Articles Feedback. We just require all due credits carried; and text, hyperlinks and headers unaltered. This article must not be used in unsolicited mail.