1. This website uses cookies to improve service and provide a tailored user experience. By using this site, you agree to this use. See our Cookie Policy.
    Dismiss Notice

Robots.txt Info

Discussion in 'White Hat SEO' started by barsha, Jun 25, 2008.

  1. barsha

    barsha Registered Member

    Apr 9, 2008
    Likes Received:
    07/01/2007 11:27 PM Quote Reply Alert
    Here is some info that is going into the FAQ section of Stomper, but I wanted to get it in here too.

    Understanding the Power of the Robots.txt File
    Whether you are a web veteran or a rookie I am confident that you have heard of the robots.txt file. You have probably heard myths, conflicting information, advice on how to use to use it. You may also have heard advice to abandon it. Who is right?

    I'm here to tell you.
    First things first, the robots.txt file was designed to inform bots how to behave on your site. What information they can get and what information they can't. It's a simple text file that is very easy to create, once you understand the proper format. This system is called the Robots Exclusion Standard.

    An example of a robots.txt file can be found at: http://www.webmarketingnow.com/robots.txt

    An important point to remember is to create your robots.txt file in Notepad or another text editor. DO NOT, under any circumstances, create your robots.txt file in an HTML Editor like DreamWeaver, GoLive or FrontPage. FTP clients usually convert the file into Unix, but there are occasions when it will fail. Do not take the chance, create it in Notepad instead.

    The User-agent line specifies the robot. For example:

    User-agent: googlebot

    You may also use the wildcard character '*' to specify all robots. For example:

    User-agent: *

    You can find user agent names in your own logs by checking for requests to robots.txt. Most major search engines have names for their spiders.

    Here is a partial list:

    MSN Robot
    Yahoo! Slurp (recently renamed)
    Google AdSense Robot
    Xenu Link Sleuth


    The second part of a robots.txt file consists of Disallow: directive lines. Just because the Disallow statement is there, doesn't mean that the bot(s) are completely disallowed on the site. These lines can specify files and/or directories. For example, if you want to instruct spiders to not download private.htm, you would enter:

    Disallow: private.htm

    You can also specify directories:

    Disallow: /cgi-bin/

    This will block spiders from your cgi-bin directory. Some webmasters are nervous to list a directory to exclude in the robots.txt as that gives hackers a reason to attempt to get into that folder. You can exclude a folder without giving out the full name. For example, if the directory you want to exclude is "secret". You could add:

    Disallow: /sec.

    This would disallow spiders from indexing folders beginning with "sec" so make sure you look at your directory structure first before implementing this, as it would also disallow the folder "secondary" as well.
    There is also a wildcard nature to the Disallow directive. The standard dictates that /temp would disallow /temp.html and /temp/index.html (both the file temp and files in the temp directory will not be indexed).

    If you leave the Disallow line blank, it indicates that ALL files may be retrieved. At least one disallow line must be present for each User-agent directive to be correct format. If you don't do this correctly, the file will not be compliant and chances are, the bots will not read it correctly, or may just ignore the entire file. Yahoo! has been known to do this. A completely empty robots.txt file is the same as if it were not present. Also, over 80% of people who complain that bots are not obeying their robots.txt file have syntax errors, thus the file isn't read.
    Any line in the robots.txt that begins with "#" is considered to be a comment line and is ignored. The standard allows for comments at the end of directive lines, but this is really bad formatting style and I don't recommend it. Example:

    Disallow: temp # Disallowing access to the temp folder.

    Some spiders will not interpret the above line correctly and instead will attempt to disallow 'temp#comment'.

    Instead, format the line as follows:

    #Disallowing access to the temp folder

    Disallow: /temp/

    That makes for a cleaner looking robots.txt file.

    The following allows all robots to visit all files because the wildcard '*' specifies all robots.

    User-agent: *

    Want to keep all robots out? Use this one:

    User-agent: *
    Disallow: /

    Want to keep out just one bot? Let's deny Ask:

    User-agent: Teoma
    Disallow: /

    Keep bots out of your cgi-bin and images folders:

    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /images/

    How about just keeping out the Google Images Bot, but allowing other image bots free roam of your site?:

    User-agent: Googlebot-Image
    Disallow: /images/

    Note: When using the above code, ensure that you don't have other images throughout your site, or the image bot will get them.

    This one bans Email Harvester from all files on the server:

    User-agent: emailharvester
    Disallow: /

    This one keeps googlebot from getting at the cloaking.htm file:

    User-agent: googlebot
    Disallow: cloaking.htm

    If you create a page that is perfect for Yahoo!, but you don't want Google to see it:

    User-Agent: Googlebot
    Disallow: /yahoo-page.html

    Before you look up at the above examples and a "light bulb" goes off in your head and you realize that you can do User-Agent based cloaking, don't go down that road. That is known as "poor man's cloaking." It may work for a little while, but you will get nailed hard and getting the domain back into the index is a painful and long process. It just isn't worth it.

    Common Questions about the Robots.txt File

    Q: Why should I use it when I can use the meta-robots tag instead.

    A: First of all, the meta-robots tag is not compliant to the needs of search engines and in testing that I have done it is often not read. All the major engines and most of the minor engines look for the robots.txt and do their best to obey it. This is not true with the meta-robots tag. Also, if you use the meta-robots tag, don't use the "index,follow" parameter. That is what a search bot does by default. It would be like you having a sign above your desk that says, "Breathe,Blink Eyes." You don't need to be told to do that and neither do the bots.

    Q: What if I don't use the robots.txt file? What is the worst that can happen?

    A: According to my testing, when a site that has been online for 12 months or longer employs the robots.txt file and doesn't make any other changes, the site is indexed an average of 14% deeper than it was before.

    Q: Where do I place the robots.txt file?

    A: The file should be placed in the root directory of your server. In other words, in the same place as your index.html file for your home page.

    Q: What are some things that I would want to exclude from the robots?

    A: Here are a few examples:

    Any folder that is "off limits" to public eye that you have not (for whatever reason) password protected.
    Print Friendly versions of pages (to avoid the duplicate content filter)
    Images - to protect them and to avoid spidering problems
    CGI-BIN (programming code)
    Review your weblogs and find spiders that you don't want to come to your site and deny them. I always look at the data transferred, and I look at 10,000kb or more per month. Anything less than that is not worth your time. The following is a dump from one of our servers over a 30 day period.
    Spider Number of Hits Data Transferred (Kb)
    MSNBot 12,473 259,687
    Yahoo 10,548 193,983
    GoogleBot 5,768 138,447
    Ask Jeeves robot 4,623 113,023
    LinksManager Link Checker Bot 1,356 31,698
    Xenu link checker 1,061 20,209
    Alexa 6 740 17,711
    wisenut robot 578 10,317

    What would I deny in the list above? Honestly, I would deny Ask Jeeves. The bot is using a ton of resources and the amount of referral traffic from Ask is so low it doesn't make a "fair trade". You would want to deny the links manager bot, which is not just a resource hog, but will fill in your inbox with spam link requests from garbage sites looking for reciprocal links. I would also deny WiseNut. WiseNut was a good idea that never quite made it.

    Q: I exclude bots from indexing my site in the robots.txt file, but they come and crawl anyway. What am I doing wrong?

    A: Make sure you validate your robots.txt file. I prefer the one from Search Engine World. Another option would be that you have encountered an "evil" bot that wants to harvest either your content, or your email addresses for spam. "Evil Bots" are not going to obey the robots.txt file on purpose. Instead, you will need to use your HTAccess file (Apache Server) to do this.

    Creating your robots.txt file is not complicated and will take less than seven minutes if you follow these steps:

    Copy and paste the robots.txt file from our site at http://www.webmarketingnow.com/robots.txt to Notepad
    FTP to your web server and write down the folders you want to exclude
    Modify the Disallow lines in the robots.txt to reflect the folders you targeted
    Save the file
    Upload to your server
    Validate the file
    If you need to make changes, do so and then repeat steps 4-6
    In about two weeks, you will begin to see improve spidering, a greater depth of indexing and maybe even a rise in your rankings. Good luck!

    Best Regards,

    Jerry West
    • Thanks Thanks x 2
  2. tkdman99

    tkdman99 Newbie

    Feb 27, 2008
    Likes Received:
    Business owner
    Thanks! I was looking for this info today!
  3. goonieguhu

    goonieguhu Junior Member

    Apr 8, 2008
    Likes Received:
    Great summary!

    A few more thoughts from a recent experience I had. I was having problems with Google indexing too much- my print pages, even my plugin files- so I went searching for a sample robots.txt file.

    I added the following:
    User-agent: *
    Disallow: /wp-content/cache/
    Disallow: /wp-content/themes/
    Disallow: /wp-content/plugins/
    Disallow: /wp-admin/
    Disallow: /wp-includes/
    Disallow: /wp-login.php
    Disallow: /wp-register.php
    Disallow: /page/
    Disallow: /feed/
    Disallow: /*/feed/$
    Disallow: /*/feed/rss/$
    Disallow: /*/trackback/$
    Disallow: /?

    Here's the problem that arose- by disallowing the robots from my rss feed, I shot myself in the foot for indexing through Google Blogs, which specifically uses the feed for indexing. It took me a couple of days to figure out why all of the sudden none of my new posts were being indexed, and then several more days after that for it to recover. So be careful what you include.

    One more thing I learned from this screwup is how to use the robots.txt tools in the Google Webmaster Tools. Go to Tools and then "Analyze robots.txt" or "Generate robots.txt" and play around with them. I especially like the Analyze tool where you can put in a specific url and then test whether a robot would crawl it.