1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Why is Googlebot completely ignoring robots.txt

Discussion in 'White Hat SEO' started by mikie46, Feb 2, 2010.

  1. mikie46

    mikie46 Jr. VIP Jr. VIP

    Joined:
    Aug 6, 2008
    Messages:
    1,454
    Likes Received:
    1,102
    So i added a directory to my robots.txt which basically says all files in my /support/ directory are off limits.

    Today i notice Googlebot ignoring this request. First it reads my robots.txt

    66.249.71.175 - - [01/Feb/2010:19:14:54 -0800] "GET /robots.txt HTTP/1.1" 200 5539 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

    Then is visits the knowledgebase articles in the directory i told it not to visit!!

    66.249.71.175 - - [01/Feb/2010:19:14:55 -0800] "GET /support/index.php?_m=knowledgebase&_a=viewarticle&kbarticleid=160&nav=0%2C4%2C8 HTTP/1.1" 200 44787 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

    WTF is up with that and since when does Google not understand;

    Disallow: /support/
     
  2. mtravel13

    mtravel13 Registered Member

    Joined:
    Sep 2, 2009
    Messages:
    81
    Likes Received:
    17
    Occupation:
    web designer
    Location:
    internet
    that`s strange !
    is it just visiting the pages or caching it too ?
    it might be that it takes a while to index the robots.txt behavior within G`s data centers and right now all this bot is doing is carrying that information over to google`s servers ..
    one more interesting thing i found out was that
    "Verifying Googlebot

    You can verify that a bot accessing your server really is Googlebot by using a reverse DNS lookup, verifying that the name is in the googlebot.com domain, and then doing a forward DNS lookup using that googlebot name. This is useful if you're concerned that spammers or other troublemakers are accessing your site while claiming to be Googlebot.

    "

    are you sure it was googlebot and not some disguised bot trying to crawl your pages ?
     
  3. mikie46

    mikie46 Jr. VIP Jr. VIP

    Joined:
    Aug 6, 2008
    Messages:
    1,454
    Likes Received:
    1,102
    The ip is definitely Google. Iv see the same ip scattered throughout my server logs indexing other files and directories. Not sure why its not abiding by the rules.

    Also, bots that ignore robots.txt usually dont read it. They just go ahead and index every directory. If its not Googlebot it wont read robots.txt usually.
     
    Last edited: Feb 2, 2010
  4. Dangazzm

    Dangazzm Regular Member

    Joined:
    Jan 9, 2010
    Messages:
    230
    Likes Received:
    21
    Never looked this closely to how the Robot.txt and g00gle bot work but could it be possible what the poster above said?

    That it will take the information back to be indexed and it will know in the future to NOT allow your directory? I would wait until the next time it comes back and see if THIS is how it works.