1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Checking for google crawlers (user agent etc)

Discussion in 'Other Languages' started by tb303, Sep 4, 2015.

  1. tb303

    tb303 Power Member

    Joined:
    Dec 18, 2011
    Messages:
    734
    Likes Received:
    388
    Im testing a couple of indexing things so im looking to simply log when googlebot turns up to crawl a page for the first time. It doesn't need to be fancy its just a simple bit of php.

    but i want to be sure im not missing anything with it.

    Would you say its enough to simply check if the user agent contains "googlebot" or should i be checking IP's as well?

    I know cloakers have been checking IP's for a long time but i'm only interested in logging when the crawler turns up for the first time - I'd really like to be 100% sure though.

    Anyone got any thoughts on this?

    reference:
    https://support.google.com/webmasters/answer/1061943?hl=en
     
  2. MrBlue

    MrBlue Senior Member

    Joined:
    Dec 18, 2009
    Messages:
    969
    Likes Received:
    678
    Occupation:
    Web/Bot Developer
    User agents can easily be spoofed. If you want to be 100% sure it's Google bot you'll need to run a DNS lookup.
    See here:
    https://support.google.com/webmasters/answer/80553?hl=en

    Based on that article, here is a quick and very dirty way to verify if it's actually Google bot using PHP
    Code:
    <?php
    if (strpos($_SERVER['HTTP_USER_AGENT'],'Google') !== false) {
        $rev_dns = shell_exec('host '.$_SERVER['REMOTE_ADDR']);
        if ($rev_dns) {
            $parts = explode('domain name pointer ', $rev_dns);
            $dn = substr($parts[1], 0, -2);
            if (strpos($dn,'googlebot') !== false) {
                $fwd_dns = shell_exec('host '.$dn);
                if ($fwd_dns) {
                    $parts = explode('has address ', $fwd_dns);
                    if (trim($parts[1]) == $_SERVER['REMOTE_ADDR']) {
                        echo 'googlebot';
                    } else {
                        echo 'not googlebot';
                    }
                }
            }
        }
    }
    ?>
    
     
    • Thanks Thanks x 3
    Last edited: Sep 5, 2015
  3. tb303

    tb303 Power Member

    Joined:
    Dec 18, 2011
    Messages:
    734
    Likes Received:
    388
    hey, thanks for that that is a lot better than what i was doing.
     
  4. pasdoy

    pasdoy Power Member

    Joined:
    Jul 17, 2008
    Messages:
    764
    Likes Received:
    241
    my eyes bleed, aye
     
  5. itz_styx

    itz_styx Jr. VIP Jr. VIP

    Joined:
    May 8, 2012
    Messages:
    361
    Likes Received:
    122
    Occupation:
    CEO / Admin / Developer
    Location:
    /dev/mem
    Home Page:
    how about just using gethostbyaddr() instead of executing the "host" binary ? ;)
    shell_exec and other functions like system() are blocked on many servers for security reasons.
     
    • Thanks Thanks x 2
  6. MrBlue

    MrBlue Senior Member

    Joined:
    Dec 18, 2009
    Messages:
    969
    Likes Received:
    678
    Occupation:
    Web/Bot Developer
    This is a good point. Most crappy shared hosts do indeed block shell_exec() and system().
    gethostbyaddr() works as well but fails if the domain name contains unicode.

     
    Last edited: Sep 20, 2015
  7. itz_styx

    itz_styx Jr. VIP Jr. VIP

    Joined:
    May 8, 2012
    Messages:
    361
    Likes Received:
    122
    Occupation:
    CEO / Admin / Developer
    Location:
    /dev/mem
    Home Page:
    yes but you can use idn_to_ascii() to convert the domain first :)

     
  8. MrBlue

    MrBlue Senior Member

    Joined:
    Dec 18, 2009
    Messages:
    969
    Likes Received:
    678
    Occupation:
    Web/Bot Developer
    How would that work? The input is an IP address.

     
  9. itz_styx

    itz_styx Jr. VIP Jr. VIP

    Joined:
    May 8, 2012
    Messages:
    361
    Likes Received:
    122
    Occupation:
    CEO / Admin / Developer
    Location:
    /dev/mem
    Home Page:
    oh lol yea its the other way around ip to domain not domain to ip.
    sorry my bad i posted too quickly without thinking properly :)