1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How to check if URL has been indexed by Google?

Discussion in 'PHP & Perl' started by dtang4, Dec 24, 2011.

  1. dtang4

    dtang4 Regular Member

    Joined:
    Apr 7, 2010
    Messages:
    291
    Likes Received:
    43
    I would like to write a script in PHP to check if URLs have been indexed by Google. I would like to do this in bulk, so using Google's API won't work as is.

    I notice Nuclear Link Indexing has this capability built in and am wondering how this is done.

    Thanks.
     
  2. MrBlue

    MrBlue Senior Member

    Joined:
    Dec 18, 2009
    Messages:
    950
    Likes Received:
    662
    Occupation:
    Web/Bot Developer
    Well I would just scrape search results for each url using something like "website.com/pageurl.html" or "site:website.com" queries in Google.
     
  3. dtang4

    dtang4 Regular Member

    Joined:
    Apr 7, 2010
    Messages:
    291
    Likes Received:
    43
    Scraping using something like cURL?

    I've tried that in the past, but I've found Google has pretty good anti-scraping measures built in and was immediately thrown a captcha.
     
  4. MrBlue

    MrBlue Senior Member

    Joined:
    Dec 18, 2009
    Messages:
    950
    Likes Received:
    662
    Occupation:
    Web/Bot Developer
    Yes Google will throw up captchas pretty quick. Make sure you use proxies and have accurate user-agent headers. Also, Google's mobile search doesn't implement strict bot detection. ;)
     
    • Thanks Thanks x 2
  5. xpwizard

    xpwizard Junior Member

    Joined:
    Nov 6, 2010
    Messages:
    198
    Likes Received:
    122
    The captcha is easy to deal.
    But you could also find a list of datacenter IP's and scrape via the IP's instead of directly from "google.com". Just set it so that if a captcha appears, change datacenter IP's.
     
  6. hypefrenzy

    hypefrenzy Junior Member

    Joined:
    Dec 12, 2011
    Messages:
    170
    Likes Received:
    19
    scraping and manipulating the data is easy. i pretty much have that down using php. the main obstacle is getting past the captcha. that's what i'm looking into now. seems doable
     
  7. Xooor

    Xooor Newbie

    Joined:
    Aug 14, 2011
    Messages:
    18
    Likes Received:
    17
    I have coded a few different tools to scrape Google, based upon that experience, I think there are 3 main situations and sets of solutions :

    - Low Volume Scraping (less than a hundred queries per day or so, spaced out in time): You shouldn't need any Proxy or Captcha breaking system
    - Medium Volume Scarping (a few thousand queries per day, randomly spaced in time) : Proxies should do the track, buy a few cheap proxies, and balance out the traffic over them, that should be sufficient to avoid Captcha's appearing and thus needing a Captcha breaking system
    - High Volume : Use both proxies and a Captcha breaking system (the proxies will lower the required calls to Captcha breaking system, but since the volume is high when worked out per proxy, you'll still need the breaker)

    I'm not sure which situation you are in, so far I've done fine with using Proxies, I prefer not having to deal with captcha breaking, since it's more error prone, complex and slower. So my suggestion would be to avoid needing to break the captchas if possible.

    Regards.
     
  8. liquidone

    liquidone Newbie

    Joined:
    Dec 26, 2009
    Messages:
    31
    Likes Received:
    12
    Why reinvent the wheel? Just use Scrapebox. Load up your proxies and urls and away you go.

    However it only checks the info: operator. Sometimes I like to check both site: and info:
     
  9. hypefrenzy

    hypefrenzy Junior Member

    Joined:
    Dec 12, 2011
    Messages:
    170
    Likes Received:
    19
    my use for scraping is evolving while i'm putting together the big picture. i don't expect to run into any issues with captcha due to volume because i'll be steadily scanning through google results unnoticed for the most part. the main thing is i want my bot to know how to get past captcha when it does come up

    i've used the g search api and it works great but it is limiting with only 100 searches per day and a maximum of 10 results per search. i actually just started looking into captcha killin and know i'll nail it down with my own personal touch once i do a little more research

     
  10. hypefrenzy

    hypefrenzy Junior Member

    Joined:
    Dec 12, 2011
    Messages:
    170
    Likes Received:
    19
    for me, there are three reasons:

    1) flexibility - i can program my bots to behave any way i need. i can control how they find results, how the results are processed, the logic for filtering the data, and how the data is organized

    2) stubbornness - i honestly just like doing things my own way and it only makes me better when i have to work for the results. buying a program is great for quick results but i'm building a long term system

    3) adaptation - if the rules change i have full control to modify my strategies on the fly

     
  11. Tensegrity

    Tensegrity Jr. VIP Jr. VIP Premium Member

    Joined:
    Apr 22, 2009
    Messages:
    1,824
    Likes Received:
    969
    It's not about creating something that already exists, it's about creating something that works for your environment. Obviously Scrapebox is not something you can have automated on the fly with php. So what if you're in a situation where you need to check if a URL is indexed before handling it in some way with your automated cron'd php script?
     
  12. Tensegrity

    Tensegrity Jr. VIP Jr. VIP Premium Member

    Joined:
    Apr 22, 2009
    Messages:
    1,824
    Likes Received:
    969
    I like the way you think.