1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Scraping millions of Google search results

Discussion in 'Black Hat SEO Tools' started by ryangineer, Apr 3, 2017.

  1. ryangineer

    ryangineer Newbie

    Joined:
    Mar 13, 2017
    Messages:
    26
    Likes Received:
    2
    Gender:
    Male
    Hey fellow BH'ers!
    I have a question and would love your help! I created a simple Node JS bot that searches exactly what I need from Google and can scrape the information I'm looking for (publically available YouTube, Twitter accounts, etc). Now that I have the bot set up - what do I need to do from here?

    Given there are millions of page results for the various searches, what is the right way to go about running my bot? I'm guessing I would need proxies? How do I avoid hangups?

    Would love to just learn all the right things here: is there a good thread or guide I can read to getting the proper set up?

    Thanks so much in advance. Means a lot!
     
  2. jamie3000

    jamie3000 Supreme Member

    Joined:
    Jun 30, 2014
    Messages:
    1,310
    Likes Received:
    586
    Occupation:
    Finance coder looking for semi-retirement
    Location:
    uk
    I've written quite a few scrapers before, Google, bing, yahoo etc. Biggest problem is proxies. You'll need a lot for Google or you'll hit captchas but you can fire these off to death by captcha or another captcha solving service. Rotating user agents helps. Also i found using a headerless browser and randomising activity can help.
     
    • Thanks Thanks x 2
  3. ryangineer

    ryangineer Newbie

    Joined:
    Mar 13, 2017
    Messages:
    26
    Likes Received:
    2
    Gender:
    Male
    Oh awesome. You wouldn't happen to be available to chat over Skype/Telegram would you? Wouldn't mind picking your brain. I got the scraping program to work perfectly, now just need to get it to run at scale.

    Would you recommend using Yahoo or Bing instead, are they much more lax?
     
  4. Javardo69

    Javardo69 Junior Member

    Joined:
    Jul 19, 2014
    Messages:
    102
    Likes Received:
    6
    Some rotation proxy service, make your program to avoid proxies that are either dead, found a captcha or its not returning the search results for some reason. To scale up its all about the amount of proxies you can afford to purchase.
     
    • Thanks Thanks x 1
  5. 710fla

    710fla Jr. VIP Jr. VIP

    Joined:
    Aug 25, 2015
    Messages:
    650
    Likes Received:
    172
    Bing is more lax on proxy bans.

    I use StormProxies rotating proxies to scrape Google. You can switch IP address every HTTP request or every couple of minutes.

    I use GSA Proxy Scraper for proxies to scrape Bing.
     
    • Thanks Thanks x 1
  6. Skyebug77

    Skyebug77 Jr. VIP Jr. VIP

    Joined:
    Mar 22, 2012
    Messages:
    1,931
    Likes Received:
    1,354
    Occupation:
    Marketing
    Location:
    Portland,Or

    DBC can't solve googles recaptcha. You would need a manual captcha solve to solve this.

    Yes, yahoo, bing, aol etc are lots easier to scrape. Yahoo's engine accepts similar footprints as google so it works well. I scraped 100's of thousands of results in yahoo with only 1 single IP
     
  7. marathipustake

    marathipustake BANNED BANNED

    Joined:
    Apr 5, 2017
    Messages:
    20
    Likes Received:
    0
    Gender:
    Male
    Bing copies google result.so try with bong
     
  8. Skyebug77

    Skyebug77 Jr. VIP Jr. VIP

    Joined:
    Mar 22, 2012
    Messages:
    1,931
    Likes Received:
    1,354
    Occupation:
    Marketing
    Location:
    Portland,Or
    No it does not. And bings results are spammed with porn.

    Also, just because search results says millions, the engine will not actually let you get that many as there are not really that many UNIQUE results per keyword as it says.
     
  9. jamie3000

    jamie3000 Supreme Member

    Joined:
    Jun 30, 2014
    Messages:
    1,310
    Likes Received:
    586
    Occupation:
    Finance coder looking for semi-retirement
    Location:
    uk
    I've found when scraping bing they will sometimes instead of straight up banning your IP, they'll just restrict it to 3 pages of results. Something to watch out for.
     
    Last edited: Apr 5, 2017
  10. jamie3000

    jamie3000 Supreme Member

    Joined:
    Jun 30, 2014
    Messages:
    1,310
    Likes Received:
    586
    Occupation:
    Finance coder looking for semi-retirement
    Location:
    uk
    Says on death by captcha homepage they do recaptcha through their API, can't vouch for this though.
     
  11. Skyebug77

    Skyebug77 Jr. VIP Jr. VIP

    Joined:
    Mar 22, 2012
    Messages:
    1,931
    Likes Received:
    1,354
    Occupation:
    Marketing
    Location:
    Portland,Or
    Maybe it could solve earlier versions of it, or audio etc. But I have had absolutely no luck finding a provider that can reliably solve the latest no captcha/recaptcha on google. Check out how insane the tech is behind it.



    If you can find a service that can do it reliably, please let me know I would be interested.
     
  12. ryangineer

    ryangineer Newbie

    Joined:
    Mar 13, 2017
    Messages:
    26
    Likes Received:
    2
    Gender:
    Male
    You guys are great. Thank you so much. Would you recommend trying to scrape public proxies as available on ScrapeBox? or definitely going private proxy rotating out through ScrapeBox? What would be the ideal proxy set up?
     
  13. littlewebdragon

    littlewebdragon Jr. VIP Jr. VIP

    Joined:
    Dec 30, 2007
    Messages:
    1,671
    Likes Received:
    824
    Occupation:
    Occupation
    Location:
    Location
    One thing that I've recently noticed and was able to test that (tnx to proxy provider) is that Google sometimes block for 24h entire /24 IP range in certain cases. And that's awful. :confused:
     
  14. Javardo69

    Javardo69 Junior Member

    Joined:
    Jul 19, 2014
    Messages:
    102
    Likes Received:
    6
    i've tested 2captcha and it works but its quite expensive in the long run ($3 for 1000 captchass) i'm looking for alternatives
     
  15. AryabhattZero

    AryabhattZero Power Member

    Joined:
    Jul 28, 2015
    Messages:
    763
    Likes Received:
    296
    You can use 2Captcha.
    I find them better than DBC