1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Scraping million of Google search results

Discussion in 'General Programming Chat' started by ryangineer, Apr 3, 2017.

  1. ryangineer

    ryangineer Newbie

    Joined:
    Mar 13, 2017
    Messages:
    26
    Likes Received:
    2
    Gender:
    Male
    Hey fellow BH'ers!

    I have a question and would love your help! I created a simple Node JS bot that searches exactly what I need from Google and can scrape the information I'm looking for. Now that I have the bot set up - what do I need to do from here? I'm new to search engines.

    Given there are millions of page results for the various searches, what is the right way to go about running my bot? I'm guessing I would need proxies? How do I avoid hangups?

    Would love to just learn all the right things here: is there a good thread or guide I can read to getting the proper set up?

    Thanks so much in advance. Means a lot!
     
  2. I know SEO

    I know SEO Marketplace Mod Moderator

    Joined:
    Nov 29, 2012
    Messages:
    16,526
    Likes Received:
    6,140
    What are you trying to do exactly? Why are you scraping google?
     
  3. ryangineer

    ryangineer Newbie

    Joined:
    Mar 13, 2017
    Messages:
    26
    Likes Received:
    2
    Gender:
    Male
    I am scraping Google for YouTube channels, twitter accounts, etc. Public information, links to other channels so that I have an internal database of potential users for a product I am developing. So my search criteria gives me a list of these accounts, now I am looking to scrape the results and internalize them in my database. But I'm trying to figure out how to best do that as I'm sure I will come up against some call/search limitation.
     
  4. Crawlie

    Crawlie Registered Member

    Joined:
    Jan 2, 2017
    Messages:
    52
    Likes Received:
    10
    Gender:
    Male
    Of course you need proxies, otherwise Google blacklists your ip pretty fast. Rotate them so each request is a new proxy. Also you can use timer to try and randomize your bot. Set max results in Google query, which I think is 100, then you can loop through pages. After some time Google gives irrelevant results so its pointless to check all.

    Use special search words wisely like site:tumblr.com.