1. This website uses cookies to improve service and provide a tailored user experience. By using this site, you agree to this use. See our Cookie Policy.
    Dismiss Notice

Thinking of creating a Google scraper

Discussion in 'Programming' started by 2makemoney, Nov 29, 2019.

  1. 2makemoney

    2makemoney Regular Member

    Joined:
    Oct 10, 2011
    Messages:
    454
    Likes Received:
    99
    Gender:
    Male
    Location:
    USA
    I am thinking about building a Google scraper to learn programming. I am curious of what features people would want in a Google scraper.

    1. Software (local install .exe) or web based?
    2. Do people want other search engine too like Bing?
     
  2. VSYNC

    VSYNC Registered Member

    Joined:
    Nov 20, 2019
    Messages:
    77
    Likes Received:
    21
    Gender:
    Male
    Occupation:
    Invisible threads are the strongest ties~Nietzsche
    Location:
    How did it get so late so soon?~Dr. Seuss
    Home Page:
    GL, what kind of scraper?

    You can make high quality scraping programs with Python/beautifulsoup2 and lxml, I'd advise that as a good way to go if you're just learning about web scraping
     
    • Thanks Thanks x 1
  3. 2makemoney

    2makemoney Regular Member

    Joined:
    Oct 10, 2011
    Messages:
    454
    Likes Received:
    99
    Gender:
    Male
    Location:
    USA
    I would like to replicate scrapebox result scraper without the need for proxies. Maybe take in a list of keywords and footprint and return a list of deduped links. Not sure if there is a use case for such tool.

    edit: added more details
     
  4. MetDark

    MetDark Newbie

    Joined:
    Oct 27, 2017
    Messages:
    21
    Likes Received:
    11
    Gender:
    Male
    Home Page:
    1. Web-based to cater for different OS, device. Less complicated?

    2. Don't know.
    But I think people want great search engines.

    A Google scrapper is not a search engine?
    Unless we are talking about DuckDuckGo "It emphasizes returning the best results, rather than the most results, generating those results from over 400 individual sources, including crowdsourced sites such as Wikipedia, and other search engines like Bing, Yahoo!, and Yandex."
     
    • Thanks Thanks x 1
  5. 2makemoney

    2makemoney Regular Member

    Joined:
    Oct 10, 2011
    Messages:
    454
    Likes Received:
    99
    Gender:
    Male
    Location:
    USA
    Great point about catering to different OS. I really like that idea.

    Never heard of DuckDuckGo. I will look that up. Thanks
     
  6. theRevolt

    theRevolt Jr. VIP Jr. VIP

    Joined:
    Jul 29, 2009
    Messages:
    2,262
    Likes Received:
    980
    Sorry to burst your bubble, but Google with no proxies will get you blocked for anything more than the couple of lookups that you may as well do manually.

    Find something more useful if you want someone to use it, or just go ahead if the main focus is just to learn by doing...
     
    • Thanks Thanks x 2
  7. rafark

    rafark Senior Member

    Joined:
    Jan 15, 2013
    Messages:
    1,068
    Likes Received:
    657
    Gender:
    Male
    Occupation:
    Moderador
    Location:
    North America
    Home Page:
    You'll need proxies. If you want lo learn scraping start with a site that doesn't require proxies.
     
    • Thanks Thanks x 2
  8. 2makemoney

    2makemoney Regular Member

    Joined:
    Oct 10, 2011
    Messages:
    454
    Likes Received:
    99
    Gender:
    Male
    Location:
    USA
    Interesting. Would it be useful to find ways of bypassing the proxy issue?
     
  9. BassTrackerBoats

    BassTrackerBoats Super Moderator Staff Member Moderator Jr. VIP

    Joined:
    Mar 10, 2010
    Messages:
    28,398
    Likes Received:
    52,401
    Occupation:
    Generic Human Being
    Location:
    As Close to Heaven as One Can Get!
    Home Page:
    That is an age old question as we all (an absolute I know so kick me for saying that) use proxies when we scrape unless we set things up to rest our scraping silly times between scrapes and that makes it almost useless.
     
    • Thanks Thanks x 1
  10. Novita Rizki

    Novita Rizki Newbie

    Joined:
    Nov 22, 2019
    Messages:
    32
    Likes Received:
    11
    Occupation:
    Marketer
    Location:
    $5mil villa
    Crowd sourcing the IP list from user mobile device would be an idea(if possible), just like some free VPN provider ways to collect IP from its user
     
    • Thanks Thanks x 2
  11. 2makemoney

    2makemoney Regular Member

    Joined:
    Oct 10, 2011
    Messages:
    454
    Likes Received:
    99
    Gender:
    Male
    Location:
    USA
    That is actually a good idea. Maybe abstract the proxies from the users.
     
    • Thanks Thanks x 1
  12. t-machine

    t-machine Newbie

    Joined:
    Nov 17, 2019
    Messages:
    44
    Likes Received:
    21
    Even with proxies, you also have to deal with recaptcha... Google is a serious pita for that, not just the search engine, even more niche stuff like google scholar, or google flights, literally everything is so messed up that if you want to get around captcha, you need to simulate browser behavior, introduce action delays, accept cookie headers, etc, etc, etc, etc, etc... I absolutely hate google in that regard. Imo the effort is not worth it as a business model, but who am I to stop you.
     
    • Thanks Thanks x 1
  13. itz_styx

    itz_styx Jr. VIP Jr. VIP

    Joined:
    May 8, 2012
    Messages:
    1,219
    Likes Received:
    648
    Occupation:
    CEO / Admin / Developer
    Location:
    /dev/mem
    Home Page:
    as long as you use proxies its fairly easy to scrape google results, no need for browser emulation, you can use pure http requests.
     
    • Thanks Thanks x 1
  14. VSYNC

    VSYNC Registered Member

    Joined:
    Nov 20, 2019
    Messages:
    77
    Likes Received:
    21
    Gender:
    Male
    Occupation:
    Invisible threads are the strongest ties~Nietzsche
    Location:
    How did it get so late so soon?~Dr. Seuss
    Home Page:
    +1 Selenium
     
    • Thanks Thanks x 1
  15. turelink

    turelink Junior Member Premium Member

    Joined:
    Jul 26, 2015
    Messages:
    194
    Likes Received:
    44
    Home Page:
    #1 rotating proxies (residential IP is better than datacenter IP)
    #2 Headless browser such as Selenium or Puppeteer
    #3 Good web scraper, with captcha solver or develop a captcha scraper by your own.
     
    • Thanks Thanks x 2
  16. FNTK

    FNTK Jr. VIP Jr. VIP

    Joined:
    Apr 7, 2014
    Messages:
    428
    Likes Received:
    146
    This seems promising.