1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

URL Scraping 101

Discussion in 'Black Hat SEO Tools' started by tixpf, May 4, 2014.

  1. tixpf

    tixpf Regular Member

    Joined:
    Dec 1, 2013
    Messages:
    295
    Likes Received:
    114
    I've decided to finally open a thread regarding this topic and this topic alone. Usually when I start asking about this in already ongoing discussions I never get the answers I hope for.
    In my opinion there isn't nearly enough information about URL scraping in general to be well informed and make a rational decision. I've searched through this forum, GSA forum and a lot of others, but couldn't really get to the bottom of this.



    1. What are the minimum requirements for (proper) scraping?
      1. Proxies (Private/Public/How Many/..)?
      2. Internet Connection Speed requirements(Home Line/VPS/Dedicated/..)?
    2. What are the major differences between GScraper and Scrapebox?
      1. Different minimum amount of proxies for each tool?
      2. Better results with GS/SB?
      3. Proxy consumption - I've read that GScraper burns through proxies like there's no tomorrow, while SB seems to treat them much less aggressively?
      4. Is one more suited for beginners/low-medium scraping needs than the other?



    These are the main questions that I'd like answers for. There are a few more, but they're rather follow up questions.
    Thanks in advance guys.
     
  2. vivid-

    vivid- Newbie

    Joined:
    Aug 30, 2010
    Messages:
    3
    Likes Received:
    1
    I've built quite a few scrapers in multiple languages - here's my thoughts:


    What are the minimum requirements for (proper) scraping?

    1. Proxies (Private/Public/How Many/..)?
    for ~99% of websites, using shared, public proxies will be fine, and you need to have 1 per "identity", or login. If you don't need to login to accomplish your scrapes (e.g. scraping Wikipedia or something), you can probably send up to 60 requests per IP and not get "blocked" or anything, but to be safe I'd limit it to 20 requests per IP or less.

    1. Internet Connection Speed requirements(Home Line/VPS/Dedicated/..)?
    This just depends on your code. Nothing is wrong with a bad connection, because a "bad" or "good" connection speed is not an indicator of whether that request is coming from a scraper or human input. However, try to get fast speeds because with slow speeds you will run into timeout issues that break your code, and it's very hard to handle these type of errors in code. I've used IPRentals, HidemyAss VPN, StrongVPN, Proxy51, etc, and I would say HideMyAss has been my favorite. It's very fast and reliable, and it feels like they are "clean". I have written AppleScripts that you can include in your apps to integrate it with HideMyAss' VPN. PM me if you want em.
    What are the major differences between GScraper and Scrapebox?
    I've only minimally used Scrapebox so I can't comment for either of these.

    1. Different minimum amount of proxies for each tool?
    2. Better results with GS/SB?
    3. Proxy consumption - I've read that GScraper burns through proxies like there's no tomorrow, while SB seems to treat them much less aggressively?
    4. Is one more suited for beginners/low-medium scraping needs than the other?