URL Scraping 101

tixpf

Regular Member
Dec 1, 2013
295
115
I've decided to finally open a thread regarding this topic and this topic alone. Usually when I start asking about this in already ongoing discussions I never get the answers I hope for.
In my opinion there isn't nearly enough information about URL scraping in general to be well informed and make a rational decision. I've searched through this forum, GSA forum and a lot of others, but couldn't really get to the bottom of this.



  1. What are the minimum requirements for (proper) scraping?
    1. Proxies (Private/Public/How Many/..)?
    2. Internet Connection Speed requirements(Home Line/VPS/Dedicated/..)?
  2. What are the major differences between GScraper and Scrapebox?
    1. Different minimum amount of proxies for each tool?
    2. Better results with GS/SB?
    3. Proxy consumption - I've read that GScraper burns through proxies like there's no tomorrow, while SB seems to treat them much less aggressively?
    4. Is one more suited for beginners/low-medium scraping needs than the other?



These are the main questions that I'd like answers for. There are a few more, but they're rather follow up questions.
Thanks in advance guys.
 
I've built quite a few scrapers in multiple languages - here's my thoughts:


What are the minimum requirements for (proper) scraping?

  1. Proxies (Private/Public/How Many/..)?
for ~99% of websites, using shared, public proxies will be fine, and you need to have 1 per "identity", or login. If you don't need to login to accomplish your scrapes (e.g. scraping Wikipedia or something), you can probably send up to 60 requests per IP and not get "blocked" or anything, but to be safe I'd limit it to 20 requests per IP or less.

  1. Internet Connection Speed requirements(Home Line/VPS/Dedicated/..)?
This just depends on your code. Nothing is wrong with a bad connection, because a "bad" or "good" connection speed is not an indicator of whether that request is coming from a scraper or human input. However, try to get fast speeds because with slow speeds you will run into timeout issues that break your code, and it's very hard to handle these type of errors in code. I've used IPRentals, HidemyAss VPN, StrongVPN, Proxy51, etc, and I would say HideMyAss has been my favorite. It's very fast and reliable, and it feels like they are "clean". I have written AppleScripts that you can include in your apps to integrate it with HideMyAss' VPN. PM me if you want em.
What are the major differences between GScraper and Scrapebox?
I've only minimally used Scrapebox so I can't comment for either of these.

  1. Different minimum amount of proxies for each tool?
  2. Better results with GS/SB?
  3. Proxy consumption - I've read that GScraper burns through proxies like there's no tomorrow, while SB seems to treat them much less aggressively?
  4. Is one more suited for beginners/low-medium scraping needs than the other?
 
Back
Top
AdBlock Detected

We get it, advertisements are annoying!

Sure, ad-blocking software does a great job at blocking ads, but it also blocks useful features and essential functions on BlackHatWorld and other forums. These functions are unrelated to ads, such as internal links and images. For the best site experience please disable your AdBlocker.

I've Disabled AdBlock