[Questions] Proxies usage for a specific business use-case

Casper_T

Newbie
Joined
Aug 3, 2016
Messages
21
Reaction score
8
Hello there,

I am seeking for suggestion on how to properly select and use proxies for my
use-case:

We are doing many scrapes of US eCommerce stores (tens of millions of requests per month) using both headless browsers
and plain requests.


Several questions I am wishing to receive suggestions on:

  1. Proxy type. As we are scraping eCommerce products and not doing any kind of accounts creation are we enough to use shared datacenter proxies? Or thinking about long-term we should seek for dedicated residential IP's? We are trying to balance between quality and price. Right now we are using static residential proxies for around ~3.49$ per IP (ISP's IP addresses, although I believe they are shared), we couldn't find better price for this quality in the market.
2. Billing type. Bandwidth based or static with unlimited bandwidth? Correct me if I am wrong but for our use-case (when scraping millions of pages using headless chromium) we would generate ridiculous
amounts of traffic monthly so the only options remains to rent unlimited bandwidth IP's? But am I correct that it usually comes at a cost of limitation of proxy speed? (limited concurrent requests). I already see such
signs by using current provider, we are getting many Timeouts when doing headless scrapes.

3. Abstraction layer. As we plan to be managing the pool of different proxy providers IP's we would need some sort of abstraction layer that would
allow to easily combine various proxies from different providers and track/manage/replace them. Maybe there are existing Sass or library/framework solutions
for this specific purpose? I believe that implementing custom solutions would be trying to reinvent the wheel.

4. Testing proxy liveliness and speed. We will also need to proactively test the quality of the proxies we use. Are there any guidelines on how programmatically we
could periodically test the liveliness and speed of proxies we rent and use? I heard that simple PING requests are not working for many proxy providers, so that means we
would need some sort of custom-written service that would try to visit specific page by using proxy and check if page loads?

5. Testing proxy quality. Maybe you know some sites or tools that would easily detect majority of data-center proxies and therefore
would act as a perfect place to test if used proxy is actually good quality (the residential one)? I would be so thankful for such examples.

Thank you for your help in advance!
 
residential proxies such as luminati and soax are the best choice
I wouldn't disagree, but in our use-case, where we will have huge amount of monthly bandwitdh, 7-12$/GB is simply

too much
 
Hello there,

I am seeking for suggestion on how to properly select and use proxies for my
use-case:

We are doing many scrapes of US eCommerce stores (tens of millions of requests per month) using both headless browsers
and plain requests.


Several questions I am wishing to receive suggestions on:

  1. Proxy type. As we are scraping eCommerce products and not doing any kind of accounts creation are we enough to use shared datacenter proxies? Or thinking about long-term we should seek for dedicated residential IP's? We are trying to balance between quality and price. Right now we are using static residential proxies for around ~3.49$ per IP (ISP's IP addresses, although I believe they are shared), we couldn't find better price for this quality in the market.

This will highly depend on the websites you're scraping and how sophisticated the developing team is. If you're able to run extended cookie-loading and warming of IPs, then datacenter proxies might work. But probably not for long — anti-automation methods & algorithms are only improving, and it might be worth investing in residential/mobile so you don't find yourself in a tough spot if all your IPs get banned.

Another option you have is scraper APIs, which might actually be the best option for you. ScrapingBee is the first one that comes to mind, but there are many options.


2. Billing type. Bandwidth based or static with unlimited bandwidth? Correct me if I am wrong but for our use-case (when scraping millions of pages using headless chromium) we would generate ridiculous
amounts of traffic monthly so the only options remains to rent unlimited bandwidth IP's? But am I correct that it usually comes at a cost of limitation of proxy speed? (limited concurrent requests). I already see such
signs by using current provider, we are getting many Timeouts when doing headless scrapes.

Do you need hundreds of IPs/threads to run simultaneously? Go for bandwidth-based billing providers such as BrightData or Oxylabs.

Do you need large amounts of data but few IPs/threads simultaneously? Go for thread-based billing providers such as Shifter.

3. Abstraction layer. As we plan to be managing the pool of different proxy providers IP's we would need some sort of abstraction layer that would
allow to easily combine various proxies from different providers and track/manage/replace them. Maybe there are existing Sass or library/framework solutions
for this specific purpose? I believe that implementing custom solutions would be trying to reinvent the wheel.

What you're talking about is a load balancer. There are hundreds of open-source projects that can do this.


4. Testing proxy liveliness and speed. We will also need to proactively test the quality of the proxies we use. Are there any guidelines on how programmatically we
could periodically test the liveliness and speed of proxies we rent and use? I heard that simple PING requests are not working for many proxy providers, so that means we
would need some sort of custom-written service that would try to visit specific page by using proxy and check if page loads?

Most load balancers should have some functionality to monitor uptime & speed. If you're talking about bandwidth, then that's something you'll need to create manually using something such as the speedtest.net library.

5. Testing proxy quality. Maybe you know some sites or tools that would easily detect majority of data-center proxies and therefore
would act as a perfect place to test if used proxy is actually good quality (the residential one)? I would be so thankful for such examples.

ipqs.com is a great option, but still, different services will have their own measures of IP quality.

No matter what stack of tools you use, make sure you understand this will be a very complex and expensive project that will take months to perfect.

Good luck.
 
This will highly depend on the websites you're scraping and how sophisticated the developing team is. If you're able to run extended cookie-loading and warming of IPs, then datacenter proxies might work. But probably not for long — anti-automation methods & algorithms are only improving, and it might be worth investing in residential/mobile so you don't find yourself in a tough spot if all your IPs get banned.

Another option you have is scraper APIs, which might actually be the best option for you. ScrapingBee is the first one that comes to mind, but there are many options.




Do you need hundreds of IPs/threads to run simultaneously? Go for bandwidth-based billing providers such as BrightData or Oxylabs.

Do you need large amounts of data but few IPs/threads simultaneously? Go for thread-based billing providers such as Shifter.



What you're talking about is a load balancer. There are hundreds of open-source projects that can do this.




Most load balancers should have some functionality to monitor uptime & speed. If you're talking about bandwidth, then that's something you'll need to create manually using something such as the speedtest.net library.



ipqs.com is a great option, but still, different services will have their own measures of IP quality.

No matter what stack of tools you use, make sure you understand this will be a very complex and expensive project that will take months to perfect.

Good luck.
Thank you so much for your insights!

We have once stumbled upon several Scraping API providers (ScrapingBee included) but with our load it's becoming too expensive, so
we decided to build in-house scraper.

Am I right that by saying load balancer you were talking about generic applications like Apache Nginx which can be used as reverse proxy?
 
Thank you so much for your insights!

We have once stumbled upon several Scraping API providers (ScrapingBee included) but with our load it's becoming too expensive, so
we decided to build in-house scraper.

Make sure you understand an in-house scraper at a mass level will not be easy to build and maintain — unless you're only looking to scrape a few websites.

Am I right that by saying load balancer you were talking about generic applications like Apache Nginx which can be used as reverse proxy?
Indeed, yes.
 
Hello there,

I am seeking for suggestion on how to properly select and use proxies for my
use-case:

We are doing many scrapes of US eCommerce stores (tens of millions of requests per month) using both headless browsers
and plain requests.


Several questions I am wishing to receive suggestions on:

  1. Proxy type. As we are scraping eCommerce products and not doing any kind of accounts creation are we enough to use shared datacenter proxies? Or thinking about long-term we should seek for dedicated residential IP's? We are trying to balance between quality and price. Right now we are using static residential proxies for around ~3.49$ per IP (ISP's IP addresses, although I believe they are shared), we couldn't find better price for this quality in the market.
2. Billing type. Bandwidth based or static with unlimited bandwidth? Correct me if I am wrong but for our use-case (when scraping millions of pages using headless chromium) we would generate ridiculous
amounts of traffic monthly so the only options remains to rent unlimited bandwidth IP's? But am I correct that it usually comes at a cost of limitation of proxy speed? (limited concurrent requests). I already see such
signs by using current provider, we are getting many Timeouts when doing headless scrapes.

3. Abstraction layer. As we plan to be managing the pool of different proxy providers IP's we would need some sort of abstraction layer that would
allow to easily combine various proxies from different providers and track/manage/replace them. Maybe there are existing Sass or library/framework solutions
for this specific purpose? I believe that implementing custom solutions would be trying to reinvent the wheel.

4. Testing proxy liveliness and speed. We will also need to proactively test the quality of the proxies we use. Are there any guidelines on how programmatically we
could periodically test the liveliness and speed of proxies we rent and use? I heard that simple PING requests are not working for many proxy providers, so that means we
would need some sort of custom-written service that would try to visit specific page by using proxy and check if page loads?

5. Testing proxy quality. Maybe you know some sites or tools that would easily detect majority of data-center proxies and therefore
would act as a perfect place to test if used proxy is actually good quality (the residential one)? I would be so thankful for such examples.

Thank you for your help in advance!

DECIDE WHAT TYPE OF PROXY YOU NEED​

CHOOSE A RELIABLE SELLER​

CONSIDER THE PRICE​

CHECK OUT USER REVIEWS​

TAKE A LOOK AT THE PROXY PROVIDER’S DASHBOARD​

 
I wouldn't disagree, but in our use-case, where we will have huge amount of monthly bandwitdh, 7-12$/GB is simply
to much
sometimes its better to have to much that to little .
 
Back
Top