VPS web requests taking 120 seconds

metalfingersdoom

Junior Member
Joined
Feb 1, 2022
Messages
169
Reaction score
41
Hey everyone I bought a VPS for the sole purpose of scraping the web and am facing an issue. I have a HUGE list of websites that I have to go over (5 million) and am trying to create 300 concurrent threads for the purpose via proxies that I bought. Problem is anytime I try to go concurrent at all (meaning like a tad bit above 20-30), the time it takes for me to do anyhting takes an incredible amount of time, about 90 seconds, and it reaches there gradually

I thought this happened because of the proxy provider, but after just using a vpn and using nothing at all (simply making requests without anything in between), I notice the trend ensued. What do you think the problem is? I want to SCRAPE!!!!!!!!
 
@Alexion We spoke about scraping a couple months prior. Do you have any comments you can make on this?
 
This happens because there's no timeout set maybe. Some websites/links are dead and it takes time to make the request. Also, it could be because the proxies have a limit in requests, and every link stays in line in the queue to be processed. In this case, more proxies, more threads.
 
Care to share what tech stack you are using?

The only thing that comes to my mind in your scenario is that you are using a bot build on ubot studio + exbrowser and this behavior is normal because thats how the selenium wrapper was built.
 
Care to share what tech stack you are using?

The only thing that comes to my mind in your scenario is that you are using a bot build on ubot studio + exbrowser and this behavior is normal because thats how the selenium wrapper was built.
Im using scrapy by itself and nothing else, havent tried anything else. By this point it either has to do with the vps or scrapy, cant think of a third option
 
Im using scrapy by itself and nothing else, havent tried anything else. By this point it either has to do with the vps or scrapy, cant think of a third option
If you think it has to do with the vps. Open up a terminal console and type in:
Code:
scrapy bench

That will spawn a local http server that scrapy will crawl. If there's a problem with the hardware not being able to handle concurrent requests for some reason, see if it get sticks here.

What do you think the problem is
Probably any of these:

1. The website you are trying to scrape is simply rate limiting you.
2. You have set an AutoThrottle and have forgotten about it.
3. Your proxies take too long to connect.
4. Your code is getting stuck in a deadlock somewhere.

To me, 1 seems the most likely. Can't be sure without running a thorough benchmark.

Also. Heres a thing you can try. Scrapy has a CLOSESPIDER_TIMEOUT extension. Set it to 5 or 10 seconds. That will stop the spider from crawling a request if it takes more than the time you set to respond.

Docs here: https://docs.scrapy.org/en/latest/topics/extensions.html#std-setting-CLOSESPIDER_TIMEOUT
 
@Alexion We spoke about scraping a couple months prior. Do you have any comments you can make on this?
Your only problem is the number of DNS requests that are performing per second.

You have a few options.

1) I would change the nameservers you are currently using. Google would be my first bet, but there are others who don't rate limit as well. Google them, they are easy to find.

2) a more straightforward approach would be to perform the DNS requests yourself at a socket level. An example would be dnspython if you're using python. Every language has it's own libraries, look them up.

3) The third option, which I often use when I'm dealing with millions of domains, is to separate the tasks.
I would build a resolver first, and have an output consisting of "domain.com 1.2.3.4" and then the second scraper would do requests at a socket level, and tell the hostname in the header params.

4) Depending on your VPS, you might not have 300 threads running. In Python, one thread "eats up" 8MB of ram.
Another issue is the number of domains hosted on the same IP. If a lot of domains are hosted with namecheap(which often they are) you will get rate limited by namecheap without even knowing.

This is a fact and has happened with me multiple times.

Set your timeout to a maximum of 5 seconds. Trust me, it's not low.

I know scrapy has its authority and popularity, but you might want to consider a different approach with gathering the URL structure. Even BeautifulSoup would do.

Only resolve the hostname once. If you need to crawl 1,000 pages from one domain, you will perform 1,000 DNS requests, which is not good. Resolve the host once, next time use the ip.

Yes, you have DNS cache, but when crawling at a volume, the cache gets overwritten fast.
 
Your only problem is the number of DNS requests that are performing per second.

You have a few options.

1) I would change the nameservers you are currently using. Google would be my first bet, but there are others who don't rate limit as well. Google them, they are easy to find.

2) a more straightforward approach would be to perform the DNS requests yourself at a socket level. An example would be dnspython if you're using python. Every language has it's own libraries, look them up.

3) The third option, which I often use when I'm dealing with millions of domains, is to separate the tasks.
I would build a resolver first, and have an output consisting of "domain.com 1.2.3.4" and then the second scraper would do requests at a socket level, and tell the hostname in the header params.

4) Depending on your VPS, you might not have 300 threads running. In Python, one thread "eats up" 8MB of ram.
Another issue is the number of domains hosted on the same IP. If a lot of domains are hosted with namecheap(which often they are) you will get rate limited by namecheap without even knowing.

This is a fact and has happened with me multiple times.

Set your timeout to a maximum of 5 seconds. Trust me, it's not low.

I know scrapy has its authority and popularity, but you might want to consider a different approach with gathering the URL structure. Even BeautifulSoup would do.

Only resolve the hostname once. If you need to crawl 1,000 pages from one domain, you will perform 1,000 DNS requests, which is not good. Resolve the host once, next time use the ip.

Yes, you have DNS cache, but when crawling at a volume, the cache gets overwritten fast.

Yeah so apparently I was a dumbass and missed a very simple thing. I read the documentation for scrapy and there was lone small page named broad crawling that had the most critical aspect for multi domain scraping. Apparently the DNS resolver has an event loop with a limited number of events that it supports (20) and you have to manually raise it. I cranked it to 300 and now I get about 2 million websites a day which is what my webpage can support at max and I'm completely fine with that.
 
Yeah so apparently I was a dumbass and missed a very simple thing. I read the documentation for scrapy and there was lone small page named broad crawling that had the most critical aspect for multi domain scraping. Apparently the DNS resolver has an event loop with a limited number of events that it supports (20) and you have to manually raise it. I cranked it to 300 and now I get about 2 million websites a day which is what my webpage can support at max and I'm completely fine with that.
lone --> one
webpage --> vps

wrote it late at night
 
Back
Top