Using Scrapebox to get millions of emails

MihailoDev

Newbie
Dec 12, 2016
42
11
Hi.

Can you recommend some settings/proxies to scrape millions of emails and phone numbers daily.

The guy that worked previously on Scrapebox had around 5-6-7 million phone numbers daily.

How can I replicate these results with phones and emails?

Are there any proxy providers that could help. Can I set scraping to 2.000 simultaneous threads?

Is there anyone who has these kinds of results?

Thanks,
Mihailo
 
Yes most proxy providers will give you infinite threads as long as you're paying per GB. I think that will be the best option for you
 
Getting to 2,000 threads is possible, but it will be possibly pushing the limits or exceeding the limits of your machine. You would need the physical resources of CPU and Memory to process 2000 web pages at teh same time, which if you think about it is a lot. Your going to also want at least 200 private proxies for this, maybe more.

You will need to probably use google public dns or similar, as even a data center DNS is going to be weak for this task, unless your paying hundreds of dollars per month for the server.

Also just a question, if you get 5 million phone numbers daily and do that for 30 days straight thats 150 million phone numbers, can you use that many?

Because I have fielded this question many times over the years, and in the majority of cases, when you widen the window of time to a month or more that you are looking at, the amount you need daily for most people goes way down. Maybe you have a plan, and thats great, but maybe if you look at this over a 3 month window, can you use 450 million phone numbers?

Anyway there may be some helpful resources in my signature.
 
Getting to 2,000 threads is possible, but it will be possibly pushing the limits or exceeding the limits of your machine. You would need the physical resources of CPU and Memory to process 2000 web pages at teh same time, which if you think about it is a lot. Your going to also want at least 200 private proxies for this, maybe more.

You will need to probably use google public dns or similar, as even a data center DNS is going to be weak for this task, unless your paying hundreds of dollars per month for the server.

Also just a question, if you get 5 million phone numbers daily and do that for 30 days straight thats 150 million phone numbers, can you use that many?

Because I have fielded this question many times over the years, and in the majority of cases, when you widen the window of time to a month or more that you are looking at, the amount you need daily for most people goes way down. Maybe you have a plan, and thats great, but maybe if you look at this over a 3 month window, can you use 450 million phone numbers?

Anyway there may be some helpful resources in my signature.
Hi loopine. Thanks for your answer.

The company I'm setting SB for sends 4 million voice mails daily.
We have a 24gb ram machine with 3.70 GHz (2 processors CPU)

What would be the maximin amount of scraping threads we could achieve with this?

What proxies would you use? Rotating Ones?

Could you recommend a provider and their package?

If you were to use Scrapebox to its maximum:

Where would you get the VM
How many proxies would you use and from where?
To how many threads would you set it to?

Thanks a lot loopline.
Mihailo
 
For 2000 threads you'll need a powerful machine.

You can use a proxy for every thread but it's gonna be expensive.

Do you want to scrape the search results or is it another source?
You'll also need huge rotations of proxies.
 
Hi loopine. Thanks for your answer.

The company I'm setting SB for sends 4 million voice mails daily.
We have a 24gb ram machine with 3.70 GHz (2 processors CPU)

What would be the maximin amount of scraping threads we could achieve with this?

What proxies would you use? Rotating Ones?

Could you recommend a provider and their package?

If you were to use Scrapebox to its maximum:

Where would you get the VM
How many proxies would you use and from where?
To how many threads would you set it to?

Thanks a lot loopline.
Mihailo
I can't say for sure the max scraping threads, but you could probably get near 2000 give or take. Windows can handle more then 200, sometimes, but sometimes it starts to have issues when you exceed that. If you want to scale its ideal to do as google does. Many smaller machines rather the fewer larger ones.

IF you have the machine already, max it out and then go from there. But if adding more you may find it better to do multiple smaller VPS of the 8GB to 12GB RAM.

Its not really possible to say for sure how many threads you can get, but its easy enough to find out, you just keep pushing the limit until it breaks and back it off some and then you know. You can run an unlimited number of instances of scrapebox on a single machine at any case, so no worries there.

rotating proxies are generally slow, as they are being used by multiple people, the source could be a generic public proxy thats in the pool, or slower sources. If you want speed, I would just use private proxies or shared private proxies. That way its data center based, you get speed and still have lots of ips.

Unless your hitting a LOT of urls from a few domains, then quantity of proxies will only matter in relation to total volume/speed, and not quantity of ips. Google is not going to see the websites your scraping from anyway.


I assume you already have a list of urls/domains and you just need to scrape the phone numbers from them?

provider, I would use my private proxy, Ill pm you a discount code and list of providers.

the machine I would get from solid SEO VPS, thats where I have used for years, but any place can work, just they are aware of scrapebox and all good.


how many proxies I would use, thats a good question. Ideally you want to use 10 connections per proxy. Some times you can use more, but its a good place to start. So 2000 connections is 200 proxies. etc.. So I would pick a number, say 150, 200 whatever, and try it. Then if all is well you can always add more proxies.


I would aim to get to 2000 threads, but you will need multiple instances.


At the end of the day, the name of the game here is push the envelop of the server and see where it chokes, then back off and your good. Thats how I have always done it anyway, its kind of like push every button, see what happens and then you know how stuff works.

Mind you that breaks stuff sometimes, and sometimes you have to clean up a mess, LOL But I am a limit pusher so its what I do.
 
Have you seen the quality of the emails from the email plugin? I was not so impressed, many on free accounts and whatnot. I like their YellowPages scraper better in general. Though I have switched to Cloud Scraper more or less for now.
 
You could also use fiverr or microworkers for that, just a thought sorry if it's not the kind of reply you were expecting.
 
Personally I would approach this solution differently. If you are scraping businesses telephone numbers I would simply extract the common crawl domain dataset (over 100 million domains) then crawl the pages one level deep from the root of each domain scraping telephone numbers as you do.

You’ll have the phone number for most businesses online.

You also won’t be limited by proxies.
 
Have you seen the quality of the emails from the email plugin? I was not so impressed, many on free accounts and whatnot. I like their YellowPages scraper better in general. Though I have switched to Cloud Scraper more or less for now.
The emails from the email plugin are just emails off websites. So if they are low quality then its just that the websites loaded in had low quality emails on them. But you can setup filters or load in only what you need to get what you want.
 
Personally I would approach this solution differently. If you are scraping businesses telephone numbers I would simply extract the common crawl domain dataset (over 100 million domains) then crawl the pages one level deep from the root of each domain scraping telephone numbers as you do.

You’ll have the phone number for most businesses online.

You also won’t be limited by proxies.
Please explain?

Also, if crawling webpages you wouldn't use proxies? How many simultaneous threads would you use? Thanks :)
 
Back
Top
AdBlock Detected

We get it, advertisements are annoying!

Sure, ad-blocking software does a great job at blocking ads, but it also blocks useful features and essential functions on BlackHatWorld and other forums. These functions are unrelated to ads, such as internal links and images. For the best site experience please disable your AdBlocker.

I've Disabled AdBlock