Which Software is The Best for Checking External Links in Huge Domain Database

iisark

Junior Member
Dec 27, 2009
112
30
Hi Guys,

I need an advise on which software is the best for checking external links in huge domain database.
Lets say I have a list of 2 000 000 domains. I need a software to:
1. craw all that domains
2. find all external links
3. save the data in a file.

If one of the domains in the domain list is: example.com .
The software needs to
1. crawl : example.com and all the internal pages of example.com (2 level deep) ,e.g.: example.com/contact ,example.com/about-us ...
2. find all external links on these pages, e.g. : wordpress.com/how-to, google.com/news ...
3. Save all these external links url's in a file
4. Move to next domain in db.

Can you recommend me a software to do this job?
I'm planning to buy ScrapeBox + expired domain finder plugin, but not sure if is what I need for.
Also what time and resources are needed to finish such a big task?
 
Time varies based on resources, and resources a decent $100 vps or probably actually less would be more then plenty.

As for being able to do it, Im a little lost if your second 1-4 is a repeat of your first 1-3 or if you want both in succession?

Scrapebox can do it with the link extractor. But you won't be able to load in 2 million domains to start. I mean if you took 2 million domains and get all the external links alone you might be looking at 500 million urls and thats conservative, it could be in the billions and windows can't handle more then 134 million lines in a file.

So you would need to break it into chunks and work with it. If your ultimate goal is just to look for expired domains, then just use the expired domain finder and load in chunks and let it run.

Any way you slice it 2 million domains is going to take quite a long time, as in probably weeks, to work thru.
 
As for being able to do it, Im a little lost if your second 1-4 is a repeat of your first 1-3 or if you want both in succession?

Hi loopline, the second 1-4 is just to explain better what I need. Actually I'm in the process of doing it, but is very hard to crawl few billion urls.
 
Hi loopline, the second 1-4 is just to explain better what I need. Actually I'm in the process of doing it, but is very hard to crawl few billion urls.
Yes, crawling a few billion urls will take a minute. :)
 
Yes, and $100 vps can't do the trick. We are using 6×3.0 GHz 16 MB Cache Server and still is slow.
Windows wasn't designed for high thread counts. Its better to use several small machines then it is to use 1 big machine, even google uses many small machines. That might not be practical, depeding on the scope of the project, but its more efficient anyway.
 
Hi loopline,

and thanks for your suggestion. We may try this as well.
 
Back
Top
AdBlock Detected

We get it, advertisements are annoying!

Sure, ad-blocking software does a great job at blocking ads, but it also blocks useful features and essential functions on BlackHatWorld and other forums. These functions are unrelated to ads, such as internal links and images. For the best site experience please disable your AdBlocker.

I've Disabled AdBlock