1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Which Software is The Best for Checking External Links in Huge Domain Database

Discussion in 'Domain Names & Parking' started by iisark, Oct 5, 2016.

  1. iisark

    iisark Jr. VIP Jr. VIP

    Joined:
    Dec 27, 2009
    Messages:
    109
    Likes Received:
    28
    Hi Guys,

    I need an advise on which software is the best for checking external links in huge domain database.
    Lets say I have a list of 2 000 000 domains. I need a software to:
    1. craw all that domains
    2. find all external links
    3. save the data in a file.

    If one of the domains in the domain list is: example.com .
    The software needs to
    1. crawl : example.com and all the internal pages of example.com (2 level deep) ,e.g.: example.com/contact ,example.com/about-us ...
    2. find all external links on these pages, e.g. : wordpress.com/how-to, google.com/news ...
    3. Save all these external links url's in a file
    4. Move to next domain in db.

    Can you recommend me a software to do this job?
    I'm planning to buy ScrapeBox + expired domain finder plugin, but not sure if is what I need for.
    Also what time and resources are needed to finish such a big task?
     
  2. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,806
    Likes Received:
    2,027
    Gender:
    Male
    Home Page:
    Time varies based on resources, and resources a decent $100 vps or probably actually less would be more then plenty.

    As for being able to do it, Im a little lost if your second 1-4 is a repeat of your first 1-3 or if you want both in succession?

    Scrapebox can do it with the link extractor. But you won't be able to load in 2 million domains to start. I mean if you took 2 million domains and get all the external links alone you might be looking at 500 million urls and thats conservative, it could be in the billions and windows can't handle more then 134 million lines in a file.

    So you would need to break it into chunks and work with it. If your ultimate goal is just to look for expired domains, then just use the expired domain finder and load in chunks and let it run.

    Any way you slice it 2 million domains is going to take quite a long time, as in probably weeks, to work thru.
     
  3. samcram

    samcram Jr. VIP Jr. VIP

    Joined:
    Sep 10, 2014
    Messages:
    138
    Likes Received:
    36
    Occupation:
    SEO
    Location:
    Moscow
    Home Page:
    You can do it with A-parser. But A-parser have hard configuration
     
  4. iisark

    iisark Jr. VIP Jr. VIP

    Joined:
    Dec 27, 2009
    Messages:
    109
    Likes Received:
    28
    Hi loopline, the second 1-4 is just to explain better what I need. Actually I'm in the process of doing it, but is very hard to crawl few billion urls.
     
  5. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,806
    Likes Received:
    2,027
    Gender:
    Male
    Home Page:
    Yes, crawling a few billion urls will take a minute. :)
     
  6. iisark

    iisark Jr. VIP Jr. VIP

    Joined:
    Dec 27, 2009
    Messages:
    109
    Likes Received:
    28
    Yes, and $100 vps can't do the trick. We are using 6×3.0 GHz 16 MB Cache Server and still is slow.
     
  7. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,806
    Likes Received:
    2,027
    Gender:
    Male
    Home Page:
    Windows wasn't designed for high thread counts. Its better to use several small machines then it is to use 1 big machine, even google uses many small machines. That might not be practical, depeding on the scope of the project, but its more efficient anyway.
     
    • Thanks Thanks x 1
  8. iisark

    iisark Jr. VIP Jr. VIP

    Joined:
    Dec 27, 2009
    Messages:
    109
    Likes Received:
    28
    Hi loopline,

    and thanks for your suggestion. We may try this as well.
     
  9. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,806
    Likes Received:
    2,027
    Gender:
    Male
    Home Page: