1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Tool to get all URLs from root domain (large list)

Discussion in 'Black Hat SEO Tools' started by jb2008, Sep 22, 2011.

  1. jb2008

    jb2008 Senior Member

    Joined:
    Jul 15, 2010
    Messages:
    1,158
    Likes Received:
    972
    Occupation:
    Scraping, Harvesting in the Corn Fields
    Location:
    On my VPS servers
    I have tried things like Gwebsitecrawler, all SB addons, but nothing does the task of getting all the URLs of a site (i've got a list of 120,000 root domains whose internal URLs I need to get).

    This is basically an alternative to the G0ogle site: command. The reason I need to do this is because G indexes only a small fraction of an entire site, and I am missing about 90% of viable targets.

    Does anyone have any ideas?
     
  2. macdonjo3

    macdonjo3 Jr. VIP Jr. VIP Premium Member

    Joined:
    Nov 8, 2009
    Messages:
    5,563
    Likes Received:
    4,317
    Location:
    Toronto
    Home Page:
    Maybe you should buy a SBOX coaching program to learn all of its functionalities. If you don't know this, then you don't know a lot of the secrets. :p
     
  3. scriptomania

    scriptomania Junior Member

    Joined:
    Dec 28, 2010
    Messages:
    127
    Likes Received:
    249
    Occupation:
    A full time pirate at sea
    Location:
    The European capital of politics
    Hey OP,

    I can code a very simple python script for you (no worries, I'm not gonna charge) that would do what you are asking if you are willing to wait for about 24 hours (as I have a ton of shit to do...).

    Drop me a PM.

    Cheers
     
    • Thanks Thanks x 1
  4. jake3340

    jake3340 Jr. VIP Jr. VIP Premium Member

    Joined:
    Nov 20, 2008
    Messages:
    1,368
    Likes Received:
    414
    Location:
    Pluto
  5. scriptomania

    scriptomania Junior Member

    Joined:
    Dec 28, 2010
    Messages:
    127
    Likes Received:
    249
    Occupation:
    A full time pirate at sea
    Location:
    The European capital of politics
    Ok,

    I'll do you guys a favour and will code it up as soon as I get some time to. You'd basically just need a crawler and thats all. Give me 24-48 hours. Would you like the source code or should I make the tool publicly available through my gapps account?

    Cheers
     
  6. dannistone

    dannistone Regular Member

    Joined:
    Aug 7, 2011
    Messages:
    230
    Likes Received:
    103
    Location:
    The Balcans
    You could use Xenu's link sleuth, and automate it in some way...
     
  7. jb2008

    jb2008 Senior Member

    Joined:
    Jul 15, 2010
    Messages:
    1,158
    Likes Received:
    972
    Occupation:
    Scraping, Harvesting in the Corn Fields
    Location:
    On my VPS servers
    @macdonjo, wtf are you talking about, SB can harvest so you could use the site operator (which as I explained does not return anywhere near the amount of URLs there actually are) or if you get an XML sitemap you can use the sitemap scraper, but many, many sites don't have xml sitemaps, so please enlighten me...

    @scriptomania, well, if you are able to do it successfully I would be willing to pay. I'm a man of modest means but I don't expect custom made scripts for free. If I pay we would keep it private, but alternatively you could choose to release it to the public. It depends on what you want to do.

    Yes, it's basically a crawler. I basically need it to go through a list of root domains and return all URLs in the site for each domain. So it's basically like a lot of sitemap generators, but automated for multiple domains, rather than just generating all URLs for just 1 root domain (site).

    It's very useful because search engines don't index a massive amount of URLs for whatever reasons, many of which are viable posting targets.
     
    • Thanks Thanks x 1
  8. luccha

    luccha Regular Member

    Joined:
    Apr 18, 2009
    Messages:
    317
    Likes Received:
    93
    Occupation:
    Cron
    Location:
    On Earth
    I use webdataextractor. It scans the site faster and gets all URL's.
     
    • Thanks Thanks x 1
  9. jason2009

    jason2009 Senior Member

    Joined:
    Apr 23, 2010
    Messages:
    1,005
    Likes Received:
    206
    Occupation:
    Student
    Location:
    Earth
    link can be harvest by SB but google allow only 1000 url.
    @scriptomania, waiting to see what you will make for us ;)
     
  10. 4don4i

    4don4i Newbie

    Joined:
    Sep 23, 2011
    Messages:
    22
    Likes Received:
    1
    I use webdataextractor too.... It scans the site faster and gets all URL's...it's easy.
     
  11. jb2008

    jb2008 Senior Member

    Joined:
    Jul 15, 2010
    Messages:
    1,158
    Likes Received:
    972
    Occupation:
    Scraping, Harvesting in the Corn Fields
    Location:
    On my VPS servers
    I looked at WDE and it can only do 1 site at a time.

    I have a list of 100,000+ root domains that I need to have all URLs for.
     
    Last edited: Sep 23, 2011
  12. fred7

    fred7 Regular Member

    Joined:
    Jun 6, 2009
    Messages:
    266
    Likes Received:
    76
    Occupation:
    college student & part time trader
    Location:
    Blackhatworld
    My 2 cents here, just use scrapebox's link checker. Spend a little time to do an experiment on it and I'm sure you will get the way. :)
     
  13. luccha

    luccha Regular Member

    Joined:
    Apr 18, 2009
    Messages:
    317
    Likes Received:
    93
    Occupation:
    Cron
    Location:
    On Earth
    It can do it to list of URL's. Url's will be picked up from a text file, it will scan each one on them one by one. It can harvest urls, emails, phone & fax ( I never tried this option) with lots of advance settings. Moreover its the fastest harvester I have ever used.
     
  14. FuryKyle

    FuryKyle Jr. VIP Jr. VIP Premium Member

    Joined:
    Nov 19, 2010
    Messages:
    2,397
    Likes Received:
    1,369
  15. claymc

    claymc Newbie

    Joined:
    Oct 19, 2010
    Messages:
    9
    Likes Received:
    0
    100,000+ root domains, and need it to spider all of those? so if they only had 10 pages, you'd get a result of 1M URLs?
     
  16. jb2008

    jb2008 Senior Member

    Joined:
    Jul 15, 2010
    Messages:
    1,158
    Likes Received:
    972
    Occupation:
    Scraping, Harvesting in the Corn Fields
    Location:
    On my VPS servers
    @luccha, but it won't go through the whole site and list all the URLs for all the sites. If you load a list of URLs, it just gets data for one page.

    @fred, And SB link checker? It does have its uses but I suppose you are suggesting checking for links (internal only) on the page, but the problem with that is that it won't spider the whole site. Unless EVERY page is linked from the home page, it won't return much. I tried this and I wasn't getting many URLs at all. It's a nice idea, but in practise, it doesn't work.

    @claymc, yes, I actually have 2 lists of 100,000 URLs. So 200k root domains total. I need to spider all these, extract all the URLs, and then manually filter in notepad++ to hopefully filter out a lot of unpostable pages. I would probably end up with around 200,000,000 URLs in one text file, so I would have no choice but to split it into about 50 parts and edit each one individually. (ouch, but can be done in a couple days)

    Since there seems to be no solution for what at first to me appeared to be a simple problem, I have just paid someone on vworker to build me a crawler. More than I was hoping to pay but I need this.
     
    Last edited: Sep 24, 2011
  17. wannabie

    wannabie Elite Member

    Joined:
    Mar 11, 2009
    Messages:
    3,807
    Likes Received:
    2,954
    Occupation:
    Seo and Marketing Suprisingly
    Location:
    Your bedroom window
    Home Page:
    Who not just create a sitemap, then export the urls?

    Or sleuth?
     
  18. banel

    banel Regular Member

    Joined:
    Mar 30, 2010
    Messages:
    287
    Likes Received:
    16
    Can you suggest a sitemap generator which can work with a list of domains ?
     
  19. wannabie

    wannabie Elite Member

    Joined:
    Mar 11, 2009
    Messages:
    3,807
    Likes Received:
    2,954
    Occupation:
    Seo and Marketing Suprisingly
    Location:
    Your bedroom window
    Home Page:
    How many domains?
     
  20. banel

    banel Regular Member

    Joined:
    Mar 30, 2010
    Messages:
    287
    Likes Received:
    16
    Well, it doesn't matter but I don't want something where you have to insert every damn link...