1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How can I use scrapebox to harvest all URL's on a root domain?

Discussion in 'Black Hat SEO Tools' started by cjshort, Jan 13, 2015.

  1. cjshort

    cjshort Registered Member

    Joined:
    Dec 23, 2014
    Messages:
    87
    Likes Received:
    5
    Occupation:
    Web Dev
    Location:
    United Kingdom
    I have 600 domains, I need to crawl them all for their URLS and then scan each page for an email so that I can build an email list from them all.

    Any recommendations on how I can do this?
     
  2. bartosimpsonio

    bartosimpsonio Jr. VIP Jr. VIP Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    12,489
    Likes Received:
    11,190
    Occupation:
    CHEAP
    Location:
    DATASETS
    Home Page:
    This is basically what SB is for....

    There is an email harvester plugin for SB and tutorials specifically for this purpose.
     
  3. cjshort

    cjshort Registered Member

    Joined:
    Dec 23, 2014
    Messages:
    87
    Likes Received:
    5
    Occupation:
    Web Dev
    Location:
    United Kingdom

    Yes I know how to harvest the emails. I just am not sure how to harvest all the URLs of a site from its root domain so I can then search all them URLs for emails. Any ideas?
     
  4. ch8878

    ch8878 Elite Member

    Joined:
    Mar 21, 2009
    Messages:
    2,242
    Likes Received:
    429
    Gender:
    Male
    Occupation:
    Gamer
    Location:
    Youtube
    Home Page:
    If the site has a sitemap just use the sitemap scraper.
     
  5. cjshort

    cjshort Registered Member

    Joined:
    Dec 23, 2014
    Messages:
    87
    Likes Received:
    5
    Occupation:
    Web Dev
    Location:
    United Kingdom
    The problem is when you have 600+ sites, a lot of them do not have sitemaps so it is missed business:(
     
  6. BassTrackerBoats

    BassTrackerBoats Super Moderator Staff Member Moderator Jr. VIP

    Joined:
    Mar 10, 2010
    Messages:
    16,762
    Likes Received:
    30,776
    Occupation:
    Selling CPA Sites
    Location:
    Not England
    Home Page:
    Screaming Frog can harvest all the urls from a domain. I'm sure that a small script can harvest the urls automatically for you so you do not have to reenter domains time and time again and then you can use SB.
     
  7. Sephirot_90

    Sephirot_90 Junior Member

    Joined:
    May 20, 2011
    Messages:
    100
    Likes Received:
    39
    Location:
    Belgrade
    You can do this in 3 ways: 1. just add site: before every url you need to crawl and put it into a harvester. This will find all the urls that are indexed in the search engine.
    2. Use the link extractor, set it to scrape internal links. Insert the 600+ urls and it will find an x number of urls of these sites. Now insert the x number of urls you got and repeat until you get all the internal urls.
    3. using the sitemap scraper which is also a plugin in scrapebox

    With this three ways you can find all the links you need. happy scraping
     
    • Thanks Thanks x 1
  8. cjshort

    cjshort Registered Member

    Joined:
    Dec 23, 2014
    Messages:
    87
    Likes Received:
    5
    Occupation:
    Web Dev
    Location:
    United Kingdom
    Perfect, thanks boss.

    In regards to 1, where am I putting the url with site: before? In the keywords or the footprint?
     
  9. J-S-T

    J-S-T Jr. VIP Jr. VIP

    Joined:
    Jul 27, 2013
    Messages:
    1,252
    Likes Received:
    625
    Gender:
    Male
    Location:
    Fb and BHW
    Enter all 600 urls in the keywords box

    Then create a .txt file and input site: in it and save the file.

    Now click on "M" which means Merge in scrapebox, select the .txt file you created.

    Now you should have it like this

    site:url1
    site:url2
    site:url3

    Start Harvesting.

    Option number 2 which is using Link extractor plugin is much better for you i guess.
     
  10. Sephirot_90

    Sephirot_90 Junior Member

    Joined:
    May 20, 2011
    Messages:
    100
    Likes Received:
    39
    Location:
    Belgrade
    I agree. Its faster,you will get urls that arent indexed and you don`t need proxy`s.
     
  11. TheVegan

    TheVegan Junior Member

    Joined:
    Mar 6, 2013
    Messages:
    179
    Likes Received:
    33
    Occupation:
    blackhat
    Location:
    Prague
    I can do this in about 15 lines of Python code, so don't need scrape box hahaha