1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

HELP! How to scrape external URLs from a website?

Discussion in 'Black Hat SEO' started by SuperLinks, Feb 3, 2015.

  1. SuperLinks

    SuperLinks Elite Member

    Joined:
    Jul 14, 2008
    Messages:
    2,903
    Likes Received:
    847
    Location:
    New York
    I'm looking to see what the best tools or methods there are to scraping the external URLs on a website? I've tried using Xenu Link Sleuth but am running into problems crawling sites on my VPS as I'm running out of memory since Xenu also documents all internal URLs, and information about them thus becoming a memory hog rather than an effective tool for this.

    I've also tried Scrapebox by scraping the indexed URLs and then using the external URL tool, however this method only works for indexed domains and indexed URLs. Both of which can be a problem for this task.

    Is anyone aware of a tool or method to scrape the external URLs from a website efficiently?
     
  2. rogerke

    rogerke Regular Member

    Joined:
    Oct 5, 2014
    Messages:
    264
    Likes Received:
    145
    I'm not sure I understand your question. First you put a seed list into Xenu, but got memory issues because it also extracts all status codes/internal links. Then you used Scrapebox, but only can use it for indexed URLs? The way I understand it you can only use indexed URLs because you scraped Google to find these seed URLs, but what's the difference with the Xenu seed list then?
    Either you scraped Google to find a seed list (same problem with the indexed domains as Scrapebox) or used external links from a previous Xenu run (still the same, because you can use/extract that seed list in SB as well).
     
    Last edited: Feb 3, 2015
  3. sneakysneakums

    sneakysneakums Newbie

    Joined:
    Dec 4, 2014
    Messages:
    5
    Likes Received:
    2
    Have you tried Screaming Frog? I'm not 100% sure but there might be a way to configure it so that it crawls external only.
     
  4. FBGuru

    FBGuru Senior Member

    Joined:
    Sep 22, 2013
    Messages:
    928
    Likes Received:
    1,172
    Location:
    Personality Type : ESTP
    Crawl only the internal links with Xenu and run those links through Scrapebox's External Link Extractor add-on.

    Goto options and Uncheck all the tickboxes under "Report" field which should greatly reduce the memory hog.

    [​IMG]

    [​IMG]
     
    • Thanks Thanks x 1
  5. SuperLinks

    SuperLinks Elite Member

    Joined:
    Jul 14, 2008
    Messages:
    2,903
    Likes Received:
    847
    Location:
    New York
    Sorry for the confusion. I'm looking for alternative methods rather than those mentioned above for scraping external URLs on a light weight VPS.

    Scraping a site using Xenu doesn't seem to work for the reasons listed above, it's pulling all internal/external URLs, status codes, image file paths, etc and that's taking up a lot of unnecessary memory resources.

    Scraping via Scrapebox only seems to work if I already have the full list of URLs to extract URLs from to begin with. Unless I'm missing something with Scrapebox the only way to generate a list of URLs is via the sitemap scraper or via SERP scraping the URLs which has their own limitations in their own right.

    Is there a tool out there in which I can input a domain, and it crawls the domain extracting only the external URLs?
     
  6. SuperLinks

    SuperLinks Elite Member

    Joined:
    Jul 14, 2008
    Messages:
    2,903
    Likes Received:
    847
    Location:
    New York
    Thanks! I've actually got my Xenu settings exactly like that already. I was hoping that it wasn't a two step process since crawling the site to scrape internal URLs I'm already retrieving the external URLs. Thus, I'd run into the same problem I mentioned above.
     
  7. FBGuru

    FBGuru Senior Member

    Joined:
    Sep 22, 2013
    Messages:
    928
    Likes Received:
    1,172
    Location:
    Personality Type : ESTP
    The only other alternative is the ScreamingFrog but it will lag even more than the Xenu.
     
  8. rogerke

    rogerke Regular Member

    Joined:
    Oct 5, 2014
    Messages:
    264
    Likes Received:
    145
    No problem, I think I got it now. So instead of only wanting to extract the external links from URLs you want to crawl the entire domain for external links?

    If so, what FB "Guru" said might be an option, but depending on the size of the domain that might cause memory issues as well.
     
  9. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,876
    Likes Received:
    2,059
    Gender:
    Male
    Home Page:
    Well if you have the automator in Scrapebox you could make it a 1 step process.

    You could scrape indexed urls, then load them into the link extractor and scrape internal, export found and load back and extract internal, export found load back in and extract internal. Do this however many times you feel it takes. Merge them all together, remove duplicates and then load back into the link extractor and extract external. 1 automator job file could do everything. You load your seed urls and then come back to the external when its done.

    alternatively another thing you could do is use the sitemap addon, if the starting domains have sitemaps, you can load them in and extract the sitemap. You could then merge with the scraped indexed domains and then run them thru the internal link extractor a couple times and then run the external link extractor.
     
  10. amithbhawani

    amithbhawani BANNED BANNED

    Joined:
    Jan 15, 2015
    Messages:
    104
    Likes Received:
    5
    there is tool called web data extracter
    I am using the same
     
  11. SuperLinks

    SuperLinks Elite Member

    Joined:
    Jul 14, 2008
    Messages:
    2,903
    Likes Received:
    847
    Location:
    New York
    Thanks for the advice, was hoping for a tool that had this as an option. I'll give Screaming Frog a try at some point soon since a friend has a license.
     
  12. gateshead

    gateshead Newbie

    Joined:
    Feb 4, 2015
    Messages:
    9
    Likes Received:
    1
    If you can get a sitemap which you can upload to scrapebox, things become easier.
     
  13. Zombie Pop

    Zombie Pop Regular Member

    Joined:
    Dec 18, 2013
    Messages:
    360
    Likes Received:
    123
    I'm with loopline. Scrapebox link extractor is what I would use, since I own scrape box. Here's my way of doing this. I scrape google for all indexed pages of the website in question (site:sitetobescraped.com). I then run those pages through the link extractor with it set to external links only.

    If this is a giant site, youre going to need a lot of public proxies.

    edit: you can also use scrapebox's sitemap extractor, and also scrape in different search engines other than google for more results, possibly, on the same domain
     
    Last edited: Feb 4, 2015