HELP! How to scrape external URLs from a website?

SuperLinks

Elite Member
Joined
Jul 14, 2008
Messages
2,903
Reaction score
852
I'm looking to see what the best tools or methods there are to scraping the external URLs on a website? I've tried using Xenu Link Sleuth but am running into problems crawling sites on my VPS as I'm running out of memory since Xenu also documents all internal URLs, and information about them thus becoming a memory hog rather than an effective tool for this.

I've also tried Scrapebox by scraping the indexed URLs and then using the external URL tool, however this method only works for indexed domains and indexed URLs. Both of which can be a problem for this task.

Is anyone aware of a tool or method to scrape the external URLs from a website efficiently?
 
I'm looking to see what the best tools or methods there are to scraping the external URLs on a website? I've tried using Xenu Link Sleuth but am running into problems crawling sites on my VPS as I'm running out of memory since Xenu also documents all internal URLs, and information about them thus becoming a memory hog rather than an effective tool for this.

I've also tried Scrapebox by scraping the indexed URLs and then using the external URL tool, however this method only works for indexed domains and indexed URLs. Both of which can be a problem for this task.

Is anyone aware of a tool or method to scrape the external URLs from a website efficiently?

I'm not sure I understand your question. First you put a seed list into Xenu, but got memory issues because it also extracts all status codes/internal links. Then you used Scrapebox, but only can use it for indexed URLs? The way I understand it you can only use indexed URLs because you scraped Google to find these seed URLs, but what's the difference with the Xenu seed list then?
Either you scraped Google to find a seed list (same problem with the indexed domains as Scrapebox) or used external links from a previous Xenu run (still the same, because you can use/extract that seed list in SB as well).
 
Last edited:
Have you tried Screaming Frog? I'm not 100% sure but there might be a way to configure it so that it crawls external only.
 
Crawl only the internal links with Xenu and run those links through Scrapebox's External Link Extractor add-on.

Goto options and Uncheck all the tickboxes under "Report" field which should greatly reduce the memory hog.

JO840HT


F7gseRd
 
I'm not sure I understand your question. First you put a seed list into Xenu, but got memory issues because it also extracts all status codes/internal links. Then you used Scrapebox, but only can use it for indexed URLs? The way I understand it you can only use indexed URLs because you scraped Google to find these seed URLs, but what's the difference with the Xenu seed list then?
Either you scraped Google to find a seed list (same problem with the indexed domains as Scrapebox) or used external links from a previous Xenu run (still the same, because you can use/extract that seed list in SB as well).

Sorry for the confusion. I'm looking for alternative methods rather than those mentioned above for scraping external URLs on a light weight VPS.

Scraping a site using Xenu doesn't seem to work for the reasons listed above, it's pulling all internal/external URLs, status codes, image file paths, etc and that's taking up a lot of unnecessary memory resources.

Scraping via Scrapebox only seems to work if I already have the full list of URLs to extract URLs from to begin with. Unless I'm missing something with Scrapebox the only way to generate a list of URLs is via the sitemap scraper or via SERP scraping the URLs which has their own limitations in their own right.

Is there a tool out there in which I can input a domain, and it crawls the domain extracting only the external URLs?
 
Crawl only the internal links with Xenu and run those links through Scrapebox's External Link Extractor add-on.

Goto options and Uncheck all the tickboxes under "Report" field which should greatly reduce the memory hog.

JO840HT


F7gseRd

Thanks! I've actually got my Xenu settings exactly like that already. I was hoping that it wasn't a two step process since crawling the site to scrape internal URLs I'm already retrieving the external URLs. Thus, I'd run into the same problem I mentioned above.
 
Thanks! I've actually got my Xenu settings exactly like that already. I was hoping that it wasn't a two step process since crawling the site to scrape internal URLs I'm already retrieving the external URLs. Thus, I'd run into the same problem I mentioned above.

The only other alternative is the ScreamingFrog but it will lag even more than the Xenu.
 
Sorry for the confusion. I'm looking for alternative methods rather than those mentioned above for scraping external URLs on a light weight VPS.

Scraping a site using Xenu doesn't seem to work for the reasons listed above, it's pulling all internal/external URLs, status codes, image file paths, etc and that's taking up a lot of unnecessary memory resources.

Scraping via Scrapebox only seems to work if I already have the full list of URLs to extract URLs from to begin with. Unless I'm missing something with Scrapebox the only way to generate a list of URLs is via the sitemap scraper or via SERP scraping the URLs which has their own limitations in their own right.

Is there a tool out there in which I can input a domain, and it crawls the domain extracting only the external URLs?

No problem, I think I got it now. So instead of only wanting to extract the external links from URLs you want to crawl the entire domain for external links?

If so, what FB "Guru" said might be an option, but depending on the size of the domain that might cause memory issues as well.
 
Thanks! I've actually got my Xenu settings exactly like that already. I was hoping that it wasn't a two step process since crawling the site to scrape internal URLs I'm already retrieving the external URLs. Thus, I'd run into the same problem I mentioned above.

Well if you have the automator in Scrapebox you could make it a 1 step process.

You could scrape indexed urls, then load them into the link extractor and scrape internal, export found and load back and extract internal, export found load back in and extract internal. Do this however many times you feel it takes. Merge them all together, remove duplicates and then load back into the link extractor and extract external. 1 automator job file could do everything. You load your seed urls and then come back to the external when its done.

alternatively another thing you could do is use the sitemap addon, if the starting domains have sitemaps, you can load them in and extract the sitemap. You could then merge with the scraped indexed domains and then run them thru the internal link extractor a couple times and then run the external link extractor.
 
Well if you have the automator in Scrapebox you could make it a 1 step process.

You could scrape indexed urls, then load them into the link extractor and scrape internal, export found and load back and extract internal, export found load back in and extract internal. Do this however many times you feel it takes. Merge them all together, remove duplicates and then load back into the link extractor and extract external. 1 automator job file could do everything. You load your seed urls and then come back to the external when its done.

alternatively another thing you could do is use the sitemap addon, if the starting domains have sitemaps, you can load them in and extract the sitemap. You could then merge with the scraped indexed domains and then run them thru the internal link extractor a couple times and then run the external link extractor.

Thanks for the advice, was hoping for a tool that had this as an option. I'll give Screaming Frog a try at some point soon since a friend has a license.
 
If you can get a sitemap which you can upload to scrapebox, things become easier.
 
I'm with loopline. Scrapebox link extractor is what I would use, since I own scrape box. Here's my way of doing this. I scrape google for all indexed pages of the website in question (site:sitetobescraped.com). I then run those pages through the link extractor with it set to external links only.

If this is a giant site, youre going to need a lot of public proxies.

edit: you can also use scrapebox's sitemap extractor, and also scrape in different search engines other than google for more results, possibly, on the same domain
 
Last edited:
Back
Top