If you're serious about your SEO analysis, there's a project called Common Crawl that offers a 100TB web archive of around 6 billion web pages, downloadable from Amazon S3. It used to be sort of a pain to use but they've added a tool at hxxp://urlsearch.commoncrawl.org/ that lets you list all the indexed urls for a site, view the page source, and download a json file of the urls and their pointers in the archive. From there it's pretty easy to do certain analysis that's pretty difficult with other tools without doing your own spidering. Common Crawl has other interesting data, including their meta files, which give HTTP response codes for URLs, which makes it fairly easy to research what's redirecting where if you have the tools (think Hadoop) and skill to process all that data, and the text only sets, approximating what a search engine would do to determine the content within a page. If you're already using CC data or providing a service based on it, please consider supporting their efforts. Karma and all that.