archive.org scrapper needed - been searching everywhere for days

mg3hockey

Newbie
Joined
May 21, 2012
Messages
18
Reaction score
4
Hello all,

I recently pickedup about 15 domains with legit pr3-pr7 to link to my money sites, but putting the sites back together with original content is a complete bitch from archive.org.

I have been search for days for a script that would do this for me but have found 0 results. I have tried 4-5 different google codes

from httrack.com to webscraping.com to warrick and nothing seems to work.

All I need is a download of all the pages wayback machine has archived for its most recent cache of the site and be able to upload to a server.

I do need footer and header links and mention of archive.org removed as well obviously. If anyone can point me in the right direction that would be awesome.

Michael
 
Im not sure how advanced you are, but you can or used to be able to download the archive.org source code they use for the engine, and at one time it had a api to pull data straight from archive.org. This was a few years back so not sure today, but I had a copy of it setup on a server a few years ago and it was pretty straight forward and a great way to use there data.
 
do you speak russian? i came across a russian tool the other day that can do that I suppose
 
Hmm didnt think about the API will have to look into this.. any other replies welcome!!
 
Use Web Archive Downloader (google it and download from cnet) . It is far from perfect but it browses web archvive and download all pages for the years you select.
 
^ that is a great tool to download sites that are currently live but I am needing a tool to download sites that are down/expired as I am purchasing expired domains.
 
^ that is a great tool to download sites that are currently live but I am needing a tool to download sites that are down/expired as I am purchasing expired domains.

Hm, no it crawls wayback machine at internet archive and download past content of the site - not live ones.
 
Hmm thats strange because I downloaded it and selected years 2004-2010 and tried to download things and got "could not connect to remote server errors"

so either archive.org is not accept their API or something is not working in between.
 
Actually it works fine on one of my computers but not so good (same as yours) on this one so I would give it a try again.. sorry but thats the only software I have found and I have used it succesfully for 10+ domains.
 
Back
Top