1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

archive.org scrapper needed - been searching everywhere for days

Discussion in 'Black Hat SEO' started by mg3hockey, May 13, 2013.

  1. mg3hockey

    mg3hockey Newbie

    Joined:
    May 21, 2012
    Messages:
    18
    Likes Received:
    3
    Hello all,

    I recently pickedup about 15 domains with legit pr3-pr7 to link to my money sites, but putting the sites back together with original content is a complete bitch from archive.org.

    I have been search for days for a script that would do this for me but have found 0 results. I have tried 4-5 different google codes

    from httrack.com to webscraping.com to warrick and nothing seems to work.

    All I need is a download of all the pages wayback machine has archived for its most recent cache of the site and be able to upload to a server.

    I do need footer and header links and mention of archive.org removed as well obviously. If anyone can point me in the right direction that would be awesome.

    Michael
     
  2. mikeydell

    mikeydell Senior Member

    Joined:
    Dec 16, 2012
    Messages:
    870
    Likes Received:
    499
    Im not sure how advanced you are, but you can or used to be able to download the archive.org source code they use for the engine, and at one time it had a api to pull data straight from archive.org. This was a few years back so not sure today, but I had a copy of it setup on a server a few years ago and it was pretty straight forward and a great way to use there data.
     
  3. mak3r

    mak3r Senior Member

    Joined:
    Aug 4, 2011
    Messages:
    891
    Likes Received:
    330
    Home Page:
    do you speak russian? i came across a russian tool the other day that can do that I suppose
     
  4. mg3hockey

    mg3hockey Newbie

    Joined:
    May 21, 2012
    Messages:
    18
    Likes Received:
    3
    Hmm didnt think about the API will have to look into this.. any other replies welcome!!
     
  5. Montgomery76

    Montgomery76 Registered Member

    Joined:
    Apr 17, 2013
    Messages:
    52
    Likes Received:
    13
    Use Web Archive Downloader (google it and download from cnet) . It is far from perfect but it browses web archvive and download all pages for the years you select.
     
    • Thanks Thanks x 1
  6. mg3hockey

    mg3hockey Newbie

    Joined:
    May 21, 2012
    Messages:
    18
    Likes Received:
    3
    ^ that is a great tool to download sites that are currently live but I am needing a tool to download sites that are down/expired as I am purchasing expired domains.
     
  7. Montgomery76

    Montgomery76 Registered Member

    Joined:
    Apr 17, 2013
    Messages:
    52
    Likes Received:
    13
    Hm, no it crawls wayback machine at internet archive and download past content of the site - not live ones.
     
  8. mg3hockey

    mg3hockey Newbie

    Joined:
    May 21, 2012
    Messages:
    18
    Likes Received:
    3
    Hmm thats strange because I downloaded it and selected years 2004-2010 and tried to download things and got "could not connect to remote server errors"

    so either archive.org is not accept their API or something is not working in between.
     
  9. Montgomery76

    Montgomery76 Registered Member

    Joined:
    Apr 17, 2013
    Messages:
    52
    Likes Received:
    13
    Actually it works fine on one of my computers but not so good (same as yours) on this one so I would give it a try again.. sorry but thats the only software I have found and I have used it succesfully for 10+ domains.