1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Download entire sites from Wayback Machine

Discussion in 'Black Hat SEO' started by Knocks, Apr 1, 2015.

  1. Knocks

    Knocks Newbie

    Joined:
    May 14, 2012
    Messages:
    16
    Likes Received:
    2
    I have found two services that will download an archived website for you from Wayback Machine for a $15 fee. That's a decent price if you only have to do this once, but if you have to do it on a regular basis, it would be a better deal to figure out how to do that yourself.

    So far I have tried HTTrack, with varied success. It works OK for small sites, although you have to remove some garbage code from the downloaded files manually. The problem with HTTrack is that for large sites, Wayback Machine doesn't always have complete snapshots for every single date. It crawls different pages on different dates, and when you download these files with HTTrack, you end up getting multiple partial snapshots of the same site, with lots of dupes or sometimes files with duplicate names or slightly different content, and it's virtually impossible to put it back together. An added drawback is that HTTrack doesn't strip Wayback Machine code (like the javascript) or metadata, which also creates extra work.

    If you want a more technical explanation of why HTTrack is not fit for this job, you can check a discussion on SuperUser called "Trouble using wget or httrack to mirror archived website" (can't post links, sorry, but you can Google it).

    So has anyone figured out a working way to download a complete version of a large site from Wayback Machine? I figured if anyone has figured it out, it would be here at BHW.
     
  2. SEO Power

    SEO Power Elite Member

    Joined:
    Jul 14, 2014
    Messages:
    2,642
    Likes Received:
    683
    Occupation:
    Self employed
    Location:
    Houston, TX
    Have you read "Bring The Fresh' by Kelly Felix? In his course, he mentioned a service that copies sites over from archive.org to new domains. I think it's waybackdownloader.com.
     
  3. Knocks

    Knocks Newbie

    Joined:
    May 14, 2012
    Messages:
    16
    Likes Received:
    2
    I don't think you read even one sentence of my post...
     
  4. Hawkster

    Hawkster Jr. VIP Jr. VIP

    Joined:
    Jun 22, 2013
    Messages:
    3,504
    Likes Received:
    3,721
    Gender:
    Male
    Occupation:
    Listen to everyone - Follow no-one
    Location:
    UK
    Home Page:
    I thought way back machine only archived a small number of pages. I never saw an archive that included the entire sites content.
     
  5. cthulchu

    cthulchu Newbie

    Joined:
    Aug 19, 2013
    Messages:
    16
    Likes Received:
    5
    Occupation:
    SEO specialist
    Location:
    127.0.0.1
    Home Page:
    that's right.

    The question is legit. I have the same issues. I copy big websites manually (hire freelancers from India and Philippine). Basically, they use HTTrack multiple times, going from the most recent results to the older ones, adding files without replacement. However, I think if you can get a nice snapshot, you should use the $15 services. The only reason I use freelancers is because I need the sites to be on certain engines (WP mostly) to be able to manage/maintain them using my scripts.
     
    • Thanks Thanks x 2
  6. mofoparrot

    mofoparrot Junior Member

    Joined:
    Jan 2, 2011
    Messages:
    133
    Likes Received:
    24
  7. mandylim

    mandylim Newbie

    Joined:
    Jul 17, 2010
    Messages:
    15
    Likes Received:
    1
    try flipping it after that to get more than that $15?
     
  8. Weblycos

    Weblycos Senior Member

    Joined:
    Feb 12, 2013
    Messages:
    967
    Likes Received:
    195
    Occupation:
    full time IM
    http://archivescraper.net - How does it actually work. I think it someone like this would work that would really help you to make $$$
     
  9. CoolAmp

    CoolAmp Registered Member

    Joined:
    Apr 22, 2009
    Messages:
    85
    Likes Received:
    23
    i dont know about scraping the wayback machine, but if the site you want to scrape is still up and running you can use ubuntu and wget to easily download the site.
     
  10. mofoparrot

    mofoparrot Junior Member

    Joined:
    Jan 2, 2011
    Messages:
    133
    Likes Received:
    24


    You put in the url of the site you want scraped. Shows you the calendar, you select the day which one you want to scrape, shows you a iframe of the site to make sure its the right one. Then it goes into the queue and shows up in history when its done scraping.
     
  11. mchabchoub

    mchabchoub Newbie

    Joined:
    Jan 9, 2016
    Messages:
    15
    Likes Received:
    2
    I used this service to download an entire website for wayback machine (archiveorg) from waybackmachinedownloadercom
    It contain a test if you need to download only the front page but I don't know why it sent the email of download link to spam folder(I emailed them to fix it).
     
  12. sm754

    sm754 Registered Member

    Joined:
    Mar 21, 2012
    Messages:
    93
    Likes Received:
    38
    Occupation:
    Farmer
    Location:
    Azerbaijan
    Wayback is a godsend. Unfortunately, domain squatters like to do this really annoying thing, where they buy an old domain and stick a catch-all robots.txt on it, which prompts Wayback to hide all the archived content. So a lot of good content tends to be lost to the tides this way (sniff)
     
  13. Hubertissick

    Hubertissick Registered Member

    Joined:
    Dec 29, 2015
    Messages:
    59
    Likes Received:
    3
    Location:
    sicklishy
    Hi buddy!
    I also interested in getting these services.
    Please describe me how to get this? Describe me in detail.
     
  14. Sristy

    Sristy Jr. VIP Jr. VIP Premium Member

    Joined:
    Aug 17, 2010
    Messages:
    1,824
    Likes Received:
    489
    Gender:
    Female
    Location:
    In My Blog Network
    Home Page:
    • Thanks Thanks x 1
  15. mchabchoub

    mchabchoub Newbie

    Joined:
    Jan 9, 2016
    Messages:
    15
    Likes Received:
    2
    For waybackmachinedownloadercom , You put an archive url and your email. It scrap the html file, js , img and css(it also scrap "background-image" images and "import url" css files), it fix external links to be SEO freindly and then scrap internal link and do the same think with 10 deep. Infortunatly , it sent download link via email in spam folder(I emailed them about this).
    But the html website that I get was perfect.
     
  16. cnick79

    cnick79 Jr. VIP Jr. VIP

    Joined:
    Jun 10, 2010
    Messages:
    689
    Likes Received:
    369
    Location:
    Google's SandBox
    If you need a downloading service for cheap, have a look at my BST in my signature or visite websiterestore.com. You enter in a domain and a date the domain was last archived and we will restore every archived page linked to that domain at that date, including images, css, docs, videos, etc. We also rebuild URLs so the links work. We don't restore pages that resulted in a 302 on the wayback machine. Our restore process is unique and we aren't limited by how "deep" the links go.
     
  17. mofoparrot

    mofoparrot Junior Member

    Joined:
    Jan 2, 2011
    Messages:
    133
    Likes Received:
    24
    seems that they updated their search function so now it takes a lot less to find older sites
     
  18. jasperq

    jasperq Newbie

    Joined:
    May 26, 2014
    Messages:
    7
    Likes Received:
    2
    I'm getting decent results with wayback-machine-downloader. If you have a VPS (or a Linux box), install Ruby 2.3 and then run:

    gem install wayback_machine_downloader​

    It accepts date ranges, concurrent request limits. It dumps the website to disk deleting out the wayback navigation guff. Grabs pages, images, styles, javascript, and files. The code's on github hartator/wayback-machine-downloader
     
  19. clerity

    clerity Registered Member

    Joined:
    Feb 21, 2016
    Messages:
    55
    Likes Received:
    14
    not working for me...
     
  20. mishrajee

    mishrajee Newbie

    Joined:
    Sep 3, 2017
    Messages:
    4
    Likes Received:
    0
    Gender:
    Female
    Some useful tips in the thread, been wanting to do this, just wasn't sure of which tool would be best. Will have to try them out.