1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Easiest way to scrape site from archive.org?

Discussion in 'Black Hat SEO' started by jon_xx_x, May 31, 2011.

  1. jon_xx_x

    jon_xx_x Jr. VIP Jr. VIP

    Joined:
    Nov 15, 2008
    Messages:
    3,116
    Likes Received:
    1,460
    Is there an easy way to take a complete site from archive.org?
    I want to take all the pages and be able to upload it as easy and quickly as possible. Is there any tools or does each page have to be done one by one?
     
  2. jon_xx_x

    jon_xx_x Jr. VIP Jr. VIP

    Joined:
    Nov 15, 2008
    Messages:
    3,116
    Likes Received:
    1,460
    Bump, someone must know!
     
  3. wkirk

    wkirk Junior Member

    Joined:
    Apr 3, 2011
    Messages:
    139
    Likes Received:
    67
    This method should work:

    Code:
    wget --mirror URL
    It will create a local mirror from the URL specified.

    Yet you still need to check the files and replace/delete all links pointing to the original source.

    I guess this would be the easiest method.
    Haven't checked any of those "site rip" tools.
     
  4. jon_xx_x

    jon_xx_x Jr. VIP Jr. VIP

    Joined:
    Nov 15, 2008
    Messages:
    3,116
    Likes Received:
    1,460
    Sorry, where do I use that? And can I use that with achive URL for a specific site?
     
  5. wkirk

    wkirk Junior Member

    Joined:
    Apr 3, 2011
    Messages:
    139
    Likes Received:
    67
    wget is a standard *nix application.

    just type the above in the command line.

    or type "wget" into google and grab a version for windows.
     
  6. wkirk

    wkirk Junior Member

    Joined:
    Apr 3, 2011
    Messages:
    139
    Likes Received:
    67
    Yes you can, it's a very flexible tool.

    Just google for "wget --mirror" and check some tutorials.

    ps: if it's all greek to you, there's a software called 'httrack'
    Code:
    hxxp://www httrack com/
    It should also fit the purpose.
     
  7. hpv222

    hpv222 Power Member

    Joined:
    Feb 8, 2010
    Messages:
    736
    Likes Received:
    274
    I know better than everybody else - you are on the right track, but you should be adopting a pro-active approach instead of trying to scrape archive.org

    Scrape thousands of live sites and download the content to your HD; six months down the road half of them will be dead and another six months 70% of them will be gone, leaving you with tons of unique content that you can use for your own sites. The directories are a very good place to start site hunting, especially if you are looking for niche sites.

    btw, httrack is the tool for the job when scraping, could be tricky, but with some presistance, you will get it right

    Now, I have never typed this and this thread must die........
     
    • Thanks Thanks x 1
  8. fiesta

    fiesta Junior Member

    Joined:
    Jan 5, 2011
    Messages:
    172
    Likes Received:
    40
    you can download the pages but they will all have the archive header and links which you will have to manually edit one by one
     
  9. jon_xx_x

    jon_xx_x Jr. VIP Jr. VIP

    Joined:
    Nov 15, 2008
    Messages:
    3,116
    Likes Received:
    1,460
    Yeah I did that, and there were like 10 pages. I don't want to do it for bigger sites, takes up too much time.
    I'll try these other methods.
    Thanks for the replies
     
  10. nagennaskar

    nagennaskar Newbie

    Joined:
    Sep 20, 2014
    Messages:
    1
    Likes Received:
    0
    I am a noob at scraping and would like to have a simple system which would allow me to enter keywords and dead/archived content would automatically appear in my website ready to be posted - or atleast something as close as possible to this. What are the possibilities? Can some member enlighten!
     
  11. gutterleech

    gutterleech Regular Member

    Joined:
    Sep 30, 2010
    Messages:
    326
    Likes Received:
    252
    Occupation:
    founder
    Location:
    Third World
    HTTrack Website Copier
     
  12. cashcorp

    cashcorp Regular Member

    Joined:
    Feb 8, 2008
    Messages:
    430
    Likes Received:
    270
    Home Page:

    This guy nailed it, now to make the next part easier download Notepad++ and use the "Replace all in files" function to strip out the HTML from Archive.org and convert all the URLS to your domain.
     
  13. Asif WILSON Khan

    Asif WILSON Khan Executive VIP Premium Member

    Joined:
    Nov 10, 2012
    Messages:
    10,132
    Likes Received:
    28,596
    Gender:
    Male
    Occupation:
    Fun Lovin' Criminal
    Location:
    London
    Home Page:
    This thread is from 2011. Pretty sure OP solved the problem by now!
     
    • Thanks Thanks x 1