downloading an entire website...

Max Nathan

Newbie
Joined
Dec 8, 2018
Messages
39
Reaction score
6
Hello everyone! :-) I hope this is the right section for this post.

Has anyone here tried downloading entire websites, especially REALLY LARGE websites? What I have in mind is Discogs (dot) com. I wanted to save everything on my hard drive and be able to access the whole site even when offline. Is that possible? I've downloaded websites before using HTTrack Website Copier, but nothing in the scale of Discogs. Would anyone have an idea how massive a website size I'm looking at here?

Thanks! :-)
 
Space, as well as time, would be a big factor. Depending on the size of the website you are trying to download, you might need several terabytes of space, a lot of processing power and a boatload of patience.
 
Space, as well as time, would be a big factor. Depending on the size of the website you are trying to download, you might need several terabytes of space, a lot of processing power and a boatload of patience.

Wow, terabytes! I was suddenly discouraged, he he....

Is there a way to know a website's size (discogs, in particular) in the first place so I would know too if what I have in mind is feasible?
 
Google "how to null website" you'll get an idea
With HTTrack you can limit the number of consequent URLs, block certain files, not follow the robots.txt rules etc
 
There's a lot of backend PHP that goes into the website which you can't download.

Bottom line, is you can't, really.

You could also build a bot to download/crawl through this data:

....

I'm really only after all the text on the site and wouldn't mind not getting any images, video/audio files. Would that make a difference? My wish is to have my own offline database of just basic tracks information like artist, accurate titles, release year, released versions, and genre.

Too bad I have zero technical inclination, so I don't think I could go as far as building bots.
 
With the example of discogs:
If you want the whole database of discogs without the images, t's mostly a matter of time. It shouldn't be that hard.
If you want the images too, then you need some serious TB storage.
 
With the example of discogs:
If you want the whole database of discogs without the images, t's mostly a matter of time. It shouldn't be that hard.
If you want the images too, then you need some serious TB storage.

Yes, that's really what I want: no images, just pure text database. Now I'm wondering how much GB that would take.
 
Sure. Happens all the time especially with sites where the information is just general knowledge, such as lyrics to a song. It's called scraping. Look for scraper tools that grab info.

Basically someone will scrape all the written content from other sites, then automate the process of creating similar pages, and adding the borrowed content to those pages.
 
Back
Top