Footprints for scraping Tumblr accounts

ensema

Junior Member
Jul 6, 2012
105
38
I've been scraping inactive Tumblr blogs with scrapebox for the past few days. I've got 10 x PR1 10 x PR2 and 1 x PR3 so far. I've been using the following method which I found somewhere on here...

1) Get a keyword list, or scrape one with Scrapebox
2) Use footprint site:tumblr.com/post/
3) Scrape
4) Trim to root
5) De-Duplicate
6) Run results through vanity checker addon
7) Export available blogs
8) Check PR
9) Register blogs with PR1+

Now if you have done this you will have noticed Tumblr's standard 404 page. It has the same little bit of text every time...

There's nothing here.

Whatever you were looking for doesn't currently exist at this address. Unless you were looking for this error page, in which case: Congrats! You totally found it.


So, just as you would use the following footprint to search for Wordpress comments:

site:.edu "You can leave a response, or trackback"

Can we not use:

site:tumblr.com "There's nothing here"


Or a footprint with any other part of the standard text on their 404 page.

I've tried it and it didn't just return available blogs so I'm either dumb, missing something, unlucky, or all three.
 
First of all use
-site:www.tumblr.com site:tumblr.com
Second thing is, not all blogs are avaiable to register.

1. Scrape
2. Remove dup domains
3. Alive check
4. Save only Dead blogs
5. Check PR / PA
6. Remove the ones with low PR / PA
6. Try to register

Keep in mind not all blogs will keep the PR you have to rebuild the content most of the time from archive.org, also most of the blogs are spammed .
 
you are all three i guess

Thanks

First of all use
Second thing is, not all blogs are avaiable to register.

1. Scrape
2. Remove dup domains
3. Alive check
4. Save only Dead blogs
5. Check PR / PA
6. Remove the ones with low PR / PA
6. Try to register

Keep in mind not all blogs will keep the PR you have to rebuild the content most of the time from archive.org, also most of the blogs are spammed .

Appreciated the pointers dude. I understand some aren't available to register but why use the alive check over the vanity check? I've tried to do the archive.org bit where possible too.

I'm still wondering if there is a way to use whats on the standard 404 page to pick up only dead blogs???
 
Alive check is 100x fast than vanity check..... thats a so called filter, so you wont waste so much time.
 
Ok cool...I'm rerunning everything I scraped in the past few days through the alive checker now. The first few times i used the alive checker it was CRAAAZY slow now its like the things on speed.
 
You are right, the footprint you said don't work. You don't have to search for any 404 text, just search for any word and hope to have luck.
 
I ran my scraped data through the alive checker and then then vanity checker and got a slightly different set of results. Doing it this way I found a few extra PR2 & PR1 blogs. So I was wrong the alive checker is worth it.

Posted via Topify on Android
 
Back
Top
AdBlock Detected

We get it, advertisements are annoying!

Sure, ad-blocking software does a great job at blocking ads, but it also blocks useful features and essential functions on BlackHatWorld and other forums. These functions are unrelated to ads, such as internal links and images. For the best site experience please disable your AdBlocker.

I've Disabled AdBlock