Scraping ezinearticles?

cooooookies

Senior Member
Joined
Oct 6, 2008
Messages
1,121
Reaction score
278
I am scraping ezinearticles with a botload of proxies and my own bots. Still, I am not satisfied since they ban really fast. I need f***ing more articles.

Did anybody research on which base they ban? Time of subsequent requests? Some header info missing?

I just checked the WP bad behaviour plugin, this reveals a lot of methods to ban unwelcome guests. Many combinations of headers and user-agents are revealed as being bot.

Any recommendations?
 
It is much harder than I thought. 25 subsequent article request led to a ban of my clean IP (with good user agent and other headers).

Also tried that: Amazon EC2 IPs; all banned, they are on the spamhaus PBL list, not a single request possible.
 
when you'r Ip has been banned, you need to wait for 30 minute to recover it, anyways you can use some perfect delay for each action, I'm using it at my own private bot, and its succesfully scrape 400 article with out getting My Ip banned.

Ps: I dont need any proxy, I just use My own Ip
 
Thanks, good advice. I want to scrape 150k, so I will need some patience or good working proxies.

What are your timings? How many request per minute?
 
OK, checked that again. Indeed, Amazon EC2 IPs are on the spamhaus PBL list.

I had 50 IPs thereby and the small amount of unlisted IPs was usable to scrape ezinearticles.
 
Do the articles have to be from Ezinearticles? There are some automatic content generators like Kontent Machine that will scrape from a variety of sources (including Ezine). I can get away with scraping tons of content with only ~20 proxies since it scrapes from a variety of sources.
 
OK, checked that again. Indeed, Amazon EC2 IPs are on the spamhaus PBL list.

I had 50 IPs thereby and the small amount of unlisted IPs was usable to scrape ezinearticles.

Btw (off topic) is the Spamhaus database the place to go to check if your IP is 'blacklisted', or are there better ones?
 
I'm using Guzzle(cant post url..), a sort of cURL wrapper for PHP to scrape websites, but I'm having trouble with eZine.

The first request is succesfull and fetches 1 page, using these headers:

Host: URL TO EZINE HERE, moderation system doesnt allow me to post :|
Connection: keep-alive
Cache-Control: no-cache
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Pragma: no-cache
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8,sv;q=0.6



After the first request, this image is shown:
EzineArticles Submission - Submit Your Best Quality Original Articles For Massive Exposure, Ezin.jpg


I can't figure out how they are blocking the requests, does anyone know what's missing and how they are detecting this after only 1 request?
 
Back
Top