1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Scraping ezinearticles?

Discussion in 'Black Hat SEO Tools' started by cooooookies, Feb 3, 2013.

  1. cooooookies

    cooooookies Senior Member

    Joined:
    Oct 6, 2008
    Messages:
    1,008
    Likes Received:
    216
    I am scraping ezinearticles with a botload of proxies and my own bots. Still, I am not satisfied since they ban really fast. I need f***ing more articles.

    Did anybody research on which base they ban? Time of subsequent requests? Some header info missing?

    I just checked the WP bad behaviour plugin, this reveals a lot of methods to ban unwelcome guests. Many combinations of headers and user-agents are revealed as being bot.

    Any recommendations?
     
  2. cooooookies

    cooooookies Senior Member

    Joined:
    Oct 6, 2008
    Messages:
    1,008
    Likes Received:
    216
    It is much harder than I thought. 25 subsequent article request led to a ban of my clean IP (with good user agent and other headers).

    Also tried that: Amazon EC2 IPs; all banned, they are on the spamhaus PBL list, not a single request possible.
     
  3. aishahriar

    aishahriar BANNED BANNED

    Joined:
    Jan 7, 2010
    Messages:
    310
    Likes Received:
    336
    when you'r Ip has been banned, you need to wait for 30 minute to recover it, anyways you can use some perfect delay for each action, I'm using it at my own private bot, and its succesfully scrape 400 article with out getting My Ip banned.

    Ps: I dont need any proxy, I just use My own Ip
     
    • Thanks Thanks x 1
  4. cooooookies

    cooooookies Senior Member

    Joined:
    Oct 6, 2008
    Messages:
    1,008
    Likes Received:
    216
    Thanks, good advice. I want to scrape 150k, so I will need some patience or good working proxies.

    What are your timings? How many request per minute?
     
  5. cooooookies

    cooooookies Senior Member

    Joined:
    Oct 6, 2008
    Messages:
    1,008
    Likes Received:
    216
    OK, checked that again. Indeed, Amazon EC2 IPs are on the spamhaus PBL list.

    I had 50 IPs thereby and the small amount of unlisted IPs was usable to scrape ezinearticles.
     
  6. myownhero

    myownhero Power Member Premium Member

    Joined:
    Mar 13, 2012
    Messages:
    763
    Likes Received:
    713
    Occupation:
    SEO Analyst / Link Builder
    Location:
    United States
    Home Page:
    Do the articles have to be from Ezinearticles? There are some automatic content generators like Kontent Machine that will scrape from a variety of sources (including Ezine). I can get away with scraping tons of content with only ~20 proxies since it scrapes from a variety of sources.
     
  7. aishahriar

    aishahriar BANNED BANNED

    Joined:
    Jan 7, 2010
    Messages:
    310
    Likes Received:
    336
    give it 20 MAX request perminute, and I think all should be fine
     
    • Thanks Thanks x 1
  8. ShabbySquire

    ShabbySquire Power Member

    Joined:
    Nov 30, 2011
    Messages:
    574
    Likes Received:
    122
    Location:
    UK
    Btw (off topic) is the Spamhaus database the place to go to check if your IP is 'blacklisted', or are there better ones?
     
  9. nihilnovi

    nihilnovi Newbie

    Joined:
    Mar 2, 2013
    Messages:
    14
    Likes Received:
    0
    I'm using Guzzle(cant post url..), a sort of cURL wrapper for PHP to scrape websites, but I'm having trouble with eZine.

    The first request is succesfull and fetches 1 page, using these headers:

    Host: URL TO EZINE HERE, moderation system doesnt allow me to post :|
    Connection: keep-alive
    Cache-Control: no-cache
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    Pragma: no-cache
    User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36
    Accept-Encoding: gzip,deflate,sdch
    Accept-Language: en-US,en;q=0.8,sv;q=0.6



    After the first request, this image is shown:
    EzineArticles Submission - Submit Your Best Quality Original Articles For Massive Exposure, Ezin.jpg


    I can't figure out how they are blocking the requests, does anyone know what's missing and how they are detecting this after only 1 request?