1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

[TIP] How to not get banned while scraping google (important for SB users)

Discussion in 'Black Hat SEO' started by dracony2, Apr 6, 2011.

  1. dracony2

    dracony2 Regular Member

    Joined:
    Mar 5, 2010
    Messages:
    349
    Likes Received:
    321
    Location:
    Jupiter
    Here is a little secret i decided to share =)
    When scraping google i used public proxies (lots of them), call me cheapskate. And there have been a lot of topics saying that private proxies are best for scraping etc. But lets face it, they cost a lot. Public proxies cost next to nothing.
    But here is the problem , they get banned a lot, because people used them for scraping.
    So how do you scrape without getting banned? even better how to scrape via semi banned proxies?

    The fact is that google never actually bans an IP completely, it bans certain requests from it, eg footprints. How do i know this?
    I have my own scraping tools that i sometimes share on this forum,and while writing them i came across a strange thing. My scripts couldnt scrape and got asked to enter captcha, while i was googling normally using browsers. And i sought to find out why this was happening.

    I tried a LOT of stuff, headers, cookies, requests to some hidden stuff, etc.
    And here are the quickest tips for you guys:

    Tip 1 (scrapebox does this automatically, but maybe this will be of use to someone):
    replace ' ' with +
    by default urls get to encode spaces as %20 and google notices that. If you look at any google url you will see only +. It rewrites your quesry to have + instead of %20. So do the same.

    Tip 2
    This is real gold.

    The problem with getting banned is your FOOTPRINT
    When searching using footprints that contain stuff like inurl and "powered by wordpress" i got banned almost instantly. Later even when using a browser such queries would produce a captcha.
    The problem with scrapebox is that it gives up on a keyword after certain amount of tries. And if you use footprints usually 30% of keywords will fail because the google will ban you for some time, and during that time scrapebox will waste keywords.

    I think google also "bans" footprints. I mean that if it spots some suspicious footprints it may ban the IP.

    Instead of powered by and inurl concentrate on other texts on page.
    for example when i searched for vbulletin forums i didn't use inurl:'member.php' "powered by vbulletin" but instead:
    "Home+page"+"find+all+posts+by+this+user"+"about+me" etc.

    If you rewrite your footprints to be less "generic" scrapebox will loose much less keywords.
    I tried this and my SB had 0 keyword loss with 5000 keywords over public proxies.

    Hope this helps.
     
    • Thanks Thanks x 8
  2. dannyhw

    dannyhw Senior Member

    Joined:
    Jul 16, 2008
    Messages:
    979
    Likes Received:
    465
    Occupation:
    Software Engineer
    Location:
    New York City Burbs
    Yeah, certain footprints will trigger a captcha, and so can going too fast. Without a proxy and one socket in my own scripts I've been able to do millions of searches in a row and never hit the captcha. Even with 2 sockets and no footprint I've been hit.

    If you can find good sources for public proxies, like 1,000 at a time you can max out your connection footprints or no footprints.
     
  3. dracony2

    dracony2 Regular Member

    Joined:
    Mar 5, 2010
    Messages:
    349
    Likes Received:
    321
    Location:
    Jupiter
    True, but still no skipped keywords is better than small number of them skipped =)
     
  4. cyberzilla

    cyberzilla Elite Member Premium Member

    Joined:
    Nov 15, 2009
    Messages:
    2,205
    Likes Received:
    3,366
    Location:
    zeta reticuli
    Private proxies are best for posting not for scraping and you are correct about the footprints. I mostly avoid using advanced go0gle operators and common footprints.
     
  5. Maruk

    Maruk Power Member

    Joined:
    Jun 15, 2009
    Messages:
    562
    Likes Received:
    898
    Home Page:
    Regarding the footprints, it is actually the search operators that will get you banned right quick!
     
  6. dracony2

    dracony2 Regular Member

    Joined:
    Mar 5, 2010
    Messages:
    349
    Likes Received:
    321
    Location:
    Jupiter
    actually "powered by" stuff got me banned pretty quick too.
     
  7. Maruk

    Maruk Power Member

    Joined:
    Jun 15, 2009
    Messages:
    562
    Likes Received:
    898
    Home Page:
    Yeah you might be right though, I just know for sure that the site: operator gets banned real quick.
    Wouldn't seem strange for Google to ban such widely searched strings so you might be right buddy.
     
  8. seolease

    seolease Newbie

    Joined:
    Feb 15, 2011
    Messages:
    27
    Likes Received:
    5
    Location:
    127.0.0.1
    Great tips! Will try to do some scraping without the operators, might indeed look a bit more natural :)
     
  9. qu4rk

    qu4rk Junior Member

    Joined:
    Apr 6, 2011
    Messages:
    145
    Likes Received:
    16
    So here comes the newbie question, how do you search for inurl or intitle without using the operator? Or do I just need to put a certain amount of time between searches?
     
  10. seoguru13

    seoguru13 Senior Member

    Joined:
    Jan 9, 2011
    Messages:
    1,070
    Likes Received:
    692
    Occupation:
    Digital Marketing Consultant
    Location:
    India
    Home Page:
    Seems interesting man. will give it a try. thanks