1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

advanced google operators and scrapebox

Discussion in 'Black Hat SEO Tools' started by utuxia, May 15, 2012.

  1. utuxia

    utuxia BANNED BANNED

    Joined:
    Feb 14, 2011
    Messages:
    673
    Likes Received:
    111
    I don't understand how you scrape so many advanced operators (i see footprint lists of 2000+).

    With a proxy list of 30, I get an IP ban on advanced operators after about 1 run. Advanced operators like: "site:" or "inurl:".

    How are these guys building 100k long lists using advanced operators??
     
  2. partymarty4870

    partymarty4870 Elite Member

    Joined:
    Jul 7, 2010
    Messages:
    2,034
    Likes Received:
    1,690
    Location:
    I come from a land downunder
    you using private or shared proxies?

    because 10 private is worth like 100 shared for me.
     
  3. utuxia

    utuxia BANNED BANNED

    Joined:
    Feb 14, 2011
    Messages:
    673
    Likes Received:
    111
    I'm using shared private. I pay for them, but they are shared with others who are also paying. They are pretty reliable so far though, no problems really except for advanced operators in Google. I do about 3 connections when harvesting google and still get an 302 IP ban.

    The ban will go away if I wait 10 minutes, but then after a few hundred queries it's back.
     
  4. jb2008

    jb2008 Senior Member

    Joined:
    Jul 15, 2010
    Messages:
    1,158
    Likes Received:
    972
    Occupation:
    Scraping, Harvesting in the Corn Fields
    Location:
    On my VPS servers
    Advanced operators always get quick bans because they are very often used by SEOs scraping.
     
  5. utuxia

    utuxia BANNED BANNED

    Joined:
    Feb 14, 2011
    Messages:
    673
    Likes Received:
    111
    Yeah, I figured that much...but how are people using them to build large lists of .edu and .gov sites?
    I can only get about 1,000 unique domains before I get an IP ban on site: or inurl:
     
  6. jb2008

    jb2008 Senior Member

    Joined:
    Jul 15, 2010
    Messages:
    1,158
    Likes Received:
    972
    Occupation:
    Scraping, Harvesting in the Corn Fields
    Location:
    On my VPS servers
    They might have huge lists of proxies (I am talking 5,000+ working public proxies). They might not even be using special operators since even with 5k proxies you will get banned fast. These days, inurl: etc is just a no go. Proxies are too valuable to just ruin them by a certain ban like that. They might have a VPN account which automatically changes IP after X minutes. They might scrape sitemaps. They might simply be using large keyword lists and good footprints which don't consist of special operators or "Powered by" footprints that get banned fast. Think outside the box and don't be a dumbass!
     
  7. audioguy

    audioguy Power Member

    Joined:
    Jun 12, 2010
    Messages:
    609
    Likes Received:
    224
    Location:
    Anywhere in the world building WP sites.
    Public proxies are the answer.

    Don't use private proxies, even shared, for scraping. That's why all the IPs are banned.
     
  8. mazgalici

    mazgalici Supreme Member

    Joined:
    Jan 2, 2009
    Messages:
    1,489
    Likes Received:
    881
    Home Page:
    advanced operators will ban you faster because they aren't used by regular users so is a footprint for scrappers
     
  9. utuxia

    utuxia BANNED BANNED

    Joined:
    Feb 14, 2011
    Messages:
    673
    Likes Received:
    111
    hate to be a dumbass again, but how else would you find .edu and .gov w/o using site:.edu?
     
  10. Scritty

    Scritty Elite Member Premium Member

    Joined:
    May 1, 2010
    Messages:
    2,807
    Likes Received:
    4,496
    Occupation:
    Affiliate Marketer
    Location:
    UK
    Home Page:
    100 proxies - 10 threads - 3k-5k keywords and letting it run overnight for between half a million and a million results.

    Also consider Yahoo. These days you get 85% of the results and they give far fewer IP bans (999 errors in their case) The days of one proxy per thread are long gone - you need to rotate and have several. 10 may be overkill, but I have had consistent bans at 4/5. 10 threads still gives you almost 100 URL's a second on a decent connection

    Don't mix hundreds of different footprints in one scrape. You need to sort them out afterwards in most cases. You'll run quicker and smoother and with less bans with max 2 footprints for the same platform (don't mix platforms). If you've got the time - run one.
    If that is a PITA for your home PC get a VPS and run it on there.

    Scritty
     
    • Thanks Thanks x 2
  11. utuxia

    utuxia BANNED BANNED

    Joined:
    Feb 14, 2011
    Messages:
    673
    Likes Received:
    111
    I see. I have 30 proxies, but was using about 30 footprints in a scrape, with advanced operators. I'm checking out proxy multiply now.
     
  12. turoc

    turoc Junior Member

    Joined:
    Oct 28, 2009
    Messages:
    117
    Likes Received:
    33
    if you're using private proxies to scrape, set your max harvesting connections to 20% of your total available proxies - ie. if you have 10 proxies, set max connections to 2 - sounds like a little, but you'd be amazed at how effective it is, and your proxies stay alive forever - I have been trying this for a couple of weeks now and my proxies are all fine - and I am harvesting results lists of 500k + at a time, with average speed of anywhere between 16 & 50 urls/s, depending on the footprint and the keywords.

    Also, stick to one footprint at a time, and use good keyword lists ...
     
  13. utuxia

    utuxia BANNED BANNED

    Joined:
    Feb 14, 2011
    Messages:
    673
    Likes Received:
    111
    Well, even at 10% i'm getting 302 in G. I had to bump it down to 1 connection, but at least I'm getting through now.
     
  14. williamk

    williamk BANNED BANNED

    Joined:
    Oct 29, 2009
    Messages:
    1,030
    Likes Received:
    184
    You are always going to have a hard time when scraping for inurl: no matter how many private proxies you have, its just that Google are really strict on those commands. I know that inurl: command are really useful when trying to find a specific niche sites (after all what more targeted than a site that have your targeted keyword niche in their domain) but thats the hard truth. It's a shame that other SE don't have that command eh..
     
    Last edited: May 17, 2012
  15. utuxia

    utuxia BANNED BANNED

    Joined:
    Feb 14, 2011
    Messages:
    673
    Likes Received:
    111
    bing has it.
    The bigger problem I see is that SB converts "site:.edu ab workouts" into "site:.edu+ab+workouts" which yields a lot less results. Same for google and bing.
     
  16. RodMar

    RodMar BANNED BANNED

    Joined:
    May 12, 2012
    Messages:
    436
    Likes Received:
    65
    What I do is, scrape huge list of public proxies for scraping from G. I use private proxies for posting needs and as I need bit good amount of scraping everyday, public proxies, large in numbers seems good to me. After all, each of those proxies will get banned.
     
  17. GoldenGlovez

    GoldenGlovez Moderator Staff Member Moderator Jr. VIP

    Joined:
    Mar 23, 2011
    Messages:
    701
    Likes Received:
    1,713
    Location:
    Guangdong, China
    Home Page:
    The idea that you cannot scrape with private proxies seems to be commonly quoted but in reality is untrue. Set a 1:5 connection to proxy ratio (or use an RND setting of 3-5 seconds) and you can harvest day and night for months on end without a ban for MOST footprints.

    Make sure to keep your footprints as light as possible and if you want to scrape multiple footprints, use the MERGE feature to hit each keyword with a separate footprint (not all at once).
     
  18. utuxia

    utuxia BANNED BANNED

    Joined:
    Feb 14, 2011
    Messages:
    673
    Likes Received:
    111
    Regular scraping is fine, it's the advanced operators that yield a ban very quicky. Ex: using "site:.edu" is an advanced operator. Even with 1 connection and 30 proxies, I can only scrape a few keywords before it gets banned.