1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

*TIPS* Harvesting URLs with G's Palm Firmly Clenched Around Your Ballsack

Discussion in 'Black Hat SEO' started by jb2008, May 22, 2011.

  1. jb2008

    jb2008 Senior Member

    Joined:
    Jul 15, 2010
    Messages:
    1,158
    Likes Received:
    972
    Occupation:
    Scraping, Harvesting in the Corn Fields
    Location:
    On my VPS servers
    Yes, dracony you are right. Although months ago I didn't quite believe it at first, G does in fact ban footprints. It's essentially restricting what you can/can't search for like some kind of political regime. Apart from blatant criminal terms, G shouldn't restrict what people can or can't see. It's meant to have objectivity.

    Anyway, you will find the following footprints restricted:

    1. powered by ANYTHING - a standard, simple footprint, the bread and butter of the beginner scraper, and often if you know the name of the platform but don't have any sites of the platform to find better footprints from, this is a good first port of call. But not for big harvests because A. it's almost NEVER the optimal footprint and B. G will burn your proxies down to worthless cinders

    2. any special operator, inurl: intext: inanchor: etc. Thankfully there is one very useful operator which I don't think is as palm-clenched-around-balls strict as the others yet but if i mention it here G will ban that too ;)


    3. any url string, especially in quotes. Once an extremely useful tool, it has just come to light over the past 3-6 MONTHS that G has finally caught on to it and put the brakes on. they really pissed me off with that one. So, for example, all the public profile scraping "action=profile" or (inventing this one) "/forum/member.php?u=" are all GONE. Sure, you can do a few searches with them, but NOTHING substantial. The whole forum links thing is this: xrumer has such a low success rate on *RAW* scrapes (even with the best XAS_AI that makes very little difference) that it is ALL about numbers. When you can only scrape a few thousand or so per IP, you've got 200 working proxies (public proxy numbers are diminishing by the day, ask proxygo who is very opinionated on the matter). Let's say you get 500k before your proxies are burnt out, remove duplicate domains and you've got 50,000. Let's take a very good raw scrape profile rate of 5% and you've got not a lot at all. It's not an outright ban, but in practical terms for the BH SEO guy, it's getting close.

    ---------------------------

    My tips to solve these ball-clenching problems are the following:

    1. Don't stinge on your public proxies. Get a good MULTIPLE SOURCE supply of public proxies. But public proxies are free, I hear you scream! Anyone can scrape tens or hundreds of thousands of shitty dead proxies, but it takes a long time to get the right sources and even more to test all of these each and every day. In the end, it's better to leave it to somebody else. I'm not an affiliate for proxygo but his service really is kickass and indispensible for the money I paid. But don't stop there, research into ways of maximizing your total daily amounts, because all across the board working public proxies are going down.


    2. Don't harvest with more threads than you have proxies, try to make it at LEAST 2:1 proxies to threads ratio. If you're not that pressed for time, also use SB's delay function or hrefer's antiban function to keep the proxies going for longer. Faster is nearly always not better for large harvests because once G realises you are automating queries, down comes the ban hammer and the proxy is gone. You may think that public proxies are infinite but as time goes by and tools become more popular, yet the supply is not increasing to meet that demand, there is a squeeze on at the moment and it will only get worse.


    3. Be inventive with your footprints. Powered by, special operators and so on once worked well, and they still can for small harvests and Yahoo! , but now for G you can't do that anymore. Think like a human, not like a bot. I can't say more than that here because G will bring down the ban hammer on everything we know that is right and good in this world.
     
    Last edited: May 23, 2011