1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Mask as a bot/g0&gle? Really need help business could be lost

Discussion in 'Black Hat SEO' started by wannabie, Jun 29, 2010.

Thread Status:
Not open for further replies.
  1. wannabie

    wannabie Elite Member

    Joined:
    Mar 11, 2009
    Messages:
    3,807
    Likes Received:
    2,954
    Occupation:
    Seo and Marketing Suprisingly
    Location:
    Your bedroom window
    Home Page:
    Really need help here - I was using proxies to crawl a website for some information about records/music blah blah but the site has started to block my private proxies, I need to bide some time so thinking what if I mask my crawlers/scraper as a adsense, engine bot?

    Does anyone know where/how the user agent or headers to mask as a bot?

    My website with 3million pages, and images, text has come to a standstill and need something to help!

    Its only information for records/vinyl
     
  2. botdevs

    botdevs Newbie

    Joined:
    May 15, 2010
    Messages:
    18
    Likes Received:
    8
    Occupation:
    Developer
    Location:
    California
    Home Page:
    What crawler/scraper are you using?
     
  3. kelvinator44

    kelvinator44 Newbie

    Joined:
    Jun 11, 2009
    Messages:
    24
    Likes Received:
    2
    Well you can mask your scraper if it uses the GoogleBot user agent.
    See here for more info : hxxp://siteware.ch/webresources/useragents/spiders/google.html
     
  4. wannabie

    wannabie Elite Member

    Joined:
    Mar 11, 2009
    Messages:
    3,807
    Likes Received:
    2,954
    Occupation:
    Seo and Marketing Suprisingly
    Location:
    Your bedroom window
    Home Page:

    Custom - Its just that we scrap alot, the annoying thing is that the site does allow scrapping, just 5k requests a day - we do that a hour easy!

    We have done it for a couple of years, no problems at all- We got "grassed" by someone who used to work for the business after stealing from us!

    Just need a way that they wouldnt notice to much, the onyl reason they did is because they are looking for proxie connections/same ips blah blah

    They wont block a g-0gle bot surely?

    Just need a few days to find a way around it
     
  5. Numa68

    Numa68 Registered Member

    Joined:
    May 21, 2009
    Messages:
    78
    Likes Received:
    28
    Occupation:
    I break things
    Location:
    North Carolina
    I don't think simply changing your user agent will help. They are probably blocking your proxy IP addresses, and I'll bet the trick is to shift out a lot of different proxies to keep under the 5k/day scraping limit. Something as simple as improper cookie management in your scraper might be giving you away, too.

    If I saw in my logs thousands of hits from Gbot coming from an IP block that didn't belong to the Big G, I would probably block the whole netblock.
     
    • Thanks Thanks x 1
  6. wannabie

    wannabie Elite Member

    Joined:
    Mar 11, 2009
    Messages:
    3,807
    Likes Received:
    2,954
    Occupation:
    Seo and Marketing Suprisingly
    Location:
    Your bedroom window
    Home Page:

    So what would you do to get around it?

    We tested with some private proxies and got a few hours worth but then they shut/blocked them
     
  7. wannabie

    wannabie Elite Member

    Joined:
    Mar 11, 2009
    Messages:
    3,807
    Likes Received:
    2,954
    Occupation:
    Seo and Marketing Suprisingly
    Location:
    Your bedroom window
    Home Page:
    You have not read a whole sentance on this thread have you?

    Im not looking to earn from adsense, god does anyone still chase the adsense dream anymore?
     
    Last edited by a moderator: Jun 30, 2010
  8. Numa68

    Numa68 Registered Member

    Joined:
    May 21, 2009
    Messages:
    78
    Likes Received:
    28
    Occupation:
    I break things
    Location:
    North Carolina
    A lot of your scraping strategy will depend on how you are scraping. Are you using PHP/cURL? iMacros? Or maybe a dedicated spidering/scraping program?

    If they've placed a hard limit of 5000 requests per day per IP, then you had better make sure that you don't do more than 208 hits an hour. To be safe you may want to make sure you're 10% below that, say 180 or so.

    Their server may also have an additional limit on the number of requests per IP within a shorter duration than a day, using something like LimitIPConn for Apache. That means that although you can do 200 requests/hour/IP, it does not mean that you can do those 180-200 requests in 5 minutes.

    There may also be some Javascript checks that are failing if you are using PHP or other scripting languages, just to test if you are a real browser or not. The best way to get around that, as invconvenient as it may be, is to use a real browser. iMacros or something scripted in .NET that uses the IE library, anything that can process JS.

    Also, you may be having issues with your referrer. If you're doing direct hits and passing no referrer this may cause problems. And again, cookie management may play a huge role in your scraping problems. Even if you're using proxies, if the number of hits that have been accumulated in your cookiefile(s) hit the limit, then boom... you're done for.

    When you get blocked, do you get an error message?
     
    • Thanks Thanks x 1
  9. wannabie

    wannabie Elite Member

    Joined:
    Mar 11, 2009
    Messages:
    3,807
    Likes Received:
    2,954
    Occupation:
    Seo and Marketing Suprisingly
    Location:
    Your bedroom window
    Home Page:
    We are using a .net http webrequest and using a proxy on that. I have looked at the cookie objects but cant really work out how to use this to be effective and stop us getting banned. Basically we are just getting
    500 Errors after the proxy is live for about 1 hour.

    any help appreciated?
     
    Last edited: Jun 30, 2010
  10. Numa68

    Numa68 Registered Member

    Joined:
    May 21, 2009
    Messages:
    78
    Likes Received:
    28
    Occupation:
    I break things
    Location:
    North Carolina
    Weird about the 500s as that is an indication of an internal server error. Usually if you're getting blocked you'll get a code in the 400s, nothing at all, or an offsite redirect. I wonder if when you trace back a URL that is giving a 500 error and try to load that in a browser on a different IP, if you'll get a 500 there too. Maybe a malformed request, but it's odd for sure. I wonder if your proxy isn't munging something up?

    With the cookies, I would triple check and make sure that they are being purged after switching IPs. You want to give the appearance of distinct, unique sessions and any cookies lying around from previous scrape cycles can give you away.
     
  11. snakemebite

    snakemebite Registered Member

    Joined:
    Jun 20, 2010
    Messages:
    78
    Likes Received:
    24
    tks for the great share..
     
    • Thanks Thanks x 1
  12. wannabie

    wannabie Elite Member

    Joined:
    Mar 11, 2009
    Messages:
    3,807
    Likes Received:
    2,954
    Occupation:
    Seo and Marketing Suprisingly
    Location:
    Your bedroom window
    Home Page:
    Thanks for the reply. is it a 500 error and we are blocked as the company told
    us. I tried using same proxy ip on different pc and browser still the same.
    The thing is, the company is not clued up so I cant work out how
    they are blocking so quick, it must be automated.

    In regards to cookies, this is the code:
    http://msdn.microsoft.com/en-us/library/system.net.webresponse%28v=VS.71%29.aspx


    The only cookie control I can see is:

    http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.cookiecontainer.aspx

    but I can see how we would use it, the site we are trying to hit is **d*isc*og*s.com if
    that helps at all. remove *
     
  13. Numa68

    Numa68 Registered Member

    Joined:
    May 21, 2009
    Messages:
    78
    Likes Received:
    28
    Occupation:
    I break things
    Location:
    North Carolina
    Well, after looking at their API info, I have to ask... are you shuffling different API keys?

    Edit:
    Also meant to ask if you are tracking the "requests=" in the "resp" section of the XML?
     
    Last edited: Jun 30, 2010
  14. catman08

    catman08 Junior Member

    Joined:
    Jan 11, 2008
    Messages:
    171
    Likes Received:
    109
    Occupation:
    IM
    Location:
    Europe
    Hehehe ... well i have seen both sides.

    Here a couple things:

    Usually its like you create a scraping strategy - they do a counter measurement - you do something again - they do something etc.

    thats what i had experienced - as said on both ends.

    There are ways to hide your scraper though:

    1.) Proxies (no public ones or tor that are used by everyone - you need private ones).
    Private proxies you have purchased and that apply 3 rules:
    a.) do not run on a standart proxy port
    b.) do not add something like x_forwarded_for etc. to the header
    c.) don't have a webserver running where a socks connection could be established too ;-)

    2.) Use the power of VPN tunnels (of course make sure you change the tunnel frequently)

    3.) Ever thought of translating from english to english using one of the the most famous engines out there ;-) *hint *hint

    That having said ... here is what will not work:
    1.) Just changing the useragent to google or yahoo - its pretty easy to spot faked agents by doing a revers dns look-up
    2.) If your crawler is not supporting cookies you lost - make sure it does
    3.) The cookies must be of course cleaned automatically from time to time or better said from a couple page requests to a couple page requests and especially when the IP is being changed

    There are more of things you need to consider - but those are the basics. Belive me i have created software for both ends ;-)

    Hope that helped.

    cheers
    catman
     
    • Thanks Thanks x 2
  15. catman08

    catman08 Junior Member

    Joined:
    Jan 11, 2008
    Messages:
    171
    Likes Received:
    109
    Occupation:
    IM
    Location:
    Europe
    P.S. Considering the error code - you can force the server to repsond with a 500 code instead of a 400 to confuse the scraper creators. ;-)
     
  16. wannabie

    wannabie Elite Member

    Joined:
    Mar 11, 2009
    Messages:
    3,807
    Likes Received:
    2,954
    Occupation:
    Seo and Marketing Suprisingly
    Location:
    Your bedroom window
    Home Page:
    reply to Numaa69:
    We are not using the api as they dont give the sales data we need.

    reply to catman:

    1. Got private proxies
    2. using port 60655
    3. not doing this
    4. no,

    2. ? no idea about this ???
    3. They ban u really quick we looked at this

    1. We are using the following code to crawl (have tried hundreds of variations):
    PublicSharedFunction NewScrape(ByVal url AsString) As StringBuilder
    Dim retval AsNew StringBuilder
    Dim request As HttpWebRequest = DirectCast(WebRequest.Create(url), HttpWebRequest)
    request.Accept = "text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"
    request.ProtocolVersion = HttpVersion.Version10
    request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)"
    request.ContentType = "application/x-www-form-urlencoded"
    Dim proxy As IWebProxy = CType(request.Proxy, IWebProxy)
    Dim myProxy AsNew WebProxy()
    myProxy.Credentials = New NetworkCredential()
    Dim newUri AsNew Uri()
    myProxy.Address = newUri
    request.Proxy = myProxy
    Dim response As HttpWebResponse = DirectCast(request.GetResponse(), HttpWebResponse)
    Using dataStream As Stream = response.GetResponseStream()
    Using reader AsNew StreamReader(dataStream)
    retval.Append(reader.ReadToEnd())
    EndUsing
    EndUsing
    Return retval
    EndFunction

    3. How on earth do u clean the cookies at they are browser and this is coming from the server? sorry you have lost me here.

    Just as a side note, if anyone solves this we will give you £50.00

    Thanks.
     
  17. wannabie

    wannabie Elite Member

    Joined:
    Mar 11, 2009
    Messages:
    3,807
    Likes Received:
    2,954
    Occupation:
    Seo and Marketing Suprisingly
    Location:
    Your bedroom window
    Home Page:
    Just did a quick test we did was blocked in less than 30 mins. somethings not right.
     
  18. catman08

    catman08 Junior Member

    Joined:
    Jan 11, 2008
    Messages:
    171
    Likes Received:
    109
    Occupation:
    IM
    Location:
    Europe
    Hi dansomers,

    @3. They ban u really quick we looked at this
    - depends in how you do it and what translation engine you use ;-) better said you need to crawl through the translation engines by putting them in the middle. so you are kinda indirect crawling them.

    @
    3. How on earth do u clean the cookies at they are browser and this is coming from the server? sorry you have lost me here.

    Well the cookies are stored in your browser or better said in your directory you tell your bot to store it to. for instance: curl stores the cookies as a single line in a predefined textfile. all you have to do then is clean the file, or just delete it and let the bot create a new one. In terms for browser - just delete the cookie files there.

    However - it totally depends on your crawling pattern and less likely on the algorithem. To crawl that mass you need to change the pattern constantly and always seem to be like a valid user.

    - change valid user agents from time to time
    - never crawl at the same time and days
    - never crawl at the same speed
    - change IPs very frequently
    - add random delays between each page crawl (2-15 seconds) like a normal user would browse
    - never crawl several pages in the same second
    - also crawl/make a page hit to the other files like images, css files etc. on the page (most crawlers can be dedected by checking if the only crawel html file but do not load the css files that a browser would load)
    - never use the same port (proxies)
    - delete cookies before ip change
    - make sure your proxies do not have x_forwarded etc. in the header
    - crawl at times where the most traffic is to the page anyway - to hide your crawler in the normal traffic ... etc.

    so much you can and must dynamically ajust.

    hope that helps

    P.S. I would suggest to spent more time ajusting the code to create some highly dynamic crawler that stores the crawled page on a html file on your server. later you can just use a parser that extracts the relvant information from all stored files at once. (just do simple parsing patterns for finding css files that could be loaded - if you do not know them yet - if you know them - just hard code them to the crawler)

    P.P.S. Thanks for the offer of 50 pounds - but i think for a business saving solution you gotta pay much much more. at my place 50 pounds is not even an hour work. ;-) If i would create a specific crawler that does your stuff and is like a stealth crawler it would take a couple hours.
     
  19. catman08

    catman08 Junior Member

    Joined:
    Jan 11, 2008
    Messages:
    171
    Likes Received:
    109
    Occupation:
    IM
    Location:
    Europe
    P.S. Hehehehe - If my tipps solved you issue - feel free to transfer the 50 pounds to my paypal :-D
     
  20. wannabie

    wannabie Elite Member

    Joined:
    Mar 11, 2009
    Messages:
    3,807
    Likes Received:
    2,954
    Occupation:
    Seo and Marketing Suprisingly
    Location:
    Your bedroom window
    Home Page:
    @ Catman thanks for your reply, will try those out today

    does anyone have a working example of scraping through translation engine??
     
Thread Status:
Not open for further replies.