1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How do I ensure anonymity and stay under the radar (scraping via proxy) ?

Discussion in 'Proxies' started by jesse_pinkman, Feb 14, 2013.

  1. jesse_pinkman

    jesse_pinkman Newbie

    Joined:
    Feb 14, 2013
    Messages:
    13
    Likes Received:
    1
    I've written a C# application that retrieves financial information from a public site. It'll make an HTTP request once every three seconds, twelve hours per day. That's a total of 14,400 hits per day.


    Using a set of ten proxies from BuyProxies.org should give me the anonymity I'm looking for (correct?).


    My concern is that the web site operators will learn that they're being scraped and implement protective measures. I'm taking precautions of my own to stay "under the radar" and would like your opinion on whether my precautions are reasonable or flawed.


    (1) My goal is to generate traffic patterns (across ten proxy IPs) that increase their site's daily traffic by no more than 1%. The assumption is that a 1% daily increase won't raise any red flags on their end. Is 1% reasonable ?


    (2) For my 1% figure to make any sense, I'd need to obtain a reasonable approximation of the total number of hits their site is ALREADY getting. How do I go about doing that? Here's what I've done so far:


    I've gone to Alexa.com and determined that each day, .01170% of the "global Internet population" views the site in question. Then, from InternetWordStats.com, I determined that the "global Internet population" is 2,405,518,376. This means the site is getting 281,446 users per day. Alexa says that each user executes an average of 6.4 page views per session? bringing the estimated total page views per day to 1,801,252.


    Does the logic make sense? If I'm executing 14,400 HTTP queries per day, their overall daily hit count will increase by only .80%. That's a modest increase? but modest enough not to raise a red flag?


    This is my first shot at screen scraping so sorry if my questions seem naive.


    Appreciate your thoughts. Thanks!
     
    • Thanks Thanks x 1
  2. cornerpath

    cornerpath Newbie

    Joined:
    Sep 29, 2012
    Messages:
    16
    Likes Received:
    0
    I dont know but ill get someone here
     
  3. bpmik

    bpmik Newbie

    Joined:
    Feb 4, 2013
    Messages:
    45
    Likes Received:
    8
    Do you really need 14,000+ hits per day? That could raise red flags especially if the admin uses a script to compare visitor IPs to known proxy IPs.

    I am interested in building scrapers too, I set delays more in a random number range, sometimes 2 second, other times 12, other times go to other sites and come back 4 minutes later (as if I were multi-tab browsing, went off on a tangent, then came back at a later point in time). Keep your 'referer's in order, user agent string, and try to appear as a normal user, which is hard if you NEED so much data. Again I ask do you need all this data every day (14,000+ queries) or can you gather it slower.

    If its 14000+ fresh pieces of data every day and thats what you gotta do, I don't see how to do it without red flags from a qualified server admin. If they are nobodies, sure, nobody will probably notice unless you get unlucky with some uber-admin, but if its a big public company like the NYSE or something I would imagine they have at least a full time server admin on staff who knows to write scripts and such, its rather trivial to write a lookup script to compare IPs vs proxies.

    I am way more paranoid then most. But if I were an admin and I didn't want my site scraped you bet that would be my policy.
     
  4. jesse_pinkman

    jesse_pinkman Newbie

    Joined:
    Feb 14, 2013
    Messages:
    13
    Likes Received:
    1
    Hi bpmik,


    Thanks for the response. Yes, 14,000 queries is a lot. Perhaps I can find a way to reduce the total quantity.


    The reason the query count is so high is that I'm dealing with live prices than can become stale quickly. The way their site works, each security requires its own page hit.


    The user agent string sent with each request mimics Safari.


    You make an interesting point about the referrer. I've been leaving it blank. Perhaps I should stick something in there.


    I can't help but think that I'd keep a lower profile by using more proxies. I'm using ten proxies in a round-robin fashion right now. So 14,000 queries spread over ten proxies is 1,400 queries per IP address. I can buy 100 proxies from my provider. If I did that, it would only be 140 queries per IP.


    So I guess the overall strategy is: Reduce queries, increase proxy count...
     
  5. bpmik

    bpmik Newbie

    Joined:
    Feb 4, 2013
    Messages:
    45
    Likes Received:
    8
    Probably fine. Running 24/7 is a wee bit above my threshold but maybe I just need more balls. I would watch your proxies, in case any get banned or suddenly pages stop coming. Check http codes each request, make sure everything is fine, and if its not, throw exception, log it and review it (maybe even halt script in the early days until you can manually review it all).

    Thats just my paranoia talking for what its worth :)

    Hope everything goes good. Realistically, spread across 100 ips thats a lot.
     
  6. faster

    faster Jr. VIP Jr. VIP Premium Member

    Joined:
    Jan 3, 2011
    Messages:
    1,730
    Likes Received:
    184
    Home Page:
    To ensure anonymity, I recommend (a) using anonymous proxies from trusted sources, (b) clearing your browser cookies after each session or using a free tool like CCleaner, and (c) disabling Flash and other add-ons.

    To stay under the radar, I recommend (a) ensuring your proxies are private, (b) ensuring your activity does not violate terms of service for websites you visit, and (c) coming across as normal browsing activity.
     
  7. SEO20

    SEO20 Elite Member

    Joined:
    Mar 25, 2009
    Messages:
    2,017
    Likes Received:
    2,259
    You need more proxies than that for that amount of requests. You need to rotate browser headers AND think about your plan again - do I really need this many requests - can this be done any simpler. Remember - as simple as possible, but not simpler ;-)