1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Java + Selenium: Intelligent Bot Detection Algorithms..

Discussion in 'Other Languages' started by agag2, Jul 17, 2015.

  1. agag2

    agag2 Supreme Member

    Joined:
    Feb 17, 2009
    Messages:
    1,309
    Likes Received:
    254
    Hi

    I am trying to scrape data from WhitePages.com and 411.com however my program is always blocked by both websites after 1 query (sometimes even less).

    When I browse WhitePages.com manually I still get blocked. However, if I switch my browser (from Firefox to Chrome) I am able to search normally. But switching the user agent on Firefox (to chrome) does not work - I am blocked even when I search manually.

    Can anyone explain how they know I am not "actually" using Chrome?

    On 411.com I am blocked starting from the first attempt by Distill Networks - does anyone know how to get around it?

    P.S

    I am setting delays and randomizing things but to no avail.

    Thanks!
     
  2. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,642
    Likes Received:
    11,355
    Occupation:
    Pusillanimous Knitter
    Location:
    Buenos Aires
  3. pasdoy

    pasdoy Power Member

    Joined:
    Jul 17, 2008
    Messages:
    790
    Likes Received:
    245
    you use proxies? I see they claim to have clients I scrape multiple time a day.
     
  4. agag2

    agag2 Supreme Member

    Joined:
    Feb 17, 2009
    Messages:
    1,309
    Likes Received:
    254

    I am being limited within 1 search so using proxies would only extend it by 1 search per proxy.

    Manually I am able to get in a couple of searches. My question is how they detect automation..
     
  5. 9to5destroyer

    9to5destroyer Jr. VIP Jr. VIP

    Joined:
    Nov 14, 2011
    Messages:
    360
    Likes Received:
    207
    are you able to use fiddler to see what the difference in http requests are from using your normal browser and automated one.

    A lot of the business listing website check browser size plugins installed,fonts etc. I know from when I've done stuff pure webrequests its a pain to replicate. but this shouldn't be a issue with using a browser. Could be something as daft as browser size.

    thanks
    9to5
     
  6. shaggy93

    shaggy93 Senior Member

    Joined:
    Dec 23, 2010
    Messages:
    1,030
    Likes Received:
    449
    Location:
    0.0.0.0
    I have analyzed both websites, both of them are on nginx. Since they can customize to block the bad bots.

    When you try to request these website, they are using AJAX to pull the pages/results.

    Add this in your request headers -> X-Requested-With: XMLHttpRequest

    Beside that you can try Selenium as well as PhantomJS(headless browser).
     
  7. agag2

    agag2 Supreme Member

    Joined:
    Feb 17, 2009
    Messages:
    1,309
    Likes Received:
    254
    I am aware of Chrome driver however I don't believe this will resolve the issue. I think if I start scraping with Chrome they will block Chrome too. Somehow they are detecting I am scraping but I don't know how.

    I've tried adding sophisticated delays (for example, delays between sending chars, delays between fields, before submitting etc etc) but it still does not work.

    The biggest question is that when I do it manually it works 5X more effectively..

    Thanks
     
  8. weedsmoker

    weedsmoker Junior Member

    Joined:
    May 2, 2011
    Messages:
    190
    Likes Received:
    79
    Got same issue with webdriver/selenium and google. Something's off with selenium's default profile, also send keys is not good, focus, move, hover events are not proper. I use win32 sendkeys for that. So I used customized ff to reduce browser footprint, running it from vm, and controlling it with mozrepl. Also marionette can be used or other browser (chrome) with similar plugins/addons. Check browser viewport size, monitor resolution (easier with vm), fonts, plugins (ditch java), you can juggle multiple popular versions of flash plugin or ditch it. Use only popular os, os version, browser, browser version etc. Reduce your uniqueness to disappear into to crowd :)
     
    • Thanks Thanks x 1
  9. agag2

    agag2 Supreme Member

    Joined:
    Feb 17, 2009
    Messages:
    1,309
    Likes Received:
    254

    I've already been using a custom profile (not the default one) but it does not seems to solve the problem..

    I also believe "sendKeys" may be causing the problem but I'm not sure why.. what is "win32 sendKeys" in Selenium?

    I've been trying to reduce footprint too... I know about "browser profiling" concept..

    Never heard of mozrepl though, will take a look.

    have you ever been able to scrape 411 or whitepages without getting blocked?
     
  10. weedsmoker

    weedsmoker Junior Member

    Joined:
    May 2, 2011
    Messages:
    190
    Likes Received:
    79
    I've used python sendkeysctype lib, it's python bindings for sending native key events:
    Code:
    https://msdn.microsoft.com/en-us/library/ms646310%28v=vs.85%29.aspx
    There are bindings for other languages as far as I know, but don't know if there is java one. Maybe calling directly user32.dll or something like this:
    Code:
    http://docs.oracle.com/javase/6/docs/api/java/awt/Robot.html
    You can send key press to any application, but it has to be focused if I'm not wrong.
    I didn't try to automate those sites, but I had similar problems with automating creation of google accounts. G detected selenium/webdriver and also phantomjs which had same issues sending key presses. I've tested while manually entering data into fields, and problem's gone. Then I switched to python sendkeys, but also ditched selenium because of other issues.
     
    • Thanks Thanks x 1
  11. agag2

    agag2 Supreme Member

    Joined:
    Feb 17, 2009
    Messages:
    1,309
    Likes Received:
    254
    Thanks for insight.

    I tried using the Robot class for other operations (scrolling etc) but not for sending keys in browser..

    From your post, I get that Selenium does not "send keys" like a normal user would - but "sendkeysctype" lib would however I don't think there is something like this for Java (as far as I can see - aside for Robot class which I need to investigate)

    Do do you know about the tech differences in how Selenium send keys vs normal user ? Or a reference on topic?

    Thanks
     
  12. weedsmoker

    weedsmoker Junior Member

    Joined:
    May 2, 2011
    Messages:
    190
    Likes Received:
    79
    I think selenium simulate key presses ("send keys") as javascript synthesized key press events. You'll need proper focus/blur and optionally (click/hover) events associated with this key presses to simulate native behavior. I never had much luck with that on some sites (google or facebook for example). Also selenium won't let you to focus or sendkeys to some elements which are not focusable.
    For native send keys check this, maybe it can be done something similar with java:
    Code:
    http://www.pinvoke.net/default.aspx/user32.sendinput
    I quickly tested 411.com to search few names, I didn't even used send keys, I've executed js and set #who element value to search name and sent click event to search button, no problems, no blockage???
    Maybe something else causing a problem?
     
    • Thanks Thanks x 1
    Last edited: Jul 29, 2015
  13. agag2

    agag2 Supreme Member

    Joined:
    Feb 17, 2009
    Messages:
    1,309
    Likes Received:
    254
    I get from post that native behavior has two properties:

    1. Element focus
    2. Click event

    And selenium does not stimulate this nor does js stimulate it.

    Did you try automating whitepages? I can give you the script I used to check it out..

    Is it possible for Google to track us by detecting mouse movements (as with Selenium there are no mouse movements but a real user needs to move mouse..)
     
  14. weedsmoker

    weedsmoker Junior Member

    Joined:
    May 2, 2011
    Messages:
    190
    Likes Received:
    79
    Checked whitepages, same shit as 411 :p, no problemo :), also without sending keys, just changing input's value attribute and clicking search button with javascript click() method.
    Google tracks mouse movements and key presses, it's obvious in their source code, they don't allow to easily screw around with input's value attribute, lots of hidden fields, so it's easier to just send keys than replicate all that. I think that you don't need to simulate mouse events because lots of users navigate form fields and submit forms only with keyboard. It would be bad if cursor movement was mandatory for many users, probably that's the reason why they don't do it. I think that selenium has mouse move action, but that's also javascript synthesized.
     
    • Thanks Thanks x 1
  15. agag2

    agag2 Supreme Member

    Joined:
    Feb 17, 2009
    Messages:
    1,309
    Likes Received:
    254
    What do you mean "without changing input's value attribute and clicking.. with JS click method"? Is it part of Python library?


    Wow.. you really seem to know your game.. I just read about Google tracking user mouse movements on SO - http://stackoverflow.com/questions/6667544/why-does-google-1-record-my-mouse-movements - what do you think are ways to circumvent this ?

    Although some users do not use mouse (like tablet and iPhone users) Google could know you are on desktop from user agent or if you used mouse once they can conclude you are on desktop so it's possible they track only desktop users.
     
  16. weedsmoker

    weedsmoker Junior Member

    Joined:
    May 2, 2011
    Messages:
    190
    Likes Received:
    79
    No, no, I didn't make myself clear. I mean just executing javascript for example on 411.com page:
    HTML:
    document.querySelector("#who").value = "YOUR SEARCH TERM";
    then firing off click() event on search button to execute search. I think selenium has evaluate method for executing javascript code.

    I think that they track you regardless. I mimicked mostly laptop users (screen resolutions) and didn't have problems with skipping mouse movement/simulation. Maybe something's changed lately, I didn't worked on google automation in past few months.
     
  17. rootjazz

    rootjazz Jr. VIP Jr. VIP

    Joined:
    Dec 21, 2012
    Messages:
    693
    Likes Received:
    340
    Occupation:
    Developer
    Location:
    UK
    Home Page:
  18. MrBlue

    MrBlue Senior Member

    Joined:
    Dec 18, 2009
    Messages:
    974
    Likes Received:
    680
    Occupation:
    Web/Bot Developer
    Been hammering both those sites without issue for the past two years using CasperJS. I'm not even setting a proper user-agent or using proxies!
     
  19. elisha13

    elisha13 Newbie

    Joined:
    Apr 19, 2013
    Messages:
    3
    Likes Received:
    0
    Would you mind sharing your bot with us :)
     
  20. thecreep

    thecreep Newbie

    Joined:
    Aug 4, 2013
    Messages:
    46
    Likes Received:
    7
    Have you tried Ubot, I used it for scraping from sites like that. The only problem is that it is quite slow cos you will have to simulate a real user clicking the site