1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Python + Selenium Leaving Footprints?

Discussion in 'Programming' started by TiagoS, Jun 10, 2017.

  1. TiagoS

    TiagoS Jr. VIP Jr. VIP

    Joined:
    Jul 5, 2014
    Messages:
    338
    Likes Received:
    158
    Hey guys! I'm developing a bot that scrapes prices from a specific website and i'm running into a problem. Somehow, the website is able to tell that I'm using selenium/chrome webdriver. It displays higher prices when I scrape trough my bot and when I search manually, it shows fake/higher prices. (I've done a search and it is confirmed that the website does that, other people mentioned it). Here what I have tried so far:

    - Clearing cookies/cache
    - Random delays
    -Changing User agent
    -Using proxies

    I'm looking for ideas of what I might be doing wrong, what kind of footprints am I leaving? What's even worse is that sometimes it works and sometimes is does not (it works 4 out of 10 times).

    I am even willing to pay if someone manages to come up with a solution, if that's the case (Pm me if you can).
     
  2. GoDesain

    GoDesain Regular Member

    Joined:
    Feb 26, 2011
    Messages:
    480
    Likes Received:
    190
    That new for me.. can you share the url with me ?
    will try to check..
    about clear cookies and chache (start private mode):
    Code:
    profile = webdriver.FirefoxProfile()
    profile.set_preference("browser.cache.disk.enable", False)
    profile.set_preference("browser.cache.memory.enable", False)
    profile.set_preference("browser.cache.offline.enable", False)
    profile.set_preference("network.http.use-cache", False)
    Random delay:
    Code:
    from random import randint
    from time import sleep
    sleep(randint(10,100))
    Random useragent:
    Code:
    from fake_useragent import UserAgent
    from selenium import webdriver
    profile = webdriver.FirefoxProfile()
    profile.set_preference("general.useragent.override", UserAgent().random)
    profile.update_preferences()
    Proxy :
    Code:
    profile = webdrive.FirefoxProfile()
    profile.set_preference('network.proxy_type',1)
    profile.set_preference('network.proxy.http',"xxx.xxx.xxx.xxx")
    profile.set_preference('network.proxy.http_port',3128)
    profile.update_preference()
     
    • Thanks Thanks x 3
  3. LostLife

    LostLife Regular Member

    Joined:
    May 12, 2017
    Messages:
    265
    Likes Received:
    288
    Gender:
    Male
    Occupation:
    Software Engineer
    Are you sure website is not changing prices randomly? Please check manually first for 20-30 times. Please make sure deleting cookies are successful. Can u share the url please?
     
  4. living2xl

    living2xl Jr. VIP Jr. VIP

    Joined:
    Dec 9, 2011
    Messages:
    1,714
    Likes Received:
    401
    Occupation:
    Sippin dat juice - Shout it louder!
    Location:
    Not sleeping!
    Home Page:
    why not put a filter that scrapes the same url x times and discards those scrapes with prices higher than x times the median
     
  5. Google Prince

    Google Prince Jr. VIP Jr. VIP

    Joined:
    Dec 24, 2015
    Messages:
    165
    Likes Received:
    96
    Location:
    Google's Search Engine
    It's most likely Chromedriver, which is known to leave footprints.

    Change the webdriver to firefox, that might help.
     
  6. jamie3000

    jamie3000 Supreme Member

    Joined:
    Jun 30, 2014
    Messages:
    1,400
    Likes Received:
    636
    Occupation:
    Finance coder looking for semi-retirement
    Location:
    uk
    https://www.distilnetworks.com are able to detect selenium, not sure if they ever elaborated on how. Think it was through some sort of js probing.
     
  7. jamie3000

    jamie3000 Supreme Member

    Joined:
    Jun 30, 2014
    Messages:
    1,400
    Likes Received:
    636
    Occupation:
    Finance coder looking for semi-retirement
    Location:
    uk
  8. akio_daichi

    akio_daichi Newbie

    Joined:
    Jul 4, 2017
    Messages:
    2
    Likes Received:
    0
    Gender:
    Male
    Can you share the site's url? I think this site is not using distill-networks services, because they block you out immediately.
     
  9. pasdoy

    pasdoy Power Member

    Joined:
    Jul 17, 2008
    Messages:
    782
    Likes Received:
    245
  10. pasdoy

    pasdoy Power Member

    Joined:
    Jul 17, 2008
    Messages:
    782
    Likes Received:
    245
  11. Tozzy

    Tozzy Jr. VIP Jr. VIP

    Joined:
    Nov 26, 2015
    Messages:
    420
    Likes Received:
    122
    Gender:
    Male
    Location:
    World
    Home Page:
    It is not so hard to detect auto browser even with plain JS and no rocket science.
    Who knows how this or that website does it but one can instantly think of 10 or 20 ways to maintain this.
     
  12. B.Shahin

    B.Shahin Junior Member

    Joined:
    Dec 10, 2016
    Messages:
    107
    Likes Received:
    17
    Gender:
    Male
    It is the 1st time I see some thing like this
    I am using selenium for fb, and it is working for months with no error
    I am also using chromedriver!
    So just following here to see if there is some thing required to be added!
     
  13. jamie3000

    jamie3000 Supreme Member

    Joined:
    Jun 30, 2014
    Messages:
    1,400
    Likes Received:
    636
    Occupation:
    Finance coder looking for semi-retirement
    Location:
    uk
    You can always recompile phantom and other headerless browsers from source if you can find what js enumeration is used for detecting
     
    • Thanks Thanks x 1
  14. MrPenguin

    MrPenguin Jr. VIP Jr. VIP

    Joined:
    Mar 10, 2010
    Messages:
    114
    Likes Received:
    17
    Use Firefox. Install the extension random user agent. This will randomize alot of different settings. Should work.
     
  15. rootjazz

    rootjazz Jr. VIP Jr. VIP

    Joined:
    Dec 21, 2012
    Messages:
    685
    Likes Received:
    329
    Occupation:
    Developer
    Location:
    UK
    Home Page:
    Lots of things could be at play.

    When you use the browser normally (and the website will know (if they chose to know) what is normal behaviour within some standard deviations). Time reading / clicking etc.
    Also, JS events will not be triggered, no mouse action could be a flag against you, just programmtically clicking on things would not give a realistic even pattern. Should be able to check this in a browser debug app.

    I've had to automate some sites where had to control the actual mouse in order to get around the detections - and then move the mouse in a realistic way (you can look at curve generation / bezier curves
     
    • Thanks Thanks x 2
  16. rolax

    rolax Registered Member

    Joined:
    Dec 26, 2016
    Messages:
    62
    Likes Received:
    9
    did you just have some movement of the mouse.. or did the mouse actually navigate to the pixel location and hover over the buttons you wanted to click?
     
  17. itz_styx

    itz_styx Jr. VIP Jr. VIP

    Joined:
    May 8, 2012
    Messages:
    620
    Likes Received:
    289
    Occupation:
    CEO / Admin / Developer
    Location:
    /dev/mem
    Home Page:
    there is a good fingerprinting method that i wont name here (no need for whitehats to get ideas). that same method also allows for detecting what type of browser you are using i.e. selenium.
     
  18. BloodyNinja

    BloodyNinja Power Member

    Joined:
    Oct 28, 2013
    Messages:
    600
    Likes Received:
    567
    Location:
    Deeptown
    I will only say here that no one has a good solution to this yet.
    There are some known or less known half-measures but a robust solution doesn't exist.
    Most people in this topic have no idea about the problem.
     
  19. Lothric

    Lothric Junior Member

    Joined:
    Apr 25, 2017
    Messages:
    113
    Likes Received:
    11
    This
    UserAgent().random looks not good enough. I just tested them and they spits out garbage like chrome 29, 41.. it doesn't make sense to me. If they use real world stats, they should return version like 59, at least 57, am i missing something here?