1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How to download a captcha which is randomly generated each time there's a GET request?

Discussion in 'General Programming Chat' started by Deusdies, Oct 5, 2012.

  1. Deusdies

    Deusdies Regular Member

    Joined:
    May 22, 2009
    Messages:
    261
    Likes Received:
    190
    You all know what I'm working on (hint: it's in my sig).

    I've encountered a certain platform which uses an in-house built captcha solution. Thought it was fairly simple at first. However, it turns out it's not really.

    The way my app works is that it uses an internal headless browser which downloads the entire web page and then you can select forms and fill them out. However, when I want to download an image (captcha) I have to issue another request, download the image, and then display it to the user in prompt.

    The only problem is that this platform's server (ASP script) generates a new image every time you visit that link (and it's the same link for img src). So, basically, my script visits the page (issuing GET requests) and one of those GET requests is for the captcha link. It doesn't download the file, but it still sends a GET. However the next time I issue a GET in order to download the image, I get a completely different image. And yes, the cookies are the same. Even when I open the captcha link in Chrome and hit refresh, it generates a new image.

    How to solve this? Anyone have any idea? I'm coding in Python, but C#/Java reply would also help.
     
  2. CodingAndStuff

    CodingAndStuff Regular Member

    Joined:
    May 6, 2012
    Messages:
    236
    Likes Received:
    84
    Occupation:
    Swagstronaut
    Location:
    You can't have my bots. Sorry :'(
    What "browser" are you using? The MSIE activex control? A WebKit frame? Either way you should be able to cache the image and then access it locally once it's been downloaded.

    If that isn't an option, you could always run your own little delegation system internally and when the request goes through to the captcha, fork it off to another thread and save the bytes from that stream to a file and read from that.

    After re-reading your post, the comment "The way my app works is that it uses an internal headless browser" leads me to believe you're using httplib/httplib2, which usually means your client receives live updates from the page (including all assets). Just toss in a conditional to check for the captcha URI (the first instance of it), and save that file.
     
    • Thanks Thanks x 1
  3. Deusdies

    Deusdies Regular Member

    Joined:
    May 22, 2009
    Messages:
    261
    Likes Received:
    190
    I'm actually using mechanize module which is somewhat similar to httplib. So what you're saying basically is that I should dig into the httplib (mechanize) code and put in a conditional there that will save the image if it encounters one?
     
  4. CodingAndStuff

    CodingAndStuff Regular Member

    Joined:
    May 6, 2012
    Messages:
    236
    Likes Received:
    84
    Occupation:
    Swagstronaut
    Location:
    You can't have my bots. Sorry :'(
    That could work :p
     
    • Thanks Thanks x 1
  5. Deusdies

    Deusdies Regular Member

    Joined:
    May 22, 2009
    Messages:
    261
    Likes Received:
    190
    Thanks man, going to try that. Although mechanize's source is quite confusing...
     
  6. matessim

    matessim Junior Member

    Joined:
    Nov 22, 2008
    Messages:
    164
    Likes Received:
    72
    Occupation:
    Being funny and kind to puppies
    Location:
    UT 2003
    Can you PM me the addr? I want to tackle this myself.
     
  7. weedsmoker

    weedsmoker Junior Member

    Joined:
    May 2, 2011
    Messages:
    190
    Likes Received:
    79
    I've posted thread about this issue some time ago. All you need to do is to save cookies when grabbing captcha (you can use urllib2 for this, no need for slower mechanize lib), then open form in mechanize, load saved cookies and submit form.
    Check thread with example: http://www.blackhatworld.com/blackh...68176-automating-captcha-protected-forms.html

    Edit:
    Now I see that you using headless browser, than you don't need mechanize at all, just use urllib2 for fetching captcha and saving cookies.

    Offtopic:
    Which headless browser you using, QT maybe?
     
    Last edited: Nov 19, 2012
  8. Deusdies

    Deusdies Regular Member

    Joined:
    May 22, 2009
    Messages:
    261
    Likes Received:
    190
    mechanize handles cookies automatically. Anyway, I solved the problem by overriding mechanize's open() method and telling it to save the file if it encounters one.
     
  9. geniusgullu

    geniusgullu Junior Member

    Joined:
    Nov 25, 2009
    Messages:
    185
    Likes Received:
    25
    Occupation:
    Student
    Location:
    Every Where
    its a dynamic captcha

    so the referral should contain referral+ phpsess id(get this from the GET request for the page)

    if you pass them correctly then you get the same captcha
     
  10. Deusdies

    Deusdies Regular Member

    Joined:
    May 22, 2009
    Messages:
    261
    Likes Received:
    190
    Yep, I was passing the referrals as well.

    @weedsmoker: mechanize is basically the wrapper around urllib/urllib2. I'm using mechanize as headless browser, don't think there's a Qt-powered headless browser.
     
  11. geniusgullu

    geniusgullu Junior Member

    Joined:
    Nov 25, 2009
    Messages:
    185
    Likes Received:
    25
    Occupation:
    Student
    Location:
    Every Where
    can you pm me the site your working on?
     
  12. weedsmoker

    weedsmoker Junior Member

    Joined:
    May 2, 2011
    Messages:
    190
    Likes Received:
    79
    I'm glad you solved the problem.

    I know it's a wrapper, and it's a bit slower than urllib, so I don't use it unless I have to. And yes, there is webkit (Qt) headless browser if you need javascript support, which is PITA with mechanize.
     
  13. Deusdies

    Deusdies Regular Member

    Joined:
    May 22, 2009
    Messages:
    261
    Likes Received:
    190
    Are you talking about Ghost.py? I tried it once and it ran horribly. If not, please tell me what that headless browser powered by Qt is :D

    Yeah, mechanize doesn't support JavaScript, which is too bad, but I used PyV8 to evaluate any JS and then plug it into mechanize's response :)
     
  14. sockpuppet

    sockpuppet Junior Member

    Joined:
    Nov 7, 2011
    Messages:
    155
    Likes Received:
    145
    afaik the only real headless Qt webkit is from phantomjs.org, they took the Qt webkit source and stripped all the gui stuff.
     
  15. Deusdies

    Deusdies Regular Member

    Joined:
    May 22, 2009
    Messages:
    261
    Likes Received:
    190
    That's not a Python browser though, it's API is JavaScript (from what I can tell).