1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Dirty trick by google? Cant figure python solution

Discussion in 'General Programming Chat' started by msimurin, Feb 14, 2010.

  1. msimurin

    msimurin Regular Member

    Joined:
    Sep 21, 2009
    Messages:
    243
    Likes Received:
    92
    I am trying to make python script for google keyword tool. The issue comes with getting captcha image.

    What i am trying to do is get src or url of image, pass it to urllib, retrieve, save image then pass to decaptcher for decoding...

    The problem is that the url of captcha on google keyword suggestion is always the same url. Like www.google.com/captchas ... so if i would pass this url to urllib that should get the image and download it then it would load another image and not the original one that is needed for account creation.

    So, i can pass the url of captcha to urllib but what is the point if it is same for every captcha image. Damn, i even confuse myself writting this, sorry for complex explanation...
     
  2. trophaeum

    trophaeum Senior Member

    Joined:
    Dec 21, 2007
    Messages:
    1,189
    Likes Received:
    706
    you need to keep the cookies intact and have multiple cookiejars on the go
     
    • Thanks Thanks x 1
  3. BozoClown

    BozoClown Junior Member

    Joined:
    Jan 4, 2009
    Messages:
    150
    Likes Received:
    106
    I'm not familiar with python, but I wrote a keyword tool for myself in QT/C++. I don't know what urllib does, but I'm assuming there is part of your program that gets a notification that they need to enter a captcha, however you want urllib to handle fetching of the captcha images. But the two parts have different cookiejars, hence when you send a reply from one part, that authentication does not crossover to the other part.

    If it is a my assumption, then you should look into the possibility of the two parts sharing the same cookiejar. This is what I did for my program. Chances are if it is object oriented, then all parts use the same cookiejar class and there is a way to make the two parts share the same instance of an object.

    In QT/C++ I would just let cookiejar member objects in different classes, point to one common object. Pick out the idea and see how it applies to python.
     
    • Thanks Thanks x 1
  4. msimurin

    msimurin Regular Member

    Joined:
    Sep 21, 2009
    Messages:
    243
    Likes Received:
    92
    well, thanks for help guys but i dont really understand what both of you meant or to say how to implement this. Because the logic my program follows is this

    - get url of captcha image - example: www.google.com/captcha1212.jpg

    - pass this url to urllib and download it on my pc

    - use that downloaded image with decaptcher api, decode it and get result


    So, the problem is that this what i am used to do is not possible because captcha url is always the same :( I can pass it and retrieve image but that image will be new loaded captcha image because url of captcha is always same(as www.google.com/captchaloader ), understand?

    I cant see how cookiejar solves this, Still, thanks for your replies
     
    Last edited: Feb 15, 2010
  5. iglow

    iglow Elite Member

    Joined:
    Feb 20, 2009
    Messages:
    2,081
    Likes Received:
    856
    Home Page:
  6. Lusches

    Lusches Newbie

    Joined:
    Oct 22, 2009
    Messages:
    12
    Likes Received:
    0
    I think the problem is that every catcha is automatically created via php. Every captcha is different.
    Decaptcha would try to solve a different captcha, because the captcha would be re-created.
    Maybe you can download the captcha and sent it via POST-request to Decaptcha.
     
  7. msimurin

    msimurin Regular Member

    Joined:
    Sep 21, 2009
    Messages:
    243
    Likes Received:
    92
    I can send image to decaptcha and get returned result thats not problem, the problem is retrieving current captcha image because if i send captcha url to urllib that retrieves, downloads image it will fetch new image since captcha url loads new image every time.
     
  8. aмillionaírе

    aмillionaírе Jr. VIP Jr. VIP Premium Member

    Joined:
    Apr 20, 2008
    Messages:
    532
    Likes Received:
    358
    Use python's ImageGrab to get a screenshot of the image.

    Code:
    http://www.pythonware.com/library/pil/handbook/imagegrab.htm
    Perl also has one made especially for this purpose:

    "Image::Grab is a simple way to get images with URLs that change constantly."

    Code:
    http://mah.everybody.org/hacks/perl/Image-Grab/
     
    • Thanks Thanks x 1
  9. msimurin

    msimurin Regular Member

    Joined:
    Sep 21, 2009
    Messages:
    243
    Likes Received:
    92
    Not really popular section of forums but i must say guys who hang in programming area here are great, thanks for all suggestions wether it helped or not, i am gonna try lib you suggested amillionare.
     
  10. Hydrogen

    Hydrogen Newbie

    Joined:
    Dec 30, 2009
    Messages:
    39
    Likes Received:
    23
    Occupation:
    Co-Owner of AdvertMarketing
    Home Page:
    The images 'have' to come from somewhere. A few things maybe happening though,

    1) there are dynamically generating the captcha image each time
    2.) they are masking the real url somehow google is known for their tricky code

    Either way install LiveHttpHeader plugin for firefox and then example the http headers sent from the webpage to your browser and you will get a better ideal on what is happening. After you know exactly what is happening then you can code around that.
     
  11. BozoClown

    BozoClown Junior Member

    Joined:
    Jan 4, 2009
    Messages:
    150
    Likes Received:
    106
    It is a cookiejar problem. I had to tackle this same problem with my program. You have to make sure that all the url fetching & submitting objects keep their cookies in the same spot, i.e same cookiejar.

    Plus, you may consider it really doesn't matter whether urllib gets its a different captcha provided it ends up sending the captcha solution.

    Another option is, if you want to avoid recalling the same captcha url which causes the captcha to update you could save the image so that urllib doesn't have to fetch anything.

    If I remember correctly each time you fetch that captcha image, 2 cookies come along however you fetch the keywordtool url it is about 5 cookies. Those two cookies one is for the captcha timeout another is to do with the captcha answer. <-- if i remember correctly. When you reply the captcha, it is matched with the answer cookie. This pairing should not be broken between the captcha fetching, captcha answering/breaking and captcha submission.
     
  12. pyronaut

    pyronaut Executive VIP

    Joined:
    Dec 9, 2008
    Messages:
    1,229
    Likes Received:
    1,422
    Here is the simplest way of saying it.

    Each time the URL is called from a different cookie (Or no cookiejar), the you get given a different image.

    SO EITHER
    Whatever your using to grab the image, needs to be given the same cookiejar that python is using, and so they are all on the same page.

    OR

    You need to grab the image straight from python, so that you dont need to pass around cookie jars (Well you will do once you submit the captcha).
     
  13. BozoClown

    BozoClown Junior Member

    Joined:
    Jan 4, 2009
    Messages:
    150
    Likes Received:
    106
    You sir, should write the law. Well said.
     
  14. msimurin

    msimurin Regular Member

    Joined:
    Sep 21, 2009
    Messages:
    243
    Likes Received:
    92
    Was passing cookiejar for every browser instance possible...I tried simply anything i could think off..

    First i used mechanize which failed to read forms with standard command, then i had to figure secret global_form command that is uncommon but somehow reads forms from the current page(99% of other sites behave different but guess google messed the site with javascript). Then after this i simply had no other choice but to use other library because mechanize returns response of the form_submit not for itself but as object for ANOTHER LIBRARY(urllib) so i passed cookiejar from mechanize browser to it and bam more problems come...

    There is simply no way to tell a i getting keywords inside mechanize browser or not because for some reason google keyword tool doesnt display this in source code of website as well in mozilla. But still firebug can display it and source code of website cant?!!?!? Ajax bullshit? So how am i gonna tell if mechanize recognizes page update with keywords inside, how the hell am i going to tell if form is submited succesfully or not with this bullshit mechanize(It shows new url that looks like submited form result and all but then again it returns same website source as the one before seed keyword is sbumited). Yes mechanize is good for simple sites but for something complicated as google keyword tool it sucks... But again even the most simple library as urllib works for simple sites so why the hell should i use mechanize after all..

    Dunno i think my days with python are over, i am moving to c++, break head abit with readability of code, memory management and other stupid things but hey at least you will get proper utility with c++

    DOH, my head hurts after hours and hours of messing with this...

    Here is the code btw

    Code:
    
    import urllib2
    import mechanize
    import lxml.html
    import decaptcher
    
    import cookielib
    
    
    
    
    br=mechanize.Browser()
    
    
    
    
    
    
    
    
    br.set_handle_equiv(True)
    br.set_handle_gzip(True)
    br.set_handle_redirect(False)
    br.set_handle_referer(True)
    br.set_handle_robots(False)
    
    
    br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
    
    
    br.set_debug_http(True)
    br.set_debug_redirects(True)
    br.set_debug_responses(True)
    
    cj = cookielib.CookieJar()
    
    br.set_cookiejar(cj)
    
    br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
    
    
    
    
    
    
    site = br.open('https://adwords.google.com/select/KeywordToolExternal')
    
    print site.read()
    
    
    form = br.global_form()
    
    form.set_value('viagra for the lol', name='keywords')
    
    
    
    
    
    
        
    
    
    
    br.retrieve('https://adwords.google.com/select/KeywordToolCaptcha', 'Captcha.jpg')
    
    decaptcher.USERNAME='id'
    decaptcher.PASSWORD='password'
    
    # What's my balance?
    print decaptcher.balance()
    
    # Solve an image
    out = decaptcher.solve('Captcha.jpg')
    
    if int(out[0]) != decaptcher.ERROR_OK:
        print "Error"
    else:
        # Do something with the image here.
        # Say the image was badly recognized...
        decaptcher.bad(out[1], out[2])
    
    
    entercaptcha=out.pop()
    
    
    
    
    form.set_value(entercaptcha, id='kpVariationsTool-captchaAnswer')
    
    print form
    
    
    
    
    cj2 = cj
    
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj2))
    
    
    
    urllib2.install_opener(opener)
    
    opener.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
    
    
    
    
    
    
    request2 = form.click()  # urllib2.Request object
    try:
        response2 = opener.open(request2)
    except urllib2.HTTPError, response2:
        pass
    
    print response2.geturl()
    
    print response2.info()  # headers
    print response2.read()  # body
    
    
    
     
    Last edited: Feb 16, 2010
  15. BozoClown

    BozoClown Junior Member

    Joined:
    Jan 4, 2009
    Messages:
    150
    Likes Received:
    106
    Try using PyQt. It should give you the extra power of the QT framework.
     
    • Thanks Thanks x 1
  16. msimurin

    msimurin Regular Member

    Joined:
    Sep 21, 2009
    Messages:
    243
    Likes Received:
    92
    Thanks for the headsup, but PyQt seems to be typical microsoft(nokia) bs, no possibility for commercial push of your product unless how much you need to pay them? hah

    Well, thanks anyway, PyQt seems good thing but just one more reason to feel that python needs more support and lacks utility when you see license blocks like this.

    And yeah Qt c++ doesnt have license paying :)
     
  17. aмillionaírе

    aмillionaírе Jr. VIP Jr. VIP Premium Member

    Joined:
    Apr 20, 2008
    Messages:
    532
    Likes Received:
    358
    Code:
    Status=OK - 200
    Set-Cookie=I=QgSQ3CYBAAA=.whA9OwLCRzoM5oLrdIuC6Q==.oDa9NPD9aN6KRCnmygiXZA==; Path=/select; Secure; HttpOnly
    Content-Disposition=inline
    Content-Type=image/jpeg; charset=UTF-8
    Transfer-Encoding=chunked
    [B]Date=Wed, 17 Feb 2010 15:03:30 GMT
    Expires=Wed, 17 Feb 2010 15:03:30 GMT
    Cache-Control=private, max-age=0[/B]
    X-Content-Type-Options=nosniff
    X-Frame-Options=SAMEORIGIN
    Server=GFE/2.0
    X-XSS-Protection=0
    
    Code:
    Host=adwords.google.com
    User-Agent=Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.4) Gecko/20091028 Ubuntu/9.10 (karmic) Firefox/3.5.4
    Accept=text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    Accept-Language=en-us,en;q=0.5
    Accept-Encoding=gzip,deflate
    Accept-Charset=ISO-8859-1,utf-8;q=0.7,*;q=0.7
    Keep-Alive=300
    Connection=keep-alive
    Cookie=I=R8KP3CYBAAA=.whA9OwLCRzoM5oLrdIuC6Q==.nDv5xl9Yn11Gw4gc1Lseew==; PREF=ID=2c4e5a77afb8f693:U=a9598266472b0f48:FF=4:LD=en:NR=10:TM=1266121117:LM=1266212118:GM=1:S=o3OI0VcpuXkM1s5o; NID=31=EYAjEQoWrpYcGfdwGIxWjOJ6BlckCMfA6Qzs0LzIiKpO2JTDuV-KSh8nco8YmqOZLaOUVmfvF28Lz_ET05J2wn7chP0Ju8vjElTrKj72k6H8LZtKvEAn-8sk2QI2-mgV; HSID=A-ON5a88M72sbWpTH; SSID=AeSuXcJbv8TuLq2i-; rememberme=true; SID=DQAAAJYAAACVPJo8zkjbGdo6L0utH2zaPwRM7r-ZZGJW3voifbhRqSiKDiPHRJ-qbHffaEPBDYGT6sONt_V3A5xbVipakh0XIjgsSkPFtNNpJ9gJoMNvYJGlcLvB71NWYj0zqGky9ChNAAQh6oMyEl6swgT8z-FgfRg93aDF58Kl8fmtUPGloQVCyvOc2WNK3a9cX1pNRQ-EpT8HVBeBB8G3AlYBqMYx; S=awfe=birn1E7pVvVUNZ1cdIPy4w:awfe-efe=birn1E7pVvVUNZ1cdIPy4w; S_awfe=FCXGzzw-1jzrQL_lAMdqgQ
    Cache-Control=max-age=0
    
    This is what I get from tamper data. Google is smart but still has to comply with normal procedures. As you can see the session expires as soon as you reload it, try to change that. If its not in the sourcecode its in the headers. Use the same cookiejar for all of your transactions with the server, for each session.
     
    Last edited: Feb 17, 2010
  18. msimurin

    msimurin Regular Member

    Joined:
    Sep 21, 2009
    Messages:
    243
    Likes Received:
    92
    Hey m8, i actually tried anything i could think of with this, i just dont feel masochistic enough to continue with this, i will donate 20$ though to anyone that sets the script to actually have object from which it can scrape results of keywords and keyword search monthly numbers....

    I am serious, if you want free year of hosting or few domains just pm me or post here with solution and i ll load it on your paypal.