1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

need some help saving captcha[VB.NET]

Discussion in 'Visual Basic .NET' started by voidale, Mar 12, 2010.

  1. voidale

    voidale Jr. VIP Jr. VIP Premium Member

    Joined:
    Sep 29, 2008
    Messages:
    583
    Likes Received:
    176
    Well I got the Decaptcher API working, but can't save the captcha image, when i try to load it to picturebox1 with any of this codes
    Code:
    PictureBox1.Load(WebBrowser1.Document.GetElementById("captcha").Parent.Parent.GetElementsByTagName("img")(0).GetAttribute("src"))
    or
    Code:
       For Each ImgElement As HtmlElement In WebBrowser1.Document.Images
                Dim b = ImgElement.GetAttribute("SRC")
                If b.Contains("/captcha/") Then
                    PictureBox1.Load(b)
                End If
            Next
    this 2 codes getting the src of the image but I see different captcha than the one i see at the webbrowser1
    (it works with FF or IE to go to the SRC and see same captcha)
    any way i think picturebox1.load fucks it up is there a way to download right away from webbrowser1? make it work, how?

    I just need to have the captcha in picturebox1 (then i save it)
    or just save it (c:\blabla\captcha.jpg)
     
    Last edited: Mar 12, 2010
  2. mline

    mline Newbie

    Joined:
    Jan 30, 2010
    Messages:
    49
    Likes Received:
    18
    You need to pass the cookies
     
  3. voidale

    voidale Jr. VIP Jr. VIP Premium Member

    Joined:
    Sep 29, 2008
    Messages:
    583
    Likes Received:
    176
    how?:rolleyes:
     
  4. alex1

    alex1 Junior Member

    Joined:
    May 23, 2009
    Messages:
    123
    Likes Received:
    110
    Occupation:
    Software Developer
    Location:
    Toronto, Canada
    My guess is that it will not help to pass the cookies. Each request to the web server returns its own image, new image every request.

    Here is my suggestion:
    - my guess is your crawler is based on MSHTML (WebBrowser obj in .NET is using MSHTML internally)
    - so go to your IE Settings and DISABLE IMAGE DOWNLOADING
    - leave the rest of your code the way it is.

    So now the very first request for SRC will be from PicureBox1.Load(b), so you will get the same image as you would get in your browser.


    They MAY sent cookies together with image and may expect you to send them those cookies back. If my suggestion above will not work, you will hav to use Proxy Server and log all the communications and see what kind of cookies/requests is travelling back and forth

    HTH
     
    • Thanks Thanks x 1
  5. voidale

    voidale Jr. VIP Jr. VIP Premium Member

    Joined:
    Sep 29, 2008
    Messages:
    583
    Likes Received:
    176
    I tried disable images didn't work
    the proxy thing sounds hard and I'm just a beginner with vb.net
    is there an easier way? like screen capture? or i heard about this command (WebBrowser1.document.Images(0)) not sure how to use it, I tried to check Temp folder for captcha image but nothing ;/
    lol why it's so hard to get this captcha image and so easy to get yahoo's ;[
     
  6. alex1

    alex1 Junior Member

    Joined:
    May 23, 2009
    Messages:
    123
    Likes Received:
    110
    Occupation:
    Software Developer
    Location:
    Toronto, Canada
    donload fiddler from microsoft (proxy server for web developers) and learn how to use it, since you have your "real task at hand"

    So run fiddler and then visit your page with captcha as the regular user would do. Type in and submit the captcha.

    Then switch to fiddler and carefully examine HTTP requests / responses

    You will need to see how IE requests the captcha image when real human user visits the page. Things to look for:
    - what is exact HTTP request for captcha (referrer header? any cookies?)
    - what is the server response (besides the image binary data, is there any new cookies being sent in the response headers, etc)

    Well if proxy is difficult for you, screen capturing is even more difficult
     
    • Thanks Thanks x 1
  7. voidale

    voidale Jr. VIP Jr. VIP Premium Member

    Joined:
    Sep 29, 2008
    Messages:
    583
    Likes Received:
    176
    Thanks, going to do it now will report back ;)
    edit: yeah got cookie
    here is the RAW from the url of the captcha
    HTML:
    GET http://www.website.com/validator/456/1268451159.gif HTTP/1.1
    Accept: */*
    Referer: http://www.website.com/file/files/456/3338/
    Accept-Language: en-US
    User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET CLR 1.1.4322; .NET CLR 4.0.20506)
    Accept-Encoding: gzip, deflate
    Host: www.website.com
    Connection: Keep-Alive
    Cookie: uid=5564236; uhsh=%25A8%25CDfG%2501%2529%25A3%2527%25E4%25D6%2560%25E1%2592%250D%2529%25A7%258D%25A8_75ff9de3c93f220264a5d49ad7842a11b; lkni=1430774893; UserInteraction4=KonaBase; vc23835752=7653443762118; __utma=140781479.284486472.1268445511.1268445511.1268451162.2; __utmb=140781479.3.10.1268451162; __utmz=140781479.1268445511.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmc=140781479
    
    but honestly i have no idea what to do with all of the information I know how to use webbrowser1 only no httprequest (using vb.net for like 7-8 days now)
    p.s can i use something like "WebBrowser.Document.Cookie"
     
    Last edited: Mar 13, 2010
  8. alex1

    alex1 Junior Member

    Joined:
    May 23, 2009
    Messages:
    123
    Likes Received:
    110
    Occupation:
    Software Developer
    Location:
    Toronto, Canada
    even though it is possible, you do NOT need to set "WebBrowser.Document.Cookie" (it is actually called differently, whatever Javascript methods are used to set cookies you could use too through MSHTML. You could inject Javascripts into pages, and much more. But that is besides the point)

    I see 2 possible solutions:

    1. use WebRequest
    ==============

    disable images in IE, and then you could craft a WebRequest like you posted from proxy log, with referrer, cookies etc - everything is easy the only thing you would have to add Request Headers containing cookies. And you would need to READ real cookie values from WebBrowser, smth like this:

    Dim cookieString As String = CType(webBrowserObj.Document, mshtml.IHTMLDocument2).cookie

    If you create a right WebRequest, they will have no way of telling whether it comes from IE or from WebRequest. So they will send you a Response Binary Stream and you will have to save it into file and here you go. The file type (JPG or GIF or whatever) you could hardcode for this site, or you could see the COntentType header on response to see the string smth like "image/jpg" etc and convert it into proper file extension. Those are called MIME Types and those are standard.

    2. IHTMLElementRender
    ==================

    You would need to ENABLE images in IE. Then you could try to use IHTMLElementRender interface on a captcha image DOM node (the one you are taking SRC from), and it should render image bitmap into System Bitmap object and it could be saved from there into file. I have not done it personally so you could do a search on IHTMLElementRenderand try to make it work (could be tough if you are new to this whole thing). I suggest you post a little project on a Free Lancers sites and some guys from India will code IHTMLElementRender solution for you, and it is more universal approach, you dont care about anything, you will be able to render an image from ANY DOM NODE, (including <OBJECT>??? aka Flash???)

    Good luck with your proj
     
  9. alex1

    alex1 Junior Member

    Joined:
    May 23, 2009
    Messages:
    123
    Likes Received:
    110
    Occupation:
    Software Developer
    Location:
    Toronto, Canada
    hxxp://www.developerfusion.com/code/4712/generate-an-image-of-a-web-page/

    Just read that page, they have a complete code sample in VB and C#. They make a picture of the whole page though. And you would only need your image node.

    So you could change this line of code:
    Dim element As IHTMLElement = CType(document.body, IHTMLElement)

    to this:
    Dim element As IHTMLElement = CType(ImgElement, IHTMLElement)

    (from your OP I assume ImgElement is the name of your variable that holds the IMG DOM Node with Captcha)

    You should be all set then.
    In fact, I could use that code for my own purposes as well :)
     
    Last edited: Mar 13, 2010
  10. voidale

    voidale Jr. VIP Jr. VIP Premium Member

    Joined:
    Sep 29, 2008
    Messages:
    583
    Likes Received:
    176
    I got working code:
    Code:
    Dim doc As MSHTML.IHTMLDocument2 = DirectCast(WebBrowser1.Document.DomDocument, MSHTML.IHTMLDocument2)
            Dim sobj As MSHTML.IHTMLSelectionObject = doc.selection
            Dim body As MSHTML.HTMLBody = TryCast(doc.body, MSHTML.HTMLBody)
            sobj.empty()
            Dim range As MSHTML.IHTMLControlRange = TryCast(body.createControlRange(), MSHTML.IHTMLControlRange)
            Dim img As MSHTML.IHTMLControlElement = DirectCast(WebBrowser1.Document.Images(145).DomElement, MSHTML.IHTMLControlElement)
    
            range.add(img)
            range.[select]()
            range.execCommand("Copy", False, Nothing)
    
            Dim bimg As New Bitmap(Clipboard.GetImage())
            PictureBox1.Image = bimg
    there is just 1 issue with this code "WebBrowser1.Document.Images(145)" 145 is the index of the captcha image at this correct page I'm in but every page has new index number is there a way to detect this index code? i have found this code so no idea how to edit this one, what it does is just using the correct webbrowser1 to get captcha img anyone?
     
    • Thanks Thanks x 1
    Last edited: Mar 13, 2010
  11. alex1

    alex1 Junior Member

    Joined:
    May 23, 2009
    Messages:
    123
    Likes Received:
    110
    Occupation:
    Software Developer
    Location:
    Toronto, Canada


    To get the IMG element with captcha without hardcoding its index (i.e. 145), just use your own code from the very first post that searches for image element that have SRC with /captcha/. Something like this:


    For Each ImgElement As HtmlElement In WebBrowser1.Document.Images
    Dim b = ImgElement.GetAttribute("SRC")
    If b.Contains("/captcha/") Then
    Dim img As MSHTML.IHTMLControlElement = DirectCast(ImgElement
    , MSHTML.IHTMLControlElement)
    End If
    Next
     
  12. voidale

    voidale Jr. VIP Jr. VIP Premium Member

    Joined:
    Sep 29, 2008
    Messages:
    583
    Likes Received:
    176
    Yea I tried that gave me like 3 of this "Runtime errors might occur when converting 'string' to MSHTML.IHTMLControlElement"
     
  13. alex1

    alex1 Junior Member

    Joined:
    May 23, 2009
    Messages:
    123
    Likes Received:
    110
    Occupation:
    Software Developer
    Location:
    Toronto, Canada
    try again, because ImgElement variable from your code is NOT a string, but rather HtmlElement:

    For Each ImgElement As HtmlElement In WebBrowser1.Document.Images ...

    so I cannot see how compiler would think it is a string. I would suggest you double check that part of your code.

    You are almost there.

    By the way, I thanked you for posting the code with Clipboard b/c it is much simpler compared to other solutions. Let us know if it works though. I may want to try it next time when dealing with Dec@ptcher API
     
  14. voidale

    voidale Jr. VIP Jr. VIP Premium Member

    Joined:
    Sep 29, 2008
    Messages:
    583
    Likes Received:
    176
    Well thanks guys i got it all working now ;>