1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Let's talk captcha breaking (GOCR,TESSERACT,etc)

Discussion in 'Black Hat SEO Tools' started by cody41, Jun 19, 2011.

  1. cody41

    cody41 Power Member

    Joined:
    Jun 18, 2009
    Messages:
    682
    Likes Received:
    274
    Location:
    Texas
    Anyone else here as interested in this technology as I am? I've been on a crusade recently to try to edit my own engines and wanted to compare with anyone else here who may be doing the same. I know Ignite SEO uses an exposed GOCR engine to crack some captchas internally.

    I've started down the path of playing with imagemagick and gocr to get a good baseline of images trained up. I'm on the lookout now for a program that can assist me with "automating" the training so to speak.

    Next up is Scrapeboard..but whereas that program uses tesseract for some captchas, it's using the default eng.trainneddata for breaking.

    Has anyone successfully merged the arc.traineddata file for anti recaptcha breaking (think jdownloader for that one) with the default tesseract engine?
     
  2. jb2008

    jb2008 Senior Member

    Joined:
    Jul 15, 2010
    Messages:
    1,158
    Likes Received:
    972
    Occupation:
    Scraping, Harvesting in the Corn Fields
    Location:
    On my VPS servers
    I'm interested in it but it is a massive project and beyond my capabilities.

    Can you code CAPTCHA DLL's for Xrumer?
     
  3. bulgariangypsy

    bulgariangypsy Newbie

    Joined:
    Jan 21, 2011
    Messages:
    39
    Likes Received:
    13
    Hey fella,

    Funnily enough CAPTCHA breaking is all I've been doing for the last week.

    Started with the more simple ones which are unwarped-seperate-words with lots of noise. These have become trivial to break (99% of the time), such that I can break these better than my gf.

    I started work on the PoF captcha this morning, the one with the triangles and circles, it's been pretty successful, as in I can isolate the correct letters and minimize the noise, but Tesseract has a pretty low crack rate. Maybe 1 in 10 get broken. I'm gonna try and write an extra part to this one so that it does the run about 10 times and combines the best answers. - However, using a human cracking service would be 100%.

    I'm gonna look into reCAPTCHA as it'd be awesome to try and get a good break rate, but that's a toughy.

    Do you wanna have a chat on skype? Pm for details.
     
  4. SkullTraill

    SkullTraill Junior Member

    Joined:
    Mar 1, 2011
    Messages:
    132
    Likes Received:
    35
    Maybe I didn't quite understand you, but haven't you heard of DeCaptcha?
     
  5. bulgariangypsy

    bulgariangypsy Newbie

    Joined:
    Jan 21, 2011
    Messages:
    39
    Likes Received:
    13

    It's fun programming to do it yourself. A very good feeling when it works.
     
  6. bulgariangypsy

    bulgariangypsy Newbie

    Joined:
    Jan 21, 2011
    Messages:
    39
    Likes Received:
    13
    [​IMG]

    There's two of 'em. PoF on the right.
     
  7. cody41

    cody41 Power Member

    Joined:
    Jun 18, 2009
    Messages:
    682
    Likes Received:
    274
    Location:
    Texas
    I'm 15 steps ahead of you. I'm already training up an engine to crack a series of captchas.

    I think it might be useful to approach this from a cloud sourcing perspective. Maybe post up trained db's to share amongst ourselves?
     
  8. indianbill007

    indianbill007 Jr. VIP Jr. VIP

    Joined:
    Jan 8, 2010
    Messages:
    4,817
    Likes Received:
    4,053
    Occupation:
    Making Money when the world is sleeping
    Location:
    Menlo Park - Next to Zuck
    Iam interested to participate in this, I have successfully trained tesseract to break hotmail and POF with 50% accuracy and am cracking Recaptcha with 15% Accuracy.

    My code is in C#, calling tesseract using pInvoke. Let me know how we can move further.
     
    • Thanks Thanks x 1
  9. bulgariangypsy

    bulgariangypsy Newbie

    Joined:
    Jan 21, 2011
    Messages:
    39
    Likes Received:
    13
    I'd be interested to know how you're doing your PoF captcha...

    This is what I'm doing:
    - Turn everything under the triangles white
    - Remove the top part of the image to get rid of the remaining circles
    - Remove any noise generated
    - Sharpen the black parts, based on rgb values of less than (125,125,125)
    - Find each seperate letter and crop it
    - Past into a new image
    - Run through Tesseract

    I get the prog to do this 10 times, and from what it seems it breaks it at least 60% of the time. Possibly up to 80%. Just need to finish it off so that it selects the most likely answer. I'm using Python and PIL for coding.

    I reckon we should get something together, non?
     
    • Thanks Thanks x 1
  10. cody41

    cody41 Power Member

    Joined:
    Jun 18, 2009
    Messages:
    682
    Likes Received:
    274
    Location:
    Texas
    Ok, so a couple of people are using tesseract, am I the only one going it alone with imagemagick and gocr?
     
  11. Autumn

    Autumn Elite Member

    Joined:
    Nov 18, 2010
    Messages:
    2,197
    Likes Received:
    3,041
    Occupation:
    I figure out ways to make money online and then au
    Location:
    Spamville
    Welcome to 2007...

    Personally I think that in 2011, coding captcha breakers is a poor use a person's time given the extremely low cost and availability of captcha decoding services.

    It's very time intensive and given that all it takes is a minor change in the captcha to break your captcha breaking solution, I think outsourcing at a low cost is a better use of your resources. Best to save your programming time for things that can't be cheaply or reliably outsourced eg coding up the signup and submission scripts for uncommon CMSs.

    bluehatseo has a few good posts about cleaning up captchas with imagemagick and gocr and also some basic neural net stuff.
     
    • Thanks Thanks x 1
  12. cody41

    cody41 Power Member

    Joined:
    Jun 18, 2009
    Messages:
    682
    Likes Received:
    274
    Location:
    Texas
    Yea, I know it's passe to break your own captchas..but at the same time, I'm all for good challenges PLUS, with the amount of tools that I use now, anything that I can do on my end to save me some money considering how much I go through decaptcher any given week, no less than a month, would be more than helpful.

    I've been to most of the posted online areas where people discuss breaking captchas, what tools, algorhythms and whatnot. I know I'm not the only one going down this road ;)
     
  13. kalrudra

    kalrudra BANNED BANNED

    Joined:
    Oct 29, 2010
    Messages:
    271
    Likes Received:
    300
    I have created my engine to break Google's internal captcha (Not recaptcha). It's accuracy is more 70%.

    I have used, Tesseract Logic + Artificial intelligence + Neural network + 1 GB Captcha image database.
     
    Last edited: Jun 20, 2011
  14. cody41

    cody41 Power Member

    Joined:
    Jun 18, 2009
    Messages:
    682
    Likes Received:
    274
    Location:
    Texas
    Good point about the database usage..anyone else using a database to load captchas into? Also, Kalrudra, did you source your own captcha 1gb set? Or was this availabe elsewhere to use?
     
  15. goodbuyer

    goodbuyer Junior Member

    Joined:
    Aug 13, 2010
    Messages:
    118
    Likes Received:
    18
    How do you train tessaract? it works that way? I have been using neural nets and support vector machines but its a lot of work to train, does tessarac does better? What algorithms are you using for cleaning noise? I know every captcha has its way but a list of them?

    We can exchange knowledge
     
  16. lattenlui

    lattenlui Newbie

    Joined:
    Mar 1, 2009
    Messages:
    26
    Likes Received:
    3
    Gender:
    Male
    How can I use the captcha solver of jdownloader for my own projects?
    I'm using tesseract too, but I don't know how to train it for recaptcha. Does anybody have some prelearned data or database of captchas?
     
  17. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,468
    Likes Received:
    10,155
    A challenge! Sounds very interesting. :) Are there any tutorials on .net for this kind of thing? I wouldn't know how to start... :eek:
     
  18. lolomgwut

    lolomgwut Newbie

    Joined:
    May 28, 2011
    Messages:
    2
    Likes Received:
    0
    I have extensive experience with tesseract, wonderful tool. Did you port it over from c++ or what?
     
  19. lwelch45

    lwelch45 Junior Member

    Joined:
    Mar 24, 2010
    Messages:
    135
    Likes Received:
    38
    Home Page:
    I have been working extensively with different types of ocr's some i've built and others like tesseract. The only downfall to tesseact in my case is that its coded in C++ and ima vb.net/C# coder. I have a few image processing classes but right now im trying to translate the antirecaptcha image processing script to vb.net
     
  20. lwelch45

    lwelch45 Junior Member

    Joined:
    Mar 24, 2010
    Messages:
    135
    Likes Received:
    38
    Home Page:
    ive been told that abbyy is really good.