1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

General OCR Tut (Vector Space Method)

Discussion in 'General Programming Chat' started by Gophering, Mar 22, 2013.

  1. Gophering

    Gophering Junior Member Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    115
    Likes Received:
    279
    Occupation:
    Unemployed
    Location:
    EU
    Ok, so I've been meaning to write an OCR tut for quite some time (long before coming across BHW). After checking out another thread in the "C, C++, C#" subforum, titled "C# - OCR - Tutorial" (great thread btw) I've decided to finally get this going and write down a theoretical/practical guide.

    This might turn out to be somewhat long as we'll need to cover a couple of thing here so I'll probably break this down into multiple posts. All the information will be kept in this thread, I might make references to external sites as well as post code snippets to pastebin, etc, but generally speaking, you should be good just by following this thread.

    Alright, lets get started!

    Requirements

    Now, whats needed in order to follow this tut? Firstly, I won't be using any external OCR/OCR-like engines (so no tesseract, etc.), we'll be building our own. Secondly, I'll be using Golang[1] (that Google language) to illustrate the methods involved. Why not C# or C or C++ or "insert language here" you ask? Well, quite honestly this can be accomplished in any more or less mature language. I chose Go because this way we won't waste time writing boilerplate image processing code, etc. Further on, the syntax is very readable especially if you are coming from a C background. Finally, I really want people to learn something from this rather than just copy-paste this into your project. So, read the code, study the material and you should be good to go.

    If you've never worked with Go before, feel free to follow the excellent official tutorial[2]. Just familiarize yourself with the syntax so that you can follow along.

    Code:
    [1] golang [.] org
    [2] tour [.] golang [.] org
    
    The Process

    Next, lets have a look at the overall process involved.
    We'll obviously need a captcha to crack first. It's best to start out slow and simple in regards to that so don't jump into cracking Recaptcha, you won't succeed. I'll provide a generic b&w captcha with some distortion/obfuscation later down the line, feel free to use your own though.

    Generally speaking any OCR process involves the preparation of the image (simplification, cleaning), the separation of interesting content and the decryption of said content. So, in our case we'd need to get our captcha image, reduce its colour range to make things a bit easier, clean it, extract the text, break down the text into standalone letters, classify the letters, get a confidence rating on each letter, print out the result.

    Note that, in terms of accuracy, theres usually no need to go for a very high rate. A 20% accuracy rate is already enough as theres usually no penalty involved. So for our purpose we'll consider a 20% accuracy rate a success. Of course, there are various things we can do to improve upon this, which I'll discuss later down the line.

    Classification

    How do we teach a machine to extract and more importantly recognize characters in an image? Commonly neural networks[1] are employed to solve this kind of problem. Without going into too many details and ultimately boring you to death (lol), a neural network is an adaptive algorithm designed to solve a computational problem without the programmer being specifically aware of the right solution involved. In other words, a number of solutions are derived from a constant adjustment/learning process.

    A neural network can be represented like so:

    input -> magic -> output

    Obviously the magic part is where the machine learning happens and a number of solutions with varying degrees of accuracy are computed. Again, this is a very superficial explanation and I encourage you to read [1].

    Interestingly enough however, we won't be using neural networks to solve our problem here. While neural networks will probably be much better suited towards solving more complicated captchas we should be able to achieve a fairly high accuracy rate by utilizing a vector space search engine[2]. In other words, we already know the right process towards solving this problem, so we'll utilize a somewhat "hacky"/fast solution instead of trying to write a more generic alternative.

    Code:
    [1] wikipedia [.] org/wiki/Artificial_neural_network
    [2] la2600 [.] org/talks/files/20040102/Vector_Space_Search_Engine_Theory.pdf
    
     
    • Thanks Thanks x 25
  2. Gophering

    Gophering Junior Member Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    115
    Likes Received:
    279
    Occupation:
    Unemployed
    Location:
    EU
    More will follow during my lunch break. For now, familiarize yourself with Golang everybody. Cheers and any comments are always welcome.
     
    • Thanks Thanks x 2
  3. Gophering

    Gophering Junior Member Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    115
    Likes Received:
    279
    Occupation:
    Unemployed
    Location:
    EU
    First Steps

    Alright, as promised, here's the next part! Keep in mind that I don't have anything prewritten really, so I'll be coding this up in go while I'm writing this, so I'm not completely sure how deep into it we'll really make it today... Regardless, lets get coding.

    So firstly, lets get ourselves a somewhat non-trivial captcha. By non-trivial I mean something not just b&w as that would be very trivial to solve and I'd like to show off as much of the techniques/thought process involved as possible.

    Let's get started with this captcha here: captcha.gif
    The captcha can also be accessed via [1]

    Let's analyze the image real quick. Keep in mind that our goal is a clean extraction of the interesting content (the letters) from the image. We notice that the letters are not interconnected, each letter is separated by white space, which is good. We also notice that we are dealing with a monotone image here, which also is not a bad thing. We also notice that the whole image is obfuscated by little dots or some such, which might be a problem later down the line.

    Lets get coding, firstly we will need several helper functions for image reading/writing and so on.

    Code:
    package main
    
    
    import (
        "image"
        "os"
        "log"
        "image/png"
        _ "image/gif"
    )
    
    
    //Opens a given image based on path
    func loadImage(path string) image.Image {
        file, err := os.Open(path)
        if err != nil {
            log.Fatal(err)
        }
        defer file.Close()
        img, _, err := image.Decode(file)
        if err != nil {
            log.Fatal(err)
        }
    
    
        return img
    }
    
    
    //Saves a given image to path. Returns error if unsuccessful.
    func saveImage(path string, img image.Image) error {
        out_file, err := os.Create(path)
        if err != nil {
            return err
        }
        defer out_file.Close()
    
    
        err = png.Encode(out_file, img)
        if err != nil {
            return err
        }
        return nil
    }
    
    
    func main (){
        
    }
    
    Ok, so what we've done above is pretty uneventful. Declare our main package, import a couple of image processing as well as other packages from Go's standard library and declared two functions, loadImage and saveImage. The former opens an image for us the latter saves it to disk.

    Lets go on. Although the image is monochrome already, it would be nice to work with black and white only. So lets write a function which will return a new b&w copy of the original image.

    Code:
    //Creates a new B&W copy of our original image and returns a pointer.
    func turnBW (img image.Image) *image.Gray {
        bounds := img.Bounds() //Get the image bounds. This way we can easily calculate dimensions.
        newImg := image.NewGray(bounds) //Create a new gray scale image with the same dimensions as the original image.
        for y := bounds.Min.Y; y < bounds.Max.Y; y++ { //Range from min Height to max Height
            for x := bounds.Min.X; x < bounds.Max.X; x++ { //Range from min Width to max Width
                r, g, b, _ := img.At(x, y).RGBA() //Get the RGB colour value of the pixel
                if r + 256 * g + b >= 16000000 { //If not black
                    newImg.Set(x, y, color.White)
                } else {
                    newImg.Set(x, y, color.Black)
                }
            }
        }
        return newImg
    }
    
    What we do here is again fairly trivial. Firstly we get the dimensions of the original image by calling the Bounds method. Then we create a new grayscale image with the same dimensions, finally we range over each pixel in the original image, get its RGB colour value, and depending on wheather it is dark or non-dark add it with an appropriate color to our new, grayscale image.

    Lets modify our main function and run the code

    Code:
    func main (){
        img := loadImage("captcha.gif")
        err := saveImage("processed.gif", turnBW(img))
        if err != nil {
            log.Fatal(err)
        }
    }
    
    and the result of running our code: Ae0261r.png

    Next, we'll need to figure out how to clean the background. More on this in the next post.

    Code:
    [1] snaphost [.] xom/captcha/WebForm.aspx?id=QAP85CQKE9T9&rad=1363954778626 (replace "x" with "c" here. Sorry BHW preventing me from posting proper URLs...)
    
     
    • Thanks Thanks x 11
  4. Gophering

    Gophering Junior Member Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    115
    Likes Received:
    279
    Occupation:
    Unemployed
    Location:
    EU
    Chopping

    As mentioned in my previous post, we need to figure out a way to remove all those artifacts from the background of the image in order to make text extraction itself as smooth and accurate as possible. There's however a slight complication as we can not rely on colour differences anymore, we are working with a monochrome image and we can not simply extract everything that is black since this would also remove our text. So we'll need to figure out another approach.

    We need to design a so-called chopping algorithm. In other words we need an algorithm which would detect a collection of target pixels (in our case a collection of black pixels) leave those in the image and chop everything else. We can set a threshold of around 2 pixels, anything less than that will be chopped away.

    Here's our complete program once again with modifications to the main function as well as the addition of the chopping function (simply called chop):

    Code:
    package main
    
    
    import (
        "image"
        "os"
        "log"
        "image/color"
        "image/png"
        _ "image/gif"
    )
    
    
    //Creates a new B&W copy of our original image and returns a pointer.
    func turnBW (img image.Image) *image.Gray {
        bounds := img.Bounds() //Get the image bounds. This way we can easily calculate dimensions.
        newImg := image.NewGray(bounds) //Create a new gray scale image
        for y := bounds.Min.Y; y < bounds.Max.Y; y++ { //Range from min Height to max Height
            for x := bounds.Min.X; x < bounds.Max.X; x++ { //Range from min Width to max Width
                r, g, b, _ := img.At(x, y).RGBA() //Get the R, G, B color value of the pixel
                if r + 256 * g + b >= 16000000 { //If not black
                    newImg.Set(x, y, color.White)
                } else {
                    newImg.Set(x, y, color.Black)
                }
            }
        }
        return newImg
    }
    
    //Chops everything below a certain pixel threshold
    func chop (img *image.Gray) *image.Gray {
        bounds := img.Bounds() //Get image dimensions again
        //Here we loop across each row first
        for y := bounds.Min.Y; y < bounds.Max.Y; y++ {
            for x := bounds.Min.X; x < bounds.Max.X; x++ {
                if img.At(x, y) == color.White {continue} //If we encounter a white pixel, simply skip and continue looping
                buffer := 0 //setup a buffer of target (black) pixels
                for c := x; c < bounds.Max.X; c++ { //range from found target pixel to image width
                    r, g, b, _ := img.At(c, y).RGBA()
                    if r + 256 * g + b < 16000000 { //if not white 
                        buffer ++ //increase buffer
                    } else {
                        break
                    }
                }
    
    
                if buffer <= 2 { //if buffer is smaller than our threshold (which is 2) replace it all with white, effectively chopping it
                    for c := 0; c < buffer; c++ {
                        img.Set(x + c, y, color.White)
                    }
                }
    
    
                x += buffer //skipping as we've already modified the above fragment
            }
        }
    
    
        //Here we just repeat everything as above, only looping through the columns instead of the rows.
        for x := bounds.Min.X; x < bounds.Max.X; x++ {
            for y := bounds.Min.Y; y < bounds.Max.Y; y++ {
                if img.At(x, y) == color.White {continue}
                buffer := 0
                for c := y; c < bounds.Max.Y; c++ {
                    r, g, b, _ := img.At(x, c).RGBA()
                    if r + 256 * g + b < 16000000 {
                        buffer ++
                    } else {
                        break
                    }
                }
    
    
                if buffer <= 2 {
                    for c := 0; c < buffer; c++ {
                        img.Set(x, y + c, color.White)
                    }
                }
    
    
                y += buffer
            }
        }
        return img
    }
    
    
    //Opens a given image based on path
    func loadImage(path string) image.Image {
        file, err := os.Open(path)
        if err != nil {
            log.Fatal(err)
        }
        defer file.Close()
        img, _, err := image.Decode(file)
        if err != nil {
            log.Fatal(err)
        }
    
    
        return img
    }
    
    
    //Saves a given image to path. Returns error if unsuccessful.
    func saveImage(path string, img image.Image) error {
        out_file, err := os.Create(path)
        if err != nil {
            return err
        }
        defer out_file.Close()
    
    
        err = png.Encode(out_file, img)
        if err != nil {
            return err
        }
        return nil
    }
    
    
    func main (){
        img := loadImage("captcha.gif")
        err := saveImage("processed.gif", chop(turnBW(img)))
        if err != nil {
            log.Fatal(err)
        }
    }
    
    After running our program, we get the following output image: bsr4ueG.png

    Perfect! The image is clean! The hardest part is now behind us, now we just need to extract the individual letters and classify them. More on this in the next post.
     
    • Thanks Thanks x 6
  5. Gophering

    Gophering Junior Member Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    115
    Likes Received:
    279
    Occupation:
    Unemployed
    Location:
    EU
    Well, thats it for today. Unfortunately I've got a bunch of work to finish over here...
    Next, we'll really get to the meat of this whole thread. Extracting and classifying letters.

    Any comments or questions are of course welcome.

    I'll try to post the next part either tonight or early tomorrow.
     
    • Thanks Thanks x 6
  6. zenoGlitch

    zenoGlitch Executive VIP Jr. VIP Premium Member

    Joined:
    Jun 25, 2009
    Messages:
    963
    Likes Received:
    1,511
    Location:
    Thailand
    You're going to fit in just fine here ... welcome to BHW. +14 rep
     
    • Thanks Thanks x 1
  7. 9to5destroyer

    9to5destroyer Jr. VIP Jr. VIP Premium Member

    Joined:
    Nov 14, 2011
    Messages:
    355
    Likes Received:
    205
    thanks great share i'll be watching this thread closely
     
    • Thanks Thanks x 1
  8. youngguy

    youngguy Senior Member

    Joined:
    Apr 11, 2009
    Messages:
    1,053
    Likes Received:
    1,560
    Location:
    Hell
    AMAZING!!! Welcome here buddy!
     
    • Thanks Thanks x 1
  9. Gophering

    Gophering Junior Member Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    115
    Likes Received:
    279
    Occupation:
    Unemployed
    Location:
    EU
    Thanks for the positive feedback everyone, really appreciate it!

    -> zenoGlitch: Glad to be here, thanks for the rep too.

    Extracting Letters

    Alright, so we are getting quite close to the interesting bits. We are definitely on the right track, since extracting relevant content from all the noise is half of the battle. Now, with some (bad) captchas this might be enough to build our corpus data (more on this later) as some captcha scripts don't bother changing individual letters and serve sets of static letters instead, thus a corpus of whole sets can be established. Our captcha script however (to the best of my knowledge) generates individual letters so we'll need to dig deeper.

    So next on our agenda: writing letter extraction functions. As you might remember, I've mentioned that letters are conveniently separated by white space. Since we have already cleaned up the noise we can use the above property of the captcha to our advantage.

    Why don't we try slicing up the image horizontally, note where each letter starts and finishes (are we within a set of black pixels?) and store those positions in a separate array thus ultimately extracting the letters. Here's our letter extraction function:

    Code:
    //Extracts individual letters
    func extractLetters (img *image.Gray) [][]int {
        var (
            letters = make([][]int, 0) //Our array of array which will hold start/end positions
            inside = false //Are we inside the letter?
            gotletter = false //Did we find the letter?
            start = 0 //Starting position
            end = 0 //Ending position
        )
        bounds := img.Bounds() //Getting image bounds once again
        //Now we need to slice through the image horizontally
        for x := bounds.Min.X; x < bounds.Max.X; x++ {
            for y := bounds.Min.Y; y < bounds.Max.Y; y++ {
                r, g, b, _ := img.At(x, y).RGBA()
                if r + 256 * g + b == 0 { //Hit a black pixel?
                    inside = true //We are inside the letter
                }
            }
            if gotletter == false && inside == true { //We've just hit a letter for the first time
                gotletter = true
                start = x //Declare starting position
            }
            if gotletter == true && inside == false { //We've just reached some white pixels again
                gotletter = false
                end = x //Declare the ending position
                /*
                    Here we need to do some additional checks before proceeding further.
                    As you might have noticed, our letters are somewhat patchy due to the image cleaning algorithm.
                    So here, we check if we've actually hit a real letter (something bigger than 4 pixels) and not some random 
                    artifact which might have been left in there...
                */
                if end - start >= 4 {
                    /*
                        Here we do a reverse check of sorts. A letter might be sliced up into two individual parts
                        due to it's choppy nature. We want to avoid this, so we check if the last known ending point lies very close 
                        (1 <= n <= 2 pixels close) to the current start, if so then we just change the last known ending point to the current one
                        else we append a new start/end set.
                    */
                    switch {
                    case len(letters) > 0:
                        diff := start - letters[len(letters) - 1][1]
                        if diff >= 1 && diff <= 2 {
                            letters[len(letters) - 1][1] = end
                            break;
                        } 
                        fallthrough;
                    default:
                        letters = append(letters, []int{start, end})
                    }
                }
            }
            inside=false //Restart
        }
        return letters
    }
    
    Make sure you read through the comments here. We are doing some additional checks in order to determine if we are really extracting a letter or some random, left-over artifact. This function can be fine-tuned further (one could check dimensions of individual sets and so on), but for now this will serve us well as it is.

    Lets write an additional function to extract sub-images from our main image based on our freshly extracted start/end coordinates and also adjust our main function slightly.

    Code:
    package main
    
    
    import (
        "image"
        "os"
        "log"
        "image/color"
        "image/png"
        _ "image/gif"
        "strconv"
    )
    
    
    //Creates a new B&W copy of our original image and returns a pointer.
    func turnBW (img image.Image) *image.Gray {
        bounds := img.Bounds() //Get the image bounds. This way we can easily calculate dimensions.
        newImg := image.NewGray(bounds) //Create a new gray scale image
        for y := bounds.Min.Y; y < bounds.Max.Y; y++ { //Range from min Height to max Height
            for x := bounds.Min.X; x < bounds.Max.X; x++ { //Range from min Width to max Width
                r, g, b, _ := img.At(x, y).RGBA() //Get the R, G, B color value of the pixel
                if r + 256 * g + b >= 16000000 { //If not black
                    newImg.Set(x, y, color.White)
                } else {
                    newImg.Set(x, y, color.Black)
                }
            }
        }
        return newImg
    }
    
    
    //Chops everything below a certain pixel treshold
    func chop (img *image.Gray) *image.Gray {
        bounds := img.Bounds() //Get image dimensions again
        //Here we loop across each row first
        for y := bounds.Min.Y; y < bounds.Max.Y; y++ {
            for x := bounds.Min.X; x < bounds.Max.X; x++ {
                if img.At(x, y) == color.White {continue} //If we encounter a white pixel, simply skip and continue looping
                buffer := 0 //setup a buffer of target (black) pixels
                for c := x; c < bounds.Max.X; c++ { //range from found target pixel to image width
                    r, g, b, _ := img.At(c, y).RGBA()
                    if r + 256 * g + b < 16000000 { //if not white 
                        buffer ++ //increase buffer
                    } else {
                        break
                    }
                }
    
    
                if buffer <= 2 { //if buffer is smaller than our treshold (which is 2) replace it all with white, effectively chopping it
                    for c := 0; c < buffer; c++ {
                        img.Set(x + c, y, color.White)
                    }
                }
    
    
                x += buffer //skipping as we've already modified the above fragment
            }
        }
    
    
        //Here we just repeat everything as above, only looping through the columns instead of the rows.
        for x := bounds.Min.X; x < bounds.Max.X; x++ {
            for y := bounds.Min.Y; y < bounds.Max.Y; y++ {
                if img.At(x, y) == color.White {continue}
                buffer := 0
                for c := y; c < bounds.Max.Y; c++ {
                    r, g, b, _ := img.At(x, c).RGBA()
                    if r + 256 * g + b < 16000000 {
                        buffer ++
                    } else {
                        break
                    }
                }
    
    
                if buffer <= 2 {
                    for c := 0; c < buffer; c++ {
                        img.Set(x, y + c, color.White)
                    }
                }
    
    
                y += buffer
            }
        }
        return img
    }
    
    
    //Opens a given image based on path
    func loadImage(path string) image.Image {
        file, err := os.Open(path)
        if err != nil {
            log.Fatal(err)
        }
        defer file.Close()
        img, _, err := image.Decode(file)
        if err != nil {
            log.Fatal(err)
        }
    
    
        return img
    }
    
    
    //Saves a given image to path. Returns error if unsuccessful.
    func saveImage(path string, img image.Image) error {
        out_file, err := os.Create(path)
        if err != nil {
            return err
        }
        defer out_file.Close()
    
    
        err = png.Encode(out_file, img)
        if err != nil {
            return err
        }
        return nil
    }
    
    
    //Extracts individual letters
    func extractLetters (img *image.Gray) [][]int {
        var (
            letters = make([][]int, 0) //Our array of array which will hold start/end positions
            inside = false //Are we inside the letter?
            gotletter = false //Did we find the letter?
            start = 0 //Starting position
            end = 0 //Ending position
        )
        bounds := img.Bounds() //Getting image bounds once again
        //Now we need to slice through the image horizontally
        for x := bounds.Min.X; x < bounds.Max.X; x++ {
            for y := bounds.Min.Y; y < bounds.Max.Y; y++ {
                r, g, b, _ := img.At(x, y).RGBA()
                if r + 256 * g + b == 0 { //Hit a black pixel?
                    inside = true //We are inside the letter
                }
            }
            if gotletter == false && inside == true { //We've just hit a letter for the first time
                gotletter = true
                start = x //Declare starting position
            }
            if gotletter == true && inside == false { //We've just reached some white pixels again
                gotletter = false
                end = x //Declare the ending position
                /*
                    Here we need to do some additional checks before proceeding further.
                    As you might have noticed, our letters are somewhat patchy due to the image cleaning algorithm.
                    So here, we check if we've actually hit a real letter (something bigger than 4 pixels) and not some random 
                    artifact which might have been left in there...
                */
                if end - start >= 4 {
                    /*
                        Here we do a reverse check of sorts. A letter might be sliced up into two individual parts
                        due to it's choppy nature. We want to avoid this, so we check if the last known ending point lies very close 
                        (1 <= n <= 2 pixels close) to the current start, if so then we just change the last known ending point to the current one
                        else we append a new start/end set.
                    */
                    switch {
                    case len(letters) > 0:
                        diff := start - letters[len(letters) - 1][1]
                        if diff >= 1 && diff <= 2 {
                            letters[len(letters) - 1][1] = end
                            break;
                        } 
                        fallthrough;
                    default:
                        letters = append(letters, []int{start, end})
                    }
                }
            }
            inside=false //Restart
        }
        return letters
    }
    
    
    //Get letter accepts a start/end slice as well as the original image and returns a pointer to a new sub-image object
    func getLetter (pos []int, img *image.Gray) *image.Gray {
        bounds := img.Bounds() //Get the bounds again
        newBounds := image.Rect(pos[0], 0, pos[1], bounds.Max.Y) //Create new dimensions based on our position data
        newImg := image.NewGray(newBounds) //Create new image object to store sub-image in
        for y := newBounds.Min.Y; y < newBounds.Max.Y; y++ { //Loop over pixels
            for x := newBounds.Min.X; x < newBounds.Max.X; x++ {
                r, g, b, _ := img.At(x, y).RGBA() //Populate pixels based on new image bounds
                if r + 256 * g + b == 0 {
                    newImg.Set(x, y, color.Black)
                } else {
                    newImg.Set(x, y, color.White)
                }
            }
        }
        return newImg
    }
    
    
    func main (){
        img := loadImage("captcha.gif") //Our original image
        processed := chop(turnBW(img)) //Our processed image
        for i, pos := range extractLetters(processed) { //Extract letter positions and save each letter into a new image
            err := saveImage("letter_" + strconv.Itoa(i) + ".gif" , getLetter(pos, processed))
            if err != nil {
                log.Fatal(err)
            }
        }
    }
    
    Lets finally run our code and see what happens.
    Results: fndyYLr.jpg

    Great, we got the individual letters! We can now finally start building our corpus as well as our vector space search engine!
    A ton of code for today, but I'll definitely pick this up tomorrow. For now, if you are generally interested in this type of stuff I can't recommend "Artificial Intelligence With Common Lisp"[1] enough. Skip all the lispy stuff if thats not your thing and concentrate on the actual methodology. Its a great read!

    Code:
    [1]books [.] google [.] xom/books?id=eIbBm7wvTjcC&redir_esc=y (once again "x" = "c")
    
     
    • Thanks Thanks x 7
  10. Gophering

    Gophering Junior Member Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    115
    Likes Received:
    279
    Occupation:
    Unemployed
    Location:
    EU
    Not sure if my last post is quite clear enough. So if you got any questions guys, I'll be around. Thanks again for all the positive comments, glad I can contribute to this amazing community.
     
  11. ReALeST

    ReALeST Power Member

    Joined:
    May 16, 2012
    Messages:
    584
    Likes Received:
    399
    This is interesting stuff...only came across an OCR tut...thnks alot dude!:) +reped
     
  12. keizer

    keizer Regular Member

    Joined:
    Oct 22, 2008
    Messages:
    373
    Likes Received:
    395
    Hmmm, do you think you will be capable to solve recaptcha? Thanks for this tutorial?
     
  13. Gophering

    Gophering Junior Member Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    115
    Likes Received:
    279
    Occupation:
    Unemployed
    Location:
    EU
    Well, the short answer is "unfortunately no"...
    Recaptcha is notoriously hard to solve and can be difficult - extremely confusing to non machine subjects (humans) too. While a short term solution (and I mean really, really short term) might be doable a long term solution I'd say is currently quite out of reach.

    One important thing to realize about recaptcha in general is that the captchas presented are specifically those which an OCR engine (probably one of the better OCR engines lol) couldn't solve already. Recaptcha are digitalizing books behind the curtain and serve failed OCR candidates as your captcha image. So in essence you'd need to work towards advancing OCR/AI/Neural Network technology in general to "break" recaptcha.
     
  14. YouFeelMeDawg?

    YouFeelMeDawg? BANNED BANNED

    Joined:
    Aug 10, 2011
    Messages:
    266
    Likes Received:
    371
    WTF did I just read?
    I am still in disbelief that this AMAZING post is here, and I thought the general programming section was dead.
    Thanks for this thread for reals, I actually did learn something useful in bhw and it has been a while since last time.
    + Rep, when is thread number 2 going to come?
    Seriously we need more of these captcha breaking threads in here, we badly need them.
     
  15. Gophering

    Gophering Junior Member Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    115
    Likes Received:
    279
    Occupation:
    Unemployed
    Location:
    EU
    Haha, well I'm glad I could be of some help.
    Thread deux will come along soon enough. I need to finish this one first mate and then we can proceed.
    A general neural network example is needed so we'll get to that in thread two.

    Also, I'd like to demonstrate how much easier it is to crack flash based/interactive captchas (e.g. "Win the game". Unless they are very well made). So theres enough content for a couple of threads in there.

    Cheers and thanks for the kind words.
     
  16. keizer

    keizer Regular Member

    Joined:
    Oct 22, 2008
    Messages:
    373
    Likes Received:
    395
    Buddy you are showing the path to patch up the loopholes of captcha makers. Am I wrong?;)
     
  17. Gophering

    Gophering Junior Member Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    115
    Likes Received:
    279
    Occupation:
    Unemployed
    Location:
    EU
    Ha, not quite sure if I am really.
    This is not new stuff by any means and the methodology is available in several books (well not applied to captcha breaking of course, as far as I know). Most importantly we haven't gotten into neural networks yet, which are very interesting by their very nature. So I hope to post that up when time allows.
     
  18. zenoGlitch

    zenoGlitch Executive VIP Jr. VIP Premium Member

    Joined:
    Jun 25, 2009
    Messages:
    963
    Likes Received:
    1,511
    Location:
    Thailand
    Looking forward to it, you're on a roll.

    It's a cat and mouse game Keizer. Posting info here isn't going to change that. ;)
     
  19. Gophering

    Gophering Junior Member Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    115
    Likes Received:
    279
    Occupation:
    Unemployed
    Location:
    EU
    Building Corpus Data

    Alright, so unfortunately I'm a bit busy today, so we probably won't be able to get into vector space search until tomorrow or maybe tonight. However, we can at least start building our corpus data.

    Now, what is corpus data exactly? In simple terms, another word for "corpus" would be "training set". A collection of sets/samples which we could use to "train" our program. Here we could really go crazy and combine our procedures with machine learning, neural networks, etc. However, the best type of solution is the easiest one, so in our case this is really not needed.

    So what should be our approach? Well, why don't we utilize our program's existing capabilities and build our corpus data this way? What we really need is a lot of captchas, which we can then automatically break down into individual letters. The only manual labour thats left to do then is to categorize/classify each individual letter. Also, keep in mind that the larger the corpus the better the result, however once again, lets start out modest and expand if needed.

    We simply need a downloader function, which would make around 100 requests (for now) to the captcha site and get us our first 100 captchas. I won't post the code for this over here since this is very trivial to do, but basically you should end up with something like this:
    JtXN8o3.png

    Next we need to feed all those to our main program and break them all down into individual letters. A function to do this may look like this:

    Code:
    //Builds corpus data
    func buildCorpus(dir string){
        files, _:= ioutil.ReadDir(dir)
        r, _ := regexp.Compile(`\.gif`)
        for h, file := range files {
            fmt.Println("Processing captcha", h)
            if r.MatchString(file.Name()) {
                img := loadImage("./captchas/" + file.Name()) //Our original image
                processed := chop(turnBW(img)) //Our processed image
                for _, pos := range extractLetters(processed) { //Extract letter positions and save each letter into a new image
                    err := saveImage( "./letters/" + randomString(10) + ".gif" , getLetter(pos, processed))
                    if err != nil {
                        log.Fatal(err)
                    }
                }
            }
        }
    }
    
    In the end we should end up with something like this:
    DWWjFiW.png

    So now comes the laborious part. Create a new directory and call it something like "corpus", next go through each letter and put it in its own sub directory inside the corpus folder. Something like this:

    q95MR4s.png

    Repeat this until you have collected all the numbers (probably 0-9) as well as all the letters.

    Thats it for now. We are almost done, next we'll be defining our vector space search engine as well as teaching the program correct classifications.
     
    • Thanks Thanks x 5
  20. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,468
    Likes Received:
    10,143
    To build a big corpus and also avoid the manual work, one can send the letter images to a decaptcha service. :)
     
    • Thanks Thanks x 2