General OCR Tut (Vector Space Method)

Gophering · Mar 22, 2013

Ok, so I've been meaning to write an OCR tut for quite some time (long before coming across BHW). After checking out another thread in the "C, C++, C#" subforum, titled "C# - OCR - Tutorial" (great thread btw) I've decided to finally get this going and write down a theoretical/practical guide.

This might turn out to be somewhat long as we'll need to cover a couple of thing here so I'll probably break this down into multiple posts. All the information will be kept in this thread, I might make references to external sites as well as post code snippets to pastebin, etc, but generally speaking, you should be good just by following this thread.

Alright, lets get started!

Requirements

Now, whats needed in order to follow this tut? Firstly, I won't be using any external OCR/OCR-like engines (so no tesseract, etc.), we'll be building our own. Secondly, I'll be using Golang[1] (that Google language) to illustrate the methods involved. Why not C# or C or C++ or "insert language here" you ask? Well, quite honestly this can be accomplished in any more or less mature language. I chose Go because this way we won't waste time writing boilerplate image processing code, etc. Further on, the syntax is very readable especially if you are coming from a C background. Finally, I really want people to learn something from this rather than just copy-paste this into your project. So, read the code, study the material and you should be good to go.

If you've never worked with Go before, feel free to follow the excellent official tutorial[2]. Just familiarize yourself with the syntax so that you can follow along.

Code:

[1] golang [.] org
[2] tour [.] golang [.] org

The Process

Next, lets have a look at the overall process involved.
We'll obviously need a captcha to crack first. It's best to start out slow and simple in regards to that so don't jump into cracking Recaptcha, you won't succeed. I'll provide a generic b&w captcha with some distortion/obfuscation later down the line, feel free to use your own though.

Generally speaking any OCR process involves the preparation of the image (simplification, cleaning), the separation of interesting content and the decryption of said content. So, in our case we'd need to get our captcha image, reduce its colour range to make things a bit easier, clean it, extract the text, break down the text into standalone letters, classify the letters, get a confidence rating on each letter, print out the result.

Note that, in terms of accuracy, theres usually no need to go for a very high rate. A 20% accuracy rate is already enough as theres usually no penalty involved. So for our purpose we'll consider a 20% accuracy rate a success. Of course, there are various things we can do to improve upon this, which I'll discuss later down the line.

Classification

How do we teach a machine to extract and more importantly recognize characters in an image? Commonly neural networks[1] are employed to solve this kind of problem. Without going into too many details and ultimately boring you to death (lol), a neural network is an adaptive algorithm designed to solve a computational problem without the programmer being specifically aware of the right solution involved. In other words, a number of solutions are derived from a constant adjustment/learning process.

A neural network can be represented like so:

input -> magic -> output

Obviously the magic part is where the machine learning happens and a number of solutions with varying degrees of accuracy are computed. Again, this is a very superficial explanation and I encourage you to read [1].

Interestingly enough however, we won't be using neural networks to solve our problem here. While neural networks will probably be much better suited towards solving more complicated captchas we should be able to achieve a fairly high accuracy rate by utilizing a vector space search engine[2]. In other words, we already know the right process towards solving this problem, so we'll utilize a somewhat "hacky"/fast solution instead of trying to write a more generic alternative.

Code:

[1] wikipedia [.] org/wiki/Artificial_neural_network
[2] la2600 [.] org/talks/files/20040102/Vector_Space_Search_Engine_Theory.pdf

Gophering · Mar 22, 2013

More will follow during my lunch break. For now, familiarize yourself with Golang everybody. Cheers and any comments are always welcome.

Gophering · Mar 22, 2013

First Steps

Alright, as promised, here's the next part! Keep in mind that I don't have anything prewritten really, so I'll be coding this up in go while I'm writing this, so I'm not completely sure how deep into it we'll really make it today... Regardless, lets get coding.

So firstly, lets get ourselves a somewhat non-trivial captcha. By non-trivial I mean something not just b&w as that would be very trivial to solve and I'd like to show off as much of the techniques/thought process involved as possible.

Let's get started with this captcha here:

The captcha can also be accessed via [1]

Let's analyze the image real quick. Keep in mind that our goal is a clean extraction of the interesting content (the letters) from the image. We notice that the letters are not interconnected, each letter is separated by white space, which is good. We also notice that we are dealing with a monotone image here, which also is not a bad thing. We also notice that the whole image is obfuscated by little dots or some such, which might be a problem later down the line.

Lets get coding, firstly we will need several helper functions for image reading/writing and so on.

Code:

package main


import (
    "image"
    "os"
    "log"
    "image/png"
    _ "image/gif"
)


//Opens a given image based on path
func loadImage(path string) image.Image {
    file, err := os.Open(path)
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()
    img, _, err := image.Decode(file)
    if err != nil {
        log.Fatal(err)
    }


    return img
}


//Saves a given image to path. Returns error if unsuccessful.
func saveImage(path string, img image.Image) error {
    out_file, err := os.Create(path)
    if err != nil {
        return err
    }
    defer out_file.Close()


    err = png.Encode(out_file, img)
    if err != nil {
        return err
    }
    return nil
}


func main (){
    
}

Ok, so what we've done above is pretty uneventful. Declare our main package, import a couple of image processing as well as other packages from Go's standard library and declared two functions, loadImage and saveImage. The former opens an image for us the latter saves it to disk.

Lets go on. Although the image is monochrome already, it would be nice to work with black and white only. So lets write a function which will return a new b&w copy of the original image.

Code:

//Creates a new B&W copy of our original image and returns a pointer.
func turnBW (img image.Image) *image.Gray {
    bounds := img.Bounds() //Get the image bounds. This way we can easily calculate dimensions.
    newImg := image.NewGray(bounds) //Create a new gray scale image with the same dimensions as the original image.
    for y := bounds.Min.Y; y < bounds.Max.Y; y++ { //Range from min Height to max Height
        for x := bounds.Min.X; x < bounds.Max.X; x++ { //Range from min Width to max Width
            r, g, b, _ := img.At(x, y).RGBA() //Get the RGB colour value of the pixel
            if r + 256 * g + b >= 16000000 { //If not black
                newImg.Set(x, y, color.White)
            } else {
                newImg.Set(x, y, color.Black)
            }
        }
    }
    return newImg
}

What we do here is again fairly trivial. Firstly we get the dimensions of the original image by calling the Bounds method. Then we create a new grayscale image with the same dimensions, finally we range over each pixel in the original image, get its RGB colour value, and depending on wheather it is dark or non-dark add it with an appropriate color to our new, grayscale image.

Lets modify our main function and run the code

Code:

func main (){
    img := loadImage("captcha.gif")
    err := saveImage("processed.gif", turnBW(img))
    if err != nil {
        log.Fatal(err)
    }
}

and the result of running our code:

Next, we'll need to figure out how to clean the background. More on this in the next post.

Code:

[1] snaphost [.] xom/captcha/WebForm.aspx?id=QAP85CQKE9T9&rad=1363954778626 (replace "x" with "c" here. Sorry BHW preventing me from posting proper URLs...)

Gophering · Mar 22, 2013

Chopping

As mentioned in my previous post, we need to figure out a way to remove all those artifacts from the background of the image in order to make text extraction itself as smooth and accurate as possible. There's however a slight complication as we can not rely on colour differences anymore, we are working with a monochrome image and we can not simply extract everything that is black since this would also remove our text. So we'll need to figure out another approach.

We need to design a so-called chopping algorithm. In other words we need an algorithm which would detect a collection of target pixels (in our case a collection of black pixels) leave those in the image and chop everything else. We can set a threshold of around 2 pixels, anything less than that will be chopped away.

Here's our complete program once again with modifications to the main function as well as the addition of the chopping function (simply called chop):

Code:

package main


import (
    "image"
    "os"
    "log"
    "image/color"
    "image/png"
    _ "image/gif"
)


//Creates a new B&W copy of our original image and returns a pointer.
func turnBW (img image.Image) *image.Gray {
    bounds := img.Bounds() //Get the image bounds. This way we can easily calculate dimensions.
    newImg := image.NewGray(bounds) //Create a new gray scale image
    for y := bounds.Min.Y; y < bounds.Max.Y; y++ { //Range from min Height to max Height
        for x := bounds.Min.X; x < bounds.Max.X; x++ { //Range from min Width to max Width
            r, g, b, _ := img.At(x, y).RGBA() //Get the R, G, B color value of the pixel
            if r + 256 * g + b >= 16000000 { //If not black
                newImg.Set(x, y, color.White)
            } else {
                newImg.Set(x, y, color.Black)
            }
        }
    }
    return newImg
}

//Chops everything below a certain pixel threshold
func chop (img *image.Gray) *image.Gray {
    bounds := img.Bounds() //Get image dimensions again
    //Here we loop across each row first
    for y := bounds.Min.Y; y < bounds.Max.Y; y++ {
        for x := bounds.Min.X; x < bounds.Max.X; x++ {
            if img.At(x, y) == color.White {continue} //If we encounter a white pixel, simply skip and continue looping
            buffer := 0 //setup a buffer of target (black) pixels
            for c := x; c < bounds.Max.X; c++ { //range from found target pixel to image width
                r, g, b, _ := img.At(c, y).RGBA()
                if r + 256 * g + b < 16000000 { //if not white 
                    buffer ++ //increase buffer
                } else {
                    break
                }
            }


            if buffer <= 2 { //if buffer is smaller than our threshold (which is 2) replace it all with white, effectively chopping it
                for c := 0; c < buffer; c++ {
                    img.Set(x + c, y, color.White)
                }
            }


            x += buffer //skipping as we've already modified the above fragment
        }
    }


    //Here we just repeat everything as above, only looping through the columns instead of the rows.
    for x := bounds.Min.X; x < bounds.Max.X; x++ {
        for y := bounds.Min.Y; y < bounds.Max.Y; y++ {
            if img.At(x, y) == color.White {continue}
            buffer := 0
            for c := y; c < bounds.Max.Y; c++ {
                r, g, b, _ := img.At(x, c).RGBA()
                if r + 256 * g + b < 16000000 {
                    buffer ++
                } else {
                    break
                }
            }


            if buffer <= 2 {
                for c := 0; c < buffer; c++ {
                    img.Set(x, y + c, color.White)
                }
            }


            y += buffer
        }
    }
    return img
}


//Opens a given image based on path
func loadImage(path string) image.Image {
    file, err := os.Open(path)
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()
    img, _, err := image.Decode(file)
    if err != nil {
        log.Fatal(err)
    }


    return img
}


//Saves a given image to path. Returns error if unsuccessful.
func saveImage(path string, img image.Image) error {
    out_file, err := os.Create(path)
    if err != nil {
        return err
    }
    defer out_file.Close()


    err = png.Encode(out_file, img)
    if err != nil {
        return err
    }
    return nil
}


func main (){
    img := loadImage("captcha.gif")
    err := saveImage("processed.gif", chop(turnBW(img)))
    if err != nil {
        log.Fatal(err)
    }
}

After running our program, we get the following output image:

Perfect! The image is clean! The hardest part is now behind us, now we just need to extract the individual letters and classify them. More on this in the next post.

Gophering · Mar 22, 2013

Well, thats it for today. Unfortunately I've got a bunch of work to finish over here...
Next, we'll really get to the meat of this whole thread. Extracting and classifying letters.

Any comments or questions are of course welcome.

I'll try to post the next part either tonight or early tomorrow.

zenoGlitch · Mar 22, 2013

You're going to fit in just fine here ... welcome to BHW. +14 rep

9to5destroyer · Mar 22, 2013

thanks great share i'll be watching this thread closely

youngguy · Mar 22, 2013

AMAZING!!! Welcome here buddy!

Gophering · Mar 22, 2013

Thanks for the positive feedback everyone, really appreciate it!

-> zenoGlitch: Glad to be here, thanks for the rep too.

Extracting Letters

Alright, so we are getting quite close to the interesting bits. We are definitely on the right track, since extracting relevant content from all the noise is half of the battle. Now, with some (bad) captchas this might be enough to build our corpus data (more on this later) as some captcha scripts don't bother changing individual letters and serve sets of static letters instead, thus a corpus of whole sets can be established. Our captcha script however (to the best of my knowledge) generates individual letters so we'll need to dig deeper.

So next on our agenda: writing letter extraction functions. As you might remember, I've mentioned that letters are conveniently separated by white space. Since we have already cleaned up the noise we can use the above property of the captcha to our advantage.

Why don't we try slicing up the image horizontally, note where each letter starts and finishes (are we within a set of black pixels?) and store those positions in a separate array thus ultimately extracting the letters. Here's our letter extraction function:

Code:

//Extracts individual letters
func extractLetters (img *image.Gray) [][]int {
    var (
        letters = make([][]int, 0) //Our array of array which will hold start/end positions
        inside = false //Are we inside the letter?
        gotletter = false //Did we find the letter?
        start = 0 //Starting position
        end = 0 //Ending position
    )
    bounds := img.Bounds() //Getting image bounds once again
    //Now we need to slice through the image horizontally
    for x := bounds.Min.X; x < bounds.Max.X; x++ {
        for y := bounds.Min.Y; y < bounds.Max.Y; y++ {
            r, g, b, _ := img.At(x, y).RGBA()
            if r + 256 * g + b == 0 { //Hit a black pixel?
                inside = true //We are inside the letter
            }
        }
        if gotletter == false && inside == true { //We've just hit a letter for the first time
            gotletter = true
            start = x //Declare starting position
        }
        if gotletter == true && inside == false { //We've just reached some white pixels again
            gotletter = false
            end = x //Declare the ending position
            /*
                Here we need to do some additional checks before proceeding further.
                As you might have noticed, our letters are somewhat patchy due to the image cleaning algorithm.
                So here, we check if we've actually hit a real letter (something bigger than 4 pixels) and not some random 
                artifact which might have been left in there...
            */
            if end - start >= 4 {
                /*
                    Here we do a reverse check of sorts. A letter might be sliced up into two individual parts
                    due to it's choppy nature. We want to avoid this, so we check if the last known ending point lies very close 
                    (1 <= n <= 2 pixels close) to the current start, if so then we just change the last known ending point to the current one
                    else we append a new start/end set.
                */
                switch {
                case len(letters) > 0:
                    diff := start - letters[len(letters) - 1][1]
                    if diff >= 1 && diff <= 2 {
                        letters[len(letters) - 1][1] = end
                        break;
                    } 
                    fallthrough;
                default:
                    letters = append(letters, []int{start, end})
                }
            }
        }
        inside=false //Restart
    }
    return letters
}

Make sure you read through the comments here. We are doing some additional checks in order to determine if we are really extracting a letter or some random, left-over artifact. This function can be fine-tuned further (one could check dimensions of individual sets and so on), but for now this will serve us well as it is.

Lets write an additional function to extract sub-images from our main image based on our freshly extracted start/end coordinates and also adjust our main function slightly.

Code:

package main


import (
    "image"
    "os"
    "log"
    "image/color"
    "image/png"
    _ "image/gif"
    "strconv"
)


//Creates a new B&W copy of our original image and returns a pointer.
func turnBW (img image.Image) *image.Gray {
    bounds := img.Bounds() //Get the image bounds. This way we can easily calculate dimensions.
    newImg := image.NewGray(bounds) //Create a new gray scale image
    for y := bounds.Min.Y; y < bounds.Max.Y; y++ { //Range from min Height to max Height
        for x := bounds.Min.X; x < bounds.Max.X; x++ { //Range from min Width to max Width
            r, g, b, _ := img.At(x, y).RGBA() //Get the R, G, B color value of the pixel
            if r + 256 * g + b >= 16000000 { //If not black
                newImg.Set(x, y, color.White)
            } else {
                newImg.Set(x, y, color.Black)
            }
        }
    }
    return newImg
}


//Chops everything below a certain pixel treshold
func chop (img *image.Gray) *image.Gray {
    bounds := img.Bounds() //Get image dimensions again
    //Here we loop across each row first
    for y := bounds.Min.Y; y < bounds.Max.Y; y++ {
        for x := bounds.Min.X; x < bounds.Max.X; x++ {
            if img.At(x, y) == color.White {continue} //If we encounter a white pixel, simply skip and continue looping
            buffer := 0 //setup a buffer of target (black) pixels
            for c := x; c < bounds.Max.X; c++ { //range from found target pixel to image width
                r, g, b, _ := img.At(c, y).RGBA()
                if r + 256 * g + b < 16000000 { //if not white 
                    buffer ++ //increase buffer
                } else {
                    break
                }
            }


            if buffer <= 2 { //if buffer is smaller than our treshold (which is 2) replace it all with white, effectively chopping it
                for c := 0; c < buffer; c++ {
                    img.Set(x + c, y, color.White)
                }
            }


            x += buffer //skipping as we've already modified the above fragment
        }
    }


    //Here we just repeat everything as above, only looping through the columns instead of the rows.
    for x := bounds.Min.X; x < bounds.Max.X; x++ {
        for y := bounds.Min.Y; y < bounds.Max.Y; y++ {
            if img.At(x, y) == color.White {continue}
            buffer := 0
            for c := y; c < bounds.Max.Y; c++ {
                r, g, b, _ := img.At(x, c).RGBA()
                if r + 256 * g + b < 16000000 {
                    buffer ++
                } else {
                    break
                }
            }


            if buffer <= 2 {
                for c := 0; c < buffer; c++ {
                    img.Set(x, y + c, color.White)
                }
            }


            y += buffer
        }
    }
    return img
}


//Opens a given image based on path
func loadImage(path string) image.Image {
    file, err := os.Open(path)
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()
    img, _, err := image.Decode(file)
    if err != nil {
        log.Fatal(err)
    }


    return img
}


//Saves a given image to path. Returns error if unsuccessful.
func saveImage(path string, img image.Image) error {
    out_file, err := os.Create(path)
    if err != nil {
        return err
    }
    defer out_file.Close()


    err = png.Encode(out_file, img)
    if err != nil {
        return err
    }
    return nil
}


//Extracts individual letters
func extractLetters (img *image.Gray) [][]int {
    var (
        letters = make([][]int, 0) //Our array of array which will hold start/end positions
        inside = false //Are we inside the letter?
        gotletter = false //Did we find the letter?
        start = 0 //Starting position
        end = 0 //Ending position
    )
    bounds := img.Bounds() //Getting image bounds once again
    //Now we need to slice through the image horizontally
    for x := bounds.Min.X; x < bounds.Max.X; x++ {
        for y := bounds.Min.Y; y < bounds.Max.Y; y++ {
            r, g, b, _ := img.At(x, y).RGBA()
            if r + 256 * g + b == 0 { //Hit a black pixel?
                inside = true //We are inside the letter
            }
        }
        if gotletter == false && inside == true { //We've just hit a letter for the first time
            gotletter = true
            start = x //Declare starting position
        }
        if gotletter == true && inside == false { //We've just reached some white pixels again
            gotletter = false
            end = x //Declare the ending position
            /*
                Here we need to do some additional checks before proceeding further.
                As you might have noticed, our letters are somewhat patchy due to the image cleaning algorithm.
                So here, we check if we've actually hit a real letter (something bigger than 4 pixels) and not some random 
                artifact which might have been left in there...
            */
            if end - start >= 4 {
                /*
                    Here we do a reverse check of sorts. A letter might be sliced up into two individual parts
                    due to it's choppy nature. We want to avoid this, so we check if the last known ending point lies very close 
                    (1 <= n <= 2 pixels close) to the current start, if so then we just change the last known ending point to the current one
                    else we append a new start/end set.
                */
                switch {
                case len(letters) > 0:
                    diff := start - letters[len(letters) - 1][1]
                    if diff >= 1 && diff <= 2 {
                        letters[len(letters) - 1][1] = end
                        break;
                    } 
                    fallthrough;
                default:
                    letters = append(letters, []int{start, end})
                }
            }
        }
        inside=false //Restart
    }
    return letters
}


//Get letter accepts a start/end slice as well as the original image and returns a pointer to a new sub-image object
func getLetter (pos []int, img *image.Gray) *image.Gray {
    bounds := img.Bounds() //Get the bounds again
    newBounds := image.Rect(pos[0], 0, pos[1], bounds.Max.Y) //Create new dimensions based on our position data
    newImg := image.NewGray(newBounds) //Create new image object to store sub-image in
    for y := newBounds.Min.Y; y < newBounds.Max.Y; y++ { //Loop over pixels
        for x := newBounds.Min.X; x < newBounds.Max.X; x++ {
            r, g, b, _ := img.At(x, y).RGBA() //Populate pixels based on new image bounds
            if r + 256 * g + b == 0 {
                newImg.Set(x, y, color.Black)
            } else {
                newImg.Set(x, y, color.White)
            }
        }
    }
    return newImg
}


func main (){
    img := loadImage("captcha.gif") //Our original image
    processed := chop(turnBW(img)) //Our processed image
    for i, pos := range extractLetters(processed) { //Extract letter positions and save each letter into a new image
        err := saveImage("letter_" + strconv.Itoa(i) + ".gif" , getLetter(pos, processed))
        if err != nil {
            log.Fatal(err)
        }
    }
}

Lets finally run our code and see what happens.
Results:

Great, we got the individual letters! We can now finally start building our corpus as well as our vector space search engine!
A ton of code for today, but I'll definitely pick this up tomorrow. For now, if you are generally interested in this type of stuff I can't recommend "Artificial Intelligence With Common Lisp"[1] enough. Skip all the lispy stuff if thats not your thing and concentrate on the actual methodology. Its a great read!

Code:

[1]books [.] google [.] xom/books?id=eIbBm7wvTjcC&redir_esc=y (once again "x" = "c")

Gophering · Mar 22, 2013

Not sure if my last post is quite clear enough. So if you got any questions guys, I'll be around. Thanks again for all the positive comments, glad I can contribute to this amazing community.

ReALeST · Mar 22, 2013

This is interesting stuff...only came across an OCR tut...thnks alot dude!

+reped

keizer · Mar 22, 2013

Hmmm, do you think you will be capable to solve recaptcha? Thanks for this tutorial?

Gophering · Mar 22, 2013

keizer said:
Hmmm, do you think you will be capable to solve recaptcha? Thanks for this tutorial?

Well, the short answer is "unfortunately no"...
Recaptcha is notoriously hard to solve and can be difficult - extremely confusing to non machine subjects (humans) too. While a short term solution (and I mean really, really short term) might be doable a long term solution I'd say is currently quite out of reach.

One important thing to realize about recaptcha in general is that the captchas presented are specifically those which an OCR engine (probably one of the better OCR engines lol) couldn't solve already. Recaptcha are digitalizing books behind the curtain and serve failed OCR candidates as your captcha image. So in essence you'd need to work towards advancing OCR/AI/Neural Network technology in general to "break" recaptcha.

YouFeelMeDawg? · Mar 22, 2013

WTF did I just read?
I am still in disbelief that this AMAZING post is here, and I thought the general programming section was dead.
Thanks for this thread for reals, I actually did learn something useful in bhw and it has been a while since last time.
+ Rep, when is thread number 2 going to come?
Seriously we need more of these captcha breaking threads in here, we badly need them.

Gophering · Mar 22, 2013

Haha, well I'm glad I could be of some help.
Thread deux will come along soon enough. I need to finish this one first mate and then we can proceed.
A general neural network example is needed so we'll get to that in thread two.

Also, I'd like to demonstrate how much easier it is to crack flash based/interactive captchas (e.g. "Win the game". Unless they are very well made). So theres enough content for a couple of threads in there.

Cheers and thanks for the kind words.

keizer · Mar 22, 2013

Buddy you are showing the path to patch up the loopholes of captcha makers. Am I wrong?

Gophering · Mar 22, 2013

keizer said:
Buddy you are showing the path to patch up the loopholes of captcha makers. Am I wrong?

Ha, not quite sure if I am really.
This is not new stuff by any means and the methodology is available in several books (well not applied to captcha breaking of course, as far as I know). Most importantly we haven't gotten into neural networks yet, which are very interesting by their very nature. So I hope to post that up when time allows.

zenoGlitch · Mar 23, 2013

Gophering said:
Ha, not quite sure if I am really.
This is not new stuff by any means and the methodology is available in several books (well not applied to captcha breaking of course, as far as I know). Most importantly we haven't gotten into neural networks yet, which are very interesting by their very nature. So I hope to post that up when time allows.

Looking forward to it, you're on a roll.

It's a cat and mouse game Keizer. Posting info here isn't going to change that.

Gophering · Mar 23, 2013

Building Corpus Data

Alright, so unfortunately I'm a bit busy today, so we probably won't be able to get into vector space search until tomorrow or maybe tonight. However, we can at least start building our corpus data.

Now, what is corpus data exactly? In simple terms, another word for "corpus" would be "training set". A collection of sets/samples which we could use to "train" our program. Here we could really go crazy and combine our procedures with machine learning, neural networks, etc. However, the best type of solution is the easiest one, so in our case this is really not needed.

So what should be our approach? Well, why don't we utilize our program's existing capabilities and build our corpus data this way? What we really need is a lot of captchas, which we can then automatically break down into individual letters. The only manual labour thats left to do then is to categorize/classify each individual letter. Also, keep in mind that the larger the corpus the better the result, however once again, lets start out modest and expand if needed.

We simply need a downloader function, which would make around 100 requests (for now) to the captcha site and get us our first 100 captchas. I won't post the code for this over here since this is very trivial to do, but basically you should end up with something like this:

Next we need to feed all those to our main program and break them all down into individual letters. A function to do this may look like this:

Code:

//Builds corpus data
func buildCorpus(dir string){
    files, _:= ioutil.ReadDir(dir)
    r, _ := regexp.Compile(`\.gif`)
    for h, file := range files {
        fmt.Println("Processing captcha", h)
        if r.MatchString(file.Name()) {
            img := loadImage("./captchas/" + file.Name()) //Our original image
            processed := chop(turnBW(img)) //Our processed image
            for _, pos := range extractLetters(processed) { //Extract letter positions and save each letter into a new image
                err := saveImage( "./letters/" + randomString(10) + ".gif" , getLetter(pos, processed))
                if err != nil {
                    log.Fatal(err)
                }
            }
        }
    }
}

In the end we should end up with something like this:

So now comes the laborious part. Create a new directory and call it something like "corpus", next go through each letter and put it in its own sub directory inside the corpus folder. Something like this:

Repeat this until you have collected all the numbers (probably 0-9) as well as all the letters.

Thats it for now. We are almost done, next we'll be defining our vector space search engine as well as teaching the program correct classifications.

jazzc · Mar 23, 2013

To build a big corpus and also avoid the manual work, one can send the letter images to a decaptcha service.

General OCR Tut (Vector Space Method)

Junior Member

Junior Member

Junior Member

Junior Member

Junior Member

Senior Member

Regular Member

BANNED

Junior Member

Junior Member

Power Member

Regular Member

Junior Member

BANNED

Junior Member

Regular Member

Junior Member

Senior Member

Junior Member

Elite Member

Main Menu

Marketplace

Making Money

BlackHat World