A few months later, I had a system that compared to Ocropus was easier to use, more forgiving of bad inputs, but also slower and with a bit less accuracy under optimal conditions. At this point I realized that for accuracy to improve, I'd have to take a much more systematic approach to its development. For some time now, I'd been in a cycle of looking at a particular example, changing the code and parameters to improve quality for that example, only to later find that these changes made quality much worse in most other cases.
Moreover, I wanted to add a Hidden Markov Model stage to the system and realized that instead of dealing with letters, the system should be dealing with frequency distributions of letters instead. The result of these two hurdles, I'm afraid to say, was that I put the project on indefinite hiatus.
Since there is at least some interest in Longan from people who have stumbled across it, I wouldn't be averse to restarting development at some point soon, though I will have to think very hard how it slots into my other commitments!
Finally, for interest's sake, here is a high-level overview of how Longan currently works:
- A histogram is generated from the source image and used to find a likely black/white threshold value for letter separation.
- Next, Longan attempts to fix rotation in the source image by finding the rotation angle that produces the cleanest horizontal lines.
- Individual islets of black are extracted as letters.
- Letters are organized into lines and words.
- Letters are assigned a single most likely identity, eg "r" and a likelihood score. (More on that below.)
- Low-scoring letters are experimentally separated into multiple parts and re-identified.
- Unusually small low-scoring letters are eliminated as speckles/dirt.
Letter identification is based on convolutional neural networks, using a network architecture very close to the original LeNet-5. There are multiple groups of similar fonts used: serif, no-serif and monospace. Each font group has a primary neural network it uses to do most of the identifying work, and secondary heuristics to tease apart similar letter classes like "ijlI". A secondary heuristic can be another neural network, a simple test for the number of islets in the letter (for i/I distinctions), or a decision tree or KNN.
One of the big tasks in creating any machine-learning system is getting enough training data. Longan takes a self-contained and lazy approach: training data is generated by rendering letters on the computer, distorting them in ways similar to the variations found in printed letters, and using these letters for training the neural networks and other identifier systems. This means you can write a single JSON config file that lists the font groups and the required secondary heuristics and feed it to the training program, which will spit out a .zip file of neural network weights after a few hours of crunching.
Finally, what would be next on the list for Longan?
- Delete old code.
- Create a standard test suite of a range of documents that can be used to track quality improvements.
- Change the neural network and identifier system to output letter probabilities, and use probabilities throughout.
- Investigate if running the neural network on the GPU is significantly faster than the CPU on a modern machine. Last time I tried this, there were no real speed gains, but GPUs have become a lot faster in the intervening years.
- Develop an optional HMM phase to further improve output quality.
That's all for now. If you have questions about Longan, drop me a line or comment below.