Handwriting Recognition with LSTMs and ofxCaffe

Categories:: computer vision, source code, technology

Long Short Term Memory (LSTM) is a Recurrent Neural Network (RNN) architecture designed to better model temporal sequences (e.g. audio, sentences, video) and long range dependencies than conventional RNNs [1]. There is a lot of excitement in the machine learning communities with LSTMs (and Deep Minds’s counterpart, “Neural Turing Machines” [2], or Facebook’s, “Memory Networks” [3]) as they overcome a fundamental limitation to conventional RNNs and are able to achieve state-of-the-art benchmark performances on a number of tasks [4,5]:

Text-to-speech synthesis (Fan et al., Microsoft, Interspeech 2014)
Language identification (Gonzalez-Dominguez et al., Google, Interspeech 2014)
Large vocabulary speech recognition (Sak et al., Google, Interspeech 2014)
Prosody contour prediction (Fernandez et al., IBM, Interspeech 2014)
Medium vocabulary speech recognition (Geiger et al., Interspeech 2014)
English to French translation (Sutskever et al., Google, NIPS 2014)
Audio onset detection (Marchi et al., ICASSP 2014)
Social signal classification (Brueckner & Schulter, ICASSP 2014)
Arabic handwriting recognition (Bluche et al., DAS 2014)
TIMIT phoneme recognition (Graves et al., ICASSP 2013)
Optical character recognition (Breuel et al., ICDAR 2013)
Image caption generation (Vinyals et al., Google, 2014)
Video to textual description (Donahue et al., 2014)

The current dynamic state of a conventional RNN can be thought of as a short term memory. The idea behind LSTMs is trying to keep these dynamics for a longer time (i.e. as more data is seen, or as the sequence progresses). LSTMs handle this by using a combination of linear and logistic units to control the information stored in a memory neuron. At each time step, the “keep” memory cell’s recurrent connection rewrites the information into itself, so that the information stays there. This is essentially a recurrent connection to itself using a linear activation function with a weight of 1. Then there is a “write” unit, controlling whether the inputs of the network are fed into the memory cell, and a “read” unit, determining whether the information in the memory cell is sent forward.

Luckily I didn’t have to implement its architecture and instead worked with a branch of Caffe. Building on this implementation, I’ve tried replicating a seminal use of LSTMs in recognizing temporal sequences depicting handwriting in an online fashion [6]. The original paper described an additional feature of when the pen tip was down, which I have not included in this example. The video demonstrates a training phase, where I give a few examples of different letters, ‘A’, ‘B’, ‘C’, and ‘D’. Then the network is optimized over an arbitrarily set 2000 iterations, and around 2:10, a few test cases are shown where the detected letter is shown with its opacity set based on how far into the letter it is detected as being.

The video above uses the framework I described briefly in my previous post on real-time object detection with ofxCaffe and is also open-source. Get the code here: ofxCaffe on Github.

[1]. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
[2]. A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009.
[3]. A. Graves, G. Wayne, and I. Danihelka. Neural Turing Machines. 2014. http://arxiv.org/abs/1410.5401
[4]. J. Weston, A. Bordes, and S. Chopra. Memory Networks. 2014. http://arxiv.org/abs/1410.3916
[5]. J. Schmidhuber. Deep Learning in Neural Networks: An Overview. Neural Networks, Volume 61, January 2015
[6]. J. Schmidhuber. Recurrent Neural Networks. http://people.idsia.ch/~juergen/rnn.html. Retrieved on Feb 5, 2015.