{"id":1836,"date":"2015-02-06T04:41:49","date_gmt":"2015-02-06T04:41:49","guid":{"rendered":"http:\/\/pkmital.com\/home\/?p=1836"},"modified":"2023-07-24T13:26:03","modified_gmt":"2023-07-24T20:26:03","slug":"handwriting-recognition-with-lstms-and-ofxcaffe","status":"publish","type":"post","link":"https:\/\/pkmital.com\/home\/handwriting-recognition-with-lstms-and-ofxcaffe\/","title":{"rendered":"Handwriting Recognition with LSTMs and ofxCaffe"},"content":{"rendered":"\n<figure class=\"wp-block-embed is-type-video is-provider-vimeo wp-block-embed-vimeo\"><div class=\"wp-block-embed__wrapper\">\n<!DOCTYPE html PUBLIC \"-\/\/W3C\/\/DTD HTML 4.0 Transitional\/\/EN\" \"http:\/\/www.w3.org\/TR\/REC-html40\/loose.dtd\">\n<html><body><iframe loading=\"lazy\" title=\"Deep Recurrent Neural Network: Handwriting Recogntion Test\" src=\"https:\/\/player.vimeo.com\/video\/118878865?dnt=1&amp;app_id=122963&amp;autoplay=0&amp;loop=1&amp;autopause=0&amp;muted=1\" width=\"412\" height=\"480\" frameborder=\"0\" allow=\"autoplay; fullscreen; picture-in-picture; clipboard-write\"><\/iframe><\/body><\/html>\n\n<\/div><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/en.wikipedia.org\/wiki\/Long_short_term_memory\">Long Short Term Memory (LSTM)<\/a> is a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Recurrent_neural_network\">Recurrent Neural Network (RNN)<\/a> architecture designed to better model temporal sequences (e.g. audio, sentences, video) and long range dependencies than conventional RNNs <a href=\"#id1\">[1]<\/a>. There is a lot of excitement in the machine learning communities with LSTMs (and Deep Minds&#8217;s counterpart, &#8220;Neural Turing Machines&#8221; <a href=\"#id2\">[2]<\/a>, or Facebook&#8217;s, &#8220;Memory Networks&#8221; <a href=\"#id3\">[3]<\/a>) as they overcome a fundamental limitation to conventional RNNs and are able to achieve state-of-the-art benchmark performances on a number of tasks [<a href=\"#id4\">4<\/a>,<a href=\"#id5\">5<\/a>]:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Text-to-speech synthesis (Fan et al., Microsoft, Interspeech 2014)<\/li>\n\n\n\n<li>Language identification (Gonzalez-Dominguez et al., Google, Interspeech 2014)<\/li>\n\n\n\n<li>Large vocabulary speech recognition (Sak et al., Google, Interspeech 2014)<\/li>\n\n\n\n<li>Prosody contour prediction (Fernandez et al., IBM, Interspeech 2014)<\/li>\n\n\n\n<li>Medium vocabulary speech recognition (Geiger et al., Interspeech 2014)<\/li>\n\n\n\n<li>English to French translation (Sutskever et al., Google, NIPS 2014)<\/li>\n\n\n\n<li>Audio onset detection (Marchi et al., ICASSP 2014)<\/li>\n\n\n\n<li>Social signal classification (Brueckner &amp; Schulter, ICASSP 2014)<\/li>\n\n\n\n<li>Arabic handwriting recognition (Bluche et al., DAS 2014)<\/li>\n\n\n\n<li>TIMIT phoneme recognition (Graves et al., ICASSP 2013)<\/li>\n\n\n\n<li>Optical character recognition (Breuel et al., ICDAR 2013)<\/li>\n\n\n\n<li>Image caption generation (Vinyals et al., Google, 2014)<\/li>\n\n\n\n<li>Video to textual description (Donahue et al., 2014)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The current dynamic state of a conventional RNN can be thought of as a short term memory. The idea behind LSTMs is trying to keep these dynamics for a longer time (i.e. as more data is seen, or as the sequence progresses). LSTMs handle this by using a combination of linear and logistic units to control the information stored in a memory neuron. At each time step, the &#8220;keep&#8221; memory cell&#8217;s recurrent connection rewrites the information into itself, so that the information stays there. This is essentially a recurrent connection to itself using a linear activation function with a weight of 1. Then there is a &#8220;write&#8221; unit, controlling whether the inputs of the network are fed into the memory cell, and a &#8220;read&#8221; unit, determining whether the information in the memory cell is sent forward.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Luckily I didn&#8217;t have to implement its architecture and instead worked with a <a href=\"http:\/\/github.com\/junhyukoh\/caffe-lstm\">branch<\/a> of <a href=\"http:\/\/caffe.berkeleyvision.org\/\">Caffe<\/a>. Building on this implementation, I&#8217;ve tried replicating a seminal use of LSTMs in recognizing temporal sequences depicting handwriting in an online fashion <a href=\"#id6\">[6]<\/a>. The original paper described an additional feature of when the pen tip was down, which I have not included in this example. The video demonstrates a training phase, where I give a few examples of different letters, &#8216;A&#8217;, &#8216;B&#8217;, &#8216;C&#8217;, and &#8216;D&#8217;. Then the network is optimized over an arbitrarily set 2000 iterations, and around 2:10, a few test cases are shown where the detected letter is shown with its opacity set based on how far into the letter it is detected as being.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The video above uses the framework I described briefly in my previous post on <a href=\"https:\/\/pkmital.com\/home\/2015\/01\/04\/real-time-object-recognition-with-ofxcaffe\/\">real-time object detection with ofxCaffe<\/a> and is also open-source. Get the code here: <a href=\"https:\/\/github.com\/pkmital\/ofxCaffe\">ofxCaffe on Github<\/a>.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>[1]. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u20131780, 1997.<\/li>\n\n\n\n<li>[2]. A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009.<\/li>\n\n\n\n<li>[3]. A. Graves, G. Wayne, and I. Danihelka. Neural Turing Machines. 2014. http:\/\/arxiv.org\/abs\/1410.5401<\/li>\n\n\n\n<li>[4]. J. Weston, A. Bordes, and S. Chopra. Memory Networks. 2014. http:\/\/arxiv.org\/abs\/1410.3916<\/li>\n\n\n\n<li>[5]. J. Schmidhuber. Deep Learning in Neural Networks: An Overview. Neural Networks, Volume 61, January 2015<\/li>\n\n\n\n<li>[6]. J. Schmidhuber. Recurrent Neural Networks. http:\/\/people.idsia.ch\/~juergen\/rnn.html. Retrieved on Feb 5, 2015.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Long Short Term Memory (LSTM) is a Recurrent Neural Network (RNN) architecture designed to better model temporal sequences (e.g. audio, sentences, video) and long range dependencies than conventional RNNs [1].&hellip;<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8,14,16],"tags":[69,109,172,210,217,219,261,296,297,348],"class_list":["post-1836","post","type-post","status-publish","format-standard","hentry","category-computer-vision","category-source-code","category-technology","tag-caffe","tag-deep-learning","tag-handwriting","tag-learning","tag-lstm","tag-machine-learning","tag-open-source","tag-recognition","tag-recurrent-neural-network","tag-training"],"acf":[],"_links":{"self":[{"href":"https:\/\/pkmital.com\/home\/wp-json\/wp\/v2\/posts\/1836","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pkmital.com\/home\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/pkmital.com\/home\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/pkmital.com\/home\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/pkmital.com\/home\/wp-json\/wp\/v2\/comments?post=1836"}],"version-history":[{"count":1,"href":"https:\/\/pkmital.com\/home\/wp-json\/wp\/v2\/posts\/1836\/revisions"}],"predecessor-version":[{"id":2348,"href":"https:\/\/pkmital.com\/home\/wp-json\/wp\/v2\/posts\/1836\/revisions\/2348"}],"wp:attachment":[{"href":"https:\/\/pkmital.com\/home\/wp-json\/wp\/v2\/media?parent=1836"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/pkmital.com\/home\/wp-json\/wp\/v2\/categories?post=1836"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/pkmital.com\/home\/wp-json\/wp\/v2\/tags?post=1836"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}