Imagine a magic algorithm that can create captions that accurately describe an image. The Google authors of, “Show and Tell: A Neural Image Caption Generator” claim to have created a machine-learning algorithm that approaches human-accuracy. If true, the value is clear as conventional text-based search methods can include relevant images as well as text. machine-translation services can handle Chinese characters as well. For consumers, the implications are many-fold as shoppers can use text search to more accurately find products. Meanwhile, search providers can learn tremendous amounts about consumers from the images they post. (For more information on this latter benefit, see our earlier article”Monetizing Image Recognition By Looking at the Background“.)
The authors (Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan) note their model is, “based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image”. Recurrent neural network architectures are computationally expensive to train as the network must be iterated to allow information to flow through both the feed forward and back (recurrent) connections. In today’s technology world, massively-parallel energy efficient GPUs or Intel Xeon Phi coprocessors are generally required to provide the floating-point capability for training.

LSTM: the memory block contains a cell c which is controlled by three gates. In blue we show the recurrent connections – the output m at time t − 1 is fed back to the memory at time t via the three gates; the cell value is fed back via the forget gate; the predicted word at time t − 1 is fed back in addition to the memory output m at time t into the Softmax for word prediction. (Image courtesy Arxiv.org)
The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets demonstate the accuracy of the model and the fluency of the language it learns solely from image descriptions.
Qualitative and quantitative measures indicate that the trained models are frequently quite accurate. For instance, the current state-of-the-art BLEU score (the higher the better) on the Pascal dataset is 25, the trained model yields 59, while human performance around 69. Similar improvements in BLEU score occur on the Flickr30k dataset, from 55 to 66, and on SBU, from 19 to 27.
It will be interesting to see how this technology evolves at major search sites like Google, and companies utilizing the IBM Synapse Chip.
Leave a Reply