# A constrained recursion algorithm for tree-structured LSTM with mini-batch SGD

### Abstract

Tree-structured LSTM is a promising concept to consider long-distance interaction over hierarchies with syntactic information. Besides, compared with chain-structured one, tree-structured LSTM has better modularity of the learning process. However, there still remains the challenge concerning hyperparameter tuning in tree-structured LSTM. Mainly, hyperparameter of mini-batch SGD (Stochastic Gradient Descent) is one of the most important factors which decides the quality of the prediction of LSTM. For more sophisticated hyperparameter tuning of mini-batch SGD, we propose a constrained recursion algorithm of tree-structured LSTM. Our algorithm enables the program to generate an LSTM tree for each batch. By doing this, we can evaluate the tuning of hyperparameter of mini-batch size more correctly compared with chain-structured one. Besides, our constrained recursion algorithm can traverse the LSTM and update the weights over several LSTM tree with a breadth-first search. In the experiment, we have measured the validation loss and elapsed time in changing the size of mini-batch. We have succeeded in measuring the learning process’s stability with small batch size and the instability of overfitting with large batch size more precisely than chain-structured LSTM.

### References

Rumelhart, David E.; Hinton, Geoffrey E., Williams, Ronald J., “Learning representa-tions by back-propagating errors.”, Nature 323 (6088): 1986/10, pp.533-536,

Jordan B. Pollack, “Recursive Distributed Representations”, Artif. Intell. 46(1-2), 1990, pp.77-105

Leon Bottou, “From Machine Learning to Machine Reasoning”, CoRR abs/1102.1808 (2011)

Socher, Richard, Lin, Cliff C., Ng, Andrew Y., and Manning, Christopher D. Parsing, “Natural Scenes and Natural Language with Recursive Neural Networks”, In Proceed-ings of the 26th International Conference on Machine Learning (ICML), 2011.

Socher, Richard, Perelygin, Alex, Wu, Jean Y., Chuang, Jason, Manning, Christopher D., Ng, Andrew Y., and Potts, Christopher, “Recursive deep models for semantic com-positionality over a sentiment treebank”, In Proceedings of the Conference on Empiri-cal Methods in Natural Language Processing, EMNLP13, Seattle, USA, 2013.

Goller, Christoph and Kohler, Andreas”, Learning task-dependent distributed represen-tations by backpropagation through structure,” In In Proc. of the ICNN-96, Bochum, Germany, 1996, pp. 347-352.

Sepp Hochreiter and J.Nurgen Schmidhuber, “Long short-term memory. Neural Com-putation” 9(8),1997, 1735-1780.

Kai Sheng Tai, Richard Socher, and Christopher D. Manning, “Improved semantic rep-resentations from tree-structured long short-term memory networks”, In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. ACL, 2015, pp.1556-1566.

Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo, “Long short-term memory over re-cursive structures”, In Proceedings of the 32nd International Conference on Machine Learning. ICML, 2015, pp.1604-1612.

Jiwei Li, Minh-Thang Luong, Dan Jurafsky, and Eduard Holy, “When are tree struc-tures necessary for deep learning of representations”, In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. EMNLP, 2015, pp.2304-2314.

P. J. Werbos, “Backpropagation through time: What does it does and how to do it”, In Proc. IEEE, vol. 78, no. 10, Oct. 1990, pp. 1550-1560,

P. J.Werbos, “The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting”, 1st ed. Hoboken, NJ: Wiley, 1994.

Yoshua Bengio, Patrice Simard, and Paolo Frasconi, “Learning long-term dependen-cies with gradient descent is difficult”, IEEE transactions on neural networks, 5(2), 1994, pp.157-166.

John F. Kolen and Stefan C. Kremer, “Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies”, Wiley-IEEE Press, 2001, pp464-479.

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio, “On the difficulty of training recurrent neural networks”, In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML13,JMLR.org, 2013 pp. III310-III318.

Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan Pascanu, “Advances in optimizing recurrent networks”, In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp.8624-8628.

Ronald J. Williams and Jing Peng, “An efficient gradient-based algorithm for online training of recurrent network trajectories”, In Neural Computation, 1990.

George Saon, Hagen Soltau, Ahmad Enami, and Michael Picheny, “Unfolded recur-rent neural networks for speech recognition”, In Interspeech, 2014.

Hasim Sak, Andrew Senior, and Francoise Beaufays, “Long short-term memory re-current neural network architectures for large scale acoustic modeling”, In Interspeech, 2014.

Kai Chen, Zhi-Jie Yan, and Qiang Huo, “Training, deep bidirectional LSTM acoustic model for LVCSR by a context-sensitive-chunk BPTT approach”, In Interspeech, 2015.

Naoyuki Kanda, Mitsuyoshi Tachimori, Xugang Lu, and Hisashi Kawai, “Training data pseudo-shuffling and direct decoding framework for recurrent neural network based acoustic modeling”, In Proc of IEEE Workshop on Automatic Speech Recog-nition and Understanding (ASRU), 2015.

Alan Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks”, in Proc of Acoustics, Speech and Signal Process-ing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 6645-6649.

Ilya Sutskever, Oriol Vinyals, and Quoc Le, “Sequence to sequence learning with neural networks”, In Advances in Neural Information Processing Systems, 2014, pp. 3104-3112.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine transla-tion by jointly learning to align and translate”, arXiv preprint arXiv:1409.0473, 2014.

Tomas Mikolov, ”Statistical language models based on neural networks”, In Presenta-tion at Google, Mountain View, 2nd April, 2012.

Bergstra, J.S., Barnet, R., Bengio, Y., Kgl, B., 2011. Algorithms for hyper-parameter optimization, in: Shawe-Taylor, J., Zemel, R.S.,

Jozefowicz, R., Zaremba, W., Sutskever, I., “An empirical exploration of recurrent network architectures”, in: Proceedings of the 32nd International Conference on In-ternational Conference on Machine Learning - Volume 37, JMLR.org. 2015,

pp.2342-2350

Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, Xiaoqiang Zheng: “TensorFlow: A System for Large-Scale Machine Learning”. OSDI 2016: pp265-283

Sergey Ioffe, Christian Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”. CoRR abs/1502.03167 (2015)

Gers, F. A., and Schmidhuber, J., “LSTM recurrent networks learn simple context-free and context-sensitive languages”, IEEE Trans. Neural. Netw., 12(6), 2001, 1333-1340.

D. Randall Wilson, Tony R. Martinez: The general inefficiency of batch training for gradient descent learning. Neural Networks 16(10): 1429-1451 (2003)

Yoshiyasu Takefuji, “Open source software is indeed based on modularity and ab-straction”, Science, eLetter, July 24, 2017

Yoh-Han Pao, Yoshiyasu Takefuji, “Functional-Link Net Computing: Theory, System Architecture, and Functionalities”, Computer 25(5), 1992, pp76-79

Keras: Deep Learning for humans, https://github.com/keras-team/keras

Chainer: A deep learning framework, https://github.com/chainer/chainer

Tensors and Dynamic neural networks in Python with strong GPU acceleration, https://github.com/pytorch/pytorch

Paolo Frasconi, Marco Gori, Alessandro Sperduti, “On the Efficient Classification of Data Structures by Neural Networks”, IJCAI 1997, pp.1066-1071

LSTM implementation for IJSCAI https://github.com/RuoAndo/LSTM-IJSCAI