Hand Detection in Egocentric Video and Investigation Toward Fine-Grained Cooking Activities Recognition

Keywords: Egocentric Video, Fine-grained Cooking Activities Recognition, Fully Con-nected Multi-layer Neural Network, Hand Detection


The analysis of egocentric videos is currently a hot topic in computer vision. In this paper, we focus on cooking activities recognition in egocentric videos. To recognize cooking activities automatically and precisely, we must solve the problems of detecting hand region in egocentric videos, representing hand motion, and classifying the cooking activities. In this research, to solve these problems, we propose a new cooking activities recognition method in egocentric videos. The characteristic points are 1) hand regions are accurately detected in a cluttered background by using color, texture, and location information, 2) temporal hand features are extracted from sequential frame images with a thinning algorithm, 3) a fully-connected multi-layer neural network is utilized to recognize the activities from the extracted features. Toward our goal of fine-grained cooking activities recognition, we in-vestigated the performance of our method with our benchmark, including 12 fine-grained cooking activities in five coarse categories. The experimental results show that our method allows us to recognize cooking activities with an accuracy of 45.2%.


A. Fathi, X. Ren, and James M. Rehg. Learning to Recognize Objects in Egocentric Activities. In Proc. of CVPR, 2011.

A. Fathi, A. Farhadi, and J. M. Rehg. Understanding Egocentric Activities. In Proc. of ICCV, 2011.

K. Inoue, M. Ono, and M. Yoshioka. Hand Detection and Cooking Activities Recognition in Egocentric Videos. In Proc. of ACIS, 2016.

P. Kakumanu, S. Makrogiannis, and N. Bourbakis. A survey of skin-color modeling and detection methods. Pattern Recognition, 40(3):1106–1122, 2007.

E. B. Sudderth, M. I. Mandel, W. T. Freeman, and A. S. Willsky. Visual Hand Tracking Using Nonparametric Belief Propagation. In Proc. of CVPRW, 2004.

I. Oikonomidis, N. Kyriazis, and A. Argyros. Markerless and Efficient 26-DOF Hand Pose Recovery. In Proc. of ACCV, 2010.

C. Li and K. M. Kitani. Pixel-level Hand Detection in Ego-Centric Videos. In Proc. of CVPR, 2013.

C. Li and K. M. Kitani. Model Recommendation with Virtual Probes for Egocentric Hand Detection. In Proc. of ICCV, 2013.

D. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. IJCV, 60(2):91–110, 2004.

K. Yamaguchi, M. H. Kiapour, L. E. Ortiz, and T. L. Berg. Retrieving similar styles to parse clothing. IEEE TPAMI, 37(5):1028–1040, 2015.

W. Yang, P. Luo, and L. Lin. Clothing co-parsing by joint image segmentation and labeling. In Proc. of ICCV, 2014.

Y. Li, A. Fathi, and J. M. Rehg. Learning to Predict Gaze in Egcentric Video. In Proc. of ICCV, 2013.

Y. Li, Z. Ye, and J. M. Rehg. Delving into Egocentric Actions. In Proc. of CVPR, 2015.

H. Wang, A. Kl¨aser, C. Schmid, and C. L. Liu. Dense Trajectories and Motion Boundary Descriptors for Action Recognition. IJCV, 103(1):60–79, 2013.

S. Escalera, X. Baro, ´ J. Gonz´alez, M. A. Bautista, M. Madadi, M. Reyes, V. P. Lopez, ´H. J. Escalante, J. Shotton, and I. Guyon. ChaLearn Looking at People Challenge 2014: Dataset and Results. In Proc. of ECCVW, 2014.

K. Inoue, T. Shiraishi, R. Matsuoka, and M. Yoshioka. Investigation of Japanese Dyanamic Finger-spelled Sign Language Recognition with RGB-D camera. In Proc. of FCV, 2016.

B. Soran, A. Farhadi, and L. Shapiro. Generating Notifications for Missing Actions: Don’t forget to turn the lights off! In Proc. of ICCV, 2015.

Y. Zhou and T. L. Berg. Temporal Perception and Prediction in Ego-Centric Video. In Proc. of ICCV, 2015.

H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE TASSP, 26(1):43–49, 1978.

X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. MatchNet: Unifying Feature and Metric Learning for Patch-Based Matching. In Proc. of CVPR, 2015.

Z. Li and J. Chen. Superpixel Segmentation using Linear Spectral Clustering. In Proc. of CVPR, 2015.

N. Otsu and T. Kurita. A new scheme for practical flexible and intelligent vision systems. In Proc. of CV, pages 431–435, 1988.

A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In Proc. of NIPS2012, 2012.

I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. C. Courville, and Y. Bengio. Maxout Networks. In Proc. of ICML, 2013.

A. U. Khan and A. Borji. Analysis of Hand Segmentation in the Wild. In Proc. of CVPR, 2018.

K. Roy, A. Mohanty, and R. R. Sahay. Deep Learning Based Hand Detection in Cluttered Environment Using Skin Segmentation. In Proc. of ICCVW, 2017.

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset. In Proc. of ECCV, 2018.

H. Kwon, Y. Kim, J. S. Lee, and M. Cho. First Person Action Recognition via Twostream ConvNet with Long-term Fusion Pooling. Pattern Recognition Letters, 112, 2018.

C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal Multiplier Networks for Video Action Recognition. In Proc. of CVPR, 2017.

S. Urabe, K. Inoue, and M. Yoshioka. Cooking Activities Recognition in Egocentric Videos Using Combining 2DCNN and 3DCNN. In Proc.of CEA/MaDiMa, 2018.

Technical Papers (Information and Communication Technology)