Combining Features For RGB-D object Recognition Wasif Khan Ekachai Phaisangittisagul Department of Electrical Engineering Kasetsart University, Bangkhen Campus, Bangkok, Thailand. Wasifuetm@yahoo.com Department of Electrical Engineering Faculty of Engineering, Kasetsart University Bangkok, Thailand. email@example.com Luqman Ali Duangrat Gansawat Department of Electrical Engineering Kasetsart University, Bangkhen Campus, Bangkok, Thailand Luqman.firstname.lastname@example.org Human Computer Communication Research Unit, NECTEC, Pathumthani, Thailand Duangrat.email@example.com Itsuo Kumazawa Imaging Science and Engineering Laboratory, Tokyo Institute of technology, Japan Kumazawa@isl.titech.ac.jp Abstract — Object category and instance recognition have received much attention in this era of modern technologies. Advanced image sensing technologies provide high resolution color and depth synchronized videos such as RGB-D (Kinect style) camera. At present, various features extraction schemes are introduced to improve classification performance. Extracting useful features from both color and depth images have recently gained much attention in this research area. In this paper, we proposed an approach to create new features using a feature combination of various feature extraction techniques: color autrocorrelogram, wavelet moments, local binary pattern (LBP) and principal component analysis (PCA) for RGB-D data. The experiments on benchmark dataset shows that the new features obtained by the proposed method using knearest neighbor (k-NN) classifier provide promising classification results. Keywords— k-Nearest Neighbors (k-NN), Local Binary Pattern (LBP), Principal Component Analysis (PCA), RGB-D Object Recognition. I. INTRODUCTION Object recognition plays a vital role in machine learning, computer vision, and robotics in the modern technologies. It can be referred as a method to identify an object in a digital image or video to a pre-defined class by using features or appearance of the object. This task, unlike human eye, is relatively difficult and challenging, due to changes of 978-1-5090-4666-9/17/$31.00 ©2017 IEEE viewpoint, intensity of light, etc. The main task is to improve the recognition performance of the system. In today research environment, researchers try different attempts to use other various features of the objects such as spatial geometry information and shape . Despite gaining better performance, there are still limitations as from 2D RGB images only have limited amount of information and lacking information such as spatial geometry and shape of the object. The spatial geometry and shape features may not be reliable due to various variations in RGB images. In order to overcome this problem, depth camera such as Microsoft Kinect is recently introduced. This camera is able to provide high quality depth of the object and RGB images. An infrared (IR) sensor is used to capture depth information while at the same time RGB camera is used to take the RGB images. RGB-D images provide much more information as compared to RGB image alone. Depth images provide sufficient information about the spatial geometry and shape of the object in scenes . As the depth and RGB images contain different information about the object, the challenging and interesting part is how to combine both depth and RGB images so that better features can be created. The major problem in machine learning and computer vision applications is to improve the object recognition and classification performance. In this field several work [1-7] have been done in order to improve object recognition performance. High performance object recognition system can be exercised by using features extracting schemes and by using different classification algorithms. Feature extraction is important in object recognition because an image is too large to be processed by any classifier and there is also redundant information present in an image, hence from both depth and RGB images useful features are extracted in order to overcome this problem. In image processing various feature extraction schemes are used for both RGB and depth images such as HMP , Kernel Descriptors , Histogram of Oriented Gradients (HOG) , PCA , Convolutional neural networks (CNN)  etc. Then, the extracted features are then passed through a classifier to measure the classification performance of the object recognition system. Various classifier can be used, for example, linear support vector machines (SVM) , deep Regularized Reconstruction Independent Component Analysis network R2ICA , convolutional- recursive neural networks CNNRNN , k-NN  etc. This paper aims to improve not only the feature extraction and classification time but also to improve the object recognition performance by using different feature extraction schemes such PCA, Wavelet transforms, Auto-Correlogram and LBP for both depth and RGB images. The accuracy of the object recognition depends on the quality of training data as well as the learning algorithm. The algorithms presented in previous papers [1-6] took too much time and too complex as compared to our approach. The remaining part of the paper is organized as follows. Section II provides details about related works that have been done previously in object recognition system. Section III describes the proposed methodology used in this study; then followed by experimental results in section IV. Conclusion and discussion are drawn in section V. II. RELATED WORKS Literature review shows that  introduced largest hierarchical multi-view object dataset. This dataset was made by using RGB-D camera and they also present object recognition performance using state-of-the-art features like spin images and SIFT . Three state of the art classifiers; linear support vector machine (LinSVM), random forest (RF) and Gaussian kernel support vector machine (KSVM) were used for category and instance recognition performance. In  the authors used five depth kernel descriptors and their combination for Depth maps and 3D point clouds, before  kernel descriptors were used only for RGB images. They presented better results than previous papers. In  authors proposed hierarchical matching pursuit (HMP), which built a feature hierarchy layer-by-layer using an efficient matching pursuit encoder . HMP has three main parts: spatial pyramid pooling, contrast normalization and batch (tree) orthogonal matching pursuit. Although it outperformed all the previous papers, it has still two main limitations. In order to overcome the problem  uses sparse coding to learn hierarchical feature representations from raw RGB-D data in an unsupervised way . In  the authors proposed two innovations to make the approach suitable for RGB-D based object recognition. Firstly, feature learning was implemented to both color and depth images because in their previous work  they used only grayscale images which is not efficiently in object recognition. Different qualities such as color, appearance details and depth channels have been included and can also improve robustness of a recognition system. Secondly, they extracted features not only from the top of the feature hierarchy, but also from lower layers . In paper , the author proposed semi-supervised learning by using unsupervised convolutional recursive neural networks (CNN-RNN) to learn a set of basis features for RGB and depth images respectively. CNN-RNN is faster than HMP  and does not need additional features such as surface normals . III. PROPOSED METHODOLOGY Methodology of the proposed algorithm consists of the following steps as shown in fig. 1. • Data preprocessing • Features Extractions • Classification A. Data preprocessing The database used in this paper is Washington RGB-D object dataset . It consists of approximately 45,000 RGBD images with 300 different objects (instances) and they are divided into 51 different categories. This dataset is chosen because it contains both textured and textured-less objects and during light condition the data frames of RGB-D show large changes due to variations in luminance. Examples of textured objects are soda can, food box to name a few and texture-less are fruits, bowls etc. as described in fig. 2(a), 2(b) respectively. In this paper results will be evaluated on two levels: one is category level and other is instance level. For example, recognizing a ball is category recognition but recognition it as a specific “football”, “basketball” etc. is instance level recognition. First of all, we reduce the size of all images in the dataset into dimension of (384×256) pixels. Some filters are applied to denoise the images such as bilateral filter. B. Feature extraction Feature extraction starts with applying color autocorrelogram to the RGB images. Color autocorrelogram is used in this paper because these features are easy to extract; its size is small as compared to other features. It has spatial correlation of colors, and for the description of the global distribution of spatial correlation of colors distribution . A color correlogrom expresses how the spatial correlation of pairs of colors varies with distance . After this, the image is converted to grayscale in order to compute wavelet moments by applying wavelet transform. As the above extracted features are not enough for object classification. Some additional features are required. For this purpose, LBP features are extracted. Recently, LBP features got a lot of attention in various applications such as image retrieval, image segmentation to name a few because it enables the user to capture each and every details present in an image. In LBP feature extraction, the operator task is to label the pixels of an image, which is done by thresholding the neighbors (3*3) of each pixel value with the central pixel and taking the result in binary representation as shown in fig. 3. LBP operator can be used for different neighbor’s sizes such 4,16 and so on. Some of the best qualities of these features are the tolerance to monotonic illumination variation and their simplicity in computation. All these features extracted from RGB images are concatenated and store in vectors. Then, Mat file is generated by assigning labels to the instances and categories present in the dataset. This file will be used in classification process later on. To extract the feature from depth images PCA is used. The step by step procedure is shown in fig. 4. First all the images are resized to an equal dimension of 384×256. After resizing, the images are stored in a column vector M of size 98304 dimensions. All of the depth images are arranged in a data matrix (D) in which each column represents a single depth image. The purpose of the formation of the data matrix is to reduce the column dimension from 98304 to variables L=99 for each observation in the data matrix. First, the mean of the data matrix is found and stored in a vector. After the mean calculation, the center matrix (CM) is formed by the subtraction of mean from the data matrix. A covariance matrix (V) is formed after center matrix as shown in the equation (1) followed by Eigen faces, Feature vector and final vector in equation 2,3, and 4 respectively. Cov(CM) = CM.CMT (1) Eigen faces = CM ×V (2) Feature vector = (eig1, eig2…… eigL) (3) L=99 Feature vector= feature vectorT×CM Fig. 1. The proposed method Fig. 2 (a): Textured objects Fig. 2(b): Texture-less objects (4) 196 192 189 1 193 191 186 1 189 189 188 0 1 0 0 0 (10000011)2 0 Fig. 3. LBP features extractions operation. The feature vector obtained from depth images is used for the classification of depth images. To the end, both RGB and depth images feature vectors are concatenated and then are stored in the Mat file to evaluate the classification accuracy of RGB-D object dataset. C. Classification In this paper, k-nearest neighbor (k-NN) classifier is used for the purpose of classification. The concept of k-NN algorithm assigns the class label to the query sample based on the majority vote among its k nearest neighbors. The mat file obtained from feature extraction is given to the classifier for the purpose of classification. In the mat file labels are assign to each category and instances. The three separate files prepared from RGB, Depth and RGBD images are given to the classifier. The classifier gives results on the basis of category and instance recognition. Beside the accuracy of classifier time and processing speed of the classification are the two criterion to be observed. IV. EXPERIMENTAL RESULTS The testing results for k-NN classifier are shown below in table 1-4. While in table 5, the comparison between our approach and state of the art HMP method  are presented. Each table shows results for different folds of data. Training and testing data is arranged in different folds such as 90 10 folds means 90% of data is for classifier training and 10% is for classifier testing, Similarly for 80 20 folds, 70 30 and 50 50 folds. Table 1: k-NN classifier Experimental Results for 90-10 folds training and testing data Category level recognition Instances level recognition RGB 94.1 Depth RGB-D 88.1 94.5 RGB 86.0 Depth RGB-D 66.3 86.8 Table 2: k-NN classifier Experimental Results for 80-20 folds training and testing data Category level recognition Instances level recognition RGB Depth RGB-D RGB Depth RGB-D 92.6 87.8 93.0 85.1 62.7 85.8 Table 3: k-NN classifier Experimental Results for 70-30 folds training and testing data Category level recognition Instances level recognition RGB Depth RGB-D RGB Depth RGB-D 91.2 85.4 91.9 83.8 58.1 84.5 Table 4: k-NN classifier Experimental Results for 50-50 folds training and testing data Category level recognition Instances level recognition RGB 88.9 Fig. 4. Block diagram of PCA Depth 83.5 RGB-D 89.3 RGB 80.3 Depth 56.2 RGB-D 80.7 In the below table 5, we have presented the comparison between our approach and HMP  on 70 30 folds. The experiments show that our approach is more suitable than HMP. As it can be seen from the results that our approach outperformed state of the art HMP on this dataset division for both RGB-D category and RGB-D instance recognition specially in case of depth images. The accuracy of our system in RGB-D instance recognition is 84.5 while the accuracy of HMP is 81.2, similarly, our accuracy is better in RGB-D category recognition as shown in table 5. Similarly, below table shows that the test time for one image in HMP  was 0.51 sec while the test time in this work is 0.37 sec. Table 5: Comparison of our approach and HMP on 30 70 training and testing data. RGB-D HMP  This work Category level recognition Instances level recognition Time (sec) RGB Depth RGBD RGB Depth RGB-D 89.3 77.2 90.9 78.9 56.4 81.2 0.51 91.2 85.4 91.9 83.8 58.1 84.5 0.37 V. DISCUSSIONS AND CONCLUSIONS In this paper we proposed combination of different features for RGB-D object classification performance. From the above results we concluded that combination of such features provides better results in relatively lesser time than state of the art approach HMP. The processing time is low as compared to [1-5] and the speed is high. Which makes this system to be practical in low cost service robots. Motivated by the fact, classifiers like CNN are too much time consuming so simple classifier such as k-NN is used for classification. The results also reveal that this approach gives us 8% better results in Depth category recognition while almost 2% in depth instance recognition when compared to HMP on 70 30 folds of training and testing respectively. While in case of category recognition the results of proposed system provide promising result. VI. ACKNOWLEDGMENT This research is financially supported by Thailand Advance Institute of Science and Technology (TAIST), National science and Technology Development Agency (NSTDA), Tokyo Institute of Technology, Kasetsart University (KU), under the TAIST TOKYO TECH program. VII. REFERENCES  Bo, Liefeng, Xiaofeng Ren, and Dieter Fox. “Unsupervised feature learning for RGB-D based object recognition”. In Experimental Robotics, pp. 387-402. Springer International Publishing, 2013.  J Bo, Liefeng, Kevin Lai, Xiaofeng Ren, and Dieter Fox. “Object recognition with hierarchical kernel descriptors”. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 17291736. IEEE, 2011.  Jhuo, I-Hong, Shenghua Gao, Liansheng Zhuang, D. T. Lee, and Yi Ma. “Unsupervised feature learning for RGB-D image classification”. In Asian Conference on Computer Vision, pp. 276-289. Springer International Publishing, 2014.  Lai, Kevin, Liefeng Bo, Xiaofeng Ren, and Dieter Fox. "A large-scale hierarchical multi-view rgb-d object dataset." In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pp. 1817-1824. IEEE, 2011..  Bo, Liefeng, Xiaofeng Ren, and Dieter Fox. “Depth kernel descriptors for object recognition.” In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 821-826. IEEE, 2011.  Bo, Liefeng, Xiaofeng Ren, and Dieter Fox. “Hierarchical matching pursuit for image classification: Architecture and fast algorithms.” In Advances in neural information processing systems, pp. 2115-2123. 2011.  Y. Cheng, X. Zhao, K. Huang and T. Tan, "Semi-supervised Learning for RGB-D Object Recognition," 2014 22nd International Conference on Pattern Recognition, Stockholm, 2014, pp. 2377-2382.  Ali, L., A. Sarwar, A. Jan, M. I. Khattak, and M. Shafi. "Identification of Seamless Connection in Merged Images using Evolutionary Artificial Neural Network (EANN)." University of Engineering and Technology Taxila. Technical Journal 20, no. 2 (2015): 34  Huang, Jing, S. Ravi Kumar, MandarMitra, Wei-Jing Zhu, and RaminZabih. "Image indexing using color correlograms." In Computer Vision and Pattern Recognition, 1997. Proceedings., 1997 IEEE Computer Society Conference on, pp. 762-768. IEEE, 1997.  Gupta, Saurabh, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. "Learning rich features from RGB-D images for object detection and segmentation." In European Conference on Computer Vision, pp. 345360. Springer International Publishing, 2014.  Dalal, Navneet, and Bill Triggs. "Histograms of oriented gradients for human detection." In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), vol. 1, pp. 886893. IEEE, 2005  Sarode, Nivedita S., and A. M. Patil. "Iris Recognition using LBP with Classifiers-KNN and NB."