Application of clustering methods: Regularized Markov clustering (R-MCL) for analyzing dengue virus similarity D. Lestari, D. Raharjo, A. Bustamam, B. Abdillah, and W. Widhianto Citation: AIP Conference Proceedings 1862, 030130 (2017); View online: https://doi.org/10.1063/1.4991234 View Table of Contents: http://aip.scitation.org/toc/apc/1862/1 Published by the American Institute of Physics Articles you may be interested in Protein sequences clustering of herpes virus by using Tribe Markov clustering (Tribe-MCL) AIP Conference Proceedings 1862, 030150 (2017); 10.1063/1.4991254 Application of Quaternion in improving the quality of global sequence alignment scores for an ambiguous sequence target in Streptococcus pneumoniae DNA AIP Conference Proceedings 1862, 030122 (2017); 10.1063/1.4991226 Implementation of hierarchical clustering using k-mer sparse matrix to analyze MERS–CoV genetic relationship AIP Conference Proceedings 1862, 030142 (2017); 10.1063/1.4991246 Application of hybrid clustering using parallel k-means algorithm and DIANA algorithm AIP Conference Proceedings 1825, 020024 (2017); 10.1063/1.4978993 Non-negative matrix factorization in texture feature for classification of dementia with MRI data AIP Conference Proceedings 1862, 030148 (2017); 10.1063/1.4991252 Implementation of parallel k-means algorithm for two-phase method biclustering in Carcinoma tumor gene expression data AIP Conference Proceedings 1825, 020004 (2017); 10.1063/1.4978973 Application of Clustering Methods: Regularized Markov Clustering (R-MCL) for Analyzing Dengue Virus Similarity D. Lestari, D. Raharjo, A. Bustamama), B. Abdillah, and W. Widhianto Department of Mathematics, Faculty of Mathematics and Natural Sciences (FMIPA), Universitas Indonesia, Depok 16424, Indonesia a) Corresponding author: alhadi@sci.ui.ac.id Abstract. Dengue virus consists of 10 different constituent proteins and are classified into 4 major serotypes (DEN 1 - DEN 4). This study was designed to perform clustering against 30 protein sequences of dengue virus taken from Virus Pathogen Database and Analysis Resource (VIPR) using Regularized Markov Clustering (R-MCL) algorithm and then we analyze the result. By using Python program 3.4, R-MCL algorithm produces 8 clusters with more than one centroid in several clusters. The number of centroid shows the density level of interaction. Protein interactions that are connected in a tissue, form a complex protein that serves as a specific biological process unit. The analysis of result shows the RMCL clustering produces clusters of dengue virus family based on the similarity role of their constituent protein, regardless of serotypes. Keywords: Bioinformatics; Clustering; Dengue Virus; Protein Interactions; R-MCL and Sequence Alignment. INTRODUCTION One of a dangerous virus that can mutate quickly is dengue virus, the virus is the cause of dengue. The virus has four main serotypes: DEN-1, DEN-2, DEN-3 and DEN-4. The serotypes of dengue virus are found in several areas in Indonesia. Serotype DEN-3 is the dominant virus serotype that causing severe cases. The difference of serotype also causing different symptoms on the patients. Dengue virus has 10 kinds of protein creators which have different roles, consist of three structural protein (C, M, E) and seven non-structural protein (NS1, NS2a, NS2b, NS3, NS4a, NS4b, NS5) [1]. Considering the size of dengue virus sequences, we need to use clustering technique in order to analyze them efficiently. Clustering is a method of analyzing data which the purpose to grouping data with similar characteristics into the same group [2]. One of the clustering methods is Markov Clustering (MCL). Recently, the MCL, which was originally developed for the general problem of graph clustering, has been adopted in a wide range of applications including in bioinformatics applications. The algorithm has also been reviewed intensively and has been shown to be robust and reliable compared to many other clustering algorithms [3]. In this study, we use the improvement of Markov clustering, called Regularized Markov Clustering (R-MCL). This method has two primary processes in each of its iterations that are regularized and inflate [4]. The process of R-MCL can be seen below. input : Matrix M, r = inflate parameter Output : Matriks M, cluster entries 1. M :=M + I // self-loop on graph 2. M :=normalize(M) repeat 3. M :=regularize(M*MG) 4. M :=inflate(M, r) 5. M :=prune(M) until M convergent International Symposium on Current Progress in Mathematics and Sciences 2016 (ISCPMS 2016) AIP Conf. Proc. 1862, 030130-1–030130-7; doi: 10.1063/1.4991234 Published by AIP Publishing. 978-0-7354-1536-2/$30.00 030130-1 The purpose of this study is to cluster dengue virus protein sequences based on the R-MCL method and find the sequences relationship based on the clustering result. METHODS In general, the clustering method in this research are divided into two steps. The first step is creating adjacency matrix form for input data. The second step is clustering dengue virus protein sequence using R-MCL method. Dengue virus protein sequence clustering with R-MCL method can be illustrated in Fig. 1. Adjacency matrix formation process is the first step that must be done to cluster dengue virus protein sequences. The step to find adjacency matrix as an input data is executing BLAST sequence alignment on protein sequences and find adjacency matrix by completing formula in Fig. 1. Each of dengue virus has a different protein sequences and written in FASTA format. The data in this research is 30 dengue virus protein located in Indonesia since 2010 until 2014 and taken from Virus Pathogen Database and Analysis Resource (ViPR) [5]. Data can be seen in Table 1. The utility of using Code from 1 until 30 as shown in Table 1 is to facilitate the dengue virus protein sequence’s name. Sequence alignment of dengue virus protein sequences is done by using BLAST from National Center for Biotechnology Information (NCBI). Sequence alignment results with BLAST are information summary of dengue virus protein sequence similarity value. E-value is the estimated value that provides a measurement of statistical significance between two sequences. The higher E-value shows the lower homology level between sequence and vice versa. The process of adjacency matrix formation by using the E-value is shown in Fig. 2. FIGURE 1. Clustering Flowchart 030130-2 TABLE 1. Thirty Dengue Virus Protein Sequences Data Code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Identity gi:VIPR_ALG4_573592343_7574_10270 gi:VIPR_ALG4_573592343_2420_3475 gi:VIPR_ALG4_573592343_4520_6376 gi:VIPR_ALG4_573592327_710_934 gi:VIPR_ALG4_573592327_95_394 gi:VIPR_ALG4_573592327_4520_6376 gi:VIPR_ALG4_573592333_2420_3475 gi:VIPR_ALG4_573592333_7574_10270 gi:VIPR_ALG4_573592333_95_394 gi:VIPR_ALG4_573592333_710_934 gi:VIPR_ALG4_573592405_937_2421 gi:VIPR_ALG4_573592405_712_936 gi:VIPR_ALG4_573592405_97_396 gi:VIPR_ALG4_573592407_712_936 gi:VIPR_ALG4_573592407_97_396 gi:VIPR_ALG4_573592407_3478_4131 gi:VIPR_ALG4_573592409_97_396 gi:VIPR_ALG4_573592409_712_936 gi:VIPR_ALG4_573592409_937_2421 gi:VIPR_ALG4_573592409_4132_4521 gi:VIPR_ALG4_573592433_437_934 gi:VIPR_ALG4_573592433_710_934 gi:VIPR_ALG4_573592433_3470_4123 gi:VIPR_ALG4_573592433_4124_4513 gi:VIPR_ALG4_573592433_6371_6751 gi:VIPR_ALG4_573592433_6821_7564 gi:VIPR_ALG4_573592435_710_934 gi:VIPR_ALG4_573592435_3470_4123 gi:VIPR_ALG4_573592435_6371_6751 gi:VIPR_ALG4_573592435_6821_7564 (a) Protein NS5 NS1 NS3 M C NS3 NS1 NS5 C M E M C M C NS2a C M E NS2b preM M NS2a NS2b NS4a NS4b M NS2a NS4a NS4b Serotype DEN 1 DEN 1 DEN 1 DEN 1 DEN 1 DEN 1 DEN 1 DEN 1 DEN 1 DEN 1 DEN 2 DEN 2 DEN 2 DEN 2 DEN 2 DEN 2 DEN 2 DEN 2 DEN 2 DEN 2 DEN 3 DEN 3 DEN 3 DEN 3 DEN 3 DEN 3 DEN 3 DEN 3 DEN 3 DEN 3 100 50 45 0 30 0 0 50 100 0 0 50 0 0 45 0 100 0 40 0 0 0 0 0 100 55 0 45 30 50 40 55 100 30 60 0 0 0 0 30 100 0 0 0 0 45 60 0 100 (b) FIGURE 2. Adjacency Matrix Form [5], (a) Graph with edges are E-value, (b) Adjacency Matrix with entries are –log(E-value) After forming the adjacency matrix, the second step to cluster is completing R-MCL method step. First phase of the R-MCL method is taking adjacency matrix as a matrix input M. Furthermore, the formation of Markov matrix input by normalizing the adjacency matrix M [4]. Normalizing process by using Equation 1. 030130-3 , = , ∑ ; , = 1,2, … , , (1) The second phase of the R-MCL is regularized process. The purpose of this process is to bring protein interactions that have not yet emerged from the protein that has the possibility to interact with others. Regularized process is a process of matrix multiplication of initial Markov matrix input ( ) with the results of matrix multiplication as the Markov matrix input in each iteration. In the first step of iteration in regularized matrix multiplication is started Markov matrix input with itself ( = ), then performing = , where is the payoff matrix regularized and store the result as a Markov matrix input for next iteration = . The third phase is Inflate which is the process to execute power function to each element of the payoff matrix regularized ( ) using inflation factor r (see Equation 2). The purpose is to strengthen the stronger interaction and weaken the weaker interaction and try to keep preserving the initial graph topology. Regularized inflated matrix result is then directly be normalized again, so that the Inflate matrix is a matrix that got two processes at once. Inflated matrix result (Γ ), is the matrix which hase inflated with r parameter as defined in Equation 2 [4]: Γ = ; , = 1,2, … , ∑ (2) Usually, the default r value is 2, with the value of the elements in matrix becomes not uniform. As for r between 0 and 1, then the elements in matrix becomes uniform (no change). Negative r value is not allowed because it would change large elements becomes small and small becomes large. The process is just one iteration, and if inflate matrix still not convergence, so that iteration should be continued. Each iteration includes the Regularized and Inflate process that will generate idempotent matrix. Iteration will stop as idempotent condition, a condition in which the global chaos that is less than the minimum threshold e, the default e = 10-3 [6]. ℎ!"# = ! Γ &'(_ ℎ!"# = ! ℎ!"# , , = 1, 2, … , −∑ Γ % = 1, 2, … , FIGURE 3. Graph of 30 sequences interactions based on their Adjacency Matrix using BLAST -log(E-value) 030130-4 (3) RESULTS AND DISCUSSION Input matrix which is the adjacency matrix of 30 dengue virus protein sequences obtained from E-value of BLAST alignment results. Adjacency matrix formed has a size of 30 x 30 appropriate with the number of dengue virus protein sequences. Alignment performed by taking one of the thirty dengue virus protein sequence as a global consensus of other sequences. Dengue virus protein sequences that serve global consensus shows the adjacency matrix column, while other sequences become a constituent element of the matrix column. This study use Code 1 as the first global consensus. Furthermore, the results of this alignment will become a constituent element of the first column adjacency matrix. Sequences with Code 2 as a second global consensus, the result of this alignment will become a constituent element of the second column of the adjacency matrix. And so forth until the sequence with the Code 30 as the global consensus on the thirtieth. Based on the results of the E-value protein sequence alignment of 30 dengue virus, the interaction of the sequences can be described Fig. 3. The result of the grouping of 30 dengue virus protein sequences using the R-MCL consist of 8 clusters as shown in Fig. 4. We found that clustering with R-MCL shows the results of the grouping are not much different with and preserving the topology of initial graph in Fig. 3. This is because the regularized process which involves the initial topology graph in each iteration. The interaction of the protein sequence can generate more than one group center, where the center of the group is a protein sequence that might plays an important role in cellular function. The more the center of the group, the more solid the group. The clustering results are also compared with the BLASTclust method. Clustering with BLASTclust produces 15 groups as shown in Fig. 5. Meanwhile the results of clustering using the R-MCL compared to BLASTclust method are presented in Table 2. Analysis of the results of the grouping on Table 2 can be divided into two categories, namely by protein constituent and serotype of dengue virus protein sequences. Based on protein constituent, the first method using the R-MCL clustering, the structural protein M and E are grouping into one group since both of these proteins have similar role. However, in the second method, the BLASTclust still distinguish structural protein M and E, perhaps since both of them still have small difference in their role. Based on serotypes, R-MCL method does not pay attention to the type of serotypes. Meanhile the BLASTclust method has attention to serotypes, such as separating non-structural proteins of NS2a with different stereotype DEN 2 and DEN 3 into different group 8 and group 3 respectively. FIGURE 4. Clustering result using R-MCL 030130-5 FIGURE 5. Clustering result using BLASTclust TABLE 2. Clustering Results: Comparison between R-MCL and BLASTclust Group 1 2 3 4 5 6 7 8 R-MCL Protein Member constituent 4 M 10 M 11 E 12 M 14 M 18 M 19 E 21 preM 22 M 27 M 5 C 9 C 13 C 15 C 17 C 16 NS2a 20 NS2b 23 NS2a 24 NS2b 28 NS2a 1 NS5 8 NS5 2 NS1 7 NS1 3 NS3 6 NS3 25 NS4a 29 NS4a 26 NS4b 30 NS4b Serotype DEN 1 DEN 1 DEN 2 DEN 2 DEN 2 DEN 2 DEN 2 DEN 3 DEN 3 DEN 3 DEN 1 DEN 1 DEN 2 DEN 2 DEN 2 DEN 2 DEN 2 DEN 3 DEN 3 DEN 3 DEN 1 DEN 1 DEN 1 DEN 1 DEN 1 DEN 1 DEN 3 DEN 3 DEN 3 DEN 3 Group 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 030130-6 BLASTclust Protein Member Constituent 12 M 14 M 18 M 13 C 15 C 17 C 4 M 10 M 22 M 27 M 5 C 9 C 25 NS4a 29 NS4a 23 NS2a 28 NS2a 26 NS4b 30 NS4b 2 NS1 7 NS1 11 E 19 E 3 NS3 6 NS3 1 NS5 8 NS5 24 NS2b 20 NS2b 21 preM 16 NS2a Serotype DEN 2 DEN 2 DEN 2 DEN 2 DEN 2 DEN 2 DEN 1 DEN 1 DEN 3 DEN 3 DEN 1 DEN 1 DEN 3 DEN 3 DEN 3 DEN 3 DEN 3 DEN 3 DEN 1 DEN 1 DEN 2 DEN 2 DEN 1 DEN 1 DEN 1 DEN 1 DEN 3 DEN 2 DEN 3 DEN 2 CONCLUSIONS This study came to the conclusion that the R-MCL algorithm produces 8 groups of 30 dengue virus protein sequence with each group has one or more of centers. The more the center of a group then the group the higher the group density. High density can be seen from the number of interactions that take place within the group. The density of the network tends to form a protein complex that serves as the unit of specific biological processes. The clustering results are also shown that kinship of dengue virus produced by RMCL are groups of the dengue virus with a similar role regardless of its constituent protein and serotypes. Furthermore, in term of networks topology the R-MCL results are more likely to preserve their original networks topology comparing with other clustering results such as BLASTclust results. Finally, in this study we found that BLASTclust method is more likely to produce larger number of clusters with more detail biological function in each groups of clusters compare to R-MCL results. FURTHER RESEARCH Given the results of the analysis of protein sequences clustering of dengue virus with R-MCL which are groups of protein with the same role, then the grouping results can be used for further study by the other related fields. REFERENCES 1. 2. 3. 4. 5. 6. World Health Organization, Dengue Guidelines for Diagnosis, Treatment, Prevention and Control (WHO Press, Geneva, 2009), available at: www.who.int/tdr/publications/documents/dengue-diagnosis.pdf X. Li and S. Kiong, in Biological Data Mining in Protein Interaction Networks, edited by X. L. Li, (IGI Global, Pennsylvania, 2009). A. Bustamam, K. Burrage, and N. A. Hamilton, Proceedings of the Second International Conference on Advances in Computing, Control, and Telecommunication Technologies (ACT), Jakarta, 2010 (IEEE, New Jersey, 2010), pp. 173-175. V. Satuluri, Proceedings of the First ACM Conference on Bioinformatics and Computational Biology, New York, 2010 (ACM, New York, 2010), pp. 247–256. Virus Pathogen Database and Analysis Resource, available at: http://www.viprbrc.org A. Bustamam, K. Burrage, and N. A. Hamilton, IEEE/ACM Trans. Comput. Biol. Bioinf. 9, 679–692 (2012). 030130-7

1/--страниц