close

Вход

Забыли?

вход по аккаунту

?

1.4991234

код для вставкиСкачать
Application of clustering methods: Regularized Markov clustering (R-MCL) for
analyzing dengue virus similarity
D. Lestari, D. Raharjo, A. Bustamam, B. Abdillah, and W. Widhianto
Citation: AIP Conference Proceedings 1862, 030130 (2017);
View online: https://doi.org/10.1063/1.4991234
View Table of Contents: http://aip.scitation.org/toc/apc/1862/1
Published by the American Institute of Physics
Articles you may be interested in
Protein sequences clustering of herpes virus by using Tribe Markov clustering (Tribe-MCL)
AIP Conference Proceedings 1862, 030150 (2017); 10.1063/1.4991254
Application of Quaternion in improving the quality of global sequence alignment scores for an ambiguous
sequence target in Streptococcus pneumoniae DNA
AIP Conference Proceedings 1862, 030122 (2017); 10.1063/1.4991226
Implementation of hierarchical clustering using k-mer sparse matrix to analyze MERS–CoV genetic relationship
AIP Conference Proceedings 1862, 030142 (2017); 10.1063/1.4991246
Application of hybrid clustering using parallel k-means algorithm and DIANA algorithm
AIP Conference Proceedings 1825, 020024 (2017); 10.1063/1.4978993
Non-negative matrix factorization in texture feature for classification of dementia with MRI data
AIP Conference Proceedings 1862, 030148 (2017); 10.1063/1.4991252
Implementation of parallel k-means algorithm for two-phase method biclustering in Carcinoma tumor gene
expression data
AIP Conference Proceedings 1825, 020004 (2017); 10.1063/1.4978973
Application of Clustering Methods: Regularized Markov
Clustering (R-MCL) for Analyzing Dengue Virus Similarity
D. Lestari, D. Raharjo, A. Bustamama), B. Abdillah, and W. Widhianto
Department of Mathematics, Faculty of Mathematics and Natural Sciences (FMIPA),
Universitas Indonesia, Depok 16424, Indonesia
a)
Corresponding author: alhadi@sci.ui.ac.id
Abstract. Dengue virus consists of 10 different constituent proteins and are classified into 4 major serotypes
(DEN 1 - DEN 4). This study was designed to perform clustering against 30 protein sequences of dengue virus taken from
Virus Pathogen Database and Analysis Resource (VIPR) using Regularized Markov Clustering (R-MCL) algorithm and
then we analyze the result. By using Python program 3.4, R-MCL algorithm produces 8 clusters with more than one centroid
in several clusters. The number of centroid shows the density level of interaction. Protein interactions that are connected
in a tissue, form a complex protein that serves as a specific biological process unit. The analysis of result shows the RMCL clustering produces clusters of dengue virus family based on the similarity role of their constituent protein, regardless
of serotypes.
Keywords: Bioinformatics; Clustering; Dengue Virus; Protein Interactions; R-MCL and Sequence Alignment.
INTRODUCTION
One of a dangerous virus that can mutate quickly is dengue virus, the virus is the cause of dengue. The virus has
four main serotypes: DEN-1, DEN-2, DEN-3 and DEN-4. The serotypes of dengue virus are found in several areas in
Indonesia. Serotype DEN-3 is the dominant virus serotype that causing severe cases. The difference of serotype also
causing different symptoms on the patients. Dengue virus has 10 kinds of protein creators which have different roles,
consist of three structural protein (C, M, E) and seven non-structural protein (NS1, NS2a, NS2b, NS3, NS4a, NS4b,
NS5) [1]. Considering the size of dengue virus sequences, we need to use clustering technique in order to analyze
them efficiently. Clustering is a method of analyzing data which the purpose to grouping data with similar
characteristics into the same group [2]. One of the clustering methods is Markov Clustering (MCL). Recently, the
MCL, which was originally developed for the general problem of graph clustering, has been adopted in a wide range
of applications including in bioinformatics applications. The algorithm has also been reviewed intensively and has
been shown to be robust and reliable compared to many other clustering algorithms [3]. In this study, we use the
improvement of Markov clustering, called Regularized Markov Clustering (R-MCL). This method has two primary
processes in each of its iterations that are regularized and inflate [4]. The process of R-MCL can be seen below.
input : Matrix M, r = inflate parameter
Output : Matriks M, cluster entries
1. M :=M + I // self-loop on graph
2. M :=normalize(M)
repeat
3. M :=regularize(M*MG)
4. M :=inflate(M, r)
5. M :=prune(M)
until M convergent
International Symposium on Current Progress in Mathematics and Sciences 2016 (ISCPMS 2016)
AIP Conf. Proc. 1862, 030130-1–030130-7; doi: 10.1063/1.4991234
Published by AIP Publishing. 978-0-7354-1536-2/$30.00
030130-1
The purpose of this study is to cluster dengue virus protein sequences based on the R-MCL method and find the
sequences relationship based on the clustering result.
METHODS
In general, the clustering method in this research are divided into two steps. The first step is creating adjacency
matrix form for input data. The second step is clustering dengue virus protein sequence using R-MCL method.
Dengue virus protein sequence clustering with R-MCL method can be illustrated in Fig. 1.
Adjacency matrix formation process is the first step that must be done to cluster dengue virus protein sequences.
The step to find adjacency matrix as an input data is executing BLAST sequence alignment on protein sequences and
find adjacency matrix by completing formula in Fig. 1.
Each of dengue virus has a different protein sequences and written in FASTA format. The data in this research is
30 dengue virus protein located in Indonesia since 2010 until 2014 and taken from Virus Pathogen Database and
Analysis Resource (ViPR) [5]. Data can be seen in Table 1. The utility of using Code from 1 until 30 as shown in
Table 1 is to facilitate the dengue virus protein sequence’s name.
Sequence alignment of dengue virus protein sequences is done by using BLAST from National Center for
Biotechnology Information (NCBI). Sequence alignment results with BLAST are information summary of dengue
virus protein sequence similarity value. E-value is the estimated value that provides a measurement of statistical
significance between two sequences. The higher E-value shows the lower homology level between sequence and vice
versa. The process of adjacency matrix formation by using the E-value is shown in Fig. 2.
FIGURE 1. Clustering Flowchart
030130-2
TABLE 1. Thirty Dengue Virus Protein Sequences Data
Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Identity
gi:VIPR_ALG4_573592343_7574_10270
gi:VIPR_ALG4_573592343_2420_3475
gi:VIPR_ALG4_573592343_4520_6376
gi:VIPR_ALG4_573592327_710_934
gi:VIPR_ALG4_573592327_95_394
gi:VIPR_ALG4_573592327_4520_6376
gi:VIPR_ALG4_573592333_2420_3475
gi:VIPR_ALG4_573592333_7574_10270
gi:VIPR_ALG4_573592333_95_394
gi:VIPR_ALG4_573592333_710_934
gi:VIPR_ALG4_573592405_937_2421
gi:VIPR_ALG4_573592405_712_936
gi:VIPR_ALG4_573592405_97_396
gi:VIPR_ALG4_573592407_712_936
gi:VIPR_ALG4_573592407_97_396
gi:VIPR_ALG4_573592407_3478_4131
gi:VIPR_ALG4_573592409_97_396
gi:VIPR_ALG4_573592409_712_936
gi:VIPR_ALG4_573592409_937_2421
gi:VIPR_ALG4_573592409_4132_4521
gi:VIPR_ALG4_573592433_437_934
gi:VIPR_ALG4_573592433_710_934
gi:VIPR_ALG4_573592433_3470_4123
gi:VIPR_ALG4_573592433_4124_4513
gi:VIPR_ALG4_573592433_6371_6751
gi:VIPR_ALG4_573592433_6821_7564
gi:VIPR_ALG4_573592435_710_934
gi:VIPR_ALG4_573592435_3470_4123
gi:VIPR_ALG4_573592435_6371_6751
gi:VIPR_ALG4_573592435_6821_7564

















(a)
Protein
NS5
NS1
NS3
M
C
NS3
NS1
NS5
C
M
E
M
C
M
C
NS2a
C
M
E
NS2b
preM
M
NS2a
NS2b
NS4a
NS4b
M
NS2a
NS4a
NS4b
Serotype
DEN 1
DEN 1
DEN 1
DEN 1
DEN 1
DEN 1
DEN 1
DEN 1
DEN 1
DEN 1
DEN 2
DEN 2
DEN 2
DEN 2
DEN 2
DEN 2
DEN 2
DEN 2
DEN 2
DEN 2
DEN 3
DEN 3
DEN 3
DEN 3
DEN 3
DEN 3
DEN 3
DEN 3
DEN 3
DEN 3
100
50
45
0
30
0
0
50
100
0
0
50
0
0
45
0
100
0
40
0
0
0
0
0
100
55
0
45
30
50
40
55
100
30
60
0
0
0
0
30
100
0
0
0
0
45
60
0
100

















(b)
FIGURE 2. Adjacency Matrix Form [5], (a) Graph with edges are E-value,
(b) Adjacency Matrix with entries are –log(E-value)
After forming the adjacency matrix, the second step to cluster is completing R-MCL method step. First phase of
the R-MCL method is taking adjacency matrix as a matrix input M. Furthermore, the formation of Markov matrix
input by normalizing the adjacency matrix M [4]. Normalizing process by using Equation 1.
030130-3
,
=
,
∑
; , = 1,2, … ,
,
(1)
The second phase of the R-MCL is regularized process. The purpose of this process is to bring protein interactions
that have not yet emerged from the protein that has the possibility to interact with others. Regularized process is a
process of matrix multiplication of initial Markov matrix input ( ) with the results of matrix multiplication as the
Markov matrix input in each iteration. In the first step of iteration in regularized matrix multiplication is started
Markov matrix input with itself ( =
), then performing
=
, where
is the payoff matrix regularized
and store the result as a Markov matrix input for next iteration
=
.
The third phase is Inflate which is the process to execute power function to each element of the payoff matrix
regularized ( ) using inflation factor r (see Equation 2). The purpose is to strengthen the stronger interaction and
weaken the weaker interaction and try to keep preserving the initial graph topology. Regularized inflated matrix result
is then directly be normalized again, so that the Inflate matrix is a matrix that got two processes at once. Inflated
matrix result (Γ
), is the matrix which hase inflated with r parameter as defined in Equation 2 [4]:
Γ
=
; , = 1,2, … ,
∑
(2)
Usually, the default r value is 2, with the value of the elements in matrix becomes not uniform. As for r between 0
and 1, then the elements in matrix becomes uniform (no change). Negative r value is not allowed because it would
change large elements becomes small and small becomes large. The process is just one iteration, and if inflate matrix
still not convergence, so that iteration should be continued.
Each iteration includes the Regularized and Inflate process that will generate idempotent matrix. Iteration will stop
as idempotent condition, a condition in which the global chaos that is less than the minimum threshold e, the default
e = 10-3 [6].
ℎ!"#
=
!
Γ
&'(_ ℎ!"#
=
!
ℎ!"# ,
,
= 1, 2, … ,
−∑
Γ
%
= 1, 2, … ,
FIGURE 3. Graph of 30 sequences interactions based on their Adjacency Matrix using BLAST -log(E-value)
030130-4
(3)
RESULTS AND DISCUSSION
Input matrix which is the adjacency matrix of 30 dengue virus protein sequences obtained from E-value of BLAST
alignment results. Adjacency matrix formed has a size of 30 x 30 appropriate with the number of dengue virus protein
sequences. Alignment performed by taking one of the thirty dengue virus protein sequence as a global consensus of
other sequences. Dengue virus protein sequences that serve global consensus shows the adjacency matrix column,
while other sequences become a constituent element of the matrix column.
This study use Code 1 as the first global consensus. Furthermore, the results of this alignment will become a
constituent element of the first column adjacency matrix. Sequences with Code 2 as a second global consensus, the
result of this alignment will become a constituent element of the second column of the adjacency matrix. And so forth
until the sequence with the Code 30 as the global consensus on the thirtieth.
Based on the results of the E-value protein sequence alignment of 30 dengue virus, the interaction of the sequences
can be described Fig. 3. The result of the grouping of 30 dengue virus protein sequences using the R-MCL consist of
8 clusters as shown in Fig. 4. We found that clustering with R-MCL shows the results of the grouping are not much
different with and preserving the topology of initial graph in Fig. 3. This is because the regularized process which
involves the initial topology graph in each iteration. The interaction of the protein sequence can generate more than
one group center, where the center of the group is a protein sequence that might plays an important role in cellular
function. The more the center of the group, the more solid the group. The clustering results are also compared with
the BLASTclust method. Clustering with BLASTclust produces 15 groups as shown in Fig. 5. Meanwhile the results
of clustering using the R-MCL compared to BLASTclust method are presented in Table 2.
Analysis of the results of the grouping on Table 2 can be divided into two categories, namely by protein constituent
and serotype of dengue virus protein sequences. Based on protein constituent, the first method using the R-MCL
clustering, the structural protein M and E are grouping into one group since both of these proteins have similar role.
However, in the second method, the BLASTclust still distinguish structural protein M and E, perhaps since both of
them still have small difference in their role. Based on serotypes, R-MCL method does not pay attention to the type
of serotypes. Meanhile the BLASTclust method has attention to serotypes, such as separating non-structural proteins
of NS2a with different stereotype DEN 2 and DEN 3 into different group 8 and group 3 respectively.
FIGURE 4. Clustering result using R-MCL
030130-5
FIGURE 5. Clustering result using BLASTclust
TABLE 2. Clustering Results: Comparison between R-MCL and BLASTclust
Group
1
2
3
4
5
6
7
8
R-MCL
Protein
Member
constituent
4
M
10
M
11
E
12
M
14
M
18
M
19
E
21
preM
22
M
27
M
5
C
9
C
13
C
15
C
17
C
16
NS2a
20
NS2b
23
NS2a
24
NS2b
28
NS2a
1
NS5
8
NS5
2
NS1
7
NS1
3
NS3
6
NS3
25
NS4a
29
NS4a
26
NS4b
30
NS4b
Serotype
DEN 1
DEN 1
DEN 2
DEN 2
DEN 2
DEN 2
DEN 2
DEN 3
DEN 3
DEN 3
DEN 1
DEN 1
DEN 2
DEN 2
DEN 2
DEN 2
DEN 2
DEN 3
DEN 3
DEN 3
DEN 1
DEN 1
DEN 1
DEN 1
DEN 1
DEN 1
DEN 3
DEN 3
DEN 3
DEN 3
Group
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
030130-6
BLASTclust
Protein
Member
Constituent
12
M
14
M
18
M
13
C
15
C
17
C
4
M
10
M
22
M
27
M
5
C
9
C
25
NS4a
29
NS4a
23
NS2a
28
NS2a
26
NS4b
30
NS4b
2
NS1
7
NS1
11
E
19
E
3
NS3
6
NS3
1
NS5
8
NS5
24
NS2b
20
NS2b
21
preM
16
NS2a
Serotype
DEN 2
DEN 2
DEN 2
DEN 2
DEN 2
DEN 2
DEN 1
DEN 1
DEN 3
DEN 3
DEN 1
DEN 1
DEN 3
DEN 3
DEN 3
DEN 3
DEN 3
DEN 3
DEN 1
DEN 1
DEN 2
DEN 2
DEN 1
DEN 1
DEN 1
DEN 1
DEN 3
DEN 2
DEN 3
DEN 2
CONCLUSIONS
This study came to the conclusion that the R-MCL algorithm produces 8 groups of 30 dengue virus protein
sequence with each group has one or more of centers. The more the center of a group then the group the higher the
group density. High density can be seen from the number of interactions that take place within the group. The density
of the network tends to form a protein complex that serves as the unit of specific biological processes. The clustering
results are also shown that kinship of dengue virus produced by RMCL are groups of the dengue virus with a similar
role regardless of its constituent protein and serotypes. Furthermore, in term of networks topology the R-MCL results
are more likely to preserve their original networks topology comparing with other clustering results such as
BLASTclust results. Finally, in this study we found that BLASTclust method is more likely to produce larger number
of clusters with more detail biological function in each groups of clusters compare to R-MCL results.
FURTHER RESEARCH
Given the results of the analysis of protein sequences clustering of dengue virus with R-MCL which are groups
of protein with the same role, then the grouping results can be used for further study by the other related fields.
REFERENCES
1.
2.
3.
4.
5.
6.
World Health Organization, Dengue Guidelines for Diagnosis, Treatment, Prevention and Control (WHO Press,
Geneva, 2009), available at: www.who.int/tdr/publications/documents/dengue-diagnosis.pdf
X. Li and S. Kiong, in Biological Data Mining in Protein Interaction Networks, edited by X. L. Li, (IGI Global,
Pennsylvania, 2009).
A. Bustamam, K. Burrage, and N. A. Hamilton, Proceedings of the Second International Conference on
Advances in Computing, Control, and Telecommunication Technologies (ACT), Jakarta, 2010 (IEEE, New
Jersey, 2010), pp. 173-175.
V. Satuluri, Proceedings of the First ACM Conference on Bioinformatics and Computational Biology, New
York, 2010 (ACM, New York, 2010), pp. 247–256.
Virus Pathogen Database and Analysis Resource, available at: http://www.viprbrc.org
A. Bustamam, K. Burrage, and N. A. Hamilton, IEEE/ACM Trans. Comput. Biol. Bioinf. 9, 679–692 (2012).
030130-7
Документ
Категория
Без категории
Просмотров
4
Размер файла
600 Кб
Теги
4991234
1/--страниц
Пожаловаться на содержимое документа