вход по аккаунту



код для вставкиСкачать
Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11
Contents lists available at ScienceDirect
Chemometrics and Intelligent Laboratory Systems
journal homepage:
Correlation and uncorrelation between hyphenated chromatographic data
with chemometric tools
Ze-ying Wu a, *, Zhongda Zeng b, c, **, Zidan Xiao d, Daniel Kam-Wah Mok b, e, ***, Hoiyan Chan d
School of Mathematics, Physics and Chemical Engineering, Changzhou Institute of Technology, Changzhou, China
Chemometrics and Herbal Medicine Laboratory, Department of Applied Biology and Chemical Technology, Hong Kong Polytechnic University, Hung Hom, Hong Kong,
Dalian ChemDataSolution Technology Co. Ltd., Dalian, China
School of Chemical and Biological Engineering, Changsha University of Science & Technology, Changsha, China
State Key Laboratory of Chinese Medicine and Molecular Pharmacology, Shenzhen, China
Hyphenated chromatography
Canonical correlation analysis
Orthogonal projection
Correlation studies of chromatographic data are common strategies used to relate complex mixtures of chemical
components. The congruence coefficient and correlation coefficient are commonly used to indicate the similarity
or correlation between different hyphenated chromatograms. However, these indices typically reduce chromatograms to a single dimension, and information in the other dimensions is not fully utilized. In this work, a new
technique is developed to identify possible relationships among related high-dimensional data sets using powerful
chemometric tools. First, principal component analysis is used to reduce experimental noise by reconstructing the
original data sets. Then, canonical correlation analysis is utilized to obtain the canonical vectors of both data sets
for comparison, which makes identification of the possible relationships between the data sets easier. An
orthogonal projection operation is then applied to identify both common and different information between the
matrix spaces spanned by the canonical vectors. Finally, the correlation and uncorrelation indices are defined
from both the chromatographic and spectral directions on the basis of the Euclidean distance of all the elements of
the final projection matrices. The new indices are more representative because they are generated via the complete employment of the entire data information that is embedded in hyphenated chromatography. In contrast to
the conventional coefficients, the indices proposed in this study provide improved performance in a simulated
HPLC-DAD data set and 12 real GC-MS data sets of ginseng, a widely used herbal medicine. The effects of various
potential factors on the results are investigated.
1. Introduction
Chromatography has been a mainstay technique for separating complex mixtures that contain hundreds or more chemical components,
including herbal medicines [1], foods [2], and biological fluids [3]. The
two widely used one-dimensional separations, gas and liquid chromatography, offer possible approaches to solve most analytical problems.
However, the need for greater resolution power continues to drive advances in chromatography to enable the study of increasingly complex
samples. The number of chromatograms for pattern recognition in a
typical real investigation has increased from 50 to 1000 during the past
few decades [4]. To achieve better separations, multidimensional and/or
hyphenated chromatography have been introduced to extract new information from sophisticated systems. Multidimensional chromatography attains this goal through the utilization of two orthogonal
separation mechanisms to increase the resolution power and peak capacity, whereas hyphenated chromatography provides spectral information related to the eluted components for possible qualitative
identification [5]. Notably, the separation resolution is indirectly
improved in hyphenated chromatography [6]. To date, most real
* Corresponding author. School of Mathematics, Physics and Chemical Engineering, Changzhou Institute of Technology, 666 Liaohe Road, 213032, Changzhou, China.
** Corresponding author. Chemometrics and Herbal Medicine Laboratory, Department of Applied Biology and Chemical Technology, Hong Kong Polytechnic University, Hung Hom,
Hong Kong, China.
*** Corresponding author. Chemometrics and Herbal Medicine Laboratory, Department of Applied Biology and Chemical Technology, Hong Kong Polytechnic University, Hung Hom,
Hong Kong, China.
E-mail addresses: (Z.-y. Wu), (Z. Zeng), (D. Kam-Wah Mok).
Received 23 June 2017; Received in revised form 3 September 2017; Accepted 10 October 2017
Available online xxxx
0169-7439/© 2017 Elsevier B.V. All rights reserved.
Please cite this article in press as: Z.-y. Wu, et al., Correlation and uncorrelation between hyphenated chromatographic data with chemometric tools,
Chemometrics and Intelligent Laboratory Systems (2017),
Z.-y. Wu et al.
Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11
correlation of two hyphenated chromatographic data are the congruence
coefficient and the correlation coefficient [11,12]. Both coefficients are
simple and intuitive through the use of the total or mean chromatographic profiles of the entire two-dimensional data sets. However, their
effectiveness should be examined because a reduction in the data dimensions will unavoidably result in the loss of much useful information.
This phenomenon is analogous to evaluating the similarities of two
three-dimensional spaces (3D objects) by comparing the shapes of their
projections on a 2D plane, thus leading to an inaccurate assessment. In
some cases, these coefficients can yield very misleading results [13,14].
Similarity is a general concept with very wide applications, and it is
not limited to comparisons of chromatographic profiles or other instrumental data. This concept is also used in numerous research fields,
including gene and protein sequence searches, molecular-structure
comparisons, image or fingerprint matching, and explorations of many
types of data sets, as in psychology [15–21]. Through similarity studies,
relevant information can be extracted from data sets or available databases. Statistical methods, clustering algorithms, and distance measurements have all been developed to assess the correlation of gene and
protein expression data with their structures or functions [15,21–23].
Techniques such as kernel-approach, information-theory, and support
vector machine (SVM)-based strategies have been employed to represent
the similarities of other types of data, including facial images, chemical
(sub-)structures, and fingerprints for identification [16–20]. Most of
these data have unique features, and the utilized strategies must be
restricted to solve particular problems.
Hyphenated chromatography has its own unique properties. The
response of the spectral dimension is often proportional to the concentration of the analyte of interest. For absorption measurements, BeerLambert's law is bilinear in both the chromatographic and spectral dimensions [24]. This property is the key to determining pure peaks from
overlapped chromatographic clusters and to extracting qualitative and
quantitative information of target chemical components using chemometric deconvolution algorithms [25]. However, in correlation and
Fig. 1. Illustration showing the correlation and uncorrelation between the two data
matrices X and Y using the orthogonal projection technique.
complicated mixtures, such as those studied in metabonomics and proteomics, have been analyzed using this approach. The hyphenated
chromatography instrumentation includes gas or liquid chromatography
hyphenated with mass spectroscopy (GC-MS and LC-MSn), and many
others [7–9].
After the hyphenated chromatographic data containing all the
detected components have been obtained, data evaluation and/or information extraction become the next critical step for sample interpretation. Scientists, including chromatography experts, chemical analysts,
and chemometricians, have conducted extensive research on this subject,
especially with regard to data evaluation and pretreatment, deconvolution, pattern recognition, and other data-processing programs [10]. In
such investigations, the study of correlation and uncorrelation between
different but related chromatographic data is fundamental. Conventionally, the most widely employed standards for assessing the
Fig. 2. Three-dimensional figure of three different simulated HPLC-DAD data sets (A, B, and C), each possessing the same total chromatographic profile (D).
Z.-y. Wu et al.
Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11
Table 1
The parameters for simulation of chromatographic profile (position/half-width/intensity) and UV spectra (a/ν0/γ) to construct X and Y. They were used to deliver the proposed strategy
proposed in this work.
Fig. 3. Simulated data sets X and Y used to demonstrate the developed strategy and the determination of the number of principal components (PCs). (A) and (B), Three-dimensional figures
of the two simulated data sets X and Y. (C) and (D), Determination of the number of PCs for data sets X and Y using the residual index and expressed variance.
number of principal components can then be retained to construct the
new data for analysis using principal component analysis (PCA) [26]. The
interferences of noise can thus be substantially reduced to acceptable
levels. The numbers of principle components (PCs) and unexpressed residuals are determined by inspecting the changes in the contiguous singular values obtained from singular value decomposition (SVD) and their
potential effects on the results. Next, canonical correlation analysis (CCA)
is used to extract the correlation of different but related chromatographic
data matrices, depending on the properties of the canonical correlation
variates (CCVs) [27–29]. Finally, two distance indices that represent the
sameness and difference of X and Y for correlation and uncorrelation
evaluation, respectively, are produced using the orthogonal projection
(OP) technique [30–32]. The data in the CCV space are projected into
uncorrelation studies, the existence of experimental noise, including
homoscedastic and heteroscedastic noise, affects the performance of
most methods. Thus, new strategies, that are more tolerant of noise and
that are effective in evaluating the similarities of the chromatograms of
complex samples such as herbal medicines and biological fluids, should
be developed.
In this work, a new technique is developed to assess the correlation
and uncorrelation of hyphenated chromatograms with the help of chemometric tools. The new method emphasizes the effective use of all the
data to achieve a faithful representation of the components, similarities,
and correlation among the data sets. The objective is to develop a more
reliable parameter that represents the relationship between the data sets.
Suppose we have two data sets X and Y from our studies; a certain
Z.-y. Wu et al.
Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11
Fig. 4. Effects of homoscedastic and heteroscedastic noise and a large chromatographic profile on the correlation index. (A) and (B), Effects of different levels of homoscedastic and
heteroscedastic noise on correlation index. (C) and (D), Effects of different enlargement times of the fifth component in data set X on the correlation index. The chromatography of the fifth
component in Y is replaced by the component in the same location in X (C), and both the chromatographic and spectral profiles of the fifth component in Y are replaced by the corresponding component in X (D).
on the correlation and uncorrelation indices. The results demonstrate
that the new method is highly effective in evaluating the relationships
between chromatograms from hyphenated chromatography during investigations of herbal medicine fingerprints and metabolite profiles in
2. Theory
Suppose two chromatograms from hyphenated chromatography, X
and Y, are full or partial chromatograms. The sizes of the two matrices for
X and Y are m1 n1 and m2 n2, respectively, meaning that the data
contain m1 and m2 chromatographic measurement points and n1 and n2
spectral channels. Each row represents a spectrum at a specific retention
time, and each column corresponds to a chromatographic profile of a
certain wavelength or m/z ratio.
2.1. Canonical correlation analysis (CCA)
Fig. 5. Effects of different numbers of PCs employed to obtain correlation and uncorrelation indices (noise level, 0.0002).
CCA was first proposed by Hotelling in 1936 [27]. It is a multivariate
statistical method for assessing and further correlating linear relationships between two multidimensional variables, such as data sets X and Y
in this study. In particular, it can be used as an exploratory tool to study
multiple variables with relation to a given analytical category. The
strategy of this method is to find two basis vectors a and b for target
matrices X and Y, respectively, such that the correlation between their
projections onto the corresponding vectors is mutually maximized. Unlike the PCA method, which attempts to identify the basis vectors of
maximum variance, CCA exploits the correlation instead and is highly
adaptable to spectral data. First, let each sample of instance be set to Z ¼
((x1, y1), (x2, y2),… … (xi, yi),… … (xn, yn)) of (x, y). CCA seeks to obtain
the canonical weights a and b in Eqs. (1) and (2), respectively, and to
maximize the correlation between the canonical variates U and V:
two parts: one part is parallel to the CCV matrix of the other data to
describe similarity, and the other part is orthogonal to describe the
dissimilarity. Thus, all the data points and information in data X and Y
are utilized to compute the correlation and uncorrelation indices.
Ginseng is a widely used herb in western and Asian countries. Its
reported biological activities include improving psychological and immune functions, reducing blood sugar and cholesterol levels, enhancing
strength, and promoting relaxation, among others [31,33]. We used GCMS data of ginseng samples to examine the proposed strategy. The
samples included four different medicinal parts of three types of ginseng.
In addition, some HPLC-DAD data sets are simulated to study the effects
of certain experimental and instrumental factors, such as noise types and
levels, data background and shifts, and unique chromatographic peaks,
Z.-y. Wu et al.
Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11
Fig. 6. Effects of different degrees of data shifts and background levels on the correlation index. (A) and (B), Effects of a shift of the data points of data sets X (A) and Y (B) on the
correlation index. (C), Effects of different background levels on the correlation index. Integrated linear shift of the whole data sets X and Y. (D), Piecewise linear shift of the data sets X and
Y (integrated non-linear background with noise level of 0.0002).
ρ ¼ max corrðU; VÞ ¼ max
U; V
E½〈a; x〉〈b; y〉
¼ max qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
E 〈a; x〉2 E 〈b; y〉2
aT Cxy a
¼ max qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
aT Cxx abT Cyy b
where ρ is the correlation coefficient and E is used to represent expectation. Matrices Cxx, Cyy, and Cxy are the within- and between-sets
covariance of x and y and construct the total block covariance matrix
C, as shown in Eq. (4):
" #
Mathematically, values of a and b that maximize the correlation can
be transformed into the following two eigenvalue equations:
Fig. 7. Total ion chromatograms (TICs) of different parts of C-, S-, and W-ginseng. Curves
1 to 4, 5 to 8, and 9 to 12 represent parts A, B, C, and D of C-ginseng, S-ginseng, and Wginseng, respectively.
U≡UðaÞ ¼ ð〈a; x1 〉; 〈a; x2 〉; ……〈a; xi 〉; ……〈a; xn 〉Þ ¼ aT X
V≡VðbÞ ¼ ð〈b; y1 〉; 〈b; y2 〉; ……〈b; yi 〉; ……〈b; yn 〉Þ ¼ bT Y
Cxx 1 Cxy Cyy 1 Cyx a ¼ ρ2 a
Cyy 1 Cyx Cxx 1 Cxy b ¼ ρ2 b
where the eigenvalues ρ2 are squared canonical correlation coefficients
and eigenvectors a and b are basis vectors for obtaining the canonical
variates U and V, respectively. The dimensions of U and V depend on the
minimum ranks of the target data sets X and Y, respectively.
The superscript letter T denotes the transposition of the vector or
matrix. Notably, the data matrices possess the same lengths in the
calculation. The maximization of the correlation can be defined with the
following equation:
2.2. Orthogonal projection (OP)
The technique of OP has been extended to studies of deconvolution,
modeling, quantitative structure-activity relationships/quantitative
structure-property relationships (QSAR/QSPR), and other topics in
Z.-y. Wu et al.
Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11
Fig. 8. Determination of the number of principal components in part A of C-ginseng and its effect on the correlation index between parts A and B of C-ginseng. (A) and (B), Determination
of the number of PCs using the residual index (A) and expressed variance (B). (C) and (D), Effects of different numbers of PCs employed to obtain the correlation and uncorrelation indices.
(C), The numbers of PCs of parts A and B of C-ginseng were changed from 1 to 300. The correlation and uncorrelation indices change from 0.570 to 0.547 and 0.430 to 453, respectively,
when the number of PCs is increased from 250 to 300. (D), The number of PCs of part B of C-ginseng changes from 1 to 300, whereas the number of PCs of part A remains fixed.
onto each other to determine the orthogonal and overlapping information, as shown in Eqs. (7) and (8):
Table 2
The number of principal components (PCs) used to analyze parts A, B, C and D of the three
types of ginseng, namely, C-, S-, and W-ginseng, and percentage of expressed variance.
Number of PCs
Expressed variance
where the superscript 1 denotes the inverse of the vector or matrix.
Matrices PU←V and PV←U represent the orthogonal parts of V from U and
those of U from V, respectively.
2.3. Definition of the correlation and uncorrelation indices
In Fig. 1, the relationships between U and V and the projection
matrices are presented to illustrate the strategy for defining the correlation and uncorrelation indices. The shaded section summarizes the
common information between U and V, and PU←V and PV←U are projection matrices containing the differences. The correlation index ci of the
original hyphenated data sets X and Y can then be written as Eq. (9):
chemometrics. The main idea of orthogonal projection posits that common information cannot be found between two orthogonal spaces, but
instead, can be found only between overlapping ones. Mutual information can be utilized to describe the correlation; however, OP must be used
for uncorrelation evaluation. Thus, identifying orthogonal subspaces of
two data matrices should be useful for investigating the similarities and
dissimilarities within the data. As previously mentioned, the canonical
variates U and V correlate the matrices X and Y, respectively. To define
the useful indices and to explore the relationships between the data using
all possible data points, the two matrices U and V can then be projected
ci ¼
P 2ffi
P 2ffi
where rX and rY are the chemical ranks of data X and Y, respectively. The
ratio rX/rY is a weighting factor that takes into account the differences
Z.-y. Wu et al.
Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11
Table 3
Correlation indices among parts A, B, C and D of the three types of ginseng shown in Fig. 7.
Z.-y. Wu et al.
Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11
example, the inclusion of 99% expressed variance in the data correlation
study should usually be sufficient, unlike the requirement of the
completely accurate determination of chemical rank to deconvolute the
overlapping chromatographic clusters. The effect of the numbers of PCs
on the correlation indices is further investigated in the next section, and
such approximations are shown to produce acceptable results. In contrast
to the conventional similarity index, the new indices are highly effective
at assessing the relationships between complicated data sets.
between the number of chemical components in X and Y. Scalars qij and
wij are elements of Q and W, as defined in Eqs. (10) and (11),
Q ¼ U PU←V ¼ V PV←U
The uncorrelation index ui can then be easily obtained according to
the expression shown in Eq. (12):
ui ¼ 1 ci
3. Experimental
3.1. Simulated data sets
The indices ci and ui are the results obtained from both the chromatographic and spectral dimensions of the hyphenated chromatography
data. Fig. 1 illustrates that the values of ci and ui must be between 0 and 1.
The two boundary values will be achieved mathematically with no or full
overlapping between the correlation variates U and V. As an example, in
the case of the ci index, the mutual projection matrices PU←V and PV←U
will include all and no information of U and V, respectively, under such
situations. The advantage of the present indices is clear through
comprehensive consideration of the high-dimensional chromatograms
and the successive power of the chemometric tools. Further, the rationality can be validated in terms of the two different computation strategies shown in Eq. (10).
In this work, synthetic HPLC-DAD data were employed to investigate
the performance of the proposed method. All synthetic high performance
liquid chromatography-diode array detection (HPLC-DAD) chromatographic profiles of pure chemical components were generated using the
Gaussian function shown in Eq. (18):
B 1
ch ¼ hi B
2π wi
Noise always exists in instrumental data such as chromatography
profiles, which leads to error in the correlation and uncorrelation analysis using the proposed method. In this work, PCA is utilized to reconstruct the target data sets X and Y as follows:
Y ¼ U2 S2 V2 T
sj ðvÞ ¼
Yr ¼ U2;r S2;r V2;r T
ðResii Resii1 Þ
0 2
v vij þ γ ij 2
where the parameters Mj, aij, ν0ij, and γ ij are the number of absorbance
spectral bands of the jth component, the maximal absorbance of the ith
band, the frequency of the band center, and the bound-width at halfmaximum, respectively.
3.2. Sample extraction of chemical components in ginseng
Samples of ginseng were prepared as follows. First, they were ground
and crushed prior to the experiment. Hexane was then added, and the
samples were ultrasonically extracted for 1 h at room temperature. After
centrifugation, the supernatant was used for GC-MS analysis.
3.3. Instruments/analytical conditions
where the six matrices U1,r, S1,r, VT1,r, and U2,r, S2,r, V2,Tr correspond to the
matrices in Eqs. (13) and (14), respectively, with the first rX and rY PCs
included. The values of rX and rY are determined with regard to the
change ratios of the residuals after different numbers of PCs have been
extracted, as given in Eq. (17):
rci ¼
where U1 and U2 are column orthogonal matrices called loadings, V1 and
V2 are row orthogonal matrices called scores, and S1 and S2 are diagonal
matrices of the associated eigenvalues. If the numbers of the principal
components (PCs) of X and Y can be obtained, then the two new matrices
Xr and Yr contain only the information of the chemical components,
Xr ¼ U1;r S1;r V1;r T
where parameters hi, wi, and pi represent the height, half-width, and
position of the ith chromatographic peak, respectively. The variable ti is
the retention time region of the target component. The UV spectrum of
the corresponding component was simulated using a mixture of Lorentz
distributions given in Eq. (19):
2.4. Reconstruction of the original chromatographic data using principal
component analysis (PCA)
X ¼ U1 S1 V1 T
ðti pi Þ
2wi 2
A Shimadzu QP-2010 GC-MS spectrometer (Tokyo, Japan), equipped
with an Agilent DB-5MS capillary GC column (30 m 0.25 mm,
0.25 μm), was utilized for the analysis of the constituents in ginseng. The
oven temperature was set at 100 C initially, ramped to 170 C at a rate of
1.5 C/min, then to 190 C at 8.0 C/min and was finally increased to
240 C at a rate of 2.0 C/min. The inlet temperature was set at 270 C
with a split ratio was 2:1. Helium, at a constant flow rate of 1.3 ml/min,
was used as carrier gas. The full spectrum was recorded in the range of m/
z 1–380. The temperatures of the EI ionization source and the interface
were maintained at 200 C and 250 C, respectively.
where Resii and Resii-1 are the sums of the squares of the elements in
residual matrices the first i or i - 1 PCs excluded. Because of the significant differences in the chemical components and the noise-to-residual
calculation in Eq. (17), an apparent increase in rc (rc ¼ [rc1, rc2, … …
rci, … … rX or rY]) indicates the successful determination of the numbers
of PCs, specifically, rX and rY. This strategy is useful for the analysis of
target sub-matrices that are extracted from the whole chromatogram.
However, the accurate determination of the sub-matrices within complex
data is more difficult. In such cases, a threshold on the percentage of
expressed variance can be set to assist the estimation of rX and rY. For
3.4. Implementation
All computer programs were coded in MATLAB 6.5.0, and all computations were performed on an Intel (R) Core (TM) 2 CPU 6300
(1.86 GHz and 1.87 GHz) PC with 2.0 GB of memory.
Z.-y. Wu et al.
Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11
4. Results and discussion
Fig. 4(D). When the present indices are used as standards, curves 2 to 4
indicate that the variation of both the chromatograms and the spectra can
be acceptably expressed in data Y.
4.1. Simulated data sets
4.1.3. Effects of the number of PCs on correlation and uncorrelation indices
The numbers of PCs determined in data X and Y substantially affect
the correlation and uncorrelation indices. In fact, numerous other strategies have been developed in the last few decades to estimate the ranks
of hyphenated data. If necessary, more advanced methods can be used to
reconstruct the data matrix. In Fig. 5, four curves are presented to show
the changes of the indices when various numbers of PCs are used.
Notably, the uncorrelation index (curve 4) possessed a minimum that
corresponds to the correct value in the abscissa. Curves 1 to 3 of the
correlation indices are acquired from the chromatographic and spectral
directions, and their average shows the variation using different numbers
of PCs. The correlation indices show a stepwise decrease when the
number of PCs is greater than 10. This approach works very well for a
portion of the data extracted from the full chromatogram when the rank
is small; however, most conventional methods, including the present
strategy, may fail to accurately estimate the numbers of PCs for chromatograms that contain tens or hundreds of chemical components.
However, the effects of the number of PCs used on the correlation index
should be insignificant when the system contains more chemical components, and acceptable results were obtained using a threshold of 90%
expressed variance, which will be discussed with respect to the ginseng
data sets.
In Fig. 2, three typical HPLC-DAD chromatograms show the effectiveness of the new methods with regard to the similarity evaluation.
Fig. 2(A) to (C) show completely different three-dimensional shapes and
structures, but they have the same total chromatograms, as shown in
Fig. 2(D). Thus, both the conventional congruence coefficient and the
correlation coefficient between these profiles will be equal to one, and
neither coefficient can distinguish the profiles according to their total or
mean ion chromatograms. The information obtained from matrix data is
obviously richer than the information obtained from scalar or vector
data. This concept is the fundamental advantage with respect to data
evaluation in the present study.
Next, most of the factors affecting the correlation and uncorrelation
indices were examined, including experimental noise levels, the existence of strong chromatographic peaks, the numbers of PCs employed,
data shifts, and background levels. These five factors could substantially
affect the indices for the similarity evaluation.
The parameters in Eq. (18) and Eq. (19) of the two simulated data sets
X and Y are presented in Table 1. Each data set comprises 10 components.
The two three-dimensional data sets are shown in Fig. 3(A) and (B), and
the corresponding rc curves that were used to determine the rank r in the
data are shown in Fig. 3(C) and (D). The numbers of PCs for data sets X
and Y were both 10. The top figures in Fig. 3(C) and (D) provide the
expressed variance using the determined numbers of PCs from the bottom graphs.
4.1.4. Effects of data shifts and background levels on correlation and
uncorrelation indices
In metabolomics studies, the pretreatment of high-throughput chromatographic profiles is a critical step for the interpretation of complex
mixture data sets and for further discoveries of biological processes. Most
existing methods encounter difficulty when attempting to automatically
extract all the rich information, such as retention time shifts and background interferences, from the data. Such problems also arise during the
process of hyphenated chromatography data evaluation. The removal of
such interference and recovery the original relationships between the
data sets would be very useful. Fig. 6(A) and (B) were obtained when the
parameter pi in Eq. (18) was gradually changed to generate retention
time shifts in the data sets.
Fig. 6(C) and (D) present the corresponding correlation indices for
different integrated and local linear backgrounds in terms of the slope
shown in the x-axis. Curves 1 to 5 represent the same information as
previously described. The similarities between the results obtained using
the conventional correlation coefficient and the congruence coefficient
decrease significantly in Fig. 6(A) and (B) to even less than zero. Yet, as
shown in Fig. 6(C) and (D), the same indices increase to greater than 0.8
when the background slopes are improved. Such results clearly cannot
reveal the actual situation or the variation between the two data sets.
This defect is mainly attributable to the limited information from hyphenated chromatography that is used to obtain the two coefficients.
However, the gradual and robust variation of curves 1, 2, and 3 in all four
sub-graphs indicate that the present correlation index is effective in
representing the relationships between the data sets. The simultaneous
analysis from the chromatographic and spectral dimensions produces
excellent results.
4.1.1. Effects of homoscedastic and heteroscedastic noise on correlation and
uncorrelation indices
The existence of noise presents a major challenge when processing
instrumental data with chemometric tools. The effects of different levels
of homoscedastic and heteroscedastic noise on the correlation indices are
presented in Fig. 4(A) and (B). The uncorrelation index can be acquired
simply via the correlation index using Eq. (12). The noise produced no
significant effects on curves 1 and 5, which represent the correlation
coefficient and the congruence coefficient, respectively. However, the
three-dimensional data of data X and Y completely changed as the noise
increased from 0.002% to 0.2% and from 0.02% to 2% of the maximum
signal for homoscedastic and heteroscedastic noise, respectively. Curves
2 and 4 in Fig. 4 represent the results of the correlation indices from the
chromatographic and spectral dimensions using the present method, and
curve 3 is their average for the final evaluation. In general, these three
curves decrease smoothly and robustly following the improvements in
the noise levels. The conventional indices apparently cannot show the
variations of hyphenated chromatograms when the noise levels are
increased. However, the indices developed in this study are effective at
assessing the correlation and uncorrelation of these data sets. The ability
of the new indices to deal with the heteroscedastic noise further demonstrates their effectiveness.
4.1.2. Effects of large chromatographic peaks on correlation and
uncorrelation indices
The use of the conventional correlation coefficient or congruence
coefficient as a similarity measurement of two hyphenated chromatograms is completely controlled by several major components, though
tens, hundreds, or even more chemical components may be present in the
mixture. This complexity limits the effectiveness of the similarity evaluation because the scope is small. The spectral information is also not
considered in the conventional similarity evaluation. Curves 1 and 5 in
Fig. 4(C) and (D) represent the same information as those in Fig. 3. To
study the effects of large chromatographic peaks on the index, the fifth
component in Y is replaced by the fifth component used in X. Fig. 4(C)
presents the results with only the chromatographic profile displaced,
although the spectrum is also changed to correspond to the results in
4.2. Ginseng data sets
The growing conditions of ginseng, including its geographic location, soil conditions, and exposure to sunlight, are the major factors that
affect the compositions of its small molecular metabolites and, hence,
the activity of these herbs. Three types of ginseng were studied in this
work. Wild ginseng, grown without human intervention, was collected
from the wild. Wildly cultivated ginseng was cultivated in the wild
under conditions that resembled the natural growing environment of
Z.-y. Wu et al.
Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11
5. Conclusions
the herb, whereas cultivated ginseng was grown on a farm. The three
samples are abbreviated as W-, S-, and C-ginseng. Furthermore, different
medical parts of ginseng are used to treat different diseases or different
stages of the same disease in traditional practice. Thus, the correlation
and uncorrelation of the hyphenated chromatograms of the four
different parts of ginseng were investigated. The four parts of ginseng
include the rhizome, the main root, the side, and the fiber root, which
are referred to as A, B, C, and D, respectively. Clearly, 66 results between each group of two data sets are sufficient to validate the present
method, and these results provide a whole picture of the relationships of
these mixtures.
In Fig. 7, 12 total ion chromatograms (TICs) of the four parts of W-, S-,
and C-ginseng are presented. This information is the only definitive
factor for data evaluation using the conventional indices. As an example,
the number of PCs of part A of C-ginseng is shown in Fig. 8(A), and the
expressed variance using the corresponding number of PCs is shown in
Fig. 8(B). The number of PCs for complicated mixtures is difficult to
determine. In this case, the number of PCs accounts for 99.47% of the
variance, which should yield reasonably good conclusions. Furthermore,
the effect of the number of PCs on the correlation and uncorrelation
indices is presented in Fig. 8(C) and (D) for parts A and B of C-ginseng.
We obtained the two sub-figures by varying the numbers of PCs from 1 to
300 in Fig. 8(C) and fixing the number of PCs in part A while changing
the number of PCs in part B. Curves 1 to 3 are the correlation indices
acquired from both chromatographic and spectral dimensions, in addition to their average. Curve 4 is the corresponding uncorrelation index.
According to the results presented in Fig. 8(C), the number of PCs affects
the correlation and uncorrelation indices when it is less than 200.
However, the indices of the chromatographic and spectral directions
change slightly from 0.570 to 0.547 and from 0.430 to 453, respectively,
as the number of PCs increases from 250 to 300. Apparently, the
approximate number of PCs is reasonable for data evaluation. The trend
of the correlation and uncorrelation indices even reverts in Fig. 8(D) to
approximately the estimated number of PCs. Table 2 shows the determined number of PCs and the corresponding expressed variance. All
correlation and uncorrelation indices are provided in Table 3. The results
are symmetric by definition of the correlation or uncorrelation indices.
The six values in each grid include the correlation indices of the chromatographic and spectral dimensions and their average, the uncorrelation index, and the congruence and correlation coefficients. The values of
the average correlation index and uncorrelation index are indicated in
bold face. Most of the two conventional indices are clearly greater than
0.9, and they are larger than the first 4 indices for the same type of
ginseng. However, a similarity so great leads to difficulty in evaluating
the sameness when the indices exhibit a correlation greater than that of
the three-dimensional data. In fact, the differences are evident from both
the chromatographic and spectral dimensions. Such confusion arises
because all of the data information and the small differences in the
chemical components were not employed. The present indices have
strong advantages in this aspect. With the help of the indices proposed in
this study, we found that the same type of ginseng has a higher correlation in every dimension, which should be consistent with the common
understanding. The spectral dimension of the different types of ginseng
indicates that the correlation between C- and W-ginseng is greater than
that between C- and S-ginseng, although the average correlation index
produces contrary results as a whole. Furthermore, relationships exist
among the different parts of ginseng. For example, according to the final
column in Table 3, the correlation in the spectral dimension generally
decreases among the medicinal parts of the different ginseng from W- to
C- to S-ginseng. The decreasing order of correlation changes to W- to S- to
C-ginseng in the chromatographic dimension. Obviously, the results
obtained using the present method have high objectivity and effectiveness, in contrast to the conventional indices.
The aim of the present study was to properly evaluate the correlation
and uncorrelation indices of hyphenated chromatographic data. This
objective is important for uncovering the relationships between
complicated experimental data and processes and characters buried in
complicated mixtures. Unlike the conventional indices that utilize few
data points from the whole high-dimensional space, the indices proposed
in this work are more effective for data evaluation because the entire
chemical information in the data matrices is considered. Powerful chemometric tools, including the PCA, CCA, and OP techniques, were
employed in the evaluation. On the basis of their performance with
simulated and experimental data sets, the new indices are more suitable
for the investigation of complex samples, such as the chromatographic
fingerprinting of herbal medicines and metabolite profiling. In fact, such
a comprehensive chemometric strategy should have promising applications in numerous fields, such as in image comparisons and identifications of human fingerprints.
This work was financially supported by Changzhou Institute of
Technology (Project YN1612 and Young Scholars Program).
[1] H.-P. Song, S.-Q. Wu, L.-W. Qi, F. Long, L.-F. Jiang, K. Liu, H. Zeng, Z.-M. Xu, P. Li,
H. Yang, A strategy for screening active lead compounds and functional compound
combinations from herbal medicines based on pharmacophore filtering and
knockout/knockin chromatography, J. Chromatogr. A 1456 (2016) 176–186.
[2] Q.-x. Sun, Q. Chen, B.-h. Kong, F.-j. Dong, Q. Liu, Review on application of highperformance liquid chromatography technology in detection of biogenic amines in
foods, Food Ind. (2014) 193–198.
[3] B.M. Hounoum, H. Blasco, P. Emond, S. Mavel, Liquid chromatography-highresolution mass spectrometry-based cell metabolomics: experimental design,
recommendations, and applications, TrAC, Trends Anal. Chem. 75 (2016) 118–128.
[4] R.G. Breteton, Applied Chemometrics for Scientists, J. Wiley and Sons, Chichester,
UK, 2006.
[5] P. Dugo, F. Cacciola, T. Kumm, G. Dugo, L. Mondello, Comprehensive
multidimensional liquid chromatography: theory and applications, J. Chromatogr.
A 1184 (2008) 353–368.
[6] F. Gong, Y.Z. Liang, Q.S. Xu, F.T. Chau, K.M. Ng, Evaluation of separation quality in
two-dimensional hyphenated chromatography, Anal. Chim. Acta 450 (2001)
[7] N. Ferreiros, Recent advances in LC-MS/MS analysis of Delta(9)tetrahydrocannabinol and its metabolites in biological matrices, Bioanalysis 5
(2013) 2713–2731.
[8] C.H. Weinert, B. Egert, S.E. Kulling, On the applicability of comprehensive twodimensional gas chromatography combined with a fast-scanning quadrupole mass
spectrometer for untargeted large-scale metabolomics, J. Chromatogr. A 1405
(2015) 156–167.
[9] X. Zhou, Y. Wang, Y. Yun, Z. Xia, H. Lu, J. Luo, Y. Liang, A potential tool for
diagnosis of male infertility: plasma metabolomics based on GC-MS, Talanta 147
(2016) 82–89.
[10] J. Trygg, E. Holmes, T. Lundstedt, Chemometrics in metabonomics, J. Proteome
Res. 6 (2007) 469–479.
[11] R.-t. Tian, P.-s. Xie, H.-p. Liu, Evaluation of traditional Chinese herbal medicine:
Chaihu (Bupleuri Radix) by both high-performance liquid chromatographic and
high-performance thin-layer chromatographic fingerprint and chemometric
analysis, J. Chromatogr. A 1216 (2009) 2150–2155.
[12] J.B.G. Souza, N. Re-Poppi, J.L. Raposo Jr., Characterization of pyroligneous acid
used in agriculture by gas chromatography-mass spectrometry, J. Braz. Chem. Soc.
23 (2012) 610–617.
[13] N. Kane, K. Aznag, A. El Oirrak, M. Kaddioui, Binary data comparison using
similarity indices and principal components analysis, Int. Arab. J. Inf. Technol. 13
(2016) 232–237.
[14] Y. Wu, S. Lv, C. Wang, X. Gao, J. Li, Q. Meng, Comparative analysis of volatiles
difference of Yunnan sun-dried Pu-erh green tea from different tea mountains:
Jingmai and Wuliang mountain by chemical fingerprint similarity combined with
principal component analysis and cluster analysis, Chem. Cent. J. 10 (2016) 1–11.
[15] D.J. Lipman, W.R. Pearson, Rapid and sensitive protein similarity searches, Science
227 (1985) 1435–1441.
[16] T.R. Hagadone, Molecular substructure similarity searching - efficient retrieval in 2dimensional structure databases, J. Chem. Inf. Comput. Sci. 32 (1992) 515–521.
[17] A.C. Good, S.S. So, W.G. Richards, Structure-activity-relationships from molecular
similarity-matrices, J. Med. Chem. 36 (1993) 433–438.
Z.-y. Wu et al.
Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11
[26] E.R. Malinowski, Factor Analysis in Chemistry, third ed., John Wiley, New York,
[27] H. Hotelling, Relations between two sets of variates, Biometrika 28 (1936)
[28] S.H. Lee, S. Choi, Two-dimensional canonical correlation analysis, IEEE Signal Proc.
Let. 14 (2007) 735–738.
[29] D.R. Hardoon, J. Shawe-Taylor, Convergence analysis of kernel Canonical
Correlation Analysis: theory and practice, Mach. Learn 74 (2009) 23–38.
[30] J.C. Harsanyi, C.I. Chang, Hyperspectral image classification and dimensionality
reduction - an orthogonal subspace projection approach, IEEE T Geosci. Remote 32
(1994) 779–785.
[31] B.K.H. Tan, J. Vanitha, Immunomodulatory and antimicrobial effects of some
traditional Chinese medicinal herbs: a review, Curr. Med. Chem. 11 (2004)
[32] X. Liang, Removal of hidden neurons in multilayer perceptrons by orthogonal
projection and weight crosswise propagation, Neural Comput. Appl. 16 (2007)
[33] E.J. Cho, X.L. Piao, M.H. Jang, S.Y. Park, S.W. Kwon, J.H. Park, The effect of
steaming on the free amino acid contents and antioxidant activity of ginseng,,
Planta Med. 74 (2008), 1174–1174.
[18] J. Batista, J. Bajorath, Chemical database mining through entropy-based molecular
similarity assessment of randomly generated structural fragment populations,
J. Chem. Inf. Model 47 (2007) 59–68.
[19] M. Rupp, E. Proschak, G. Schneider, Kernel approach to molecular similarity based
con iterative graph similarity, J. Chem. Inf. Model 47 (2007) 2280–2286.
[20] F. Su, X. Xie, J. Feng, A. Cai, Fingerprint matching using SVM-based similarity
measure, Chin. J. Electron 16 (2007) 459–463.
[21] L. Si, D. Yu, D. Kihara, Y. Fang, Combining gene sequence similarity and textual
information for gene function annotation in the literature, Inf. Retr. 11 (2008)
[22] L. Yin, C.-H. Huang, J. Ni, Clustering of gene expression data: performance and
similarity analysis, BMC Bioinf. 7 (2006) 1–11.
[23] S.Y. Kim, J.W. Lee, Ensemble clustering method based on the resampling similarity
measure for gene expression data, Stat. Methods Med. Res. 16 (2007) 539–564.
[24] Y.Z. Liang, O.M. Kvalheim, R. Manne, White, gray and black multicomponent
systems - a classification of mixture problems and methods for their quantitative analysis, Chemom. Intell. Lab. Syst. 18 (1993) 235–250.
[25] Z.D. Zeng, Y.Z. Liang, Y.L. Wang, X.R. Li, L.M. Liang, Q.S. Xu, C.X. Zhao, B.Y. Li,
F.T. Chau, Alternative moving window factor analysis for comparison analysis
between complex chromatographic data, J. Chromatogr. A 1107 (2006) 273–285.
Без категории
Размер файла
3 385 Кб
2017, chemolab, 005
Пожаловаться на содержимое документа