Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11 Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemometrics Correlation and uncorrelation between hyphenated chromatographic data with chemometric tools Ze-ying Wu a, *, Zhongda Zeng b, c, **, Zidan Xiao d, Daniel Kam-Wah Mok b, e, ***, Hoiyan Chan d a School of Mathematics, Physics and Chemical Engineering, Changzhou Institute of Technology, Changzhou, China Chemometrics and Herbal Medicine Laboratory, Department of Applied Biology and Chemical Technology, Hong Kong Polytechnic University, Hung Hom, Hong Kong, China c Dalian ChemDataSolution Technology Co. Ltd., Dalian, China d School of Chemical and Biological Engineering, Changsha University of Science & Technology, Changsha, China e State Key Laboratory of Chinese Medicine and Molecular Pharmacology, Shenzhen, China b A R T I C L E I N F O A B S T R A C T Keywords: Hyphenated chromatography Chemometrics Canonical correlation analysis Orthogonal projection Ginseng Correlation studies of chromatographic data are common strategies used to relate complex mixtures of chemical components. The congruence coefﬁcient and correlation coefﬁcient are commonly used to indicate the similarity or correlation between different hyphenated chromatograms. However, these indices typically reduce chromatograms to a single dimension, and information in the other dimensions is not fully utilized. In this work, a new technique is developed to identify possible relationships among related high-dimensional data sets using powerful chemometric tools. First, principal component analysis is used to reduce experimental noise by reconstructing the original data sets. Then, canonical correlation analysis is utilized to obtain the canonical vectors of both data sets for comparison, which makes identiﬁcation of the possible relationships between the data sets easier. An orthogonal projection operation is then applied to identify both common and different information between the matrix spaces spanned by the canonical vectors. Finally, the correlation and uncorrelation indices are deﬁned from both the chromatographic and spectral directions on the basis of the Euclidean distance of all the elements of the ﬁnal projection matrices. The new indices are more representative because they are generated via the complete employment of the entire data information that is embedded in hyphenated chromatography. In contrast to the conventional coefﬁcients, the indices proposed in this study provide improved performance in a simulated HPLC-DAD data set and 12 real GC-MS data sets of ginseng, a widely used herbal medicine. The effects of various potential factors on the results are investigated. 1. Introduction Chromatography has been a mainstay technique for separating complex mixtures that contain hundreds or more chemical components, including herbal medicines [1], foods [2], and biological ﬂuids [3]. The two widely used one-dimensional separations, gas and liquid chromatography, offer possible approaches to solve most analytical problems. However, the need for greater resolution power continues to drive advances in chromatography to enable the study of increasingly complex samples. The number of chromatograms for pattern recognition in a typical real investigation has increased from 50 to 1000 during the past few decades [4]. To achieve better separations, multidimensional and/or hyphenated chromatography have been introduced to extract new information from sophisticated systems. Multidimensional chromatography attains this goal through the utilization of two orthogonal separation mechanisms to increase the resolution power and peak capacity, whereas hyphenated chromatography provides spectral information related to the eluted components for possible qualitative identiﬁcation [5]. Notably, the separation resolution is indirectly improved in hyphenated chromatography [6]. To date, most real * Corresponding author. School of Mathematics, Physics and Chemical Engineering, Changzhou Institute of Technology, 666 Liaohe Road, 213032, Changzhou, China. ** Corresponding author. Chemometrics and Herbal Medicine Laboratory, Department of Applied Biology and Chemical Technology, Hong Kong Polytechnic University, Hung Hom, Hong Kong, China. *** Corresponding author. Chemometrics and Herbal Medicine Laboratory, Department of Applied Biology and Chemical Technology, Hong Kong Polytechnic University, Hung Hom, Hong Kong, China. E-mail addresses: wuzy@czu.cn (Z.-y. Wu), adawin.tsang@gmail.com (Z. Zeng), daniel.mok@polyu.edu.hk (D. Kam-Wah Mok). https://doi.org/10.1016/j.chemolab.2017.10.005 Received 23 June 2017; Received in revised form 3 September 2017; Accepted 10 October 2017 Available online xxxx 0169-7439/© 2017 Elsevier B.V. All rights reserved. Please cite this article in press as: Z.-y. Wu, et al., Correlation and uncorrelation between hyphenated chromatographic data with chemometric tools, Chemometrics and Intelligent Laboratory Systems (2017), https://doi.org/10.1016/j.chemolab.2017.10.005 Z.-y. Wu et al. Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11 correlation of two hyphenated chromatographic data are the congruence coefﬁcient and the correlation coefﬁcient [11,12]. Both coefﬁcients are simple and intuitive through the use of the total or mean chromatographic proﬁles of the entire two-dimensional data sets. However, their effectiveness should be examined because a reduction in the data dimensions will unavoidably result in the loss of much useful information. This phenomenon is analogous to evaluating the similarities of two three-dimensional spaces (3D objects) by comparing the shapes of their projections on a 2D plane, thus leading to an inaccurate assessment. In some cases, these coefﬁcients can yield very misleading results [13,14]. Similarity is a general concept with very wide applications, and it is not limited to comparisons of chromatographic proﬁles or other instrumental data. This concept is also used in numerous research ﬁelds, including gene and protein sequence searches, molecular-structure comparisons, image or ﬁngerprint matching, and explorations of many types of data sets, as in psychology [15–21]. Through similarity studies, relevant information can be extracted from data sets or available databases. Statistical methods, clustering algorithms, and distance measurements have all been developed to assess the correlation of gene and protein expression data with their structures or functions [15,21–23]. Techniques such as kernel-approach, information-theory, and support vector machine (SVM)-based strategies have been employed to represent the similarities of other types of data, including facial images, chemical (sub-)structures, and ﬁngerprints for identiﬁcation [16–20]. Most of these data have unique features, and the utilized strategies must be restricted to solve particular problems. Hyphenated chromatography has its own unique properties. The response of the spectral dimension is often proportional to the concentration of the analyte of interest. For absorption measurements, BeerLambert's law is bilinear in both the chromatographic and spectral dimensions [24]. This property is the key to determining pure peaks from overlapped chromatographic clusters and to extracting qualitative and quantitative information of target chemical components using chemometric deconvolution algorithms [25]. However, in correlation and Fig. 1. Illustration showing the correlation and uncorrelation between the two data matrices X and Y using the orthogonal projection technique. complicated mixtures, such as those studied in metabonomics and proteomics, have been analyzed using this approach. The hyphenated chromatography instrumentation includes gas or liquid chromatography hyphenated with mass spectroscopy (GC-MS and LC-MSn), and many others [7–9]. After the hyphenated chromatographic data containing all the detected components have been obtained, data evaluation and/or information extraction become the next critical step for sample interpretation. Scientists, including chromatography experts, chemical analysts, and chemometricians, have conducted extensive research on this subject, especially with regard to data evaluation and pretreatment, deconvolution, pattern recognition, and other data-processing programs [10]. In such investigations, the study of correlation and uncorrelation between different but related chromatographic data is fundamental. Conventionally, the most widely employed standards for assessing the Fig. 2. Three-dimensional ﬁgure of three different simulated HPLC-DAD data sets (A, B, and C), each possessing the same total chromatographic proﬁle (D). 2 Z.-y. Wu et al. Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11 Table 1 The parameters for simulation of chromatographic proﬁle (position/half-width/intensity) and UV spectra (a/ν0/γ) to construct X and Y. They were used to deliver the proposed strategy proposed in this work. No. X Chromatography 1 2 3 4 5 6 7 8 9 10 40/3/12 62/5/3 98/6/10 138/9/10 164/12/6 196/12/5 224/14/7 260/17/9 290/20/12 320/22/7 Y Spectrum Chromatography 0 a ν 12/4/12 16/20/16 20/8/4 4/20/12 16/16/20 12/4/12 16/16/16 4/4/12 16/4/8 12/20/8 90/30/30 30/60/90 150/150/60 30/30/150 30/60/150 90/90/60 30/30/120 150/120/120 90/30/90 120/30/120 γ 50/10/50 20/40/20 20/40/30 40/50/50 40/30/20 50/30/30 50/30/40 10/40/20 10/50/20 30/50/10 36/4/3 74/5/5 100/11/10 134/13/6 160/13/8 192/14/11 220/15/10 242/16/3 280/17/8 304/18/10 Spectrum a ν0 γ 8/16/16 4/16/12 16/4/8 20/16/16 12/12/12 20/16/12 4/16/12 4/8/4 20/12/12 16/4/4 150/30/30 150/60/90 90/150/90 150/30/30 30/90/120 30/150/90 90/150/120 120/90/120 30/120/120 120/30/150 50/50/40 10/20/20 50/20/20 50/50/10 30/50/10 20/10/40 40/50/10 10/10/30 20/30/50 30/50/50 Fig. 3. Simulated data sets X and Y used to demonstrate the developed strategy and the determination of the number of principal components (PCs). (A) and (B), Three-dimensional ﬁgures of the two simulated data sets X and Y. (C) and (D), Determination of the number of PCs for data sets X and Y using the residual index and expressed variance. number of principal components can then be retained to construct the new data for analysis using principal component analysis (PCA) [26]. The interferences of noise can thus be substantially reduced to acceptable levels. The numbers of principle components (PCs) and unexpressed residuals are determined by inspecting the changes in the contiguous singular values obtained from singular value decomposition (SVD) and their potential effects on the results. Next, canonical correlation analysis (CCA) is used to extract the correlation of different but related chromatographic data matrices, depending on the properties of the canonical correlation variates (CCVs) [27–29]. Finally, two distance indices that represent the sameness and difference of X and Y for correlation and uncorrelation evaluation, respectively, are produced using the orthogonal projection (OP) technique [30–32]. The data in the CCV space are projected into uncorrelation studies, the existence of experimental noise, including homoscedastic and heteroscedastic noise, affects the performance of most methods. Thus, new strategies, that are more tolerant of noise and that are effective in evaluating the similarities of the chromatograms of complex samples such as herbal medicines and biological ﬂuids, should be developed. In this work, a new technique is developed to assess the correlation and uncorrelation of hyphenated chromatograms with the help of chemometric tools. The new method emphasizes the effective use of all the data to achieve a faithful representation of the components, similarities, and correlation among the data sets. The objective is to develop a more reliable parameter that represents the relationship between the data sets. Suppose we have two data sets X and Y from our studies; a certain 3 Z.-y. Wu et al. Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11 Fig. 4. Effects of homoscedastic and heteroscedastic noise and a large chromatographic proﬁle on the correlation index. (A) and (B), Effects of different levels of homoscedastic and heteroscedastic noise on correlation index. (C) and (D), Effects of different enlargement times of the ﬁfth component in data set X on the correlation index. The chromatography of the ﬁfth component in Y is replaced by the component in the same location in X (C), and both the chromatographic and spectral proﬁles of the ﬁfth component in Y are replaced by the corresponding component in X (D). on the correlation and uncorrelation indices. The results demonstrate that the new method is highly effective in evaluating the relationships between chromatograms from hyphenated chromatography during investigations of herbal medicine ﬁngerprints and metabolite proﬁles in metabonomics. 2. Theory Suppose two chromatograms from hyphenated chromatography, X and Y, are full or partial chromatograms. The sizes of the two matrices for X and Y are m1 n1 and m2 n2, respectively, meaning that the data contain m1 and m2 chromatographic measurement points and n1 and n2 spectral channels. Each row represents a spectrum at a speciﬁc retention time, and each column corresponds to a chromatographic proﬁle of a certain wavelength or m/z ratio. 2.1. Canonical correlation analysis (CCA) Fig. 5. Effects of different numbers of PCs employed to obtain correlation and uncorrelation indices (noise level, 0.0002). CCA was ﬁrst proposed by Hotelling in 1936 [27]. It is a multivariate statistical method for assessing and further correlating linear relationships between two multidimensional variables, such as data sets X and Y in this study. In particular, it can be used as an exploratory tool to study multiple variables with relation to a given analytical category. The strategy of this method is to ﬁnd two basis vectors a and b for target matrices X and Y, respectively, such that the correlation between their projections onto the corresponding vectors is mutually maximized. Unlike the PCA method, which attempts to identify the basis vectors of maximum variance, CCA exploits the correlation instead and is highly adaptable to spectral data. First, let each sample of instance be set to Z ¼ ((x1, y1), (x2, y2),… … (xi, yi),… … (xn, yn)) of (x, y). CCA seeks to obtain the canonical weights a and b in Eqs. (1) and (2), respectively, and to maximize the correlation between the canonical variates U and V: two parts: one part is parallel to the CCV matrix of the other data to describe similarity, and the other part is orthogonal to describe the dissimilarity. Thus, all the data points and information in data X and Y are utilized to compute the correlation and uncorrelation indices. Ginseng is a widely used herb in western and Asian countries. Its reported biological activities include improving psychological and immune functions, reducing blood sugar and cholesterol levels, enhancing strength, and promoting relaxation, among others [31,33]. We used GCMS data of ginseng samples to examine the proposed strategy. The samples included four different medicinal parts of three types of ginseng. In addition, some HPLC-DAD data sets are simulated to study the effects of certain experimental and instrumental factors, such as noise types and levels, data background and shifts, and unique chromatographic peaks, 4 Z.-y. Wu et al. Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11 Fig. 6. Effects of different degrees of data shifts and background levels on the correlation index. (A) and (B), Effects of a shift of the data points of data sets X (A) and Y (B) on the correlation index. (C), Effects of different background levels on the correlation index. Integrated linear shift of the whole data sets X and Y. (D), Piecewise linear shift of the data sets X and Y (integrated non-linear background with noise level of 0.0002). ρ ¼ max corrðU; VÞ ¼ max a;b a;b U; V E½〈a; x〉〈b; y〉 ¼ max qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ a;b kUkkVk E 〈a; x〉2 E 〈b; y〉2 aT Cxy a ¼ max qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ a;b aT Cxx abT Cyy b (3) where ρ is the correlation coefﬁcient and E is used to represent expectation. Matrices Cxx, Cyy, and Cxy are the within- and between-sets covariance of x and y and construct the total block covariance matrix C, as shown in Eq. (4): Cxx C¼ Cyx Cxy Cyy " # T x x ¼E y y (4) Mathematically, values of a and b that maximize the correlation can be transformed into the following two eigenvalue equations: Fig. 7. Total ion chromatograms (TICs) of different parts of C-, S-, and W-ginseng. Curves 1 to 4, 5 to 8, and 9 to 12 represent parts A, B, C, and D of C-ginseng, S-ginseng, and Wginseng, respectively. U≡UðaÞ ¼ ð〈a; x1 〉; 〈a; x2 〉; ……〈a; xi 〉; ……〈a; xn 〉Þ ¼ aT X (1) V≡VðbÞ ¼ ð〈b; y1 〉; 〈b; y2 〉; ……〈b; yi 〉; ……〈b; yn 〉Þ ¼ bT Y (2) Cxx 1 Cxy Cyy 1 Cyx a ¼ ρ2 a (5) Cyy 1 Cyx Cxx 1 Cxy b ¼ ρ2 b (6) where the eigenvalues ρ2 are squared canonical correlation coefﬁcients and eigenvectors a and b are basis vectors for obtaining the canonical variates U and V, respectively. The dimensions of U and V depend on the minimum ranks of the target data sets X and Y, respectively. The superscript letter T denotes the transposition of the vector or matrix. Notably, the data matrices possess the same lengths in the calculation. The maximization of the correlation can be deﬁned with the following equation: 2.2. Orthogonal projection (OP) The technique of OP has been extended to studies of deconvolution, modeling, quantitative structure-activity relationships/quantitative structure-property relationships (QSAR/QSPR), and other topics in 5 Z.-y. Wu et al. Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11 Fig. 8. Determination of the number of principal components in part A of C-ginseng and its effect on the correlation index between parts A and B of C-ginseng. (A) and (B), Determination of the number of PCs using the residual index (A) and expressed variance (B). (C) and (D), Effects of different numbers of PCs employed to obtain the correlation and uncorrelation indices. (C), The numbers of PCs of parts A and B of C-ginseng were changed from 1 to 300. The correlation and uncorrelation indices change from 0.570 to 0.547 and 0.430 to 453, respectively, when the number of PCs is increased from 250 to 300. (D), The number of PCs of part B of C-ginseng changes from 1 to 300, whereas the number of PCs of part A remains ﬁxed. onto each other to determine the orthogonal and overlapping information, as shown in Eqs. (7) and (8): Table 2 The number of principal components (PCs) used to analyze parts A, B, C and D of the three types of ginseng, namely, C-, S-, and W-ginseng, and percentage of expressed variance. C-ginseng S-ginseng ginseng A B C D A B C D A B C D Number of PCs Expressed variance 246 243 246 245 247 235 247 238 245 250 260 262 99.47 99.42 99.49 99.34 99.58 99.45 99.61 99.61 99.37 99.36 99.35 99.39 1 PU←V ¼ I UðUT UÞ UT V (7) 1 PV←U ¼ I VðVT VÞ VT U (8) where the superscript 1 denotes the inverse of the vector or matrix. Matrices PU←V and PV←U represent the orthogonal parts of V from U and those of U from V, respectively. 2.3. Deﬁnition of the correlation and uncorrelation indices In Fig. 1, the relationships between U and V and the projection matrices are presented to illustrate the strategy for deﬁning the correlation and uncorrelation indices. The shaded section summarizes the common information between U and V, and PU←V and PV←U are projection matrices containing the differences. The correlation index ci of the original hyphenated data sets X and Y can then be written as Eq. (9): chemometrics. The main idea of orthogonal projection posits that common information cannot be found between two orthogonal spaces, but instead, can be found only between overlapping ones. Mutual information can be utilized to describe the correlation; however, OP must be used for uncorrelation evaluation. Thus, identifying orthogonal subspaces of two data matrices should be useful for investigating the similarities and dissimilarities within the data. As previously mentioned, the canonical variates U and V correlate the matrices X and Y, respectively. To deﬁne the useful indices and to explore the relationships between the data using all possible data points, the two matrices U and V can then be projected ci ¼ rX rY qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ P 2ﬃ qij pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ P 2ﬃ wij (9) where rX and rY are the chemical ranks of data X and Y, respectively. The ratio rX/rY is a weighting factor that takes into account the differences 6 Z.-y. Wu et al. Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11 Table 3 Correlation indices among parts A, B, C and D of the three types of ginseng shown in Fig. 7. 7 Z.-y. Wu et al. Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11 example, the inclusion of 99% expressed variance in the data correlation study should usually be sufﬁcient, unlike the requirement of the completely accurate determination of chemical rank to deconvolute the overlapping chromatographic clusters. The effect of the numbers of PCs on the correlation indices is further investigated in the next section, and such approximations are shown to produce acceptable results. In contrast to the conventional similarity index, the new indices are highly effective at assessing the relationships between complicated data sets. between the number of chemical components in X and Y. Scalars qij and wij are elements of Q and W, as deﬁned in Eqs. (10) and (11), respectively: Q ¼ U PU←V ¼ V PV←U (10) W¼UþV (11) The uncorrelation index ui can then be easily obtained according to the expression shown in Eq. (12): ui ¼ 1 ci 3. Experimental (12) 3.1. Simulated data sets The indices ci and ui are the results obtained from both the chromatographic and spectral dimensions of the hyphenated chromatography data. Fig. 1 illustrates that the values of ci and ui must be between 0 and 1. The two boundary values will be achieved mathematically with no or full overlapping between the correlation variates U and V. As an example, in the case of the ci index, the mutual projection matrices PU←V and PV←U will include all and no information of U and V, respectively, under such situations. The advantage of the present indices is clear through comprehensive consideration of the high-dimensional chromatograms and the successive power of the chemometric tools. Further, the rationality can be validated in terms of the two different computation strategies shown in Eq. (10). In this work, synthetic HPLC-DAD data were employed to investigate the performance of the proposed method. All synthetic high performance liquid chromatography-diode array detection (HPLC-DAD) chromatographic proﬁles of pure chemical components were generated using the Gaussian function shown in Eq. (18): 0 B 1 ch ¼ hi B e @pﬃﬃﬃﬃﬃ 2π wi Noise always exists in instrumental data such as chromatography proﬁles, which leads to error in the correlation and uncorrelation analysis using the proposed method. In this work, PCA is utilized to reconstruct the target data sets X and Y as follows: (13) Y ¼ U2 S2 V2 T (14) sj ðvÞ ¼ (15) Yr ¼ U2;r S2;r V2;r T (16) ðResii Resii1 Þ Resii C C A (18) aij 0 2 v vij þ γ ij 2 ! (19) where the parameters Mj, aij, ν0ij, and γ ij are the number of absorbance spectral bands of the jth component, the maximal absorbance of the ith band, the frequency of the band center, and the bound-width at halfmaximum, respectively. 3.2. Sample extraction of chemical components in ginseng Samples of ginseng were prepared as follows. First, they were ground and crushed prior to the experiment. Hexane was then added, and the samples were ultrasonically extracted for 1 h at room temperature. After centrifugation, the supernatant was used for GC-MS analysis. 3.3. Instruments/analytical conditions where the six matrices U1,r, S1,r, VT1,r, and U2,r, S2,r, V2,Tr correspond to the matrices in Eqs. (13) and (14), respectively, with the ﬁrst rX and rY PCs included. The values of rX and rY are determined with regard to the change ratios of the residuals after different numbers of PCs have been extracted, as given in Eq. (17): rci ¼ Mj X i where U1 and U2 are column orthogonal matrices called loadings, V1 and V2 are row orthogonal matrices called scores, and S1 and S2 are diagonal matrices of the associated eigenvalues. If the numbers of the principal components (PCs) of X and Y can be obtained, then the two new matrices Xr and Yr contain only the information of the chemical components, Xr ¼ U1;r S1;r V1;r T 2 where parameters hi, wi, and pi represent the height, half-width, and position of the ith chromatographic peak, respectively. The variable ti is the retention time region of the target component. The UV spectrum of the corresponding component was simulated using a mixture of Lorentz distributions given in Eq. (19): 2.4. Reconstruction of the original chromatographic data using principal component analysis (PCA) X ¼ U1 S1 V1 T 1 ðti pi Þ 2wi 2 A Shimadzu QP-2010 GC-MS spectrometer (Tokyo, Japan), equipped with an Agilent DB-5MS capillary GC column (30 m 0.25 mm, 0.25 μm), was utilized for the analysis of the constituents in ginseng. The oven temperature was set at 100 C initially, ramped to 170 C at a rate of 1.5 C/min, then to 190 C at 8.0 C/min and was ﬁnally increased to 240 C at a rate of 2.0 C/min. The inlet temperature was set at 270 C with a split ratio was 2:1. Helium, at a constant ﬂow rate of 1.3 ml/min, was used as carrier gas. The full spectrum was recorded in the range of m/ z 1–380. The temperatures of the EI ionization source and the interface were maintained at 200 C and 250 C, respectively. (17) where Resii and Resii-1 are the sums of the squares of the elements in residual matrices the ﬁrst i or i - 1 PCs excluded. Because of the signiﬁcant differences in the chemical components and the noise-to-residual calculation in Eq. (17), an apparent increase in rc (rc ¼ [rc1, rc2, … … rci, … … rX or rY]) indicates the successful determination of the numbers of PCs, speciﬁcally, rX and rY. This strategy is useful for the analysis of target sub-matrices that are extracted from the whole chromatogram. However, the accurate determination of the sub-matrices within complex data is more difﬁcult. In such cases, a threshold on the percentage of expressed variance can be set to assist the estimation of rX and rY. For 3.4. Implementation All computer programs were coded in MATLAB 6.5.0, and all computations were performed on an Intel (R) Core (TM) 2 CPU 6300 (1.86 GHz and 1.87 GHz) PC with 2.0 GB of memory. 8 Z.-y. Wu et al. Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11 4. Results and discussion Fig. 4(D). When the present indices are used as standards, curves 2 to 4 indicate that the variation of both the chromatograms and the spectra can be acceptably expressed in data Y. 4.1. Simulated data sets 4.1.3. Effects of the number of PCs on correlation and uncorrelation indices The numbers of PCs determined in data X and Y substantially affect the correlation and uncorrelation indices. In fact, numerous other strategies have been developed in the last few decades to estimate the ranks of hyphenated data. If necessary, more advanced methods can be used to reconstruct the data matrix. In Fig. 5, four curves are presented to show the changes of the indices when various numbers of PCs are used. Notably, the uncorrelation index (curve 4) possessed a minimum that corresponds to the correct value in the abscissa. Curves 1 to 3 of the correlation indices are acquired from the chromatographic and spectral directions, and their average shows the variation using different numbers of PCs. The correlation indices show a stepwise decrease when the number of PCs is greater than 10. This approach works very well for a portion of the data extracted from the full chromatogram when the rank is small; however, most conventional methods, including the present strategy, may fail to accurately estimate the numbers of PCs for chromatograms that contain tens or hundreds of chemical components. However, the effects of the number of PCs used on the correlation index should be insigniﬁcant when the system contains more chemical components, and acceptable results were obtained using a threshold of 90% expressed variance, which will be discussed with respect to the ginseng data sets. In Fig. 2, three typical HPLC-DAD chromatograms show the effectiveness of the new methods with regard to the similarity evaluation. Fig. 2(A) to (C) show completely different three-dimensional shapes and structures, but they have the same total chromatograms, as shown in Fig. 2(D). Thus, both the conventional congruence coefﬁcient and the correlation coefﬁcient between these proﬁles will be equal to one, and neither coefﬁcient can distinguish the proﬁles according to their total or mean ion chromatograms. The information obtained from matrix data is obviously richer than the information obtained from scalar or vector data. This concept is the fundamental advantage with respect to data evaluation in the present study. Next, most of the factors affecting the correlation and uncorrelation indices were examined, including experimental noise levels, the existence of strong chromatographic peaks, the numbers of PCs employed, data shifts, and background levels. These ﬁve factors could substantially affect the indices for the similarity evaluation. The parameters in Eq. (18) and Eq. (19) of the two simulated data sets X and Y are presented in Table 1. Each data set comprises 10 components. The two three-dimensional data sets are shown in Fig. 3(A) and (B), and the corresponding rc curves that were used to determine the rank r in the data are shown in Fig. 3(C) and (D). The numbers of PCs for data sets X and Y were both 10. The top ﬁgures in Fig. 3(C) and (D) provide the expressed variance using the determined numbers of PCs from the bottom graphs. 4.1.4. Effects of data shifts and background levels on correlation and uncorrelation indices In metabolomics studies, the pretreatment of high-throughput chromatographic proﬁles is a critical step for the interpretation of complex mixture data sets and for further discoveries of biological processes. Most existing methods encounter difﬁculty when attempting to automatically extract all the rich information, such as retention time shifts and background interferences, from the data. Such problems also arise during the process of hyphenated chromatography data evaluation. The removal of such interference and recovery the original relationships between the data sets would be very useful. Fig. 6(A) and (B) were obtained when the parameter pi in Eq. (18) was gradually changed to generate retention time shifts in the data sets. Fig. 6(C) and (D) present the corresponding correlation indices for different integrated and local linear backgrounds in terms of the slope shown in the x-axis. Curves 1 to 5 represent the same information as previously described. The similarities between the results obtained using the conventional correlation coefﬁcient and the congruence coefﬁcient decrease signiﬁcantly in Fig. 6(A) and (B) to even less than zero. Yet, as shown in Fig. 6(C) and (D), the same indices increase to greater than 0.8 when the background slopes are improved. Such results clearly cannot reveal the actual situation or the variation between the two data sets. This defect is mainly attributable to the limited information from hyphenated chromatography that is used to obtain the two coefﬁcients. However, the gradual and robust variation of curves 1, 2, and 3 in all four sub-graphs indicate that the present correlation index is effective in representing the relationships between the data sets. The simultaneous analysis from the chromatographic and spectral dimensions produces excellent results. 4.1.1. Effects of homoscedastic and heteroscedastic noise on correlation and uncorrelation indices The existence of noise presents a major challenge when processing instrumental data with chemometric tools. The effects of different levels of homoscedastic and heteroscedastic noise on the correlation indices are presented in Fig. 4(A) and (B). The uncorrelation index can be acquired simply via the correlation index using Eq. (12). The noise produced no signiﬁcant effects on curves 1 and 5, which represent the correlation coefﬁcient and the congruence coefﬁcient, respectively. However, the three-dimensional data of data X and Y completely changed as the noise increased from 0.002% to 0.2% and from 0.02% to 2% of the maximum signal for homoscedastic and heteroscedastic noise, respectively. Curves 2 and 4 in Fig. 4 represent the results of the correlation indices from the chromatographic and spectral dimensions using the present method, and curve 3 is their average for the ﬁnal evaluation. In general, these three curves decrease smoothly and robustly following the improvements in the noise levels. The conventional indices apparently cannot show the variations of hyphenated chromatograms when the noise levels are increased. However, the indices developed in this study are effective at assessing the correlation and uncorrelation of these data sets. The ability of the new indices to deal with the heteroscedastic noise further demonstrates their effectiveness. 4.1.2. Effects of large chromatographic peaks on correlation and uncorrelation indices The use of the conventional correlation coefﬁcient or congruence coefﬁcient as a similarity measurement of two hyphenated chromatograms is completely controlled by several major components, though tens, hundreds, or even more chemical components may be present in the mixture. This complexity limits the effectiveness of the similarity evaluation because the scope is small. The spectral information is also not considered in the conventional similarity evaluation. Curves 1 and 5 in Fig. 4(C) and (D) represent the same information as those in Fig. 3. To study the effects of large chromatographic peaks on the index, the ﬁfth component in Y is replaced by the ﬁfth component used in X. Fig. 4(C) presents the results with only the chromatographic proﬁle displaced, although the spectrum is also changed to correspond to the results in 4.2. Ginseng data sets The growing conditions of ginseng, including its geographic location, soil conditions, and exposure to sunlight, are the major factors that affect the compositions of its small molecular metabolites and, hence, the activity of these herbs. Three types of ginseng were studied in this work. Wild ginseng, grown without human intervention, was collected from the wild. Wildly cultivated ginseng was cultivated in the wild under conditions that resembled the natural growing environment of 9 Z.-y. Wu et al. Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11 5. Conclusions the herb, whereas cultivated ginseng was grown on a farm. The three samples are abbreviated as W-, S-, and C-ginseng. Furthermore, different medical parts of ginseng are used to treat different diseases or different stages of the same disease in traditional practice. Thus, the correlation and uncorrelation of the hyphenated chromatograms of the four different parts of ginseng were investigated. The four parts of ginseng include the rhizome, the main root, the side, and the ﬁber root, which are referred to as A, B, C, and D, respectively. Clearly, 66 results between each group of two data sets are sufﬁcient to validate the present method, and these results provide a whole picture of the relationships of these mixtures. In Fig. 7, 12 total ion chromatograms (TICs) of the four parts of W-, S-, and C-ginseng are presented. This information is the only deﬁnitive factor for data evaluation using the conventional indices. As an example, the number of PCs of part A of C-ginseng is shown in Fig. 8(A), and the expressed variance using the corresponding number of PCs is shown in Fig. 8(B). The number of PCs for complicated mixtures is difﬁcult to determine. In this case, the number of PCs accounts for 99.47% of the variance, which should yield reasonably good conclusions. Furthermore, the effect of the number of PCs on the correlation and uncorrelation indices is presented in Fig. 8(C) and (D) for parts A and B of C-ginseng. We obtained the two sub-ﬁgures by varying the numbers of PCs from 1 to 300 in Fig. 8(C) and ﬁxing the number of PCs in part A while changing the number of PCs in part B. Curves 1 to 3 are the correlation indices acquired from both chromatographic and spectral dimensions, in addition to their average. Curve 4 is the corresponding uncorrelation index. According to the results presented in Fig. 8(C), the number of PCs affects the correlation and uncorrelation indices when it is less than 200. However, the indices of the chromatographic and spectral directions change slightly from 0.570 to 0.547 and from 0.430 to 453, respectively, as the number of PCs increases from 250 to 300. Apparently, the approximate number of PCs is reasonable for data evaluation. The trend of the correlation and uncorrelation indices even reverts in Fig. 8(D) to approximately the estimated number of PCs. Table 2 shows the determined number of PCs and the corresponding expressed variance. All correlation and uncorrelation indices are provided in Table 3. The results are symmetric by deﬁnition of the correlation or uncorrelation indices. The six values in each grid include the correlation indices of the chromatographic and spectral dimensions and their average, the uncorrelation index, and the congruence and correlation coefﬁcients. The values of the average correlation index and uncorrelation index are indicated in bold face. Most of the two conventional indices are clearly greater than 0.9, and they are larger than the ﬁrst 4 indices for the same type of ginseng. However, a similarity so great leads to difﬁculty in evaluating the sameness when the indices exhibit a correlation greater than that of the three-dimensional data. In fact, the differences are evident from both the chromatographic and spectral dimensions. Such confusion arises because all of the data information and the small differences in the chemical components were not employed. The present indices have strong advantages in this aspect. With the help of the indices proposed in this study, we found that the same type of ginseng has a higher correlation in every dimension, which should be consistent with the common understanding. The spectral dimension of the different types of ginseng indicates that the correlation between C- and W-ginseng is greater than that between C- and S-ginseng, although the average correlation index produces contrary results as a whole. Furthermore, relationships exist among the different parts of ginseng. For example, according to the ﬁnal column in Table 3, the correlation in the spectral dimension generally decreases among the medicinal parts of the different ginseng from W- to C- to S-ginseng. The decreasing order of correlation changes to W- to S- to C-ginseng in the chromatographic dimension. Obviously, the results obtained using the present method have high objectivity and effectiveness, in contrast to the conventional indices. The aim of the present study was to properly evaluate the correlation and uncorrelation indices of hyphenated chromatographic data. This objective is important for uncovering the relationships between complicated experimental data and processes and characters buried in complicated mixtures. Unlike the conventional indices that utilize few data points from the whole high-dimensional space, the indices proposed in this work are more effective for data evaluation because the entire chemical information in the data matrices is considered. Powerful chemometric tools, including the PCA, CCA, and OP techniques, were employed in the evaluation. On the basis of their performance with simulated and experimental data sets, the new indices are more suitable for the investigation of complex samples, such as the chromatographic ﬁngerprinting of herbal medicines and metabolite proﬁling. In fact, such a comprehensive chemometric strategy should have promising applications in numerous ﬁelds, such as in image comparisons and identiﬁcations of human ﬁngerprints. Acknowledgements This work was ﬁnancially supported by Changzhou Institute of Technology (Project YN1612 and Young Scholars Program). References [1] H.-P. Song, S.-Q. Wu, L.-W. Qi, F. Long, L.-F. Jiang, K. Liu, H. Zeng, Z.-M. Xu, P. Li, H. Yang, A strategy for screening active lead compounds and functional compound combinations from herbal medicines based on pharmacophore ﬁltering and knockout/knockin chromatography, J. Chromatogr. A 1456 (2016) 176–186. [2] Q.-x. Sun, Q. Chen, B.-h. Kong, F.-j. Dong, Q. Liu, Review on application of highperformance liquid chromatography technology in detection of biogenic amines in foods, Food Ind. (2014) 193–198. [3] B.M. Hounoum, H. Blasco, P. Emond, S. Mavel, Liquid chromatography-highresolution mass spectrometry-based cell metabolomics: experimental design, recommendations, and applications, TrAC, Trends Anal. Chem. 75 (2016) 118–128. [4] R.G. Breteton, Applied Chemometrics for Scientists, J. Wiley and Sons, Chichester, UK, 2006. [5] P. Dugo, F. Cacciola, T. Kumm, G. Dugo, L. Mondello, Comprehensive multidimensional liquid chromatography: theory and applications, J. Chromatogr. A 1184 (2008) 353–368. [6] F. Gong, Y.Z. Liang, Q.S. Xu, F.T. Chau, K.M. Ng, Evaluation of separation quality in two-dimensional hyphenated chromatography, Anal. Chim. Acta 450 (2001) 99–114. [7] N. Ferreiros, Recent advances in LC-MS/MS analysis of Delta(9)tetrahydrocannabinol and its metabolites in biological matrices, Bioanalysis 5 (2013) 2713–2731. [8] C.H. Weinert, B. Egert, S.E. Kulling, On the applicability of comprehensive twodimensional gas chromatography combined with a fast-scanning quadrupole mass spectrometer for untargeted large-scale metabolomics, J. Chromatogr. A 1405 (2015) 156–167. [9] X. Zhou, Y. Wang, Y. Yun, Z. Xia, H. Lu, J. Luo, Y. Liang, A potential tool for diagnosis of male infertility: plasma metabolomics based on GC-MS, Talanta 147 (2016) 82–89. [10] J. Trygg, E. Holmes, T. Lundstedt, Chemometrics in metabonomics, J. Proteome Res. 6 (2007) 469–479. [11] R.-t. Tian, P.-s. Xie, H.-p. Liu, Evaluation of traditional Chinese herbal medicine: Chaihu (Bupleuri Radix) by both high-performance liquid chromatographic and high-performance thin-layer chromatographic ﬁngerprint and chemometric analysis, J. Chromatogr. A 1216 (2009) 2150–2155. [12] J.B.G. Souza, N. Re-Poppi, J.L. Raposo Jr., Characterization of pyroligneous acid used in agriculture by gas chromatography-mass spectrometry, J. Braz. Chem. Soc. 23 (2012) 610–617. [13] N. Kane, K. Aznag, A. El Oirrak, M. Kaddioui, Binary data comparison using similarity indices and principal components analysis, Int. Arab. J. Inf. Technol. 13 (2016) 232–237. [14] Y. Wu, S. Lv, C. Wang, X. Gao, J. Li, Q. Meng, Comparative analysis of volatiles difference of Yunnan sun-dried Pu-erh green tea from different tea mountains: Jingmai and Wuliang mountain by chemical ﬁngerprint similarity combined with principal component analysis and cluster analysis, Chem. Cent. J. 10 (2016) 1–11. [15] D.J. Lipman, W.R. Pearson, Rapid and sensitive protein similarity searches, Science 227 (1985) 1435–1441. [16] T.R. Hagadone, Molecular substructure similarity searching - efﬁcient retrieval in 2dimensional structure databases, J. Chem. Inf. Comput. Sci. 32 (1992) 515–521. [17] A.C. Good, S.S. So, W.G. Richards, Structure-activity-relationships from molecular similarity-matrices, J. Med. Chem. 36 (1993) 433–438. 10 Z.-y. Wu et al. Chemometrics and Intelligent Laboratory Systems xxx (2017) 1–11 [26] E.R. Malinowski, Factor Analysis in Chemistry, third ed., John Wiley, New York, 2002. [27] H. Hotelling, Relations between two sets of variates, Biometrika 28 (1936) 321–377. [28] S.H. Lee, S. Choi, Two-dimensional canonical correlation analysis, IEEE Signal Proc. Let. 14 (2007) 735–738. [29] D.R. Hardoon, J. Shawe-Taylor, Convergence analysis of kernel Canonical Correlation Analysis: theory and practice, Mach. Learn 74 (2009) 23–38. [30] J.C. Harsanyi, C.I. Chang, Hyperspectral image classiﬁcation and dimensionality reduction - an orthogonal subspace projection approach, IEEE T Geosci. Remote 32 (1994) 779–785. [31] B.K.H. Tan, J. Vanitha, Immunomodulatory and antimicrobial effects of some traditional Chinese medicinal herbs: a review, Curr. Med. Chem. 11 (2004) 1423–1430. [32] X. Liang, Removal of hidden neurons in multilayer perceptrons by orthogonal projection and weight crosswise propagation, Neural Comput. Appl. 16 (2007) 57–68. [33] E.J. Cho, X.L. Piao, M.H. Jang, S.Y. Park, S.W. Kwon, J.H. Park, The effect of steaming on the free amino acid contents and antioxidant activity of ginseng,, Planta Med. 74 (2008), 1174–1174. [18] J. Batista, J. Bajorath, Chemical database mining through entropy-based molecular similarity assessment of randomly generated structural fragment populations, J. Chem. Inf. Model 47 (2007) 59–68. [19] M. Rupp, E. Proschak, G. Schneider, Kernel approach to molecular similarity based con iterative graph similarity, J. Chem. Inf. Model 47 (2007) 2280–2286. [20] F. Su, X. Xie, J. Feng, A. Cai, Fingerprint matching using SVM-based similarity measure, Chin. J. Electron 16 (2007) 459–463. [21] L. Si, D. Yu, D. Kihara, Y. Fang, Combining gene sequence similarity and textual information for gene function annotation in the literature, Inf. Retr. 11 (2008) 389–404. [22] L. Yin, C.-H. Huang, J. Ni, Clustering of gene expression data: performance and similarity analysis, BMC Bioinf. 7 (2006) 1–11. [23] S.Y. Kim, J.W. Lee, Ensemble clustering method based on the resampling similarity measure for gene expression data, Stat. Methods Med. Res. 16 (2007) 539–564. [24] Y.Z. Liang, O.M. Kvalheim, R. Manne, White, gray and black multicomponent systems - a classiﬁcation of mixture problems and methods for their quantitative analysis, Chemom. Intell. Lab. Syst. 18 (1993) 235–250. [25] Z.D. Zeng, Y.Z. Liang, Y.L. Wang, X.R. Li, L.M. Liang, Q.S. Xu, C.X. Zhao, B.Y. Li, F.T. Chau, Alternative moving window factor analysis for comparison analysis between complex chromatographic data, J. Chromatogr. A 1107 (2006) 273–285. 11

1/--страниц