MITjats-NECO.cls October 3, 2017 LETTER 20:51 Communicated by Jose Miguel Hernandez Lobato Pro o Temporal Causal Inference with Time Lag f NECO_a_01028-Du Sizhen Du gjsong@pku.edu.cn Guojie Song ed dusizhen@126.com Key Laboratory of Machine Perception, Ministry of Education, Peking University, Beijing 100871, China Lei Han Haikun Hong orr ect leihan.cs@gmail.com SystemsBio Lab, Department of Basic Science, Mississippi State University, MS 39759, U.S.A. honghaikun@hotmail.com Institude of Electronics, Chinese Academy of Sciences, Beijing 100190, China Un c Accurate causal inference among time series helps to better understand the interactive scheme behind the temporal variables. For time series analysis, an unavoidable issue is the existence of time lag among different temporal variables. That is, past evidence would take some time to cause a future effect instead of an immediate response. To model this process, existing approaches commonly adopt a prefixed time window to define the lag. However, in many real-world applications, this parameter may vary among different time series, and it is hard to be predefined with a fixed value. In this letter, we propose to learn the causal relations as well as the lag among different time series simultaneously from data. Specifically, we develop a probabilistic decomposed slab-andspike (DSS) model to perform the inference by applying a pair of decomposed spike-and-slab variables for the model coefficients, where the first variable is used to estimate the causal relationship and the second one captures the lag information among different temporal variables. For parameter inference, we propose an efficient expectation propagation (EP) algorithm to solve the DSS model. Experimental results conducted on both synthetic and real-world problems demonstrate the effectiveness of the proposed method. The revealed time lag can be well validated by the domain knowledge within the real-world applications. S.D. is now at Peking University. Neural Computation 29, 1–21 (2017) doi:10.1162/NECO_a_01028 © Massachusetts Institute of Technology NECO_a_01028-Du MITjats-NECO.cls October 3, 2017 2 20:51 S. Du, G. Song, L. Han, and H. Hong 1 Introduction Un c orr ect ed Pro o f Temporal causal inference aims to reveal the causal relationship among temporal variables and time series. Much effort has gone into learning an accurate temporal causal inference model for various applications, such as gene regulatory network discovery (Friedman, 2004; Liu, Niculescu-Mizil, Lozano, & Lu, 2011), social network analysis (Goldenberg & Moore, 2004), indictor analysis in business (Arnold, Liu, & Abe, 2007), climate data study (Chen, Liu, Liu, & Carbonell, 2010), and traffic analysis (Han, Song, Cong, & Xie, 2012). For temporal causal modeling, an important factor is the time lag, which indicates how long one must look into the past to make accurate predictions of a current event. A real-world example is provided in Figure 1. Figure 1a shows the structure of a highway network; the red circle indicates a destination, and the blue circles denote the starting points. The traffic flow originating from the blue circles goes through the network, with a cost of some travel time, and finally arrives at the destination. Here, the travel time brings a time lag between the time series recorded at the source points and the time series recorded at the destination. This lag information is captured in Figures 1b to 1d. Figure 1b records the number of vehicles from the source point A arriving at the destination between 10:00 and 10:15. We observe that these vehicles depart from A earlier: 9:15 to 10:00. By treating 15 minutes as a unit, the time lag from A to the destination is thus a value in [1, 3] in terms of units. Similarly, Figures 1c and 1d record the lag information from starting points B and C to the destination, respectively. Commonly, the lag between a pair of time series can be described as a value in a range [Lmin , Lmax ] (e.g., the values in Figure 1). Hence, an accurate estimation of the range for each pair of time series can bring a better understanding of the causal relations and more accurate prediction performance. Most of the existing causal inference methods treat Lmin and Lmax as constants instead of variables before learning. For example, previous approaches (Bahadori, Liu, & Xing, 2013; Cheng, Bahadori, & Liu, 2014; Lebre, Becq, Devaux, Stumpf, & Lelandais, 2010) set Lmin = 1, which assumes that the current event is affected by the most recent history and choose Lmin either by cross-validation method or as a predefined value. As we can see from the previous example, such a setting will prevent the model from detecting the true time lags and causal relationship among temporal variables. In this letter, we propose to learn the causal relationship and the time lag among different temporal variables at the same time. Specifically, we propose a probabilistic decomposed slab-and-spike (DSS) model in which we separate the inference by decomposing the slab-and-spike variable into a product of two latent variables—one designed for estimating the causal relationship among different temporal variables and another layer to capture the optimal time lag among these variables in the form of range [Lmin , Lmax ]. For parameter inference, we propose an efficient expectation propagation NECO_a_01028-Du MITjats-NECO.cls October 3, 2017 20:51 3 Un c orr ect ed Pro o f Temporal Causal Inference with Time Lag Figure 1: (a) A traffic network. (b–d) Lag intervals for different starting points. algorithm to solve the DSS model. Experimental evaluations are performed on both synthetic and real-world problems, and the results demonstrate the effectiveness of the DSS model and that the recovered time lag can well match the domain knowledge. The rest of the letter is organized as follows. Related work is discussed in section 2. We describe our model in section 3; then we give the detailed inference in section 4. We show experimental results on two data sets in section 5. We summarize the letter and conclude with future work in section 6. 2 Related Work There has been a rich literature on temporal causal learning. Since the seminal work of Granger causality (Granger, 1969), where a statistical hypothesis NECO_a_01028-Du MITjats-NECO.cls October 3, 2017 4 20:51 S. Du, G. Song, L. Han, and H. Hong Un c orr ect ed Pro o f test was proposed to determine whether one time series was useful in predicting another one, many approaches have been developed for inferring the causal relationship among temporal variables. For example, the exhaustive graphical Granger method (Arnold, Liu, & Abe, 2007) applied the Granger causality test to each pair of the temporal variables and established a causal relation graph for the variables; the vector autoregressive (VAR) model (Gilbert, 1995; Hatemi-J, 2004) fit a linear regression by placing the current value of all the temporal variables as the response while treating the past values as covariates; the LASSO Granger method (Arnold et al., 2007; Qiu, Liu, Subrahmanya, & Li, 2012) employed the 1 regularization to learn sparse causal relations among the variables. The SIN method (Arnold et al., 2007; Drton & Perlman, 2008) adopted a notion of the gaussian graphical model to test the causal relationship among temporal variables, and a greedy extension of this idea to deal with large scale graphs was proposed in Han et al. (2012). Some causal inference models utilizing the hidden Markov random filed regression were discussed in Liu, Kalagnanam, and Johnsen (2009), Liu, Niculescu-Mizil, Lozano, and Lu (2010, 2011). Other causal inference models that aim to analyze irregular time series analysis (Qiu et al., 2012; Bahadori & Liu, 2012), evolving relational graphs (Chen et al., 2010), and the switch of roles between cause and effect (Cheng et al., 2014) have been studied recently as well. Although this is a rich literature, when considering the time lag, most of these methods prefix this value as constant before learning, leading to inaccurate estimates of the causal relations. Very few studies tried to understand the time lag from data. Dhurandhar (2010) proposed estimating the maximum lag for a grouped graphical Granger model. Knapp and Carter (1976) developed a maximum likelihood estimator to determine the time delay in correlation analysis. Unfortunately, this work focuses only on inferring the lag; they cannot infer the causal relationship at the same time. We aim to learn the causal relationship and the lag among temporal variables simultaneously from data. 3 The Decomposed Slab-and-Spike Model Throughout the letter, we use roman letters to denote scalars, boldface and lowercase letters for vectors, and boldface and uppercase letters for matrices. Denote the target time series as y = (y1 , . . . , yn ) ∈ Rn , where n is the total length of the considered time. The input data are X with P features where Xi, j denotes the ith sample of X j . Assume the considered P time series elapses in W time steps. The input data are converted to a new matrix X lag with size n × d, where d = P × W and n is the number of samples. The tth row of matrix X lag is [X1,t−W:t−1 , . . . , XP,t−W:t−1 ]. Now, we consider a linear regression model, y = X lag β + ε, (3.1) NECO_a_01028-Du MITjats-NECO.cls October 3, 2017 20:51 Temporal Causal Inference with Time Lag 5 p(y | X, β) = n N(yi | Xi,: β, σ02 ), i=1 Pro o f where β ∈ Rd×1 is the regression coefficient and ε ∼ N(0, Iσ02 ) is the gaussian noise. The causal inference problem is to find the optimal coefficient matrix β. We propose a probabilistic model for the coefficient β. The likelihood function for y is defined as (3.2) Un c orr ect ed where N(·) denotes a normal distribution with mean Xi,: β, σ02 and variance σ02 . Since we aim to learn the causal relationship and the time lag between the causes and the target simultaneously, we introduce two binary latent variables ω ∈ {0, 1}P and s ∈ {0, 1}d×1 , where ω controls the causal relationship between the causes and the target and s contains the lag information. Specifically, ωi = 1 indicates that the variable Xi is possibly causally related to y; otherwise, if ωi = 0, Xi is not responsible for y. On the other hand, s determines whether a specific time point lies in the lag interval [Limin , Limax ] of the ith variable. Let il denote the lth time point in the time window [1, W] of variable Xi : il = (i − 1) ∗ W + l. That is, when sil = 1, the lth time point is regarded as belonging to the lag interval of variable Xi ; hence, Xt−l,i could be the cause of yt , that is, βil = 0. Otherwise, when sil = 0, the lth time point is outside the lag interval and Xt−l,i is removed from the model: βil = 0. In order to capture the causalities and lag information in one unified model, we use the corresponding two hyperparameters to control the regression coefficient β. The model coefficient β should satisfy that βil = 0 if ωi sil = 1 and βil = 0 if ωi sil = 0. Therefore, the model coefficient β admits a sparse structure determined by two latent variables ω and s. To achieve this, we propose to apply the slab-and-spike priors, which have been used in the previous probabilistic sparse learning literature (Chipman, 1996; Hernndez-Lobato, Hernndez-Lobato, & Surez, 2015; Hernndez-Lobato, Hernndez-Lobato, & Dupont, 2013). Specifically, we devise a probabilistic model based on the spike-and-slab priors by using two binary variables controlling the sparsity of β as P(βil |ωi , sil ) = ωi sil N(βil | 0, v 0 ) + (1 − ωi sil )δ(βil ), (3.3) where v 0 is the variance of a gaussian distribution (the slab) and δ(·) is a Dirac delta function, which is also known as the point probability mass (the spike), centered at 0. The value of v 0 controls the shrinkage of the coefficients that are different from zero. If v 0 is large, βil ’s are barely regularized and hence will be different from zero. Conversely, if v 0 is small, the coefficients βil ’s will strongly shrink toward zero. According to equation 3.3, it is easy to NECO_a_01028-Du MITjats-NECO.cls October 3, 2017 6 20:51 S. Du, G. Song, L. Han, and H. Hong p(ω) = Bern(ω | p0 ) = P Pro o f see that the model coefficient β depends on two latent variables, which imply the causal relationship and the lag information, respectively. Therefore, we call our method the decomposed slab-and-spike (DSS) model. For the latent variable ω, we apply a multivariate Bernoulli prior, which is commonly used for binary variables, as pω0,ii (1 − p0,i ) i=1 (1−ωi ) , (3.4) (3.5) orr ect P(sil |gil ) = Bern(sil |gil ), ed where p0,i is the probability that ωi is nonzero and p0 = (p0,1 , . . . , p0,P ) . Now we focus on the binary variable s. The function of hyperparameter sil is to check whether the lth time point should be included in the lag interval of the Xi . Thus we apply a Bernoulli prior for s: Un c where gil is the probability of the time point l lying in the lag interval of Xi . Recall that we consider the lag with form [Lmin , Lmax ]; therefore, we need to encode the two parameters Limin and Limax in the latent variable s. To achieve this, we introduce two new variables, a = (a1 , . . . , aP ) ∈ RP and L = (L1 , . . . , LP ) ∈ RP , and reformulate Lmin and Lmax as ai = Limin and Li = Limax − Limin + 1. Now the lag interval can be represented as [ai , ai + Li − 1]. We assume that if l belongs to the lag interval of Xi , the probability gil > 0.5. Otherwise, gil < 0.5. Here the probability gil in the lag interval does not follow a fixed form. Thus, we assume gil follows the uniform distribution: gil |ai , Li ∼ U(0.5, 1), if ai ≤ l ≤ ai + Li − 1 U(0, 0.5), otherwise . (3.6) Without generality, we apply normal priors for variables a and L as follows: p(a) = N (μa , a ), (3.7) p(L) = N (μL , L ), (3.8) where μa and μL are the mean vectors and a and L denote the variance matrices, which are the diagonal matrix. Now, by marginalizing out s and ω, we have p(βil ) = p0,i gil N (βil |0, v 0 ) + p0,i (1 − gil ) + (1 − p0,i )gil + (1 − p0,i )(1 − gil ) δ(βil ). (3.9) Based on equation 3.9, Figures 2a and 2b show example plots of the probability density p(βil ) under different choices of hyperparameters. The NECO_a_01028-Du MITjats-NECO.cls October 3, 2017 20:51 7 ed Pro o f Temporal Causal Inference with Time Lag orr ect Figure 2: Probability density for βil under different choices of p0,i and gil . (a) gil = 0.5. v 0 = 1. p0,i takes values 0.5, 1, and 0, respectively. (b) p0,i = 0.5. v 0 = 1. gil takes values 0.5, 1, and 0, respectively. Un c spike-and-slab prior is a mixture of some gaussian distributions (the slab) and a point probability mass (the spike), and it is displayed by an upwardpointing arrow. The height of this arrow is the probability mass of the prior at zero. From the figure, we observe that the proposed spike-and-slab prior can clearly discriminate the coefficients modeled by the slab, whose probability density is barely changed, and the coefficients described by the spike, whose posterior has significantly different peaks around zero. In many real-world applications, we may be aware of some a priori knowledge about the causal relationship and the time lag. For example, in traffic and climate analysis, the time series collected from spatially close locations are more likely to be causal related, and the time lag between them is relatively small compared with those from distant locations (see the example in Figure 1). Hence, we may choose the hyperparameters, such as p0 , μa , and μL , to match this knowledge to facilitate the learning, as we will do in our experiments. 3.1 Prediction and Dependence. Given the observed data X and y, we make inferences about the potential values of β and s and ω using Bayes’ theorem: p(β, ω, s, a, L | X, y) = p(y | β, X )p(β | s, ω)p(s | g)p(g | a, L)p(ω)p(a)p(L) , p(y | X ) (3.10) where p(y|X ) is a normalization constant, known as the model evidence. This posterior distribution and the likelihood, equation 3.2, can be combined to compute a predictive distribution for the target ynew associated NECO_a_01028-Du MITjats-NECO.cls October 3, 2017 8 20:51 S. Du, G. Song, L. Han, and H. Hong with a new observation Xnew : Pro o f p(ynew | Xnew , y, X ) p(ynew | β, Xnew )p(β, ω, s, a, L | y, X )dβdadL. = s,ω (3.11) The posterior distribution of ω defines the probability of causal features: p(ω | X, y) = p(β, ω, s, a, L | y, X )dβdadL. (3.12) ed s orr ect A practical difficulty is that the exact computation of equations 3.10 to 3.12 is intractable for typical learning problems. All of these expressions involve a number of summations that grow exponentially. Thus, they must be approximated in practice. As an efficient alternative, we employ expectation propagation (EP), a method for fast, approximate inference with Bayesian models. 4 Parameter Inference Un c In this section, we propose an efficient parameter inference algorithm to solve the DSS model. EP is a deterministic method for carrying out approximate Bayesian inference. EP approximates the posterior distribution of the parameters of interest using a simpler parametric distribution Q. The form of Q is chosen so that the integrals required to calculate expected values and marginal distributions with respect to Q can be obtained analytically in closed form. EP fixes the parameters of Q to approximately minimize the Kullback-Leibler divergence between the exact posterior and Q (Hernndez-Lobato et al., 2013). The joint posterior distribution of all the latent variables can be written as p(β, w, s, a, L|X, y) = p(y|X, β) p(β|s, ω) p(s|g(a, L)) p(w) p(a) p(L) . f1 f2 f3 f4 f5 f6 (4.1) EP approximates each exact factor fk by a simpler factor fk , such that 6 k=1 fk (β, ω, s, a, L) ≈ 6 k=1 f (β, ω, s, a, L). (4.2) NECO_a_01028-Du MITjats-NECO.cls October 3, 2017 20:51 Temporal Causal Inference with Time Lag 9 All approximate factors fk are considered to belong to the same family of ˜ exponential distributions. Q = Z1 fk , where Z is a normalization term. Pro o f k 4.1 Approximation of the Posterior. In equation 4.1, f4 , f5 , and f6 can be represented by equations 3.4, 3.7, and 3.8, respectively. f1 refers to a multivariate gaussian distribution with mean m̃1 and covariance matrix Ṽ 1 defined as 1 1 −1 (X lag )T X lag , Ṽ 1 m̃1 = 2 (X lag )T y. σ02 σ0 ed −1 Ṽ 1 = f4 is set to be equivalent to f4 , and it is not iteratively reestimated through the EP algorithm. The corresponding parameters of f4 are orr ect p4,i = σ −1 (p0,i ), (4.3) where σ −1 (·) is the logit function defined as σ −1 (x) = log x . 1−x (4.4) Un c We propose to derive approximations of f2 and f3 , denoted as f˜2 and f˜3 , respectively, using the EP method. In order to take advantage of the structure of the true posterior distribution, we assume that f˜2 and f˜3 can be decomposed into d independent terms. Specifically, let f˜2 and f˜3 be f˜2 = Z̃2 P W i=1 l=1 1 exp − (βil − m̃2,il )2 2ṽ 2,il (4.5) · Bern(wi |σ ( p̃2,i )) · Bern(sil |σ (g(ã2,i , L̃2,i ))), ˜ 3,L ) · N (a|μ̃3,a , ˜ 3,a ) f˜3 (s, a, L) = Z̃3 · N (L|μ̃3,L , · P W Bern(sil |σ (g(ã3,i , L̃3,i ))), (4.6) i=1 l=1 where σ (·) denotes the sigmoid function as σ (x) = 1 1+exp(−x) and g(a, L) de- notes the probability g with parameters a and L. Each component f˜2,il is the product of a univariate gaussian distribution and two Bernoulli distributions, and f˜3,il is the product of two univariate gaussian distributions and a Bernoulli distribution. Based on the approximations in equations 4.5 and NECO_a_01028-Du MITjats-NECO.cls October 3, 2017 10 20:51 S. Du, G. Song, L. Han, and H. Hong 4.6, we obtain a joint variational approximation on the posterior: i=1 l=1 Pro o f Q(β, ω, s, a, L) = N (β|m, V )N (L|μL , L )N (a|μa , a ) P W · Bern(ωi | σ (p0,i ))Bern(sil | σ (g(ai , Li ))) . (4.7) According to the quotient rules for gaussian and Bernoulli distributions (Hernndez-Lobato et al., 2013), the joint parameters are defined as −1 −1 a = −1 ( 3,a + −1 6,L )−1 + −1 5,a )−1 μL = μa = −1 μ3,L L ( 3,L −1 μ3,a a ( 3,a + + −1 μ6,L ), 6,L −1 μ5,a ), 5,a (4.8) (4.9) (4.10) orr ect L = −1 ( 3,L ed V = (Ṽ 1 + −1 )−1 , m = V (Ṽ 1 m̃1 +−1 m̃2 ), v 2,PW ). v 2,il denotes the the variance of the βil in where = diag( v 2,11 , . . . , approximate factors f2 . Un c 4.2 EP Update Operations. Next, we perform the expectation propagation. Taking f˜2 as an example, we first compute the corresponding parameters of the marginal distributions of βil , sil , and ωi under the cavity f2,il . The parameters of Q\2,il are obtained from the distribution Q\2,il ∝ Q/ Bayesian rules of the quotient between gaussian distributions and the quotient between Bernoulli distributions (Hernndez-Lobato, 2009). Once Q\2,il is computed, we can find the corresponding approximation f˜2,il , which minimizes the KL divergence between f2,il · Q\2,il and f˜2,il · Q\2,il . Once all the approximate factors f \2,il have been updated in parallel, Q needs to be recomputed as the product of all the approximate factors. The similar calculation process can be found in Hernndez-Lobato et al. (2013) and HernndezLobato (2009). Computations for f˜3 can be derived similarly. The procedure is depicted in algorithm 1. 5 Experiments In this section, we conduct empirical results on both synthetic and realworld data sets. We compare the proposed DSS model with the state-ofthe-art temporal causal inference methods, including the Granger LASSO (L(G); Arnold et al., 2007), the standard LASSO (L(S); Lee, Lee, Abbeel, & Ng, 2006), the SIN Granger method (Drton & Perlman, 2008), the TLHL model (Zhou et al., 2015), and the VAR model (Gilbert, 1995). Specifically, standard LASSO does not consider time lag interval; it looks back only one time unit. Granger LASSO and SIN Granger are given a fixed time window as lag interval. The TLHL model takes lags as adaptive parameters, NECO_a_01028-Du MITjats-NECO.cls October 3, 2017 20:51 11 Un c orr ect ed Pro o f Temporal Causal Inference with Time Lag but it assumes the regression coefficients follow exponential form. The VAR model generalizes the univariate AR model to multiple time series. For the LASSO-based methods, we choose their regularization parameters from a candidate set by cross-validation according to Arnold et al. (2007). 5.1 Synthetic Data 5.1.1 Setting. We first evaluate various methods on synthetic data. We set n = 1000, W = 20, and P = 20 in the synthetic data. The predictors x1 , x2 , · · · , x20 are generated from the standard normal distribution. In order to simulate the causal relationship, we set the response variable y to be affected by x1 , x2 , x3 · · · , x8 —that is, w1 = 1, . . . , w8 = 1. The corresponding probability matrix g is set as Figure 3a. For the model coefficient β, we generate each βil from N (μil , σ0 ), where for l ∈ [Limin , Limax ], we set μil = 5; we set μil = 0 for any other cases. The NECO_a_01028-Du MITjats-NECO.cls October 3, 2017 S. Du, G. Song, L. Han, and H. Hong ed Pro o f 12 20:51 orr ect Figure 3: (a) The true probability matrix g. (b) The probability matrix ĝ learned by the DSS model. variance σ0 is always set to be 1 for all cases. The response y is finally generated as yt = p W i=1 l=1 βil Xt−l,i + σ0 . Un c We select the λ in the standard LASSO based on cross-validation. In this experiment, we select the optimal λ from the candidate set {0.1, 0.01, 0.001, 0.0001, 0.00001} and get λ = 0.1 finally. The selection of λ in the Granger LASSO is the same as that in the standard LASSO. 5.1.2 Evaluation. In order to evaluate the performance of lag inference of the proposed methods, we introduce some measurements. In our model, for each temporal variable X i , we can determine the lag interval via the probability gil —that is, the probability of the time point l lying in the [Limin , Limax ]. When gil < 0.5 and gil > 0.5, the time point l is regarded as Limin . When gil < 0.5 and gil > 0.5, the time point l is referred as the Limax . Given the minimum and maximum lag values, we can get the true parameter g. Let ĝ denote the probability estimated by the DSS model. Let g denote the true probability based on the fixed minimum and maximum lags. If ĝil > 0.5, l is lying in the lag interval of Xi . Then the precision Pr and recall Re for recovering the true lag intervals are defined as follows: Pr = (l, i) ∈ W × P : (ĝi > 0.5)&(gi = 1) l l , (l, i) ∈ W × P : ĝi > 0.5 l NECO_a_01028-Du MITjats-NECO.cls October 3, 2017 20:51 13 (l, i) ∈ W × P : (ĝi > 0.5)&(gi = 1) l l . Re = (l, i) ∈ W × P : gi = 1 l f Temporal Causal Inference with Time Lag orr ect ed Pro o Based on the precision and recall, the F1 score is defined as F1 = 2Pr · Re/(Pr + Re). We report the average F1 score to evaluate the proposed methods. For measuring the causal recovery performance, we also use the F1 score. Let ω denote the true causal result vector and ω̂ denote the learned causal result vector. Then the precision Pr and recall Re for recovering the true causalities are defined as follows: i ∈ [1, P] : ω̂i = ωi , Pr = i ∈ [1, P] : ω̂i = 1 i ∈ [1, P] : ω̂i = ωi . Re = i ∈ [1, P] : ωi = 1 Moreover, we use the mean absolute percentage error (MAPE), defined as n 1 yt − ŷt , MAPE = yt n t=1 Un c to evaluate the performance of different methods of their prediction performance. 5.1.3 Results. We evaluate the estimated parameters in the DSS model. Figures 3a and 3b show the true probability g and the learned ĝ by the DSS model, respectively. Comparing the two panels, we can see that the learned probability matrix ĝ is close to the true matrix, especially for the time points where the probability equals 1. Thus, we can conclude that the learned probability ĝ clearly recovers the true probability, which demonstrates that the proposed DSS model can accurately estimate the lag intervals for different covariates. Moreover, we can obtain the causal relations by observing the probability p0 , which is the probability that the coefficient ω is different from zero. Figure 4 reports the learned probabilities. From the results, we observe that the results well match the ground truth that the first eight variables are the causal factors of the response y. Figure 5 shows the effect of the parameter W, where we vary W from 1 to 25 and evaluate the F1 measures of the obtained lags. It is clear that when W is larger than the maximum lag defined for all the features (i.e., 20), the F1 score can reach a high value around 0.8 and generally remains stable. This observation is reasonable since the considered time window W should be larger than the existing maximum lag, such that we can accurately discover NECO_a_01028-Du MITjats-NECO.cls October 3, 2017 S. Du, G. Song, L. Han, and H. Hong Pro o f 14 20:51 orr ect ed Figure 4: The learned probability p(ω) from the DSS model. Un c Figure 5: The F1 score of lag estimation with varying W. all the lags in the variables. Moreover, the results also imply that we do not need a very large W. Next, we compare the DSS model with other methods in terms of learning accuracy. The F1 score of the causal relationship recovery is provided in Figure 6. From the result, we can conclude that the DSS model is much more stable than all the other methods. Because the method of standard LASSO with fixed time lag considers only one value of lag, the lag fluctuations cannot be captured. Thus, the performance of standard LASSO is worse. Finally, based on the recovered dependency relationships, we can evaluate various models in terms of predication performance. We use the MAPE to evaluate the performance of different methods. Figure 7 provides the predication results. We see that our model achieves the lowest prediction errors in the MAPE. 5.2 Traffic Data. In this section, we evaluate the proposed methods on a real-world traffic data set studied by Han et al. (2012). NECO_a_01028-Du MITjats-NECO.cls October 3, 2017 20:51 15 Pro o f Temporal Causal Inference with Time Lag orr ect ed Figure 6: The F1 score of causal relation recovery. Un c Figure 7: The MAPE of different methods. 5.2.1 Setting. The features in this traffic data set are observations collected from sensors located on stations in a traffic network. The observation of each station is the number of passed vehicles over 15 minutes. The data are collected from records over 92 days, and there are 236 traffic stations, so 236 time series were observed in this data set. The data values are normalized to [0,1]. The task here is to infer the causal relationship among different time series from two busy periods in each day: 7:00 a.m. to noon and midnight to 7:00 a.m. We select one month’s data, with 600 samples as the training data. In this traffic data set, there exists prior knowledge on the lag intervals. When a vehicle passes a traffic station, the time is recorded. And as a consequence, we can compute the average travel time between any two locations. This travel time can be viewed as a rough guess of the lag, and we use it to define the hyperparameters in the DSS model. Specifically, we set the hyperparameter μa as the corresponding average travel time. 5.2.2 Results and Discussion. Table 1 shows the predicted mean absolute n |yt −ŷt | percentage error ( n1 t=1 |y | ) for all the methods. From the results, we t NECO_a_01028-Du MITjats-NECO.cls October 3, 2017 16 20:51 S. Du, G. Song, L. Han, and H. Hong Table 1: MAPE of Various Methods on the Traffic Data Set. L(G) 0.22 0.3 TLHL DSS f VAR 0.35 Pro o L(S) 0.23 0.12 orr ect ed MAPE Un c Figure 8: Causal relations learned by the DSS model. The red circle indicates the target station and the blue circle the source station. see that the DSS method outperforms the other methods. This demonstrates that learning both the causal relations and the lag intervals simultaneously from data can significantly improve prediction performance compared to the methods that fix the length of the lag. We also show the detailed causal relations that the DSS model learned from 7:00 a.m. to noon and midnight to 5:00 a.m. in Figures 8a and 8b, respectively. In the two figures, the red circle indicates the target station (destination) and the blue circle the source station (starting point). We observe that the causal-related source stations are mostly located close to the target station from 7:00 a.m. to noon. However, from midnight to 5:00 a.m. (the late night), the causal-related source stations change dramatically where these source stations are generally distant from the target station. This interesting phenomenon can be well explained by our domain experts. A large number of trucks travel through this traffic network in the late night every day, and most of the trucks travel a long distance for business reasons. In contrast, during the day, most traffic is composed of private cars, which usually travel a very short distance. This knowledge well explains the results in Figures 8a and 8b and demonstrates the effectiveness of the proposed method. NECO_a_01028-Du MITjats-NECO.cls October 3, 2017 20:51 17 ed Pro o f Temporal Causal Inference with Time Lag orr ect Figure 9: The lag intervals learned by DSS and the corresponding mean travel time. (a) The lag intervals of dependency relationships for 7:00 a.m. to noon. (b) The lag intervals of dependency relationships for midnight to 5:00 a.m. The black curve denotes the travel time from the dependent station to the target station. Un c The lag intervals that the DSS model learned are shown in Figure 9. It is obvious that most of the ground-truth travel time values lie in the lag intervals the DSS model learned, which implies that the inferred lag intervals are very accurate. For example, from Figure 9a, we can see that the lag interval of station 10 is [9,15] while that of station 1 is [2,8], which is consistent with the fact that the distance between station 10 and the target station is much farther than the one between station 1 and the target station. 5.3 Air Quality Data. In this section, we apply our method to the air quality data analysis. The problem of air quality has attracted increasing interest in recent years (Intergovernmental Panel on Climate Change, 2007). In this application, one of the most important challenges is to discover the dispersion path of air pollution (Seinfeld & Pandis, 2016) and learning spatial causal relationship is very helpful for interpreting and understanding the spreading pattern of the pollution. We will test all the methods on a real-world air quality data set to evaluate their effectiveness. 5.3.1 Data. The air quality data are collected from 14 air quality monitoring stations in different locations of Beijing, China. Each station measures the hourly PM2.5 data from December 1, 2014, to March 31, 2015. We extract the period from February 20, 2015, at 7:00 a.m. to February 22 at 7:00 p.m. from the entire time series, since the air quality is seriously polluted during this time interval. We chose the time series from four sensors located at four NECO_a_01028-Du MITjats-NECO.cls October 3, 2017 S. Du, G. Song, L. Han, and H. Hong Pro o f 18 20:51 orr ect ed Figure 10: The three snapshots of the pollution process. Un c Figure 11: The spatial dependencies and lag intervals of four sensors learned by the DSS model. important areas as the target time series and then determined their causal factors from all the time series collected from all the sensors separately. 5.3.2 Results and Discussion. Figure 11 shows the recovered causal relationship and the estimated lag intervals for the four sensors. We find that the estimated causal relationship is consistent with the direction of the pollution diffusion as revealed in Figure 10. From the results, we could infer that this pollution incident originates southwest of Beijing. As our domain experts acknowledged, there are a number of chemical factories located in these areas, and this evidence provides a reasonable explanation for our results. In addition, we also compare the error with other methods. Table 2 shows the MAPE of all methods on this data set. From the table, we see that our model obtains the lowest prediction error. Thus, we can conclude that causal inference with time lag helps to provide more accurate prediction of the future effects. NECO_a_01028-Du MITjats-NECO.cls October 3, 2017 20:51 Temporal Causal Inference with Time Lag 19 Table 2: The MAPE of Various Methods on the Air Quality Data Set. L(s) VAR 0.29 0.64 0.51 6 Conclusion TLHL DSS f L(G) Pro o MAPE 0.29 0.24 orr ect ed Time lag is an important parameter in causal relationship learning. In this letter, we proposed a probabilistic decomposed slab-and-spike model to simultaneously estimate the causal relationship and the time lag interval within a unified framework. We developed an expectation propagation algorithm for the parameter inference. We evaluated our model on one synthetic data set and two real-world data sets. Experimental results demonstrate the effectiveness of the proposed model in learning the causal relationship as well as the time lag. Importantly, the results conducted on the real data sets are well interpreted by the domain experts. It will be interesting to extend the static DSS model to learn a dynamic model, where the causal relationship and lags are time varying over the entire time series. We are also interested in learning the causal relationship and the time lag between the covariates and multiple responses at the same time. Acknowledgments Un c This work was supported by the National Natural Science Foundation of China (61572041) and the Beijing Natural Science Foundation (4152023). References Arnold, A., Liu, Y., & Abe, N. (2007). Temporal causal modeling with graphical Granger methods. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 66–75). New York: ACM. Bahadori, M. T., & Liu, Y. (2012). Granger causality analysis in irregular time series. In Proceedings of the 2012 SIAM International Conference on Data Mining (pp. 660– 671). Philadelphia: Society for Industrial and Applied Mathematics. Bahadori, M. T., Liu, Y., & Xing, E. P. (2013). Fast structure learning in generalized stochastic processes with latent factors. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 284–292). New York: ACM. Chen, X., Liu, Y., Liu, H., & Carbonell, J. G. (2010, July). Learning spatial-temporal varying graphs with applications to climate data analysis. In Proceedings of the 24th AAAI Conference on Artificial Intelligence. Cambridge, MA: AAAI Press. Cheng, D., Bahadori, M. T., & Liu, Y. (2014). FBLG: A simple and effective approach for temporal dependence discovery from time series data. In Proceedings of the 20th NECO_a_01028-Du MITjats-NECO.cls October 3, 2017 20 20:51 S. Du, G. Song, L. Han, and H. Hong Un c orr ect ed Pro o f ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 382–391). New York: ACM. Chipman, H. (1996). Bayesian variable selection with related predictors. Canadian Journal of Statistics, 24(1), 17–36. Dhurandhar, A. (2010, December). Learning maximum lag for grouped graphical granger models. In Proceedings of the 2010 IEEE International Conference on Data Mining Workshops (pp. 217–224). Piscataway, NJ: IEEE. Drton, M., & Perlman, M. D. (2008). A SINful approach to gaussian graphical model selection. Journal of Statistical Planning and Inference, 138(4), 1179–1200. Friedman, N. (2004). Inferring cellular networks using probabilistic graphical models. Science, 303(5659), 799–805. Gilbert, P. D. (1995). Combining VAR estimation and state space model reduction for simple good predictions. Journal of Forecasting, 14(3), 229–250. Goldenberg, A., & Moore, A. (2004). Tractable learning of large Bayes net structures from sparse data. In Proceedings of the Twenty-First International Conference on Machine learning (p. 44). New York: ACM. Granger, C. W. (1969). Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society, 37, 424– 438. Han, L., Song, G., Cong, G., & Xie, K. (2012). Overlapping decomposition for causal graphical modeling. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 114–122). New York: ACM. Hatemi-J, A. (2004). Multivariate tests for autocorrelation in the stable and unstable VAR models. Economic Modelling, 21(4), 661–683. Hernndez-Lobato, D. (2009). Prediction based on averages over automatically induced learners: Ensemble methods and Bayesian techniques. Ph.D. diss., Universidad Autónoma de Madrid. Hernndez-Lobato, D., Hernndez-Lobato, J. M., & Dupont, P. (2013). Generalized spike-and-slab priors for Bayesian group feature selection using expectation propagation. Journal of Machine Learning Research, 14(1), 1891–1945. Hernndez-Lobato, J. M., Hernndez-Lobato, D., & Surez, A. (2015). Expectation propagation in linear regression models with spike-and-slab priors. Machine Learning, 99(3), 437–487. Intergovernmental Panel on Climate Change. (2007). Fourth Assessment Report: Climate Change 2007 (Working Paper 1325). http://wmo.insomnation.com/sites /default/files/documents/meetings/session20/doc2.pdf Knapp, C., & Carter, G. (1976). The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(4), 320–327. Lebre, S., Becq, J., Devaux, F., Stumpf, M. P., & Lelandais, G. (2010). Statistical inference of the time-varying structure of gene-regulation networks. BMC Systems Biology, 4(1), 130. Lee, S. I., Lee, H., Abbeel, P., & Ng, A. Y. (2006). Efficient l 1 regularized logistic regression. In AAAI (Vol. 6, pp. 401–408). Liu, Y., Kalagnanam, J. R., & Johnsen, O. (2009). Learning dynamic temporal graphs for oil-production equipment monitoring system. In Proceedings of the 15th ACM NECO_a_01028-Du MITjats-NECO.cls October 3, 2017 20:51 Temporal Causal Inference with Time Lag 21 orr ect ed Pro o f SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1225–1234). New York: ACM. Liu, Y., Niculescu-Mizil, A., Lozano, A. C., & Lu, Y. (2010). Learning temporal causal graphs for relational time-series analysis. In Proceedings of the 27th International Conference on Machine Learning (pp. 687–694). New York: ACM. Liu, Y., Niculescu-Mizil, A., Lozano, A., & Lu, Y. (2011). Temporal graphical models for cross-species gene regulatory network discovery. Journal of Bioinformatics and Computational Biology, 9(2), 231–250. Qiu, H., Liu, Y., Subrahmanya, N. A., & Li, W. (2012, December). Granger causality for time-series anomaly detection. In IEEE 12th International Conference on Data Mining (pp. 1074–1079). Piscataway, NJ: IEEE. Seinfeld, J. H., & Pandis, S. N. (2016). Atmospheric chemistry and physics: From air pollution to climate change. Hoboken, NJ: Wiley. Zhou, X., Hong, H., Xing, X., Huang, W., Bian, K., & Xie, K. (2015, June). Mining dependencies considering time lag in spatio-temporal traffic data. In Proceedings of the International Conference on Web-Age Information Management (pp. 285–296). New York: Springer. Un c Received December 30, 2016; accepted July 28, 2017.

1/--страниц