close

Вход

Забыли?

вход по аккаунту

?

neco a 01028

код для вставкиСкачать
MITjats-NECO.cls
October 3, 2017
LETTER
20:51
Communicated by Jose Miguel Hernandez Lobato
Pro
o
Temporal Causal Inference with Time Lag
f
NECO_a_01028-Du
Sizhen Du
gjsong@pku.edu.cn
Guojie Song
ed
dusizhen@126.com
Key Laboratory of Machine Perception, Ministry of Education,
Peking University, Beijing 100871, China
Lei Han
Haikun Hong
orr
ect
leihan.cs@gmail.com
SystemsBio Lab, Department of Basic Science, Mississippi State University,
MS 39759, U.S.A.
honghaikun@hotmail.com
Institude of Electronics, Chinese Academy of Sciences, Beijing 100190, China
Un
c
Accurate causal inference among time series helps to better understand
the interactive scheme behind the temporal variables. For time series
analysis, an unavoidable issue is the existence of time lag among different temporal variables. That is, past evidence would take some time
to cause a future effect instead of an immediate response. To model this
process, existing approaches commonly adopt a prefixed time window
to define the lag. However, in many real-world applications, this parameter may vary among different time series, and it is hard to be predefined with a fixed value. In this letter, we propose to learn the causal
relations as well as the lag among different time series simultaneously
from data. Specifically, we develop a probabilistic decomposed slab-andspike (DSS) model to perform the inference by applying a pair of decomposed spike-and-slab variables for the model coefficients, where the first
variable is used to estimate the causal relationship and the second one
captures the lag information among different temporal variables. For parameter inference, we propose an efficient expectation propagation (EP)
algorithm to solve the DSS model. Experimental results conducted on
both synthetic and real-world problems demonstrate the effectiveness of
the proposed method. The revealed time lag can be well validated by the
domain knowledge within the real-world applications.
S.D. is now at Peking University.
Neural Computation 29, 1–21 (2017)
doi:10.1162/NECO_a_01028
© Massachusetts Institute of Technology
NECO_a_01028-Du
MITjats-NECO.cls
October 3, 2017
2
20:51
S. Du, G. Song, L. Han, and H. Hong
1 Introduction
Un
c
orr
ect
ed
Pro
o
f
Temporal causal inference aims to reveal the causal relationship among
temporal variables and time series. Much effort has gone into learning an
accurate temporal causal inference model for various applications, such as
gene regulatory network discovery (Friedman, 2004; Liu, Niculescu-Mizil,
Lozano, & Lu, 2011), social network analysis (Goldenberg & Moore, 2004),
indictor analysis in business (Arnold, Liu, & Abe, 2007), climate data study
(Chen, Liu, Liu, & Carbonell, 2010), and traffic analysis (Han, Song, Cong,
& Xie, 2012).
For temporal causal modeling, an important factor is the time lag, which
indicates how long one must look into the past to make accurate predictions of a current event. A real-world example is provided in Figure 1. Figure 1a shows the structure of a highway network; the red circle indicates a
destination, and the blue circles denote the starting points. The traffic flow
originating from the blue circles goes through the network, with a cost of
some travel time, and finally arrives at the destination. Here, the travel time
brings a time lag between the time series recorded at the source points and
the time series recorded at the destination. This lag information is captured
in Figures 1b to 1d. Figure 1b records the number of vehicles from the source
point A arriving at the destination between 10:00 and 10:15. We observe that
these vehicles depart from A earlier: 9:15 to 10:00. By treating 15 minutes as a
unit, the time lag from A to the destination is thus a value in [1, 3] in terms of
units. Similarly, Figures 1c and 1d record the lag information from starting
points B and C to the destination, respectively. Commonly, the lag between
a pair of time series can be described as a value in a range [Lmin , Lmax ] (e.g.,
the values in Figure 1). Hence, an accurate estimation of the range for each
pair of time series can bring a better understanding of the causal relations
and more accurate prediction performance.
Most of the existing causal inference methods treat Lmin and Lmax as
constants instead of variables before learning. For example, previous approaches (Bahadori, Liu, & Xing, 2013; Cheng, Bahadori, & Liu, 2014; Lebre, Becq, Devaux, Stumpf, & Lelandais, 2010) set Lmin = 1, which assumes
that the current event is affected by the most recent history and choose Lmin
either by cross-validation method or as a predefined value. As we can see
from the previous example, such a setting will prevent the model from detecting the true time lags and causal relationship among temporal variables.
In this letter, we propose to learn the causal relationship and the time lag
among different temporal variables at the same time. Specifically, we propose a probabilistic decomposed slab-and-spike (DSS) model in which we
separate the inference by decomposing the slab-and-spike variable into a
product of two latent variables—one designed for estimating the causal relationship among different temporal variables and another layer to capture
the optimal time lag among these variables in the form of range [Lmin , Lmax ].
For parameter inference, we propose an efficient expectation propagation
NECO_a_01028-Du
MITjats-NECO.cls
October 3, 2017
20:51
3
Un
c
orr
ect
ed
Pro
o
f
Temporal Causal Inference with Time Lag
Figure 1: (a) A traffic network. (b–d) Lag intervals for different starting points.
algorithm to solve the DSS model. Experimental evaluations are performed
on both synthetic and real-world problems, and the results demonstrate the
effectiveness of the DSS model and that the recovered time lag can well
match the domain knowledge.
The rest of the letter is organized as follows. Related work is discussed
in section 2. We describe our model in section 3; then we give the detailed inference in section 4. We show experimental results on two data sets
in section 5. We summarize the letter and conclude with future work in
section 6.
2 Related Work
There has been a rich literature on temporal causal learning. Since the seminal work of Granger causality (Granger, 1969), where a statistical hypothesis
NECO_a_01028-Du
MITjats-NECO.cls
October 3, 2017
4
20:51
S. Du, G. Song, L. Han, and H. Hong
Un
c
orr
ect
ed
Pro
o
f
test was proposed to determine whether one time series was useful in predicting another one, many approaches have been developed for inferring
the causal relationship among temporal variables. For example, the exhaustive graphical Granger method (Arnold, Liu, & Abe, 2007) applied the
Granger causality test to each pair of the temporal variables and established
a causal relation graph for the variables; the vector autoregressive (VAR)
model (Gilbert, 1995; Hatemi-J, 2004) fit a linear regression by placing the
current value of all the temporal variables as the response while treating
the past values as covariates; the LASSO Granger method (Arnold et al.,
2007; Qiu, Liu, Subrahmanya, & Li, 2012) employed the 1 regularization to
learn sparse causal relations among the variables. The SIN method (Arnold
et al., 2007; Drton & Perlman, 2008) adopted a notion of the gaussian graphical model to test the causal relationship among temporal variables, and a
greedy extension of this idea to deal with large scale graphs was proposed in
Han et al. (2012). Some causal inference models utilizing the hidden Markov
random filed regression were discussed in Liu, Kalagnanam, and Johnsen
(2009), Liu, Niculescu-Mizil, Lozano, and Lu (2010, 2011). Other causal inference models that aim to analyze irregular time series analysis (Qiu et al.,
2012; Bahadori & Liu, 2012), evolving relational graphs (Chen et al., 2010),
and the switch of roles between cause and effect (Cheng et al., 2014) have
been studied recently as well. Although this is a rich literature, when considering the time lag, most of these methods prefix this value as constant
before learning, leading to inaccurate estimates of the causal relations.
Very few studies tried to understand the time lag from data. Dhurandhar (2010) proposed estimating the maximum lag for a grouped graphical
Granger model. Knapp and Carter (1976) developed a maximum likelihood
estimator to determine the time delay in correlation analysis. Unfortunately,
this work focuses only on inferring the lag; they cannot infer the causal relationship at the same time. We aim to learn the causal relationship and the
lag among temporal variables simultaneously from data.
3 The Decomposed Slab-and-Spike Model
Throughout the letter, we use roman letters to denote scalars, boldface and
lowercase letters for vectors, and boldface and uppercase letters for matrices. Denote the target time series as y = (y1 , . . . , yn ) ∈ Rn , where n is the
total length of the considered time. The input data are X with P features
where Xi, j denotes the ith sample of X j . Assume the considered P time series elapses in W time steps. The input data are converted to a new matrix
X lag with size n × d, where d = P × W and n is the number of samples. The
tth row of matrix X lag is [X1,t−W:t−1 , . . . , XP,t−W:t−1 ]. Now, we consider a linear regression model,
y = X lag β + ε,
(3.1)
NECO_a_01028-Du
MITjats-NECO.cls
October 3, 2017
20:51
Temporal Causal Inference with Time Lag
5
p(y | X, β) =
n
N(yi | Xi,: β, σ02 ),
i=1
Pro
o
f
where β ∈ Rd×1 is the regression coefficient and ε ∼ N(0, Iσ02 ) is the gaussian noise.
The causal inference problem is to find the optimal coefficient matrix
β. We propose a probabilistic model for the coefficient β. The likelihood
function for y is defined as
(3.2)
Un
c
orr
ect
ed
where N(·) denotes a normal distribution with mean Xi,: β, σ02 and variance
σ02 .
Since we aim to learn the causal relationship and the time lag between
the causes and the target simultaneously, we introduce two binary latent
variables ω ∈ {0, 1}P and s ∈ {0, 1}d×1 , where ω controls the causal relationship between the causes and the target and s contains the lag information.
Specifically, ωi = 1 indicates that the variable Xi is possibly causally related
to y; otherwise, if ωi = 0, Xi is not responsible for y.
On the other hand, s determines whether a specific time point lies in
the lag interval [Limin , Limax ] of the ith variable. Let il denote the lth time
point in the time window [1, W] of variable Xi : il = (i − 1) ∗ W + l. That is,
when sil = 1, the lth time point is regarded as belonging to the lag interval
of variable Xi ; hence, Xt−l,i could be the cause of yt , that is, βil = 0. Otherwise, when sil = 0, the lth time point is outside the lag interval and Xt−l,i is
removed from the model: βil = 0.
In order to capture the causalities and lag information in one unified
model, we use the corresponding two hyperparameters to control the regression coefficient β. The model coefficient β should satisfy that βil = 0 if
ωi sil = 1 and βil = 0 if ωi sil = 0. Therefore, the model coefficient β admits
a sparse structure determined by two latent variables ω and s. To achieve
this, we propose to apply the slab-and-spike priors, which have been used
in the previous probabilistic sparse learning literature (Chipman, 1996;
Hernndez-Lobato, Hernndez-Lobato, & Surez, 2015; Hernndez-Lobato,
Hernndez-Lobato, & Dupont, 2013). Specifically, we devise a probabilistic
model based on the spike-and-slab priors by using two binary variables
controlling the sparsity of β as
P(βil |ωi , sil ) = ωi sil N(βil | 0, v 0 ) + (1 − ωi sil )δ(βil ),
(3.3)
where v 0 is the variance of a gaussian distribution (the slab) and δ(·) is a
Dirac delta function, which is also known as the point probability mass (the
spike), centered at 0. The value of v 0 controls the shrinkage of the coefficients
that are different from zero. If v 0 is large, βil ’s are barely regularized and
hence will be different from zero. Conversely, if v 0 is small, the coefficients
βil ’s will strongly shrink toward zero. According to equation 3.3, it is easy to
NECO_a_01028-Du
MITjats-NECO.cls
October 3, 2017
6
20:51
S. Du, G. Song, L. Han, and H. Hong
p(ω) = Bern(ω | p0 ) =
P
Pro
o
f
see that the model coefficient β depends on two latent variables, which imply the causal relationship and the lag information, respectively. Therefore,
we call our method the decomposed slab-and-spike (DSS) model.
For the latent variable ω, we apply a multivariate Bernoulli prior, which
is commonly used for binary variables, as
pω0,ii (1 − p0,i )
i=1
(1−ωi )
,
(3.4)
(3.5)
orr
ect
P(sil |gil ) = Bern(sil |gil ),
ed
where p0,i is the probability that ωi is nonzero and p0 = (p0,1 , . . . , p0,P ) .
Now we focus on the binary variable s. The function of hyperparameter sil is to check whether the lth time point should be included in the lag
interval of the Xi . Thus we apply a Bernoulli prior for s:
Un
c
where gil is the probability of the time point l lying in the lag interval of Xi .
Recall that we consider the lag with form [Lmin , Lmax ]; therefore, we need
to encode the two parameters Limin and Limax in the latent variable s. To
achieve this, we introduce two new variables, a = (a1 , . . . , aP ) ∈ RP and
L = (L1 , . . . , LP ) ∈ RP , and reformulate Lmin and Lmax as ai = Limin and Li =
Limax − Limin + 1. Now the lag interval can be represented as [ai , ai + Li − 1].
We assume that if l belongs to the lag interval of Xi , the probability gil >
0.5. Otherwise, gil < 0.5. Here the probability gil in the lag interval does not
follow a fixed form. Thus, we assume gil follows the uniform distribution:
gil |ai , Li ∼
U(0.5, 1),
if ai ≤ l ≤ ai + Li − 1
U(0, 0.5),
otherwise
.
(3.6)
Without generality, we apply normal priors for variables a and L as follows:
p(a) = N (μa , a ),
(3.7)
p(L) = N (μL , L ),
(3.8)
where μa and μL are the mean vectors and a and L denote the variance
matrices, which are the diagonal matrix.
Now, by marginalizing out s and ω, we have
p(βil ) = p0,i gil N (βil |0, v 0 )
+ p0,i (1 − gil ) + (1 − p0,i )gil + (1 − p0,i )(1 − gil ) δ(βil ). (3.9)
Based on equation 3.9, Figures 2a and 2b show example plots of the
probability density p(βil ) under different choices of hyperparameters. The
NECO_a_01028-Du
MITjats-NECO.cls
October 3, 2017
20:51
7
ed
Pro
o
f
Temporal Causal Inference with Time Lag
orr
ect
Figure 2: Probability density for βil under different choices of p0,i and gil .
(a) gil = 0.5. v 0 = 1. p0,i takes values 0.5, 1, and 0, respectively. (b) p0,i = 0.5.
v 0 = 1. gil takes values 0.5, 1, and 0, respectively.
Un
c
spike-and-slab prior is a mixture of some gaussian distributions (the slab)
and a point probability mass (the spike), and it is displayed by an upwardpointing arrow. The height of this arrow is the probability mass of the prior
at zero. From the figure, we observe that the proposed spike-and-slab prior
can clearly discriminate the coefficients modeled by the slab, whose probability density is barely changed, and the coefficients described by the spike,
whose posterior has significantly different peaks around zero.
In many real-world applications, we may be aware of some a priori
knowledge about the causal relationship and the time lag. For example, in
traffic and climate analysis, the time series collected from spatially close locations are more likely to be causal related, and the time lag between them
is relatively small compared with those from distant locations (see the example in Figure 1). Hence, we may choose the hyperparameters, such as p0 ,
μa , and μL , to match this knowledge to facilitate the learning, as we will do
in our experiments.
3.1 Prediction and Dependence. Given the observed data X and y, we
make inferences about the potential values of β and s and ω using Bayes’
theorem:
p(β, ω, s, a, L | X, y)
=
p(y | β, X )p(β | s, ω)p(s | g)p(g | a, L)p(ω)p(a)p(L)
,
p(y | X )
(3.10)
where p(y|X ) is a normalization constant, known as the model evidence.
This posterior distribution and the likelihood, equation 3.2, can be combined to compute a predictive distribution for the target ynew associated
NECO_a_01028-Du
MITjats-NECO.cls
October 3, 2017
8
20:51
S. Du, G. Song, L. Han, and H. Hong
with a new observation Xnew :
Pro
o
f
p(ynew | Xnew , y, X )
p(ynew | β, Xnew )p(β, ω, s, a, L | y, X )dβdadL.
=
s,ω
(3.11)
The posterior distribution of ω defines the probability of causal features:
p(ω | X, y) =
p(β, ω, s, a, L | y, X )dβdadL.
(3.12)
ed
s
orr
ect
A practical difficulty is that the exact computation of equations 3.10 to
3.12 is intractable for typical learning problems. All of these expressions involve a number of summations that grow exponentially. Thus, they must
be approximated in practice. As an efficient alternative, we employ expectation propagation (EP), a method for fast, approximate inference with
Bayesian models.
4 Parameter Inference
Un
c
In this section, we propose an efficient parameter inference algorithm to
solve the DSS model.
EP is a deterministic method for carrying out approximate Bayesian inference. EP approximates the posterior distribution of the parameters of interest using a simpler parametric distribution Q. The form of Q is chosen so
that the integrals required to calculate expected values and marginal distributions with respect to Q can be obtained analytically in closed form. EP
fixes the parameters of Q to approximately minimize the Kullback-Leibler
divergence between the exact posterior and Q (Hernndez-Lobato et al.,
2013).
The joint posterior distribution of all the latent variables can be written
as
p(β, w, s, a, L|X, y) = p(y|X, β) p(β|s, ω) p(s|g(a, L)) p(w) p(a) p(L) .
f1
f2
f3
f4
f5
f6
(4.1)
EP approximates each exact factor fk by a simpler factor fk , such that
6
k=1
fk (β, ω, s, a, L) ≈
6
k=1
f (β, ω, s, a, L).
(4.2)
NECO_a_01028-Du
MITjats-NECO.cls
October 3, 2017
20:51
Temporal Causal Inference with Time Lag
9
All approximate factors fk are considered
to belong to the same family of
˜
exponential distributions. Q = Z1
fk , where Z is a normalization term.
Pro
o
f
k
4.1 Approximation of the Posterior. In equation 4.1, f4 , f5 , and f6 can
be represented by equations 3.4, 3.7, and 3.8, respectively. f1 refers to a multivariate gaussian distribution with mean m̃1 and covariance matrix Ṽ 1 defined as
1
1
−1
(X lag )T X lag , Ṽ 1 m̃1 = 2 (X lag )T y.
σ02
σ0
ed
−1
Ṽ 1 =
f4 is set to be equivalent to f4 , and it is not iteratively reestimated through
the EP algorithm. The corresponding parameters of f4 are
orr
ect
p4,i = σ −1 (p0,i ),
(4.3)
where σ −1 (·) is the logit function defined as
σ −1 (x) = log
x
.
1−x
(4.4)
Un
c
We propose to derive approximations of f2 and f3 , denoted as f˜2 and
f˜3 , respectively, using the EP method. In order to take advantage of the
structure of the true posterior distribution, we assume that f˜2 and f˜3 can
be decomposed into d independent terms. Specifically, let f˜2 and f˜3 be
f˜2 = Z̃2
P W
i=1 l=1
1
exp −
(βil − m̃2,il )2
2ṽ 2,il
(4.5)
· Bern(wi |σ ( p̃2,i )) · Bern(sil |σ (g(ã2,i , L̃2,i ))),
˜ 3,L ) · N (a|μ̃3,a , ˜ 3,a )
f˜3 (s, a, L) = Z̃3 · N (L|μ̃3,L , ·
P W
Bern(sil |σ (g(ã3,i , L̃3,i ))),
(4.6)
i=1 l=1
where σ (·) denotes the sigmoid function as σ (x) =
1
1+exp(−x)
and g(a, L) de-
notes the probability g with parameters a and L. Each component f˜2,il is the
product of a univariate gaussian distribution and two Bernoulli distributions, and f˜3,il is the product of two univariate gaussian distributions and
a Bernoulli distribution. Based on the approximations in equations 4.5 and
NECO_a_01028-Du
MITjats-NECO.cls
October 3, 2017
10
20:51
S. Du, G. Song, L. Han, and H. Hong
4.6, we obtain a joint variational approximation on the posterior:
i=1 l=1
Pro
o
f
Q(β, ω, s, a, L) = N (β|m, V )N (L|μL , L )N (a|μa , a )
P W
·
Bern(ωi | σ (p0,i ))Bern(sil | σ (g(ai , Li ))) .
(4.7)
According to the quotient rules for gaussian and Bernoulli distributions
(Hernndez-Lobato et al., 2013), the joint parameters are defined as
−1
−1
a =
−1
(
3,a
+
−1
6,L )−1
+
−1
5,a )−1
μL =
μa =
−1
μ3,L
L (
3,L
−1
μ3,a
a (
3,a
+
+
−1
μ6,L ),
6,L
−1
μ5,a ),
5,a
(4.8)
(4.9)
(4.10)
orr
ect
L =
−1
(
3,L
ed
V = (Ṽ 1 + −1 )−1 , m = V (Ṽ 1 m̃1 +−1 m̃2 ),
v 2,PW ). v 2,il denotes the the variance of the βil in
where = diag(
v 2,11 , . . . ,
approximate factors f2 .
Un
c
4.2 EP Update Operations. Next, we perform the expectation propagation. Taking f˜2 as an example, we first compute the corresponding parameters of the marginal distributions of βil , sil , and ωi under the cavity
f2,il . The parameters of Q\2,il are obtained from the
distribution Q\2,il ∝ Q/ Bayesian rules of the quotient between gaussian distributions and the quotient between Bernoulli distributions (Hernndez-Lobato, 2009). Once Q\2,il
is computed, we can find the corresponding approximation f˜2,il , which minimizes the KL divergence between f2,il · Q\2,il and f˜2,il · Q\2,il . Once all the
approximate factors f \2,il have been updated in parallel, Q needs to be recomputed as the product of all the approximate factors. The similar calculation process can be found in Hernndez-Lobato et al. (2013) and HernndezLobato (2009). Computations for f˜3 can be derived similarly. The procedure
is depicted in algorithm 1.
5 Experiments
In this section, we conduct empirical results on both synthetic and realworld data sets. We compare the proposed DSS model with the state-ofthe-art temporal causal inference methods, including the Granger LASSO
(L(G); Arnold et al., 2007), the standard LASSO (L(S); Lee, Lee, Abbeel, &
Ng, 2006), the SIN Granger method (Drton & Perlman, 2008), the TLHL
model (Zhou et al., 2015), and the VAR model (Gilbert, 1995). Specifically,
standard LASSO does not consider time lag interval; it looks back only one
time unit. Granger LASSO and SIN Granger are given a fixed time window as lag interval. The TLHL model takes lags as adaptive parameters,
NECO_a_01028-Du
MITjats-NECO.cls
October 3, 2017
20:51
11
Un
c
orr
ect
ed
Pro
o
f
Temporal Causal Inference with Time Lag
but it assumes the regression coefficients follow exponential form. The VAR
model generalizes the univariate AR model to multiple time series. For the
LASSO-based methods, we choose their regularization parameters from a
candidate set by cross-validation according to Arnold et al. (2007).
5.1 Synthetic Data
5.1.1 Setting. We first evaluate various methods on synthetic data. We
set n = 1000, W = 20, and P = 20 in the synthetic data. The predictors
x1 , x2 , · · · , x20 are generated from the standard normal distribution. In order to simulate the causal relationship, we set the response variable y to be
affected by x1 , x2 , x3 · · · , x8 —that is, w1 = 1, . . . , w8 = 1. The corresponding
probability matrix g is set as Figure 3a.
For the model coefficient β, we generate each βil from N (μil , σ0 ), where
for l ∈ [Limin , Limax ], we set μil = 5; we set μil = 0 for any other cases. The
NECO_a_01028-Du
MITjats-NECO.cls
October 3, 2017
S. Du, G. Song, L. Han, and H. Hong
ed
Pro
o
f
12
20:51
orr
ect
Figure 3: (a) The true probability matrix g. (b) The probability matrix ĝ learned
by the DSS model.
variance σ0 is always set to be 1 for all cases. The response y is finally generated as
yt =
p W
i=1 l=1
βil Xt−l,i + σ0 .
Un
c
We select the λ in the standard LASSO based on cross-validation. In
this experiment, we select the optimal λ from the candidate set {0.1, 0.01,
0.001, 0.0001, 0.00001} and get λ = 0.1 finally. The selection of λ in the
Granger LASSO is the same as that in the standard LASSO.
5.1.2 Evaluation. In order to evaluate the performance of lag inference of
the proposed methods, we introduce some measurements. In our model, for
each temporal variable X i , we can determine the lag interval via the probability gil —that is, the probability of the time point l lying in the [Limin , Limax ].
When gil < 0.5 and gil > 0.5, the time point l is regarded as Limin . When
gil < 0.5 and gil > 0.5, the time point l is referred as the Limax . Given the
minimum and maximum lag values, we can get the true parameter g. Let
ĝ denote the probability estimated by the DSS model. Let g denote the true
probability based on the fixed minimum and maximum lags. If ĝil > 0.5,
l is lying in the lag interval of Xi . Then the precision Pr and recall Re for
recovering the true lag intervals are defined as follows:
Pr =
(l, i) ∈ W × P : (ĝi > 0.5)&(gi = 1)
l
l
,
(l, i) ∈ W × P : ĝi > 0.5
l
NECO_a_01028-Du
MITjats-NECO.cls
October 3, 2017
20:51
13
(l, i) ∈ W × P : (ĝi > 0.5)&(gi = 1)
l
l
.
Re =
(l, i) ∈ W × P : gi = 1
l
f
Temporal Causal Inference with Time Lag
orr
ect
ed
Pro
o
Based on the precision and recall, the F1 score is defined as F1 = 2Pr ·
Re/(Pr + Re). We report the average F1 score to evaluate the proposed methods. For measuring the causal recovery performance, we also use the F1
score. Let ω denote the true causal result vector and ω̂ denote the learned
causal result vector. Then the precision Pr and recall Re for recovering the
true causalities are defined as follows:
i ∈ [1, P] : ω̂i = ωi ,
Pr = i ∈ [1, P] : ω̂i = 1 i ∈ [1, P] : ω̂i = ωi .
Re = i ∈ [1, P] : ωi = 1 Moreover, we use the mean absolute percentage error (MAPE), defined
as
n 1 yt − ŷt ,
MAPE =
yt n
t=1
Un
c
to evaluate the performance of different methods of their prediction performance.
5.1.3 Results. We evaluate the estimated parameters in the DSS model.
Figures 3a and 3b show the true probability g and the learned ĝ by the DSS
model, respectively. Comparing the two panels, we can see that the learned
probability matrix ĝ is close to the true matrix, especially for the time points
where the probability equals 1. Thus, we can conclude that the learned probability ĝ clearly recovers the true probability, which demonstrates that the
proposed DSS model can accurately estimate the lag intervals for different
covariates.
Moreover, we can obtain the causal relations by observing the probability p0 , which is the probability that the coefficient ω is different from zero.
Figure 4 reports the learned probabilities. From the results, we observe that
the results well match the ground truth that the first eight variables are the
causal factors of the response y.
Figure 5 shows the effect of the parameter W, where we vary W from 1 to
25 and evaluate the F1 measures of the obtained lags. It is clear that when W
is larger than the maximum lag defined for all the features (i.e., 20), the F1
score can reach a high value around 0.8 and generally remains stable. This
observation is reasonable since the considered time window W should be
larger than the existing maximum lag, such that we can accurately discover
NECO_a_01028-Du
MITjats-NECO.cls
October 3, 2017
S. Du, G. Song, L. Han, and H. Hong
Pro
o
f
14
20:51
orr
ect
ed
Figure 4: The learned probability p(ω) from the DSS model.
Un
c
Figure 5: The F1 score of lag estimation with varying W.
all the lags in the variables. Moreover, the results also imply that we do not
need a very large W.
Next, we compare the DSS model with other methods in terms of learning accuracy. The F1 score of the causal relationship recovery is provided
in Figure 6. From the result, we can conclude that the DSS model is much
more stable than all the other methods. Because the method of standard
LASSO with fixed time lag considers only one value of lag, the lag fluctuations cannot be captured. Thus, the performance of standard LASSO is
worse.
Finally, based on the recovered dependency relationships, we can evaluate various models in terms of predication performance. We use the MAPE
to evaluate the performance of different methods. Figure 7 provides the
predication results. We see that our model achieves the lowest prediction
errors in the MAPE.
5.2 Traffic Data. In this section, we evaluate the proposed methods on
a real-world traffic data set studied by Han et al. (2012).
NECO_a_01028-Du
MITjats-NECO.cls
October 3, 2017
20:51
15
Pro
o
f
Temporal Causal Inference with Time Lag
orr
ect
ed
Figure 6: The F1 score of causal relation recovery.
Un
c
Figure 7: The MAPE of different methods.
5.2.1 Setting. The features in this traffic data set are observations collected from sensors located on stations in a traffic network. The observation of each station is the number of passed vehicles over 15 minutes. The
data are collected from records over 92 days, and there are 236 traffic stations, so 236 time series were observed in this data set. The data values are
normalized to [0,1]. The task here is to infer the causal relationship among
different time series from two busy periods in each day: 7:00 a.m. to noon
and midnight to 7:00 a.m. We select one month’s data, with 600 samples as
the training data.
In this traffic data set, there exists prior knowledge on the lag intervals.
When a vehicle passes a traffic station, the time is recorded. And as a consequence, we can compute the average travel time between any two locations.
This travel time can be viewed as a rough guess of the lag, and we use it to
define the hyperparameters in the DSS model. Specifically, we set the hyperparameter μa as the corresponding average travel time.
5.2.2 Results and Discussion. Table 1 shows the predicted mean absolute
n |yt −ŷt |
percentage error ( n1 t=1
|y | ) for all the methods. From the results, we
t
NECO_a_01028-Du
MITjats-NECO.cls
October 3, 2017
16
20:51
S. Du, G. Song, L. Han, and H. Hong
Table 1: MAPE of Various Methods on the Traffic Data Set.
L(G)
0.22
0.3
TLHL
DSS
f
VAR
0.35
Pro
o
L(S)
0.23
0.12
orr
ect
ed
MAPE
Un
c
Figure 8: Causal relations learned by the DSS model. The red circle indicates
the target station and the blue circle the source station.
see that the DSS method outperforms the other methods. This demonstrates
that learning both the causal relations and the lag intervals simultaneously
from data can significantly improve prediction performance compared to
the methods that fix the length of the lag.
We also show the detailed causal relations that the DSS model learned
from 7:00 a.m. to noon and midnight to 5:00 a.m. in Figures 8a and 8b, respectively. In the two figures, the red circle indicates the target station (destination) and the blue circle the source station (starting point). We observe
that the causal-related source stations are mostly located close to the target
station from 7:00 a.m. to noon. However, from midnight to 5:00 a.m. (the late
night), the causal-related source stations change dramatically where these
source stations are generally distant from the target station.
This interesting phenomenon can be well explained by our domain experts. A large number of trucks travel through this traffic network in the late
night every day, and most of the trucks travel a long distance for business
reasons. In contrast, during the day, most traffic is composed of private cars,
which usually travel a very short distance. This knowledge well explains
the results in Figures 8a and 8b and demonstrates the effectiveness of the
proposed method.
NECO_a_01028-Du
MITjats-NECO.cls
October 3, 2017
20:51
17
ed
Pro
o
f
Temporal Causal Inference with Time Lag
orr
ect
Figure 9: The lag intervals learned by DSS and the corresponding mean travel
time. (a) The lag intervals of dependency relationships for 7:00 a.m. to noon.
(b) The lag intervals of dependency relationships for midnight to 5:00 a.m. The
black curve denotes the travel time from the dependent station to the target
station.
Un
c
The lag intervals that the DSS model learned are shown in Figure 9. It is
obvious that most of the ground-truth travel time values lie in the lag intervals the DSS model learned, which implies that the inferred lag intervals are
very accurate. For example, from Figure 9a, we can see that the lag interval
of station 10 is [9,15] while that of station 1 is [2,8], which is consistent with
the fact that the distance between station 10 and the target station is much
farther than the one between station 1 and the target station.
5.3 Air Quality Data. In this section, we apply our method to the air
quality data analysis. The problem of air quality has attracted increasing interest in recent years (Intergovernmental Panel on Climate Change, 2007).
In this application, one of the most important challenges is to discover the
dispersion path of air pollution (Seinfeld & Pandis, 2016) and learning spatial causal relationship is very helpful for interpreting and understanding
the spreading pattern of the pollution. We will test all the methods on a
real-world air quality data set to evaluate their effectiveness.
5.3.1 Data. The air quality data are collected from 14 air quality monitoring stations in different locations of Beijing, China. Each station measures
the hourly PM2.5 data from December 1, 2014, to March 31, 2015. We extract
the period from February 20, 2015, at 7:00 a.m. to February 22 at 7:00 p.m.
from the entire time series, since the air quality is seriously polluted during
this time interval. We chose the time series from four sensors located at four
NECO_a_01028-Du
MITjats-NECO.cls
October 3, 2017
S. Du, G. Song, L. Han, and H. Hong
Pro
o
f
18
20:51
orr
ect
ed
Figure 10: The three snapshots of the pollution process.
Un
c
Figure 11: The spatial dependencies and lag intervals of four sensors learned
by the DSS model.
important areas as the target time series and then determined their causal
factors from all the time series collected from all the sensors separately.
5.3.2 Results and Discussion. Figure 11 shows the recovered causal relationship and the estimated lag intervals for the four sensors. We find that
the estimated causal relationship is consistent with the direction of the pollution diffusion as revealed in Figure 10. From the results, we could infer
that this pollution incident originates southwest of Beijing. As our domain
experts acknowledged, there are a number of chemical factories located in
these areas, and this evidence provides a reasonable explanation for our
results.
In addition, we also compare the error with other methods. Table 2 shows
the MAPE of all methods on this data set. From the table, we see that
our model obtains the lowest prediction error. Thus, we can conclude that
causal inference with time lag helps to provide more accurate prediction of
the future effects.
NECO_a_01028-Du
MITjats-NECO.cls
October 3, 2017
20:51
Temporal Causal Inference with Time Lag
19
Table 2: The MAPE of Various Methods on the Air Quality Data Set.
L(s)
VAR
0.29
0.64
0.51
6 Conclusion
TLHL
DSS
f
L(G)
Pro
o
MAPE
0.29
0.24
orr
ect
ed
Time lag is an important parameter in causal relationship learning. In this
letter, we proposed a probabilistic decomposed slab-and-spike model to
simultaneously estimate the causal relationship and the time lag interval
within a unified framework. We developed an expectation propagation
algorithm for the parameter inference. We evaluated our model on one
synthetic data set and two real-world data sets. Experimental results
demonstrate the effectiveness of the proposed model in learning the causal
relationship as well as the time lag. Importantly, the results conducted on
the real data sets are well interpreted by the domain experts.
It will be interesting to extend the static DSS model to learn a dynamic
model, where the causal relationship and lags are time varying over the
entire time series. We are also interested in learning the causal relationship
and the time lag between the covariates and multiple responses at the same
time.
Acknowledgments
Un
c
This work was supported by the National Natural Science Foundation of
China (61572041) and the Beijing Natural Science Foundation (4152023).
References
Arnold, A., Liu, Y., & Abe, N. (2007). Temporal causal modeling with graphical
Granger methods. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 66–75). New York: ACM.
Bahadori, M. T., & Liu, Y. (2012). Granger causality analysis in irregular time series.
In Proceedings of the 2012 SIAM International Conference on Data Mining (pp. 660–
671). Philadelphia: Society for Industrial and Applied Mathematics.
Bahadori, M. T., Liu, Y., & Xing, E. P. (2013). Fast structure learning in generalized
stochastic processes with latent factors. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 284–292). New
York: ACM.
Chen, X., Liu, Y., Liu, H., & Carbonell, J. G. (2010, July). Learning spatial-temporal
varying graphs with applications to climate data analysis. In Proceedings of the
24th AAAI Conference on Artificial Intelligence. Cambridge, MA: AAAI Press.
Cheng, D., Bahadori, M. T., & Liu, Y. (2014). FBLG: A simple and effective approach
for temporal dependence discovery from time series data. In Proceedings of the 20th
NECO_a_01028-Du
MITjats-NECO.cls
October 3, 2017
20
20:51
S. Du, G. Song, L. Han, and H. Hong
Un
c
orr
ect
ed
Pro
o
f
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp. 382–391). New York: ACM.
Chipman, H. (1996). Bayesian variable selection with related predictors. Canadian
Journal of Statistics, 24(1), 17–36.
Dhurandhar, A. (2010, December). Learning maximum lag for grouped graphical
granger models. In Proceedings of the 2010 IEEE International Conference on Data
Mining Workshops (pp. 217–224). Piscataway, NJ: IEEE.
Drton, M., & Perlman, M. D. (2008). A SINful approach to gaussian graphical model
selection. Journal of Statistical Planning and Inference, 138(4), 1179–1200.
Friedman, N. (2004). Inferring cellular networks using probabilistic graphical models. Science, 303(5659), 799–805.
Gilbert, P. D. (1995). Combining VAR estimation and state space model reduction for
simple good predictions. Journal of Forecasting, 14(3), 229–250.
Goldenberg, A., & Moore, A. (2004). Tractable learning of large Bayes net structures
from sparse data. In Proceedings of the Twenty-First International Conference on Machine learning (p. 44). New York: ACM.
Granger, C. W. (1969). Investigating causal relations by econometric models and
cross-spectral methods. Econometrica: Journal of the Econometric Society, 37, 424–
438.
Han, L., Song, G., Cong, G., & Xie, K. (2012). Overlapping decomposition for
causal graphical modeling. In Proceedings of the 18th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (pp. 114–122). New York:
ACM.
Hatemi-J, A. (2004). Multivariate tests for autocorrelation in the stable and unstable
VAR models. Economic Modelling, 21(4), 661–683.
Hernndez-Lobato, D. (2009). Prediction based on averages over automatically induced
learners: Ensemble methods and Bayesian techniques. Ph.D. diss., Universidad
Autónoma de Madrid.
Hernndez-Lobato, D., Hernndez-Lobato, J. M., & Dupont, P. (2013). Generalized
spike-and-slab priors for Bayesian group feature selection using expectation
propagation. Journal of Machine Learning Research, 14(1), 1891–1945.
Hernndez-Lobato, J. M., Hernndez-Lobato, D., & Surez, A. (2015). Expectation propagation in linear regression models with spike-and-slab priors. Machine Learning,
99(3), 437–487.
Intergovernmental Panel on Climate Change. (2007). Fourth Assessment Report:
Climate Change 2007 (Working Paper 1325). http://wmo.insomnation.com/sites
/default/files/documents/meetings/session20/doc2.pdf
Knapp, C., & Carter, G. (1976). The generalized correlation method for estimation
of time delay. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(4),
320–327.
Lebre, S., Becq, J., Devaux, F., Stumpf, M. P., & Lelandais, G. (2010). Statistical inference of the time-varying structure of gene-regulation networks. BMC Systems
Biology, 4(1), 130.
Lee, S. I., Lee, H., Abbeel, P., & Ng, A. Y. (2006). Efficient l 1 regularized logistic
regression. In AAAI (Vol. 6, pp. 401–408).
Liu, Y., Kalagnanam, J. R., & Johnsen, O. (2009). Learning dynamic temporal graphs
for oil-production equipment monitoring system. In Proceedings of the 15th ACM
NECO_a_01028-Du
MITjats-NECO.cls
October 3, 2017
20:51
Temporal Causal Inference with Time Lag
21
orr
ect
ed
Pro
o
f
SIGKDD International Conference on Knowledge Discovery and Data Mining (pp.
1225–1234). New York: ACM.
Liu, Y., Niculescu-Mizil, A., Lozano, A. C., & Lu, Y. (2010). Learning temporal causal
graphs for relational time-series analysis. In Proceedings of the 27th International
Conference on Machine Learning (pp. 687–694). New York: ACM.
Liu, Y., Niculescu-Mizil, A., Lozano, A., & Lu, Y. (2011). Temporal graphical models
for cross-species gene regulatory network discovery. Journal of Bioinformatics and
Computational Biology, 9(2), 231–250.
Qiu, H., Liu, Y., Subrahmanya, N. A., & Li, W. (2012, December). Granger causality
for time-series anomaly detection. In IEEE 12th International Conference on Data
Mining (pp. 1074–1079). Piscataway, NJ: IEEE.
Seinfeld, J. H., & Pandis, S. N. (2016). Atmospheric chemistry and physics: From air pollution to climate change. Hoboken, NJ: Wiley.
Zhou, X., Hong, H., Xing, X., Huang, W., Bian, K., & Xie, K. (2015, June). Mining
dependencies considering time lag in spatio-temporal traffic data. In Proceedings
of the International Conference on Web-Age Information Management (pp. 285–296).
New York: Springer.
Un
c
Received December 30, 2016; accepted July 28, 2017.
Документ
Категория
Без категории
Просмотров
0
Размер файла
2 568 Кб
Теги
01028, neco
1/--страниц
Пожаловаться на содержимое документа