close

Вход

Забыли?

вход по аккаунту

?

3125645

код для вставкиСкачать
Stopping Criterion for Active Learning with Model Stability
YEXUN ZHANG, Shanghai Jiao Tong University
WENBIN CAI, Microsoft
WENQUAN WANG and YA ZHANG, Shanghai Jiao Tong University
Active learning selectively labels the most informative instances, aiming to reduce the cost of data annotation.
While much effort has been devoted to active sampling functions, relatively limited attention has been paid
to when the learning process should stop. In this article, we focus on the stopping criterion of active learning
and propose a model stability–based criterion, that is, when a model does not change with inclusion of additional training instances. The challenge lies in how to measure the model change without labeling additional
instances and training new models. Inspired by the stochastic gradient update rule, we use the gradient of
the loss function at each candidate example to measure its effect on model change. We propose to stop active
learning when the model change brought by any of the remaining unlabeled examples is lower than a given
threshold. We apply the proposed stopping criterion to two popular classifiers: logistic regression (LR) and
support vector machines (SVMs). In addition, we theoretically analyze the stability and generalization ability
of the model obtained by our stopping criterion. Substantial experiments on various UCI benchmark datasets
and ImageNet datasets have demonstrated that the proposed approach is highly effective.
CCS Concepts: • Computing methodologies → Active learning settings;
Additional Key Words and Phrases: Active learning, logistic regression, model stability, stopping criterion,
support vector machines
ACM Reference format:
Yexun Zhang, Wenbin Cai, Wenquan Wang, and Ya Zhang. 2017. Stopping Criterion for Active Learning with
Model Stability. ACM Trans. Intell. Syst. Technol. 9, 2, Article 19 (October 2017), 26 pages.
https://doi.org/10.1145/3125645
1
INTRODUCTION
In recent years, active learning has been successfully applied in many machine-learning problems
in which the data may be abundant but labels are absent or expensive to obtain [16, 27, 29]. Most
studies on active learning focus on designing sampling algorithms, such as uncertainty sampling
(US) [15] and query-by-committee (QBC) [23]. However, another important issue of the interactive
learning process is the stopping criterion, that is, deciding when to stop active learning, which is a
relatively less researched area. Here, we use an illustrative example to show the importance of the
A preliminary version of this article appeared as a paper in Proceedings of International Conference on Data Mining
(ICDM’14) [33].
Authors’ addresses: Y. Zhang, W. Wang, and Y. Zhang (corresponding author), Room 303A, No.5 SEIEE Building, the Cooperative Medianet Innovation Center, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong
University, Shanghai 200240, China; emails: {zhyxun, wangwenquan, ya_zhang}@sjtu.edu.cn. W. Cai, the Applications &
Services Group East Asia, Microsoft, Beijing 100080, China; email: wenbca@microsoft.com.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the
full citation on the first page. Copyrights for components of this work owned by others than the ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2017 ACM 2157-6904/2017/10-ART19 $15.00
https://doi.org/10.1145/3125645
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
19
19:2
Y. Zhang et al.
Fig. 1. A typical active learning curve.
stopping criterion. Figure 1 presents a typical active learning curve. The X-axis denotes the number
of actively selected instances and the Y-axis stands for classification accuracy. As plotted in the
figure, we can observe that the performance increases gradually at the initial stages. With the active
learning proceeds, the learning curve starts to saturate, indicating that the model performance
cannot be further improved. Thus, in this particular example, it is desirable to stop active learning
at point B. In contrast, stopping at point A is too early and leads to a lower performance while
stopping at point C is too late and wastes annotation. To ensure model quality as well as save
labeling cost, it is highly desirable to define an appropriate stopping criterion to cease learning.
In principle, active learning should terminate when the desired performance has been reached.
However, this ideal approach requires an extra test set to evaluate the performance, which has to
be annotated in advance, involving a a great deal of extra human effort. Thus, the key challenge is
how to stop active learning without a test set.
So far, there are limited studies on stopping criteria that require no test set [21, 28, 31, 35]. They
mainly fall into two classes: model-specific stopping criteria and learning strategy-specific stopping criteria. The model-specific stopping criterion is restricted by the base model employed. For
example, the support vector machines (SVMs)–based stopping criterion [21] is particularly designed for SVMs, and the confidence-based stopping criterion [31, 35] is mainly for probabilistic
learning models. The learning strategy-specific stopping criterion has its own limitations because
it is based on specific features of the active learning strategy. For example, the QBC-based stopping criterion [28] requires an ensemble to calculate the disagreement on each unlabeled example,
and it is not readily applicable to other sampling strategies. In order to adapt to various practical
applications, a general stopping criterion strategy unconstrained by specific models and specific
active learning strategies is extremely important.
In this article, we propose a new stopping criterion framework based on model stability, measured by the capability of each candidate example to change the model. The main idea is that active
learning should stop when the model will not change with more training data.
With this idea, we stop active learning when the changing ability of all remaining unlabeled
examples is less than a given threshold, that is., the model is stable. The model change is quantified
as the difference between the current model parameters and the new parameters obtained with
the accumulated training set. Inspired by the stochastic gradient update rule, in which the model
parameters are updated repeatedly according to the negative gradient of the objective function,
we use the gradient of the loss function at each candidate example to measure its capability to
change the model. In this study, we apply the stability-based stopping criterion to two popular
base models: logistic regression (LR) and SVMs. In addition, we theoretically analyze the model’s
properties, including stability and generalizability, when active learning stops according to our
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
Model Stability based Stopping Criterion
19:3
method. Furthermore, we derive the extensions of our method for several other kinds of models
including kernelized models and nongradient-based models to show its widespread applications.
Extensive experiments on various datasets demonstrate that the proposed approach outperforms
state-of-the-art methods in most cases.
The main contributions of this article are summarized as follows.
(1) We propose a new stopping criterion framework based on model stability, which considers
the capability of each candidate example to change the model. It can be applied to a wide
spectrum of learners.
(2) Under the framework, we derive the stopping criterion algorithms for two popular classifiers: LR and SVMs.
(3) We theoretically analyze the proposed stopping criterion to provide a theoretical support
for its performance.
(4) We further extend the proposed method to kernelized models and nongradient-based
models to show its applicability in a wide range of applications.
The rest of this article is structured as follows. Section 2 reviews the related work. The general framework of the stability-based stopping criterion is introduced in Section 3. The proposed
method for classification models is presented in Section 4. Section 5 theoretically analyzes the
method. Section 6 presents the experiments and interprets the results. The extensions of our proposed method are introduced in Section 7. We present our conclusions in Section 8.
2 RELATED WORK
In this section, we first quickly review several typical active learning strategies, and then summarize existing studies on stopping criteria for active learning.
2.1 Active Learning Methodologies
Most active-learning research has focused on designing sampling functions for choosing the informative examples. Perhaps the most widely used is US, which chooses the instance whose label
the current classifier is most uncertain about [15]. The margin-based active-learning method [30]
is mostly used when taking SVMs as the base models, which selects instances closest to the separating hyperplane. Small [26] extended the margin-based method to structured output spaces and
pipeline models, which enlarged the scope of problems for which active learning can be applied.
Wang [32] introduced the furthest nearest-neighbor (FNN) method, which generalizes the marginbased uncertainty to a multiclass case. Another typical strategy is QBC [23], which generates a
committee of models and selects instances having the greatest disagreement among the committee members [1]. In recent years, expected model change (EMC) has been adopted in designing
sampling functions, which considers the ability of examples to update the current model [22]. A
comprehensive active-learning survey can be found in [22].
2.2 Stopping Criterion for Active Learning
In this section, we summarize existing work on stopping criteria for active learning as follows.
(1) Stopping criterion with test set: The idea behind this kind of method is intuitive and
natural, that is, active learning should stop when the desired performance is reached [17]. However,
an extra labeled test set is needed to evaluate the model’s performance, involving an extra cost of
labeling.
(2) Stopping criterion with no test set: The main idea of this type of method is to evaluate the
model’s performance without using the labeled test set. Bloodgood [3] proposed the predictionbased stopping criterion, which detects model predictions on data that do not have to be labeled,
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
19:4
Y. Zhang et al.
called the stop set. However, the performance of this method greatly depends on the selected stop
set and how to determine an appropriate stop set needs further and better justifications. Laws
and Schutze [14] presented a gradient-based stopping criterion, which stops active learning when
the model has reached maximum performance or the uncertainty cannot be reduced further. It is
nontrivial to sufficiently estimate the gradient of the performance curve during the active learning
process in real-world applications, however. Recently, there have been several other stopping criteria, which can be further grouped into two classes: model-specific stopping criteria and learning
strategy-specific stopping criteria.
The model-specific stopping criterion contains the following methods. Taking SVMs as the base
model, Schohn and Cohn [21] suggested that active learning should stop when there are no unlabeled instances lying in the margin. This strategy is specific to SVMs and is not applicable for
other base models. Another model-specific method is the confidence-based stopping criterion,
which stops active learning when the classifier has enough confidence in its classification with
regard to the remaining unlabeled data. Vlachos [31] proposed a stopping criterion to stop active learning when the model confidence consistently drops, in which a separate outside dataset
is used to estimate the classifier’s confidence. The key issue is how to identify a consistent drop
in the confidence of a classifier during the active-learning process; otherwise, a local maximum
of the confidence curve would be found. Zhu et al. [35] proposed several alternatives to estimate
confidence, such as Max-confidence, Overall-uncertainty, Min-error and Selected-accuracy. However, these confidence-based stopping criteria are specific to a set of probabilistic learning models,
which limits their applicability.
The learning strategy-specific stopping criterion is mainly designed for a specific active learning
strategy. With the QBC framework, Thomanek et al. [28] utilized the disagreement rate among the
member classifiers to design the stopping criterion. However, this method requires an ensemble
to calculate the disagreement on each unlabeled example; thus, it is not readily applicable to other
sampling strategies.
Considering the limitations of existing stopping criteria, we propose a general stopping criterion
framework unconstrained by specific models, specific active-learning strategies, and so on. This
work is an extension of our previous work [33] by providing solid theoretical justification. Next,
we introduce the framework of a stability-based stopping criterion.
3 THE FRAMEWORK OF STABILITY-BASED STOPPING CRITERION
In machine learning, the ultimate goal is to learn a model with good generalization performance
on future unseen data. The model’s generalization capability is changed if and only if the model
is changed; thus, examples that cannot change the model are actually useless for active learning.
Therefore, we believe that model stability is a reasonable indicator for stopping active learning.
Model stability can be measured by the model parameter change caused by querying and adding
one more candidate example x + into the training set. We define the model parameter change as
Cθ (x + ) = ||θ − θ + ||,
(1)
where θ and θ + denote the current model parameter and the updated parameter learned from the
expanded training set, respectively.
Then, our model stability–based stopping criterion (denoted as SC MS hereafter) will stop the
active-learning process when the following condition is satisfied:
Cθ (x + ) ≤ η,
∀x + ∈ U,
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
(2)
Model Stability based Stopping Criterion
19:5
where η is a predefined threshold and U stands for the unlabeled pool set. The main problem here
is how to calculate the parameter change Cθ (x + ) (i.e., ||θ − θ + ||). In the following, we present our
method with a stochastic gradient update rule.
We start with the well-known Empirical Risk Minimization (ERM) principle. Given a training
m , where x ∈ Rd is a d-dimensional feature vector and y is a class label drawn
set D = {(x i , yi )}i=1
i
i
i.i.d. from a certain underlying distribution, the empirical risk on the training set is
R̂(D) =
m
Lθ [f (x i ), yi ],
(3)
i=1
where f (x i ) is the predicted label and Lθ [.] denotes a given loss function parameterized by θ . To
minimize the empirical error, a widely used search method is the stochastic gradient update rule,
which updates the parameter θ repeatedly according to the negative gradient of the loss function
with regard to each training example
θ ←θ −α
∂Lθ (x i )
,
∂θ
i = 1, 2, . . . , m,
(4)
where α is the learning rate.
Here, we consider the stochastic gradient update rule in active-learning cases. If the candidate
example x + is added to the training set with a given class label y + , the empirical error on the
expanded training set D + = D ∪ (x + , y + ) becomes
R̂(D + ) =
m
Lθ [f (x i ), yi ] + Lθ [f (x + ), y + ].
(5)
i=1
Consequently, the parameter is changed due to the inclusion of the new example (x + , y + ). According to the update rule Equation (4), we can approximate the parameter change as the derivative of
+
θ (x )
the loss function at the candidate example (x + , y + ). Let Gθ (x + ) = ∂ L∂θ
; then, we have that
Cθ (x + ) = ||θ − θ + || = ||α Gθ (x + )||.
(6)
In practice, since the true class label y + of the example x + is unknown in advance, we cannot calculate the derivative Gθ (x + ) inside Equation (6) directly. Instead, we use the expectation calculated
over all possible labels y + ∈ Y to estimate the true parameter change
∂Lθ (x + ) ,
(7)
Cθ (x + ) = Ey + ||α Gθ (x + )|| =
p(y + |x + ) α
∂θ
y + ∈Y
where p(y + |x + ) is the conditional probability of label y + given the instance x + and it can be estimated by the current model.
Thus, we stop active learning when the following inequality is satisfied:
Ey + ||Gθ (x + )|| ≤ η,
∀x + ∈ U,
(8)
where α can be seen as a constant and it is integrated into η.
4
STABILITY-BASED STOPPING CRITERION FOR CLASSIFICATION
In this section, we apply the proposed method to binary classification tasks. We first introduce
the two classifiers used in this study, LR and SVMs, and the two active sampling methods, US
and EMC. Then, the details of SC MS for the two base models are derived. Finally, we analyze the
computation complexity of our method.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
19:6
Y. Zhang et al.
4.1 Classification Models
4.1.1 Logistic Regression. Logistic regression [2], which can be viewed as arising from a
1
Bernoulli model, is usually formulated as f (x ) = 1+exp(−θ
T x ) , where θ is the parameter vector
m
characterizing the model. Given a dataset D = {(x i , yi )}i=1 with yi ∈ YLR = {0, 1}, the conditional
probability of assigning a class label y ∈ YLR to the data point x is
p(y|x; θ ) = f (x )y (1 − f (x )) 1−y .
As a result, building an LR model is to maximize the likelihood Lθ (D) =
lently, we minimize the following logistic loss:
Lθ (D) = − logLθ (D) =
m
(9)
m
i=1 p(yi |x i ; θ ).
Equiva-
−yi logf (x i ) − (1 − yi )log(1 − f (x i )),
(10)
i=1
which is a commonly adopted convention.
4.1.2 Support Vector Machines. For SVMs [6], it is common to denote the class labels by YSVM =
{−1, 1}. The linear SVM model can be represented by a hyperplane f (x ) = w Tx + b = 0, where w is
the weight vector parameterizing the classifier. For simplicity, we omit the bias term b throughout
this study.
Constructing an SVM classifier is to solve the Quadratic Programming (QP) problem
1
||w || 2 + C
ξi
2
i=1
m
min
w
s.t.
yi w Tx i ≥ 1 − ξ i ,
ξ i ≥ 0,
i = 1, . . . , m,
(11)
where ξ i is a slack variable and C is the penalty parameter.
4.2
Active Learning Methods
In this section, we introduce the active sampling methods used in our study: US and EMC. Our
stopping criterion can be combined with any other active-learning strategies.
4.2.1 Uncertainty Sampling. In a US [15] framework, an active learner queries instances about
which it is least certain, which is often straightforward for probabilistic learning models. For binary
classification, US simply selects the instance whose posterior probability is nearest to 0.5:
∗
= arg min p(y = ŷ|x; θ ),
x US
x ∈U
(12)
∗ is the instance selected by US, U is the unlabeled pool set, θ is the model parameter
where x US
vector, and ŷ is the predicted label of x.
4.2.2 Expected Model Change. EMC [22] selects the instance that will impart the greatest
change to the current model if it is labeled and added into the training set D. The sampling function of EMC is defined as
∂lθ ((x, yk )) ∗
,
(13)
= arg max
p(yk |x; θ ) x EMC
∂θ
x ∈U
k
∗
where x EMC
is the instance queried by EMC, lθ is the loss function with model θ , and yk is the kth
possible label of x.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
Model Stability based Stopping Criterion
19:7
Discussion. Our proposed model stability–based stopping criterion SC MS is related to the EMC
active learning algorithm as both of the two algorithms measure the model-changing ability of examples. However, they have some differences in terms of motivation and generalization: (1) EMC
aims to choose the examples maximizing the model parameter change for learning, while SC MS focuses on stopping the active-learning process when the changing ability of all unlabeled instances
is less than a threshold. (2) EMC is restricted to an LR model, but we view this measurement method
as a general framework and generalize it to many important kinds of learning models, including
SVMs, kernelized models, and nongradient models.
4.3 Stopping Criterion for Logistic Regression
Assume that a candidate example x + with a given label y + is added to the training set,
then the likelihood on the expanded training set D + = D ∪ (x + , y + ) and becomes Lθ (D + ) =
m
+ +
i=1 p(yi |x i ; θ )p(y |x ; θ ). Thus, the logistic loss becomes
Lθ (D + ) = −logLθ (D) − logLθ (x + ) =
m
−yi logf (x i ) − (1 − yi )log(1 − f (x i ))
i=1
− y + logf (x + ) − (1 − y + )log(1 − f (x + )).
(14)
The derivative of the logistic loss Lθ (x + ) at the candidate example (x + , y + ) is calculated as
Gθ (x + ) =
∂Lθ (x + )
= ( f (x + ) − y + )x + .
∂θ
(15)
Since the true class label y + is unknown before querying, we employ the current classifier to
estimate the prediction distribution y + ∈ YLR , and then use the expectation to approximate the
true parameter change. Therefore, for LR, the expectation of the model change is
p(y + |x + ; θ )||( f (x + ) − y + )x + ||,
(16)
Ey + ||Gθ (x + )|| =
y + ∈YLR
and active learning ceases when Equation (8) is satisfied.
An interpretation behind the SC MS for LR is explained as follows. The parameter change is
proportional to the error term ( f (x + ) − y + ). Hence, if the classifier is good enough to accurately
predict, the change ability of instances will be small, indicating that the model is stable. The corresponding pseudocode of SC MS for LR is presented in Algorithm 1.
4.4
Stopping Criterion for Support Vector Machines
In this section, we derive the SC MS for SVMs. However, due to the constraints in the above QP
optimization problem, we cannot directly apply the stochastic gradient update rule to SVMs. To
address this problem, we rewrite the above QP problem as an unconstrained problem by rearranging the constraints and substituting the parameter C for C = λ1 :
λ
w 2 +
[1 − yi w Tx i ]+ ,
2
i=1
m
min
w
(17)
where the subscript “+” indicates the positive part, the first term is the regularizer, and the second
term represents the standard Hinge loss.
In the literature, several stochastic gradient–based learning algorithms have been studied for
solving SVMs’ optimization problems [13, 25]. Similarly, we consider the update rule in active
learning. If a candidate point x + is added into the training set with a given label y + , the objective
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
19:8
Y. Zhang et al.
ALGORITHM 1: SCMS for Logistic Regression
Require: The initial labeled data set D, the unlabeled pool set U, the LR classifier θ trained with D and the
predefined threshold η.
1: for each x + ∈ U do
2:
for each possible y + ∈ YLR do
3:
Calculate p(y + |x + ; θ ) by the current classifier θ .
4:
Calculate the derivative of loss Lθ (x + ) with regard to θ via Equation (15).
5:
end for
6:
Estimate the model parameter change via Equation (16).
7: end for
8: if Ey + ||Gθ (x + )|| ≤ η is satisfied for ∀x + ∈ U then
9:
Stop the active-learning process.
10: else
11:
Query the most uncertainty instances by (Equation (12)) if using US as the sampling algorithm.
12:
Query the instances changing the model most by (Equation (13)) if using EMC as the sampling
algorithm.
13:
Update D, U, and θ , and then go to step 1.
14: end if
function on the enlarged training set D + = D ∪ (x + , y + ), then becomes
λ
w 2 +
[1 − yi w Tx i ]+ + [1 − y +w Tx + ]+ .
2
i=1
m
min
w
(18)
We estimate the effect of adding the new candidate point on the training loss to approximate the
parameter change; hence, the derivative of the loss at the example (x + , y + ) is
Gw (x + ) =
⎪ −y +x + , if y +w Tx + < 1,
∂Lw (x + ) ⎧
=⎨
⎪ 0,
∂w
otherwise.
⎩
(19)
It shows that SVM updates its weight solely on examples that satisfy y +w Tx + < 1, which is
straightforward. Again, since the true label y + ∈ YSVM is unknown in advance, we therefore rewrite
the inequality constraint y +w Tx + < 1 as |w Tx + | < 1. Meanwhile, we take the expectation over all
possible labels y + ∈ YSVM to approximate the true parameter change.
Thus, for x + ∈ U, s.t. |w Tx + | < 1, the expected model change can be formulated as
Ey + ||Gw (x + )|| =
p(y + |x + ; w )|| − y +x + || = ||x + ||,
(20)
y + ∈YSVM
other x +
(x + )||
and for
∈ U, Ey + ||Gw
= 0. We stop active learning when Equation (8) is satisfied.
A good interpretation is that the examples within the margin are the ones having the ability to
change SVMs, that is, |w Tx + | < 1. Thus, an SVM classifier is sufficiently stable when there are no
data examples lying in the margin, thereby resulting in a good generalization performance. Also,
our SC MS for SVMs can be regarded as a general case of previous work [21], which stops active
learning when there are no unlabeled points within the margin. The corresponding pseudocode
of SC MS for SVMs is presented in Algorithm 2.
4.5 Computation Complexity
Since we focus on the stopping criterion, we only consider the computation time taken by our
SC MS . Suppose that there are N examples of d dimensions in the pool set U.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
Model Stability based Stopping Criterion
19:9
ALGORITHM 2: SCMS for Support Vector Machines
Require: The initial labeled data set D, the unlabeled pool set U, the current SVM classifier w trained with
D and the predefined threshold η.
1: for each x + ∈ U do
2:
if |w Tx + | < 1 then
3:
Estimate the model parameter change via Equation (20).
4:
else
5:
The model parameter change is 0.
6:
end if
7: end for
8: if Ey + ||Gw (x + )|| ≤ η is satisfied for ∀x + ∈ U then
9:
Stop the active-learning process.
10: else
11:
Query the most uncertainty instances by (Equation (12)) if using US as the sampling algorithm.
12:
Query the instances changing the model most by(Equation (13)) if using EMC as the sampling
algorithm.
13:
Update D, U, and w, and then go to step 1.
14: end if
For LR, the computation complexity needed by our SC MS is mainly from the computation of the
model parameter change with O (Nd ) time.
For SVMs, the computation contains two aspects: the calculation of |w Tx + | with O (Nd ) time
and the computation of the model parameter change, that is, ||x + ||, which needs O (kd ) time if
there are k instances satisfying |w Tx + | < 1. Therefore, the total time needed is O ((N + k )d ).
Therefore, for both LR and SVMs, the computation complexity required by our SC MS is linear
to the size of the unlabeled pool set, which is promising in practice.
5 THEORETICAL ANALYSIS
In this section, we theoretically analyze the proposed method concentrating on the following two
questions:
(1) How is the stability of the model obtained when active learning stops according to our
SC MS ?
(2) How is the generalization ability of the model obtained when active learning stops according to our SC MS ?
In the following, we try to answer these questions, but first, we provide some notations and
useful tools.
5.1
Notations
We have a space of points X and a set of labels Y. Given a dataset Q = D ∪ U drawn from an
unknown distribution P, since the dataset Q is usually very large in practice, we can deem the
dataset as an approximation of the distribution P. Suppose that there are (n − m) instances selected
during active learning by a given sampling strategy, then when active learning stops, we have a
labeled set S of size n in Z = X × Y
S = D ∪ {zm+1 , . . . , zn } = {z 1 , . . . , zm , zm+1 , . . . , zn }
(21)
and an unlabeled pool set U − = U \ {zm+1 , . . . , zn } of size N − (n − m), where N is the size of
U and zi = (x i , yi ). As suggested by [5], we consider the randomness of the initial labeled set D,
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
19:10
Y. Zhang et al.
then if we obtain a training set S of size n when active learning stops according to our SC MS , the
threshold η S will be different for the different initial set D. We denote the maximum value of all
these η S by η max , namely, η max = max S η S .
To support the latter derivations, we further define the following two sets:
(1) By adding one more instance z + randomly selected from U − into the training set S
S + = {z 1 , . . . , zi−1 , zi , zi+1 , . . . , zn , z + },
(22)
(2) By replacing the ith element of S by z +
S i = {z 1 , . . . , zi−1 , z + , zi+1 , . . . , zn },
(23)
which are useful in our analysis. Since unlabeled data usually can be easily collected in practical
applications, we assume that U is extremely large such that after removing S from Q, the rest set
U − can still be seen as an approximation of P.
Let l (θ S , z) be the loss function of the model θ S that is learned from the set S, and we suppose that 0 ≤ l (θ S , z) ≤ M always holds for all z ∈ Z and all sets S, where M is a constant. Let
R = Ez [l (θ S , z)] be the generalization error of θ S with regard to distribution P and we simplify the
z ∼ P to be z hereafter. Since P is an unknown distribution and we cannot compute R directly, we
estimate R by the empirical error R emp = n1 ni=1 l (θ S , zi ). Furthermore, as active learning selects
examples by some specific strategies, the examples in S are not i.i.d. with regard to the underlying
distribution P. However, approximately, they can be seen as i.i.d. with regard to an unknown distribution A. We define the generalization error of θ S on the distribution A as R A = Ez∼A [l (θ S , z)].
5.2 Main Tools
Here are some main tools we will use in our latter analysis.
Lemma 5.1 (Lipschitz Condition [5]). A function f is said to satisfy the Lipschitz condition if
there exits a constant t such that
| f (x 1 , y) − f (x 2 , y)| ≤ t |x 1 − x 2 |.
If
∂f (x,y )
∂x
(24)
exists and has a bound for all x, then f (x, y) meets the Lipschitz condition.
Lemma 5.2. Let θ S + and θ S i be the models estimated with S + and S i , respectively. Then, according
to our SC MS , the following equations will hold:
∂l (θ S , z + ) ∂l (θ i , zi ) S
≤ η S and ||θ S + − θ S i || = ≤ η max .
||θ S + − θ S || = (25)
∂θ
∂θ
S
Si
The second equation in the above lemma is because S + can be seen as the dataset constructed
by adding zi into S i , which can be deemed as obtained by our SC MS with an initial set D.
Lemma 5.3. Let l satisfy the Lipschitz condition. Then, for ∀i ∈ {1, 2, . . . , n}, we have that
|l (θ S + , z) − l (θ S , z)| ≤t ||θ S + − θ S || ≤ tη S ,
(26)
|l (θ S + , z) − l (θ S i , z)| ≤t ||θ S + − θ S i || ≤ tη max .
(27)
and
Lemma 5.4. For any symmetric learning algorithm, for ∀i ∈ {1, 2, . . . , n}, we have that
E S [R − R emp ] = E S,z + ∼A [l (θ S , z + ) − l (θ S i , z + )] + ν,
(28)
where E S [R] = E S [R A ] + ν and ν is a bias term that is equal to zero when distribution A is the same
as P.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
Model Stability based Stopping Criterion
19:11
Lemma 5.5. For any learning algorithm and loss function l such that 0 ≤ l (θ S , z) ≤ M is held for
all z ∈ Z and all sets S, we have the following inequality for ∀i ∈ {1, 2, . . . , n}:
E S [(R − R emp ) 2 ] ≤
M2
+ 3ME S,z + ∼A [|l (θ S , zi ) − l (θ S i , zi )|]
2n
+ 2ν E S,z + ∼A [l (θ S , z + ) − l (θ S i , z + )] + ν 2 .
(29)
The proof for the above two lemmas is similar to [5] by replacing E S [R] to be E S [R A ] + ν ; thus,
we omit the details here. For simplicity, we will drop ν in the following, which does not affect the
behavior of the analysis.
Lemma 5.6 (Efron-Stein [4], [10]). Let S and S i be defined as above. Let F : Z n → R be any
measurable function; then, we have the following inequality for the variance of F (S):
1
E S,z + [(F (S) − F (S i )) 2 ].
2 i=1
n
V ar (F (S)) = E S [(F (S) − E S [F (S)]) 2 ] ≤
(30)
Lemma 5.7 (McDiarmid [4], [19]). Let S and S i be defined as above. Let F : Z n → R be any measurable function for which there exists constants c i (i = 1, 2, . . . , n) such that sup S ∈Z n ,z + ∈Z |F (S) −
F (S i )| ≤ c i . Then,
−2ϵ 2
P S [F (S) − E S [F (S)] ≥ ϵ] ≤ exp n
.
(31)
2
i=1 c i
5.3
Stability Analysis
Stability means how much the model can be changed if we add one more instance into the training
set after active learning stops. Next, we analyze the model stability by deriving an upper bound
for the model change, which considers the randomness of the initial labeled set. The derived upper
bound indicates that the model will reach a stable status with the increase of the training set size.
Theorem 5.8. With a loss function l such that for all z ∈ Z and all sets S, 0 ≤ l (θ S , z) ≤ M and
the Lipschitz condition is satisfied, then, for any initial labeled set D, when active learning stops with
a training set S of size n according to our SC MS , if we add one more instance into the training set, the
model parameter change will have the following upper bound:
||θ S +
√
M + M 2 + nτ ,
− θ S || ≤ O
nt
(32)
where τ ≥ 0 is a slack term, and θ S and θ S + are the models estimated with S and S + , respectively.
Proof. By Lemma 5.3, we can get the following inequality:
|R − R i | ≤ Ez [|l (θ S , z) − l (θ S i , z)|]
≤ Ez [|l (θ S , z) − l (θ S + , z)| + |l (θ S + , z) − l (θ S i , z)|]
≤ t (η S + η max ),
(33)
where R i is the generalization error of the model θ S i with regard to P.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
19:12
Y. Zhang et al.
Similarly, we can obtain that
i
|≤
|R emp − R emp
≤
|l (θ S , z j ) − l (θ S i , z j )| |l (θ , zi ) − l (θ i , z + )|
S
S
+
n
n
ji
|l (θ S , z j ) − l (θ S + , z j )| + |l (θ S + , z j ) − l (θ S i , z j )|
n
ji
≤ t (η S + η max ) +
+
M
n
M
,
n
(34)
i
where R emp
is the empirical risk of the model θ S i .
Combining Equation (33) and Equation (34), we have that
i
i
|R − R emp − (R i − R emp
)| ≤|R − R i | + |R emp − R emp
| ≤2t (η S + η max ) +
M
.
n
(35)
According to Lemma 5.6, let F (S) = R − R emp ; then, we can obtain the variance of R − R emp
n
1
M 2
M2
2
+ 4Mtη max ,
V ar (R − R emp ) ≤
E S,z + 2t (η S + η max ) +
+
≤8nt 2η max
2 i=1
n
2n
(36)
where the last step is due to the fact that η S ≤ η max is satisfied for all S ∈ Z n , and ∃τ ≥ 0 such
that
M2
2
+ 4Mtη max − τ .
V ar (R − R emp ) = 8nt 2η max
+
(37)
2n
By Lemma 5.5, we have the following inequality:
E S [(R − R emp ) 2 ] ≤
M2
M2
+ 3ME S,z + ∼A [t (η S + η max )] ≤
+ 6Mtη max .
2n
2n
(38)
Since V ar (R − R emp ) ≤ E S [(R − R emp ) 2 ], we can get the following inequality:
2
8nt 2η max
+
M2
M2
+ 4Mtη max − τ ≤
+ 6Mtη max .
2n
2n
(39)
Furthermore, according to our SC MS and Equation (25), we have the following upper bound for the
model change:
√
√
M + M 2 + nτ M + M 2 + 8nτ
||θ S + − θ S || ≤
=O
.
(40)
8nt
nt
Based on the analysis provided above, we obtain the following two conclusions. (1) Given initial
training set D, when active learning
stops with a training set S of size n according to our SC MS , the
√
M 2 +nτ
model can be changed O ( M + nt
) at most by adding one more instance into the training set. (2)
The upper bound is negatively related to n, indicating that the model will become increasingly stable with the growth of the training set. In addition, the bound also guarantees good generalization
performance, as proved in the next section.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
Model Stability based Stopping Criterion
19:13
5.4 Generalization Error Analysis
Generalization ability is the model performance on future unseen data. In this section, we will try
to give an exponential bound for the generalization error of the model suggested by our SC MS in
the spirit of Bousquet [5] and Boucheron [4].
Theorem 5.9. With a loss function l such that for all z ∈ Z and all sets S, 0 ≤ l (θ S , z) ≤ M and
the Lipschitz condition is satisfied, for any initial labeled set D, when active learning stops with a
training set S of size n according to our SC MS , the following generalization error bound will hold with
probability at least 1-δ :
1
√
1 1 2
+
log
,
(41)
R ≤ R emp + O M + M + nτ
n
n
δ where τ ≥ 0 is a slack term.
Proof. According to Equation (35), F (S) = R − R emp meets the condition of Lemma 5.7 with
c i = 4tη max + M
n , then
P S R − R emp − E S [R − R emp ] ≥ ϵ ≤ δ,
(42)
2
where δ = exp(− (4nt η2nϵ
2 ). By Lemma 5.4, we can obtain that E S [R − R emp ] ≤ 2tη max . Thus,
max +M )
the following inequality holds true:
P S (R − R emp − 2tη max ≥ ϵ ) ≤ δ,
(43)
P S (R − R emp − 2tη max ≤ ϵ ) ≥ 1 − δ .
(44)
which is equivalent to
According to δ , we can get the formulation of ϵ and the following inequality holds true with
probability at least 1 − δ :
1
1
log
R ≤ R emp + 2tη max + (4ntη max + M )
2n
δ
√
√
1
1
M + M 2 + 8nτ 3M + M 2 + 8nτ
+
log
≤ R emp +
4n
2
2n
δ
1
√
1
1
2
+
log .
(45)
= R emp + O M + M + nτ
n
n
δ
Based on the above theoretical justification, the following two conclusions are derived. (1) The
above theorem provides a generalization error bound for the model obtained when active learning
stops with the training set S. (2) Optimizing model stability can guarantee the generalization
ability, as the upper bound of model change is a constituent term of the generalization error bound
and the two bounds are positively related.
6 EXPERIMENTS
In this section, we carry out substantial experiments to validate the effectiveness of our proposed
SC MS .
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
19:14
Y. Zhang et al.
6.1 Experiments on UCI Data Sets
6.1.1 Experimental Setup. To validate the effectiveness of the proposed strategy, we conduct
experiments on nine benchmark datasets of various domains from the UCI machine-learning
repository1 : Biodeg, Firm-Teacher-Clave-Direction-Classification, Ionosphere, Spambase,
WDBC, Letter. Since Letter is a multiclass dataset, we select four pairs of letters (i.e., D-vs-P,
E-vs-F, M-vs-N, U-vs-V) that are relatively difficult to distinguish and build a binary-class dataset
for each pair. Similarly, we select reverse clave class and forward clave class from Firm-TeacherClave-Direction-Classification and build the dataset Clave.
Each dataset is randomly divided into three disjoint subsets: the initial labeled set D (5%) to train
the initial model, the unlabeled pool set U (75%) to select the most informative examples, and the
test set T (20%) to evaluate different stopping criteria. The features are normalized as suggested
by [8].
Two learning models are used as the base learner: LR and SVMs. The active-learning algorithms
used here are US and EMC. In this study, 1% of the whole examples are selected in each learning
iteration. To avoid random fluctuation, each experiment is repeated 20 times with random initialization and the averaged results are reported. In each experiment, all stopping critera use the same
initialization for fair comparison.
For parameter η, we find its optimal value by simulating the active-learning process using the
labeled set. Since the size of the initial labeled set is limited, we enlarge it by assigning pseudolabels
to instances in the pool set whose labels are highly certain under the current classifier, which
is inspired by [18]. Then, we separate the enlarged labeled set into D , U , and T , with the
proportion the same as the setting taken in the real-learning process and simulate the activelearning process 20 times with random initialization. Finally, we set the threshold having the best
averaged performance. We set the threshold candidates as [0.2, 0.5, 1.0, 1.5, 2.0, 2.5].
6.1.2 Implementation. An initial model is built with the seed training set. Then, we calculate the
stopping-criterion score for unlabeled examples and judge the stopping criterion. If the criterion
is satisfied, we cease the active-learning process. Otherwise, the informativeness score will be
computed for each unlabeled instance according to the used active-learning algorithm (US or EMC
in this study). Then, the top-k instances with the highest scores will be selected and added into
the training set with their true labels. In this study, we use the greedy batch mode and k is set as
the number of 1% of the whole dataset. The above process is repeated until the stopping criterion
is satisfied or the whole unlabeled set is added into the training set.
6.1.3 Comparison Methods. To test the effectiveness of our model stability–based stopping
criterion (MS), we compare it with the following six state-of-the-art competitors: (a) stop set
prediction-based stopping criterion (SP), (b) gradient-based stopping criterion (GRAD), (c) overalluncertainty (OU), (d) max-confidence (MC), (e) min-error (ME) and (f) QUIRE. The experimental
settings for these competitors are set as follows.
(1) For SP, Bloodgood [3] uses the Kappa statistic to measure the agreement among most
recently learned models. We follow their settings in our reimplementation of this method.
(2) For GRAD, we use a window of size 100 and a threshold of 0.00005, as suggested by
Laws [14].
(3) For OU, MC, and ME, Zhu [35] shows that a start threshold of 0.1 with the threshold
update technique performs best. Thus, we follow their setting in our implementation.
1 http://archive.ics.uci.edu/ml/.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
Model Stability based Stopping Criterion
19:15
(4) For QUIRE [12], we stop the active-learning process when the scores of informativeness
and representativeness for all unlabeled examples are lower than a threshold that is automatically set following the principle of cross-validation. Since it is only applicable for
SVMs, we compare it only with SVMs.
Because some of these methods are specific to probabilistic models, we utilize the sigmoid function to transform SVMs’ outputs to posterior probabilities as suggested by [20].
6.1.4 Evaluation Metrics. In active-learning scenarios, when the model first reaches the highest
performance, which is the highest classification accuracy in this study, it is suggested that the
process can stop. We refer to this point as the Highest Performance Point (HPP). With the HPP,
we use the following two metrics [35] to evaluate each stopping criterion.
(a) The difference between the training set size at the HPP (denoted by Size HPP ) and the size at
the stopping point suggested by a certain stopping criterion (denoted by Size SC ):
ΔSize = Size SC − Size HPP .
(46)
The closer ΔSize is to zero, the better the stopping criterion.
(b) The difference between the highest performance (denoted by Acc HPP ) and the performance
obtained at the stopping point with a stopping criterion (denoted by Acc SC ):
ΔAcc = Acc HPP − Acc SC .
(47)
The smaller the ΔAcc , the better the stopping criterion.
Combining the above two metrics, for the stopping criterion that obtains the highest accuracy
among the competitors, the smaller the ΔSize , the better the stopping criterion.
6.1.5 Experimental Results with Uncertainty Sampling Strategy. The comparison results of ΔSize
and ΔAcc for these seven criteria are presented in Table 1 and Table 2 for LR and SVMs, taking US
as the active-learning method. The values having the best performance are highlighted in bold.
“Initial” denotes the initial settings before selecting instances and “All” shows the results after
labeling and adding the whole unlabeled set U into the training set.
Before we compare the results, it is necessary to stress the following three facts. First, all results
that we present are averages; therefore, sometimes two methods could have very similar ΔSize
but wildly different average classification accuracy. Second, some stopping points would not lie
on their active-learning curves since the results are averages. Third, the learning model with the
higher number of training data may have a lower classification accuracy, which is called “less is
more” in active learning and was first reported by Schohn in [21].
As shown in these tables, MS is observed to perform the best among these methods in most
cases, demonstrating the effectiveness of MS. SP performs poorly in most cases. GRAD achieves a
small ΔAcc on some datasets but also a relatively large ΔSize , which may be because the uncertainty
or performance is not stable until in later iterations for these datasets. OU and MC perform well in
terms of ΔAcc but poorly in terms of ΔSize , and the relatively large ΔSize indicates that OU and MC
tend to stop active learning at the latter iteration with the goal of achieving a high performance.
ME performs well in terms of ΔSize but poorly in terms of ΔAcc , and the ΔSize is lower than zero in
many cases, indicating that it stops before some informative instances being selected and results
in low performance. QUIRE performs well on some datasets but not consistently.
MS performs not so well sometimes. A possible reason is that the model change is performed as
an approximation, that is, the true change is approximated using the expectation calculation over
all possible labels, which may hurt its performance in some cases.
In addition, we present two intuitive examples of active learning curves with stopping points
based on the US sampling strategy in Figure 2. As observed, MS stops at the point that achieves
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
19:16
Y. Zhang et al.
Table 1. Results of ΔSize and ΔAcc on UCI Datasets for LR with US Strategy
Dataset
Size
Biodeg
Acc
Size
Clave
Acc
Size
Ionosphere
Acc
Size
Spambase
Acc
Size
WDBC
Acc
Size
D-vs-P
Acc
Size
E-vs-F
Acc
Size
M-vs-N
Acc
Size
U-vs-V
Acc
Initial
52.0
68.87%
430.0
95.71%
17.0
70.21%
230.0
77.63%
28.0
88.7%
80.0
95.31%
77.0
94.56%
78.0
93.58%
78.0
95.62%
HPP
317.1
87.5%
1694.2
96.68%
123.4
89.86%
867.9
92.53%
83.5
96.96%
174.4
98.85%
224.0
97.73%
247.6
97.18%
158.0
98.86%
All
843.0
84.72%
6880.0
96.54%
280.0
87.11%
3680.0
91.44%
454.0
95.0%
1286.0
98.7%
1234.0
97.48%
1259.0
96.84%
1260.0
98.62%
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
MS
−84.2
12.31%
−541.8
0.14%
52.6
2.32%
122.7
0.37%
39.0
1.3%
51.2
0.09%
0.0
0.19%
24.0
0.32%
42.0
0.12%
SP
187.0
1.89%
−713.8
0.18%
−8.6
3.38%
318.9
0.57%
54.0
1.74%
36.8
0.19%
6.0
0.32%
−67.2
0.6%
44.0
0.12%
GRAD
27.8
3.23%
3285.2
0.18%
68.0
2.68%
953.7
1.64%
370.5
1.96%
1111.0
0.16%
903.0
0.26%
1011.4
0.35%
1102.0
0.24%
OU
360.3
2.64%
−619.2
0.15%
141.4
2.89%
1576.3
0.88%
127.5
1.3%
147.2
0.09%
66.0
0.39%
328.0
0.44%
206.0
0.16%
MC
437.3
2.81%
550.4
0.16%
148.5
3.1%
2014.8
0.93%
220.5
1.96%
356.8
0.09%
276.0
0.32%
558.4
0.41%
410.0
0.2%
ME
134.8
1.67%
−1264.2
0.97%
31.0
2.46%
555.1
0.78%
42.0
1.74%
4.8
0.34%
−36.0
0.45%
−30.4
0.66%
28.0
0.2%
Note: The thresholds of MS are 0.5, 1.0, 0.5, 0.5, 1.0, 0.5, 0.5, 1.0, and 0.5, respectively.
Table 2. Results of ΔSize and ΔAcc on UCI Datasets for SVMs with US Strategy
Dataset
Size
Biodeg
Acc
Size
Clave
Acc
Size
Ionosphere
Acc
Size
Spambase
Acc
Size
WDBC
Acc
Size
D-vs-P
Acc
Size
E-vs-F
Acc
Size
M-vs-N
Acc
Size
U-vs-V
Acc
Initial HPP All
52.0 393.0 843.0 ΔSize
73.87% 88.3% 86.7% ΔAcc
430.0 1204.0 6880.0 ΔSize
96.15% 96.76% 96.69% ΔAcc
17.0 133.8 280.0 ΔSize
67.75% 88.87% 87.46% ΔAcc
230.0 1104.0 3680.0 ΔSize
86.21% 91.97% 90.88% ΔAcc
28.0 57.4 454.0 ΔSize
93.04% 98.17% 97.39% ΔAcc
80.0 160.0 1286.0 ΔSize
96.55% 98.91% 98.54% ΔAcc
77.0 192.5 1234.0 ΔSize
94.24% 98.09% 97.48% ΔAcc
78.0 193.2 1259.0 ΔSize
94.84% 97.69% 97.31% ΔAcc
78.0 146.8 1260.0 ΔSize
96.85% 99.43% 99.31% ΔAcc
MS
SP GRAD OU MC ME QUIRE
26.4 26.4 324.8 140.8 299.2 −44.0 −211.0
1.04% 1.04% 1.98% 1.51% 1.32% 1.7% 2.45%
−172.0 −103.2 3113.2 −258.0 894.4 −774.0 34.0
0.06% 0.08% 0.08% 0.12% 0.08% 0.6% 0.08%
0.8
−8.0 87.0 73.2 118.0 −3.2 −101.0
2.82% 3.66% 1.83% 2.96% 1.83% 2.96% 10.28%
92.0 276.0 2438.0 368.0 1380.0 −92.0 −322.0
0.33% 0.54% 1.09% 0.33% 0.54% 0.98% 1.41%
70.2 51.6 325.8 38.4 79.2 34.8 37.0
0.61% 0.78% 0.78% 0.78% 0.7% 0.78% 1.13%
152.0 76.8 692.8 44.8 144.0 −27.2 99.0
0.31% 0.4% 0.34% 0.4% 0.31% 1.49% 0.37%
102.0 87.0 472.5 51.0 130.5 52.5 90.0
0.29% 0.52% 0.65% 0.49% 0.42% 0.71% 0.58%
192.0 97.6 852.8 83.2 291.2 −28.8 53.0
0.32% 0.51% 0.38% 0.47% 0.41% 1.9% 0.32%
102.4 38.4 528.0 24.0 100.8 −68.8 88.0
0.06% 0.19% 0.09% 0.16% 0.16% 2.59% 0.22%
Note: The thresholds of MS are 1.0, 1.0, 2.0, 1.0, 2.0, 2.0, 2.0, 2.0, and 2.0, respectively.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
Model Stability based Stopping Criterion
19:17
Fig. 2. Examples of active-learning curves with stopping points based on US sampling strategy.
Fig. 3. Examples of active-learning curves with stopping points based on EMC sampling strategy.
high accuracy with relatively fewer training examples, demonstrating that MS can stop active
learning at an appropriate point.
In summary, our MS method outperforms the state-of-the-art criteria on most data sets for both
LR and SVMs with US, indicating the effectiveness of the proposed method. We explain that as
SP and GRAD being based on performance. They suggest a stopping point when they observe
a stable performance on the stop set or a window of recent selected instances. However, before
some important informative unlabeled instances are added into the training set, the capability
of the model will not be changed, that is, the model’s performance on the stop set and recent
selected instances will be stable. The OU method is to calculate the average uncertainty value of
all remaining unlabeled examples. However, even some important unlabeled instances have high
uncertainty; with a large size of pool set, the average value will also be very small. MC considers
the maximum uncertainty of instances in the pool set that can be easily influenced by the outliers.
The performance of ME is inconsistent. It works well on some datasets, but performs poorly on the
others. This is likely due to the fact that the expected error estimation is highly correlated with the
characteristics of the datasets. QUIRE is based on both the informativeness and representativeness,
which are hard to balance, therefore the performance cannot be guaranteed.
6.1.6 Experimental Results with Expected Model Change Strategy. Table 3 and Table 4 present
the results for LR and SVMs taking EMC as the active-learning method. Similarly, MS outperforms
the baselines most of the time, which demonstrates the effectiveness of our proposed method. In
addition, we present two intuitive examples of active learning curves with stopping points based
on the EMC strategy in Figure 3. Similar results can be observed.
6.1.7 Efficiency Comparison. Indeed, the time required by the whole active learning process is
based on two parts: the time cost by the active-learning algorithm and the time cost by stopping
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
19:18
Y. Zhang et al.
Table 3. Results of ΔSize and ΔAcc on UCI Datasets for LR with EMC Strategy
Dataset
Size
Biodeg
Acc
Size
Clave
Acc
Size
Ionosphere
Acc
Size
Spambase
Acc
Size
WDBC
Acc
Size
D-vs-P
Acc
Size
E-vs-F
Acc
Size
M-vs-N
Acc
Size
U-vs-V
Acc
Initial
52.0
72.78%
430.0
95.84%
17.0
69.08%
230.0
79.83%
28.0
85.65%
80.0
95.96%
77.0
95.6%
78.0
93.59%
78.0
94.45%
HPP
387.0
86.66%
1221.0
96.65%
154.0
90.56%
2052.0
91.83%
79.0
97.39%
170.0
98.82%
190.0
98.25%
318.0
96.84%
177.0
98.86%
All
843.0
84.3%
6880.0
96.47%
280.0
88.38%
3680.0
91.26%
454.0
95.22%
1286.0
98.39%
1234.0
97.83%
1259.0
96.6%
1260.0
98.68%
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
MS
149.0
1.42%
−149.0
0.17%
30.0
1.55%
−1311.0
3.29%
69.0
1.3%
−3.0
0.12%
9.0
0.26%
−12.0
0.24%
13.0
0.13%
SP
124.0
1.75%
−235.0
0.19%
−27.0
5.28%
−741.0
1.03%
69.0
1.3%
29.0
0.37%
41.0
0.58%
−80.0
0.79%
26.0
0.19%
GRAD
227.0
2.56%
4208.0
0.19%
126.0
2.18%
1559.0
0.6%
375.0
2.17%
1116.0
0.43%
954.0
0.42%
941.0
0.24%
1083.0
0.19%
OU
305.0
2.16%
−120.0
0.19%
111.0
2.39%
920.0
0.61%
150.0
1.74%
154.0
0.37%
113.0
0.39%
256.0
0.32%
195.0
0.19%
MC
388.0
2.22%
1350.0
0.18%
118.0
2.32%
1615.0
0.58%
246.0
2.17%
381.0
0.43%
344.0
0.36%
512.0
0.32%
416.0
0.19%
ME
107.0
1.89%
−791.0
0.8%
14.0
2.11%
−207.0
0.24%
57.0
1.74%
64.0
0.25%
27.0
0.65%
−48.0
0.55%
32.0
0.25%
Note: The thresholds of MS are 0.5, 1.0, 1.0, 0.2, 0.5, 1.0, 1.0, 0.5, and 0.5, respectively.
Table 4. Results of ΔSize and ΔAcc on UCI Datasets for SVMs with EMC Strategy
Dataset
Initial HPP All
MS
SP GRAD OU MC ME QUIRE
Size 52.0 470.0 843.0 ΔSize −6.0 −23.0 −137.0 373.0 373.0 317.0 −351.0
Biodeg
Acc 75.16%86.91%85.85% ΔAcc 0.55% 1.38% 2.87% 1.06% 1.06% 0.98% 8.41%
Size 430.0 2648.0 6880.0 ΔSize −1437.0−1573.0−1358.0 104.0 4232.0−2218.0 762.0
Clave
Acc 96.06%96.73%96.54% ΔAcc 0.12% 0.16% 0.16% 0.17% 0.19% 0.66% 0.15%
Size 17.0 107.0 280.0 ΔSize 59.0
69.0 52.0 173.0 173.0 139.0 −67.0
Ionosphere
Acc 70.25%89.17%87.15% ΔAcc 1.58% 1.85% 2.55% 2.02% 2.02% 2.29% 11.0%
Size 230.0 2484.0 3680.0 ΔSize −718.0 −1187.0 −340.0 1196.01196.0 −386.0 −1829.0
Spambase
Acc 85.17%90.92%90.51% ΔAcc 0.22% 1.8% 1.59% 0.41% 0.41% 0.24% 4.33%
Size 28.0 83.0 454.0 ΔSize 4.0
15.0 157.0 134.0 361.0 371.0
9.0
WDBC
Acc 92.05%98.01%96.27% ΔAcc 0.37% 1.12% 1.12% 1.37% 1.49% 1.74% 0.62%
Size 80.0 158.0 1286.0 ΔSize 72.0
16.0 192.0 98.0 979.0 −78.0 118.0
D-vs-P
Acc 96.77%98.63%98.48% ΔAcc 0.06% 0.31% 0.09% 0.12% 0.16% 1.86% 0.16%
Size 77.0 316.0 1234.0 ΔSize −46.0 −56.0 26.0 19.0 907.0 918.0 166.0
E-vs-F
Acc 94.15%98.06%97.39% ΔAcc 0.35% 0.86% 0.71% 0.5% 0.65% 0.67% 0.54%
Size 78.0 272.0 1259.0 ΔSize 6.0
−10.0 119.0 170.0 978.0 65.0 −30.0
M-vs-N
Acc 94.83%97.58%97.12% ΔAcc 0.59% 0.57% 0.68% 0.51% 0.44% 1.69% 1.04%
Size 78.0 180.0 1260.0 ΔSize 29.0 −27.0 246.0 684.0 338.0 489.0 109.0
U-vs-V
Acc 97.1% 99.31%99.15% ΔAcc 0.09% 0.38% 0.19% 0.16% 0.16% 1.23% 0.09%
Note: The thresholds of MS are 1.5, 2.5, 1.0, 0.2, 1.5, 2.0, 2.0, 2.0, and 1.5, respectively.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
Model Stability based Stopping Criterion
19:19
Table 5. CPU Runtime (In Seconds) of the Entire Active Learning Process for SVMs
with US Strategy
Dataset Exam. × Fea. (U)
Biodeg
1003 × 41
Clave
8170 × 16
Ionosphere
334 × 34
Spambase
4371 × 58
WDBC
541 × 30
D-vs-P
1528 × 16
E-vs-F
1466 × 16
M-vs-N
1497 × 16
U-vs-V
1499 × 16
MS
21.655
40.035
5.507
138.743
4.561
8.773
8.153
12.081
6.467
SP
21.846
30.44
5.632
106.336
3.996
5.817
7.506
8.042
3.971
GRAD
32.005
232.971
7.351
364.799
10.935
19.734
15.621
26.008
14.84
OU
22.788
20.34
6.846
103.159
2.523
3.42
4.55
5.625
2.526
MC
30.728
84.902
8.452
233.793
3.947
6.194
6.794
11.744
4.629
ME
13.218
0.006
4.355
55.686
2.369
1.415
4.547
2.329
0.001
QUIRE
121.985
1565.004
9.893
955.127
53.857
164.752
185.895
156.94
146.474
judgment. To validate the efficiency, we compare the averaged CPU runtime when the entire active
learning stops. Due to space limitations, we only present the results on SVMs with US strategy,
and similar observations can be obtained for other situations.
Table 5 shows the results together with the size of the pool set U. As shown in Table 5, we can
see that ME needs the least CPU time to stop active learning in most cases due to fewer learning
iterations. MS, SP, GRAD, OU, and MC have similar time cost. QUIRE is relatively inefficient and
costs much more time than the other six methods, which is because it requires many complicated
matrix computations during the stopping judgment, which is time-consuming.
6.2 Experiments on Image Categorization
6.2.1 Experimental Setup. In this section, we apply our method to a large-scale image dataset
for image categorization: ImageNet2 . A detailed description can be found in [9].
For the purpose of diversity, we construct the image classification datasets at three levels of
the mammal subtree, representing three different perceived difficulty levels. We randomly select
seven categories of domestic dogs and create a total of fourteen datasets in the following three
levels: (1) one kind of dogs-versus-rest dogs: Basenji, Chihuahua, English setter, German police dog, Standard poodle, Vizsla, and Yorkshire terrier; (2) dogs-versus-canids (i.e., wild dog,
wolf, and fox): Wild dog, Wolf, and Fox; (3) dogs-versus-other mammals (i.e., rabbit, cat, panda
and elephant): Rabbit, Cat, Panda, and Elephant. Examples of images from the ImageNet are
presented in Figure 4.
We then construct a balanced binary-class dataset for each versus pair by random sampling.
Images are represented by the publicly available bag-of-words features [9]. We select the top 50
features using Principle Component Analysis (PCA), which is commonly used in vision data [11].
The other settings follow the experiments on the UCI datasets.
6.2.2 Experimental Results with Uncertainty Sampling Strategy. Table 6 and Table 7 present the
results for the ImageNet datasets with US as the sampling method.
MS is still observed to perform the best, which is consistent with the results for UCI datasets. SP
performs well on ΔSize sometimes but poorly on ΔAcc . It stops either too early or too late, which
may be related to the stop set. For some cases, the performance on stop set seems to be stable before
some important instances being added and SP will stop early. For some cases, the performance on
stop set will not be stable until much later iterations, then SP will stop late. GRAD stops with a
2 http://www.image-net.org.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
19:20
Y. Zhang et al.
Fig. 4. Examples of images from the ImageNet: dog, wild dog, wolf, fox, rabbit, cat, panda, and elephant.
Table 6. Results of ΔSize and ΔAcc on ImageNet DatasSets for LR with US Strategy
Dataset
Basenji
Chihuahua
English
setter
German
police dog
Standard
poodle
Vizsla
Yorkshire
terrier
Wild dog
Wolf
Fox
Rabbit
Cat
Panda
Elephant
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Initial
184.0
50.69%
175.0
51.86%
242.0
53.24%
174.0
51.31%
195.0
58.52%
233.0
53.92%
304.0
66.6%
242.0
62.84%
561.0
69.05%
478.0
67.48%
349.0
63.37%
732.0
62.94%
351.0
71.65%
392.0
73.42%
HPP
1999.9
62.54%
2198.0
65.4%
2765.5
63.04%
2255.8
62.82%
2386.8
71.92%
2601.8
66.56%
3044.4
73.58%
2357.5
71.72%
5825.0
73.7%
5402.8
73.72%
4444.0
74.13%
5045.7
67.32%
3862.7
81.0%
3994.4
82.72%
All
2951.0
61.27%
2800.0
64.6%
3881.0
62.23%
2785.0
61.8%
3129.0
71.2%
3734.0
65.57%
4874.0
72.76%
3875.0
70.68%
8986.0
73.09%
7654.0
73.32%
5593.0
73.41%
11722.0
66.14%
5628.0
80.58%
6282.0
82.2%
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
MS
806.6
1.06%
465.5
0.7%
1015.6
0.72%
348.2
1.07%
308.1
0.63%
799.0
0.91%
919.0
0.82%
1054.4
0.8%
1736.0
0.34%
931.2
0.32%
266.0
0.56%
5871.4
0.97%
79.3
0.43%
79.0
0.46%
SP
951.1
1.27%
602.0
0.8%
1115.5
0.8%
529.2
1.02%
641.1
0.7%
1014.8
1.08%
639.5
0.81%
497.9
0.78%
−545.1
0.43%
−744.0
0.64%
−308.0
0.72%
4043.2
0.74%
−714.0
0.62%
−948.0
0.59%
GRAD
−1534.7
4.52%
−1683.5
4.36%
−2048.2
5.03%
−1813.0
5.87%
−1868.1
4.71%
−1997.5
6.26%
−2229.2
3.12%
−1717.1
3.46%
−2023.4
1.31%
−1784.4
1.2%
−2626.2
3.17%
−2926.0
1.45%
−1283.7
1.77%
686.5
1.35%
OU
951.1
1.27%
602.0
0.8%
1115.5
0.8%
529.2
1.02%
735.0
0.74%
1132.2
1.0%
1629.7
0.75%
1517.5
1.04%
3136.0
0.54%
2251.2
0.4%
1149.0
0.72%
6676.3
1.18%
1397.7
0.43%
1769.6
0.48%
MC
951.1
1.27%
602.0
0.8%
1115.5
0.8%
529.2
1.02%
741.3
0.73%
1132.2
1.0%
1766.3
0.8%
1517.5
1.04%
3140.2
0.55%
2251.2
0.4%
1149.0
0.72%
6676.3
1.18%
1612.3
0.38%
2125.1
0.48%
ME
951.1
1.27%
584.5
0.87%
1103.8
0.82%
523.7
0.98%
331.5
0.74%
1071.6
1.02%
662.2
0.77%
999.7
0.77%
1396.3
0.37%
998.4
0.36%
70.0
0.69%
5694.0
0.95%
−443.3
0.48%
−434.5
0.48%
Note: The thresholds of MS are 1.0, 0.5, 1.0, 1.0, 1.0, 1.0, 0.5, 1.0, 2.0, 2.0, 0.5, 2.0, 0.5, and 0.5, respectively.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
Model Stability based Stopping Criterion
19:21
Table 7. Results of ΔSize and ΔAcc on ImageNet Datasets for SVMs with US Strategy
Dataset
Initial HPP
All
Size 184.0 1659.2 2951.0 ΔSize
Basenji
Acc 50.86% 62.62% 60.88% ΔAcc
Size 175.0 1669.5 2800.0 ΔSize
Chihuahua
Acc 49.03% 65.56% 64.39% ΔAcc
English Size 242.0 2458.0 3881.0 ΔSize
Acc 50.72% 62.16% 60.91% ΔAcc
setter
German Size 174.0 2048.3 2785.0 ΔSize
police dog Acc 49.69% 63.34% 62.27% ΔAcc
Standard Size 195.0 1822.0 3129.0 ΔSize
poodle Acc 63.39% 72.2% 71.01% ΔAcc
Size 233.0 2222.0 3734.0 ΔSize
Vizsla
Acc 55.9% 65.81% 64.68% ΔAcc
Yorkshire Size 304.0 2642.7 4874.0 ΔSize
terrier Acc 67.98% 73.77% 72.89% ΔAcc
Size 242.0 2009.5 3875.0 ΔSize
Wild dog
Acc 66.44% 72.47% 71.37% ΔAcc
Size 561.0 5970.6 8986.0 ΔSize
Wolf
Acc 70.36% 73.63% 73.01% ΔAcc
Size 478.0 3222.0 7654.0 ΔSize
Fox
Acc 69.77% 73.81% 73.09% ΔAcc
Size 349.0 2682.1 5593.0 ΔSize
Rabbit
Acc 65.98% 73.54% 72.44% ΔAcc
Size 732.0 5438.9 11722.0 ΔSize
Cat
Acc 63.74% 67.25% 65.81% ΔAcc
Size 351.0 2830.4 5628.0 ΔSize
Panda
Acc 75.99% 81.03% 80.44% ΔAcc
Size 392.0 2834.0 6282.0 ΔSize
Elephant
Acc 75.78% 81.55% 80.95% ΔAcc
MS
SP GRAD OU MC ME QUIRE
799.9 1261.0 −1005.7 1291.8 1291.8 1289.2 −956.0
1.62% 1.84% 4.1% 1.73% 1.73% 1.73% 4.08%
533.8 960.8 −1030.8 1130.5 1130.5 1125.3 −845.0
1.16% 1.19% 4.44% 1.18% 1.18% 1.29% 2.87%
834.7 1389.5 −1566.8 1423.0 1423.0 1412.3 −1509.0
1.15% 1.35% 4.15% 1.25% 1.25% 1.26% 3.68%
306.3 657.1 −1445.2 736.8 736.8 728.4 −1307.0
1.32% 1.43% 5.02% 1.07% 1.07% 1.06% 5.15%
477.5 687.2 −1058.8 1298.3 1303.1 600.3 −1028.0
1.06% 1.07% 3.27% 1.1% 1.12% 1.16% 2.73%
667.4 1056.3 −1324.7 1512.0 1512.0 1429.3 −1279.0
0.97% 1.02% 3.3% 1.13% 1.13% 1.09% 3.17%
832.7 1121.8 −821.2 2128.3 2198.8 683.3 −1098.0
0.87% 0.87% 2.12% 0.83% 0.89% 0.9% 1.71%
686.7 633.9 −763.6 1865.5 1865.5 1121.1 −1042.0
0.86% 0.87% 1.95% 1.09% 1.09% 0.88% 2.08%
−302.4 −638.4 −323.5 2923.2 2992.9 616.0 −2251.0
0.55% 0.6% 1.12% 0.55% 0.61% 0.45% 0.87%
1563.2 1582.4 2028.8 4412.8 4432.0 2593.6 −307.0
0.75% 0.76% 1.09% 0.72% 0.72% 0.74% 0.82%
1020.6 1162.7 −347.1 2904.5 2910.9 1261.4 −834.0
0.91% 0.93% 1.92% 1.11% 1.1% 0.93% 1.67%
2382.9 2810.6 −934.2 6283.1 6283.1 5128.8 −811.0
0.6% 0.7% 1.38% 1.45% 1.45% 1.17% 0.81%
509.6 587.3 1018.3 2104.9 2508.8 107.8 −832.0
0.58% 0.65% 1.09% 0.53% 0.55% 0.65% 1.12%
1286.8 686.4 2428.0 2329.6 3024.8 212.4 −799.0
0.65% 0.93% 0.78% 0.67% 0.67% 0.95% 1.23%
Note: The thresholds of MS are all 2.0.
small size of training data and low accuracies most of the time, indicating that it stops early and
omits some important instances, which is especially obvious for the ImageNet dataset. The reason
may be that in such image category tasks, a great deal of instances will be highly uncertain under
the linear classifier at early iterations of the learning process, which will result in the observation
of stable uncertainty on the window of recently selected instances and early stop suggested by
GRAD. OU and MC obtain the smallest ΔAcc sometimes but large ΔSize mostly, showing that
they may aspire to high accuracies, thus stop with a large training set. The performance of ME
is inconsistent among different datasets, which is likely because the expected error estimation is
highly correlated with the characteristics of datasets. QUIRE performs poorly most of the time,
which may be because it is hard to balance the informativeness and representativeness; thus, the
performance cannot be guaranteed.
6.2.3 Experimental Results with Expected Model Change Strategy. Table 8 and Table 9 present
the results for LR and SVMs with EMC as the sampling method. As shown in the tables, similar
results can be obtained, that is, MS outperforms the baselines on most cases, demonstrating the
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
19:22
Y. Zhang et al.
Table 8. Results of ΔSize and ΔAcc on ImageNet Datasets for LR with EMC Strategy
Dataset
Basenji
Chihuahua
English
setter
German
police dog
Standard
poodle
Vizsla
Yorkshire
terrier
Wild dog
Wolf
Fox
Rabbit
Cat
Panda
Elephant
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Size
Acc
Initial
184.0
51.03%
175.0
53.1%
242.0
50.91%
174.0
50.36%
195.0
55.81%
233.0
55.42%
304.0
67.08%
242.0
62.72%
561.0
68.97%
478.0
68.36%
349.0
63.87%
732.0
64.01%
351.0
71.71%
392.0
74.11%
HPP
2312.0
61.87%
1792.0
65.26%
3123.0
62.45%
2336.0
62.81%
2824.0
73.33%
2501.0
65.81%
2598.0
73.22%
2558.0
71.81%
6990.0
73.59%
5105.0
74.21%
3988.0
73.81%
7347.0
67.5%
4388.0
81.02%
4073.0
82.88%
All
2951.0
60.87%
2800.0
64.04%
3881.0
61.19%
2785.0
61.89%
3129.0
72.64%
3734.0
65.15%
4874.0
72.46%
3875.0
70.84%
8986.0
73.19%
7654.0
73.74%
5593.0
73.05%
11722.0
66.85%
5628.0
80.69%
6282.0
82.42%
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
ΔSize
ΔAcc
MS
502.0
0.84%
830.0
1.19%
594.0
1.03%
384.0
0.72%
−94.0
0.54%
1107.0
0.56%
1427.0
0.57%
948.0
0.76%
358.0
0.36%
1382.0
0.36%
862.0
0.61%
3704.0
0.76%
−389.0
0.28%
158.0
0.34%
SP
640.0
1.0%
1008.0
1.21%
758.0
1.26%
449.0
0.92%
111.0
0.59%
1095.0
0.7%
872.0
0.66%
361.0
1.08%
−1859.0
0.52%
−499.0
0.38%
92.0
0.79%
1316.0
0.55%
−1229.0
0.6%
−924.0
0.51%
GRAD
−1233.0
3.1%
−525.0
2.0%
−1252.0
3.3%
−1553.0
4.25%
−1872.0
4.52%
47.0
1.06%
−757.0
2.75%
178.0
1.31%
826.0
0.46%
−922.0
1.19%
−905.0
2.14%
1537.0
0.65%
−122.0
0.86%
1770.0
0.55%
OU
640.0
1.0%
1008.0
1.21%
758.0
1.26%
449.0
0.92%
300.0
0.56%
1233.0
0.66%
2092.0
0.57%
1317.0
0.96%
1971.0
0.39%
2549.0
0.47%
1605.0
0.76%
4375.0
0.65%
879.0
0.32%
1754.0
0.43%
MC
640.0
1.0%
1008.0
1.21%
758.0
1.26%
449.0
0.92%
305.0
0.69%
1233.0
0.66%
2214.0
0.66%
1317.0
0.96%
1981.0
0.36%
2549.0
0.47%
1605.0
0.76%
4375.0
0.65%
1097.0
0.3%
2054.0
0.47%
ME
640.0
1.0%
994.0
1.39%
747.0
1.2%
447.0
0.99%
−47.0
0.72%
1206.0
0.62%
1171.0
0.7%
868.0
0.76%
426.0
0.38%
1373.0
0.42%
638.0
0.71%
3410.0
0.77%
−902.0
0.44%
−395.0
0.42%
Note: The thresholds of MS are 1.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, and 1.0, respectively.
effectiveness of our proposed stopping criterion. Similar to previous experiments, the other methods do not perform well most of the time. We omit the detailed descriptions here due to space
limitation.
7 EXTENSIONS
In the above, the proposed SC MS is applied to linear models in a stochastic gradient descent learning
manner. In fact, our SC MS is flexible and can be generalized to many models. In this section, we
extend SC MS to kernelized models and nongradient-based models. Since SC MS stops active learning
when C(x + ) ≤ η, ∀x + ∈ U, the main problem of the extensions lies in the measurement of model
change C(x + ).
7.1 Stopping Criterion for Kernelized Models
In this section, we discuss the extensions of our method to kernelized models, including kernel LR
and kernel SVM. The kernel trick transforms the feature space into a Hilbert feature space with
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
Model Stability based Stopping Criterion
19:23
Table 9. Results of ΔSize and ΔAcc on ImageNet Datasets for SVMs with EMC Strategy
Dataset
Initial HPP All
MS
SP GRAD OU MC ME QUIRE
Size 184.0 2001.0 2951.0 ΔSize 828.0 940.0 −895.0 950.0 950.0 950.0 −1103.0
Basenji
Acc 50.04%61.73% 60.54% ΔAcc 0.83% 1.23% 3.1% 1.19% 1.19% 1.19% 3.98%
Size 175.0 2167.0 2800.0 ΔSize 420.0 620.0 −805.0 634.0 634.0 634.0 −1715.0
Chihuahua
Acc 53.49%64.69% 63.76% ΔAcc 0.7% 0.87% 2.29% 0.93% 0.93% 0.93% 5.96%
English Size 242.0 2920.0 3881.0 ΔSize 923.0 961.0 −1152.0 961.0 961.0 961.0 −2311.0
setter Acc 49.28%62.06% 61.34% ΔAcc 0.69% 0.72% 2.5% 0.72% 0.72% 0.72% 6.24%
German Size 174.0 2337.0 2785.0 ΔSize 368.0 448.0 −1195.0 448.0 448.0 448.0 −1649.0
police dog Acc 50.62%62.44% 61.54% ΔAcc 0.72% 0.9% 4.71% 0.9% 0.9% 0.9% 5.77%
Standard Size 195.0 2289.0 3129.0 ΔSize 335.0 513.0 −635.0 840.0 840.0 838.0 −1509.0
poodle Acc 64.6% 71.69% 70.98% ΔAcc 0.64% 0.66% 2.46% 0.7% 0.7% 0.7% 3.78%
Size 233.0 3006.0 3734.0 ΔSize 423.0 630.0 −2059.0 728.0 728.0 728.0 −2186.0
Vizsla
Acc 54.61%65.48% 64.78% ΔAcc 0.61% 0.63% 3.46% 0.71% 0.71% 0.71% 4.23%
Yorkshire Size 304.0 3952.0 4874.0 ΔSize −6.0 43.0 −479.0 922.0 922.0 573.0 −2788.0
terrier Acc 68.39%73.82% 73.3% ΔAcc 0.29% 0.3% 1.79% 0.52% 0.52% 0.49% 4.25%
Size 242.0 2690.0 3875.0 ΔSize 638.0 182.0 599.0 1185.01185.0 1185.0 −1858.0
Wild dog
Acc 66.96%71.81% 70.86% ΔAcc 0.72% 0.82% 1.11% 0.95% 0.95% 0.95% 3.33%
Size 561.0 5900.0 8986.0 ΔSize 1997.0 −373.0 1170.0 3086.03086.0 3086.0 −3677.0
Wolf
Acc 70.34%73.79% 73.42% ΔAcc 0.29% 0.39% 0.76% 0.37% 0.37% 0.37% 1.85%
Size 478.0 6248.0 7654.0 ΔSize 336.0 −1258.0 1406.0 1406.01406.0 1406.0 −4195.0
Fox
Acc 69.32%73.36% 73.12% ΔAcc 0.35% 0.52% 0.24% 0.24% 0.24% 0.24% 1.46%
Size 349.0 4612.0 5593.0 ΔSize 336.0 −42.0 −389.0 981.0 981.0 981.0 −3220.0
Rabbit
Acc 66.35%72.75% 72.22% ΔAcc 0.41% 0.44% 1.22% 0.54% 0.54% 0.54% 3.27%
Size 732.0 9748.0 11722.0 ΔSize 1801.0 −588.0 −2021.01974.01974.0 1974.0 −6468.0
Cat
Acc 64.17%66.47% 65.77% ΔAcc 0.63% 0.64% 1.33% 0.7% 0.7% 0.7% 2.47%
Size 351.0 3872.0 5628.0 ΔSize 238.0 −266.0 912.0 1756.01756.0 1015.0 −3010.0
Panda
Acc 75.82%80.56% 80.2% ΔAcc 0.24% 0.67% 0.46% 0.36% 0.36% 0.25% 4.27%
Size 392.0 4872.0 6282.0 ΔSize −429.0 −813.0 904.0 1410.01410.0 −446.0−3702.0
Elephant
Acc 76.65%82.14% 81.8% ΔAcc 0.27% 0.5% 0.52% 0.34% 0.34% 0.25% 2.07%
Note: The thresholds of MS are all 1.5.
kernel function K (, ). Here, we consider two commonly used kernels: radial basis function (RBF)
2
kernel K r (z, x ) = exp(− z−x
) and polynomial (PN) kernel K p (z, x ) = (z Tx + 1)p .
2σ 2
7.1.1 SC MS for Kernel Logistic Regression. With the kernel trick, the kernel LR model [34] can be
formulated as f (x ) = 1+exp(−θ1 T ϕ (x )) , where θ ϕ is the model parameter vector in the Hilbert feature
ϕ
space and ϕ (.) is the mapping function. Based on the analysis in [24], we obtain θ ϕ ≈ ϕ (θ ), thus
the kernel LR model can be rewritten as
f (x ) =
1
.
1 + exp(−K (θ, x ))
(48)
Given the dataset D, training the kernel LR is to minimize the following logistic loss:
Lθ (D) =
m
−yi K (θ, x i ) + log[1 + exp(K (θ, x i ))],
(49)
i=1
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
19:24
Y. Zhang et al.
which can be solved by the stochastic gradient update rule as well. Therefore, we can estimate the
model change by the gradient of the logistic loss Lθ (x + ) with regard to the parameter vector θ .
Then, for kernel LR, the expected model change can be denoted by
∂K (θ, x + ) ∂Lθ (x + ) = 2f (x + )(1 − f (x + )) ,
Cθ (x + ) = Ey + Gθ (x + ) = Ey + ∂θ
∂θ
where for the RBF kernel
∂K r (θ, x + )
θ − x + 2 θ − x +
= −exp −
,
∂θ
2σ 2
σ2
(50)
(51)
and for the PN kernel
∂K p (θ, x + )
= p(θ Tx + + 1)p−1x + .
∂θ
According to Equation (2), SC MS will stop active learning when Cθ (x + ) ≤ η, ∀x + ∈ U.
(52)
7.1.2 SC MS for Kernel SVM. The kernel SVM model can be represented by a hyperplane f (x ) =
w ϕT ϕ (x ), where w ϕ ≈ ϕ (w ) is the model parameter in the Hilbert feature space as defined in [24].
Therefore, kernel SVM can be rewritten as f (x ) = K (w, x ) with the loss function
λ
K (w, w ) +
[1 − yi K (w, x i )]+ ,
2
i=1
m
Lw (D) =
(53)
which can also be solved by the stochastic gradient update rule. As before, after adding x + , the
model change can be measured by the gradient of the loss Lw (x + ) with regard to w.
Then, for instance, x + ∈ U, which satisfies |K (w, x + )| < 1, the expectation of the model change
can be formulated as
∂Lw (x + ) ∂K (w, x + ) =
,
Cw (x + ) = Ey + ||Gw (x + )|| = Ey + (54)
∂w
∂w where the gradient of the kernel function K (w, x + ) with regard to w for RBF kernel and PN kernel
is similar to Equation (51) and Equation (52); for other x + ∈ U, we have that Cw (x + ) = 0.
Similarly, taking kernel SVM as the base model, we stop active learning when Cw (x + ) ≤ η is
satisfied for ∀x + ∈ U. Detailed experimental results can be found in https://github.com/zhyxun/
Stopping-criterion-for-active-learning.
7.2 Stopping Criterion for Nongradient-Based Models
As the essential step of SC MS is to estimate the model change, it is naturally applicable to gradientbased models. However, applying SC MS to nongradient-based models is more complex; the key
challenge is to get the closed form of model change. Here, we extend SC MS to nongradient-based
models, taking Gaussian process regression (GPR) as an example.
Given the training set D, GPR can be defined as
f (x; α ) =
m
α i κ (x i , x ),
(55)
i=1
where κ (, ) is a given kernel function and α = [α 1 , α 2 , . . . , αm ]T is the model parameter vector
2 I ) −1Y , where K is the kernel matrix of set D, I is the identity
that can be obtained by α = (K + σm
2 is the assumed noise variance, and Y is the label vector.
matrix, σm
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
Model Stability based Stopping Criterion
19:25
As suggested by Cai [7], after adding a new instance x + into the training set, the updated model
parameter for GPR can be computed by a closed form:
2 I ) −1κ
κT α − y + (K + σm
α
+
+ 2
α =
,
(56)
2
0
−1
σf + + σm
where κ is the vector of kernel values of the training set and x + (i.e., the ith term is κ (x i , x + )) and
σf2+ is the predictive variance of x + .
Therefore, the expectation of model change for GPR can be formulated as
α + − [α T , 0]T P (y + |x + ) dy + .
CGPR (x + ) = Ey + ||α + − [α T , 0]T || =
(57)
Y
According to SC MS , taking GPR as the base model, we stop active learning if the condition
CGPR (x + ) ≤ η, ∀x + ∈ U is satisfied. Detailed experimental results are presented in https://github.
com/zhyxun/Stopping-criterion-for-active-learning.
8 CONCLUSIONS
In this article, we propose a novel model stability–based stopping criterion, which considers the
potential capability of each unlabeled example to change the model once added to the training
set. The main idea is that active learning should stop when the model will not change with more
training data. With this idea, we stop active learning when the changing ability of all remaining
unlabeled examples is less than a given threshold. Inspired by the stochastic gradient update rule,
we use the gradient of the loss at each candidate example to measure its capability to change the
model. We apply the method to two popular classifiers: logistic regression and support vector machines. Then, we theoretically analyze the stability and generalizability of the model obtained by
our method and provide a theoretical support. Substantial experiments on various datasets demonstrate the effectiveness of the proposed method. Finally, we generalize it to several other kinds of
models, including kernelized models and nongradient-based models, to show its widespread applications.
In this study, we focus on binary classification. In fact, our proposed model stability–based
stopping criterion can also be generalized to multiclass classification problems, which will be done
in our future work. In addition, the model change now is estimated by the expectation over all
possible labels; we will consider finding a better way for more reliable estimation of the model
change in the future.
ACKNOWLEDGMENTS
The work is partially supported by the High Technology Research and Development Program of
China 2015AA015801, NSFC 61521062, and STCSM 12DZ2272600.
REFERENCES
[1] N. Abe and H. Mamitsuka. 1998. Query learning strategies using boosting and bagging. In Proceedings of the 15th
International Conference on Machine Learning. 1–10.
[2] I. S. Andrew and H. U. Lyle. 2007. Active learning for logistic regression: An evaluation. Machine Learning 68, 3,
235–265.
[3] M. Bloodgood and K. Vijay-Shanker. 2009. A method for stopping active learning based on stabilizing predictions
and the need for user-adjustable stopping. In Proceedings of the 13th Conference on Computational Natural Language
Learning Association for Computational Linguistics. 39–47.
[4] S. Boucheron, G. Lugosi, and O. Bousquet. 2013. Concentration Inequalities. Oxford, Oxford University Press.
[5] O. Bousquet and A. Elisseeff. 2002. Stability and generalization. Journal of Machine Learning Research 2, 3, 499–526.
[6] C. Burges. 1998. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery
2, 2, 121–167.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
19:26
Y. Zhang et al.
[7] W. Cai, M. Zhang, and Y. Zhang. 2017. Batch mode active learning for regression with expected model change. IEEE
Transactions on Neural Networks and Learning Systems 28, 7, 1668–1681.
[8] W. Cai, Y. Zhang, and J. Zhou. 2013. Maximizing expected model change for active learning in regression. In Proceedings of the 13th International Conference on Data Mining. 51–60.
[9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei. 2009. Imagenet: A large-scale hierarchical image database. In
Proceedings of the Conference on Computer Vision and Pattern Recognition. 248–255.
[10] B. Efron and C. Stein. 1981. The jackknife estimate of variance. The Annals of Statistics 9, 3, 586–596.
[11] T. M. Hospedales, S. Gong, and T. Xiang. 2013. Finding rare classes: Active learning with generative and discriminative
models. IEEE Transactions on Knowledge and Data Engineering 25, 2, 374–386.
[12] S. Huang, R. Jin, and Z. Zhou. 2014. Active learning by querying informative and representative examples. IEEE
Transactions on Pattern Analysis and Machine Intelligence 36, 10, 1936–1949.
[13] J. Kivinen, A. Smola, and R. Williamson. 2004. Online learning with kernels. IEEE Transactions on Signal Processing
52, 8, 2165–2176.
[14] F. Laws and H. Schutze. 2008. Stopping criteria for active learning of named entity recognition. In Proceedings of the
22nd International Conference on Computational Linguistics. 465–472.
[15] D. D. Lewis and W. A. Gale. 1994. A sequential algorithm for training text classifiers. In Proceedings of the 17th
International ACM SIGIR Conference on Research and Development in Information Retrieval. 3–12.
[16] L. Li, X. Jin, S. Pan, and J. Sun. 2012. Multi-domain active learning for text classification. In Proceedings of the 18th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1086–1094.
[17] M. Li and I. K. Sethi. 2006. Confidence-based active learning. IEEE Transactions on Pattern Analysis and Machine
Intelligence 28, 8, 1251–1261.
[18] A. Liu, G. Jun, and J. Ghosh. 2009. A self-training approach to cost sensitive uncertainty sampling. Machine Learning
76, 2, 257–270.
[19] C. McDiarmid. 1989. On the method of bounded differences. Surveys in Combinatorics 141, 1, 148–188.
[20] J. Platt. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.
Advances in Large Margin Classifiers 10, 3, 61–74.
[21] G. Schohn and D. Cohn. 2000. Less is more: Active learning with support vector machines. In Proceedings of the 17th
International Conference on Machine Learning. 839–846.
[22] B. Settles. 2010. Active learning literature survey. University of Wisconsin– Madison 39, 2, 127–131.
[23] H. S. Seung, M. Opper, and H. Sompolinsky. 1992. Query by committee. In Proceedings of the 5th Workshop on Computational Learning Theory. 287–294.
[24] G. Sharma, F. Jurie, and P. Perez. 2014. Learning non-linear SVM in input space for image classification. HAL Archives
1–14.
[25] S. Shwartz, Y. Singer, and N. Srebro. 2007. Pegasos: Primal estimated sub-gradient solver for SVM. In Proceedings of
the 24th International Conference on Machine Learning. 807–814.
[26] K. Small and D. Roth. 2010. Margin-based active learning for structured predictions. International Journal of Machine
Learning and Cybernetics 1, 1-4, 3–25.
[27] M. Tang, X. Luo, and S. Roukos. 2002. Active learning for statistical natural language parsing. In Proceedings of the
40th Annual Meeting on Association for Computational Linguistics. 120–127.
[28] K. Tomanek, J. Wermter, and U. Hahn. 2007. An approach to text corpus construction which cuts annotation costs
and maintains reusability of annotated data. In Proceedings of Joint Meeting of the Conference on Empirical Methods
on Natural Language Processing and the Conference on Natural Language Learning. 486–495.
[29] S. Tong and E. Chang. 2001. Support vector machine active learning for image retrieval. In Proceedings of the 9th ACM
International Conference on Multimedia. 107–118.
[30] S. Tong and D. Koller. 2002. Support vector machine active learning with applications to text classification. Journal
of Machine Learning Research 2, 1, (2002), 45–66.
[31] A. Vlachos. 2008. A stopping criterion for active learning. Computer Speech and Language 22, 3, 295–312.
[32] J. Wang, E. Sung, and W. Yau. 2011. Active learning for solving the incomplete data problem in facial age classification
by the furthest nearest-neighbor criterion. IEEE Transactions on Image Processing 20, 7, 2049–2062.
[33] W. Wang, W. Cai, and Y. Zhang. 2014. Stability-based stopping criterion for active learning. In Proceedings of the 14th
International Conference on Data Mining. 1019–1024.
[34] J. Zhu and T. Hastie. 2001. Kernel logistic regression and the import vector machine. In Proceedings of the 14th Advances in Neural Information Processing Systems. 1081–1088.
[35] J. Zhu, H. Wang, and E. Hovy. 2010. Confidence-based stopping criteria for active learning for data annotation. ACM
Transactions on Speech and Language Processing 6, 3, 1–24.
Received March 2017; revised June 2017; accepted July 2017
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.
Документ
Категория
Без категории
Просмотров
3
Размер файла
745 Кб
Теги
3125645
1/--страниц
Пожаловаться на содержимое документа