Stopping Criterion for Active Learning with Model Stability YEXUN ZHANG, Shanghai Jiao Tong University WENBIN CAI, Microsoft WENQUAN WANG and YA ZHANG, Shanghai Jiao Tong University Active learning selectively labels the most informative instances, aiming to reduce the cost of data annotation. While much effort has been devoted to active sampling functions, relatively limited attention has been paid to when the learning process should stop. In this article, we focus on the stopping criterion of active learning and propose a model stability–based criterion, that is, when a model does not change with inclusion of additional training instances. The challenge lies in how to measure the model change without labeling additional instances and training new models. Inspired by the stochastic gradient update rule, we use the gradient of the loss function at each candidate example to measure its effect on model change. We propose to stop active learning when the model change brought by any of the remaining unlabeled examples is lower than a given threshold. We apply the proposed stopping criterion to two popular classifiers: logistic regression (LR) and support vector machines (SVMs). In addition, we theoretically analyze the stability and generalization ability of the model obtained by our stopping criterion. Substantial experiments on various UCI benchmark datasets and ImageNet datasets have demonstrated that the proposed approach is highly effective. CCS Concepts: • Computing methodologies → Active learning settings; Additional Key Words and Phrases: Active learning, logistic regression, model stability, stopping criterion, support vector machines ACM Reference format: Yexun Zhang, Wenbin Cai, Wenquan Wang, and Ya Zhang. 2017. Stopping Criterion for Active Learning with Model Stability. ACM Trans. Intell. Syst. Technol. 9, 2, Article 19 (October 2017), 26 pages. https://doi.org/10.1145/3125645 1 INTRODUCTION In recent years, active learning has been successfully applied in many machine-learning problems in which the data may be abundant but labels are absent or expensive to obtain [16, 27, 29]. Most studies on active learning focus on designing sampling algorithms, such as uncertainty sampling (US) [15] and query-by-committee (QBC) [23]. However, another important issue of the interactive learning process is the stopping criterion, that is, deciding when to stop active learning, which is a relatively less researched area. Here, we use an illustrative example to show the importance of the A preliminary version of this article appeared as a paper in Proceedings of International Conference on Data Mining (ICDM’14) [33]. Authors’ addresses: Y. Zhang, W. Wang, and Y. Zhang (corresponding author), Room 303A, No.5 SEIEE Building, the Cooperative Medianet Innovation Center, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China; emails: {zhyxun, wangwenquan, ya_zhang}@sjtu.edu.cn. W. Cai, the Applications & Services Group East Asia, Microsoft, Beijing 100080, China; email: wenbca@microsoft.com. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2017 ACM 2157-6904/2017/10-ART19 $15.00 https://doi.org/10.1145/3125645 ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. 19 19:2 Y. Zhang et al. Fig. 1. A typical active learning curve. stopping criterion. Figure 1 presents a typical active learning curve. The X-axis denotes the number of actively selected instances and the Y-axis stands for classification accuracy. As plotted in the figure, we can observe that the performance increases gradually at the initial stages. With the active learning proceeds, the learning curve starts to saturate, indicating that the model performance cannot be further improved. Thus, in this particular example, it is desirable to stop active learning at point B. In contrast, stopping at point A is too early and leads to a lower performance while stopping at point C is too late and wastes annotation. To ensure model quality as well as save labeling cost, it is highly desirable to define an appropriate stopping criterion to cease learning. In principle, active learning should terminate when the desired performance has been reached. However, this ideal approach requires an extra test set to evaluate the performance, which has to be annotated in advance, involving a a great deal of extra human effort. Thus, the key challenge is how to stop active learning without a test set. So far, there are limited studies on stopping criteria that require no test set [21, 28, 31, 35]. They mainly fall into two classes: model-specific stopping criteria and learning strategy-specific stopping criteria. The model-specific stopping criterion is restricted by the base model employed. For example, the support vector machines (SVMs)–based stopping criterion [21] is particularly designed for SVMs, and the confidence-based stopping criterion [31, 35] is mainly for probabilistic learning models. The learning strategy-specific stopping criterion has its own limitations because it is based on specific features of the active learning strategy. For example, the QBC-based stopping criterion [28] requires an ensemble to calculate the disagreement on each unlabeled example, and it is not readily applicable to other sampling strategies. In order to adapt to various practical applications, a general stopping criterion strategy unconstrained by specific models and specific active learning strategies is extremely important. In this article, we propose a new stopping criterion framework based on model stability, measured by the capability of each candidate example to change the model. The main idea is that active learning should stop when the model will not change with more training data. With this idea, we stop active learning when the changing ability of all remaining unlabeled examples is less than a given threshold, that is., the model is stable. The model change is quantified as the difference between the current model parameters and the new parameters obtained with the accumulated training set. Inspired by the stochastic gradient update rule, in which the model parameters are updated repeatedly according to the negative gradient of the objective function, we use the gradient of the loss function at each candidate example to measure its capability to change the model. In this study, we apply the stability-based stopping criterion to two popular base models: logistic regression (LR) and SVMs. In addition, we theoretically analyze the model’s properties, including stability and generalizability, when active learning stops according to our ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. Model Stability based Stopping Criterion 19:3 method. Furthermore, we derive the extensions of our method for several other kinds of models including kernelized models and nongradient-based models to show its widespread applications. Extensive experiments on various datasets demonstrate that the proposed approach outperforms state-of-the-art methods in most cases. The main contributions of this article are summarized as follows. (1) We propose a new stopping criterion framework based on model stability, which considers the capability of each candidate example to change the model. It can be applied to a wide spectrum of learners. (2) Under the framework, we derive the stopping criterion algorithms for two popular classifiers: LR and SVMs. (3) We theoretically analyze the proposed stopping criterion to provide a theoretical support for its performance. (4) We further extend the proposed method to kernelized models and nongradient-based models to show its applicability in a wide range of applications. The rest of this article is structured as follows. Section 2 reviews the related work. The general framework of the stability-based stopping criterion is introduced in Section 3. The proposed method for classification models is presented in Section 4. Section 5 theoretically analyzes the method. Section 6 presents the experiments and interprets the results. The extensions of our proposed method are introduced in Section 7. We present our conclusions in Section 8. 2 RELATED WORK In this section, we first quickly review several typical active learning strategies, and then summarize existing studies on stopping criteria for active learning. 2.1 Active Learning Methodologies Most active-learning research has focused on designing sampling functions for choosing the informative examples. Perhaps the most widely used is US, which chooses the instance whose label the current classifier is most uncertain about [15]. The margin-based active-learning method [30] is mostly used when taking SVMs as the base models, which selects instances closest to the separating hyperplane. Small [26] extended the margin-based method to structured output spaces and pipeline models, which enlarged the scope of problems for which active learning can be applied. Wang [32] introduced the furthest nearest-neighbor (FNN) method, which generalizes the marginbased uncertainty to a multiclass case. Another typical strategy is QBC [23], which generates a committee of models and selects instances having the greatest disagreement among the committee members [1]. In recent years, expected model change (EMC) has been adopted in designing sampling functions, which considers the ability of examples to update the current model [22]. A comprehensive active-learning survey can be found in [22]. 2.2 Stopping Criterion for Active Learning In this section, we summarize existing work on stopping criteria for active learning as follows. (1) Stopping criterion with test set: The idea behind this kind of method is intuitive and natural, that is, active learning should stop when the desired performance is reached [17]. However, an extra labeled test set is needed to evaluate the model’s performance, involving an extra cost of labeling. (2) Stopping criterion with no test set: The main idea of this type of method is to evaluate the model’s performance without using the labeled test set. Bloodgood [3] proposed the predictionbased stopping criterion, which detects model predictions on data that do not have to be labeled, ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. 19:4 Y. Zhang et al. called the stop set. However, the performance of this method greatly depends on the selected stop set and how to determine an appropriate stop set needs further and better justifications. Laws and Schutze [14] presented a gradient-based stopping criterion, which stops active learning when the model has reached maximum performance or the uncertainty cannot be reduced further. It is nontrivial to sufficiently estimate the gradient of the performance curve during the active learning process in real-world applications, however. Recently, there have been several other stopping criteria, which can be further grouped into two classes: model-specific stopping criteria and learning strategy-specific stopping criteria. The model-specific stopping criterion contains the following methods. Taking SVMs as the base model, Schohn and Cohn [21] suggested that active learning should stop when there are no unlabeled instances lying in the margin. This strategy is specific to SVMs and is not applicable for other base models. Another model-specific method is the confidence-based stopping criterion, which stops active learning when the classifier has enough confidence in its classification with regard to the remaining unlabeled data. Vlachos [31] proposed a stopping criterion to stop active learning when the model confidence consistently drops, in which a separate outside dataset is used to estimate the classifier’s confidence. The key issue is how to identify a consistent drop in the confidence of a classifier during the active-learning process; otherwise, a local maximum of the confidence curve would be found. Zhu et al. [35] proposed several alternatives to estimate confidence, such as Max-confidence, Overall-uncertainty, Min-error and Selected-accuracy. However, these confidence-based stopping criteria are specific to a set of probabilistic learning models, which limits their applicability. The learning strategy-specific stopping criterion is mainly designed for a specific active learning strategy. With the QBC framework, Thomanek et al. [28] utilized the disagreement rate among the member classifiers to design the stopping criterion. However, this method requires an ensemble to calculate the disagreement on each unlabeled example; thus, it is not readily applicable to other sampling strategies. Considering the limitations of existing stopping criteria, we propose a general stopping criterion framework unconstrained by specific models, specific active-learning strategies, and so on. This work is an extension of our previous work [33] by providing solid theoretical justification. Next, we introduce the framework of a stability-based stopping criterion. 3 THE FRAMEWORK OF STABILITY-BASED STOPPING CRITERION In machine learning, the ultimate goal is to learn a model with good generalization performance on future unseen data. The model’s generalization capability is changed if and only if the model is changed; thus, examples that cannot change the model are actually useless for active learning. Therefore, we believe that model stability is a reasonable indicator for stopping active learning. Model stability can be measured by the model parameter change caused by querying and adding one more candidate example x + into the training set. We define the model parameter change as Cθ (x + ) = ||θ − θ + ||, (1) where θ and θ + denote the current model parameter and the updated parameter learned from the expanded training set, respectively. Then, our model stability–based stopping criterion (denoted as SC MS hereafter) will stop the active-learning process when the following condition is satisfied: Cθ (x + ) ≤ η, ∀x + ∈ U, ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. (2) Model Stability based Stopping Criterion 19:5 where η is a predefined threshold and U stands for the unlabeled pool set. The main problem here is how to calculate the parameter change Cθ (x + ) (i.e., ||θ − θ + ||). In the following, we present our method with a stochastic gradient update rule. We start with the well-known Empirical Risk Minimization (ERM) principle. Given a training m , where x ∈ Rd is a d-dimensional feature vector and y is a class label drawn set D = {(x i , yi )}i=1 i i i.i.d. from a certain underlying distribution, the empirical risk on the training set is R̂(D) = m Lθ [f (x i ), yi ], (3) i=1 where f (x i ) is the predicted label and Lθ [.] denotes a given loss function parameterized by θ . To minimize the empirical error, a widely used search method is the stochastic gradient update rule, which updates the parameter θ repeatedly according to the negative gradient of the loss function with regard to each training example θ ←θ −α ∂Lθ (x i ) , ∂θ i = 1, 2, . . . , m, (4) where α is the learning rate. Here, we consider the stochastic gradient update rule in active-learning cases. If the candidate example x + is added to the training set with a given class label y + , the empirical error on the expanded training set D + = D ∪ (x + , y + ) becomes R̂(D + ) = m Lθ [f (x i ), yi ] + Lθ [f (x + ), y + ]. (5) i=1 Consequently, the parameter is changed due to the inclusion of the new example (x + , y + ). According to the update rule Equation (4), we can approximate the parameter change as the derivative of + θ (x ) the loss function at the candidate example (x + , y + ). Let Gθ (x + ) = ∂ L∂θ ; then, we have that Cθ (x + ) = ||θ − θ + || = ||α Gθ (x + )||. (6) In practice, since the true class label y + of the example x + is unknown in advance, we cannot calculate the derivative Gθ (x + ) inside Equation (6) directly. Instead, we use the expectation calculated over all possible labels y + ∈ Y to estimate the true parameter change ∂Lθ (x + ) , (7) Cθ (x + ) = Ey + ||α Gθ (x + )|| = p(y + |x + ) α ∂θ y + ∈Y where p(y + |x + ) is the conditional probability of label y + given the instance x + and it can be estimated by the current model. Thus, we stop active learning when the following inequality is satisfied: Ey + ||Gθ (x + )|| ≤ η, ∀x + ∈ U, (8) where α can be seen as a constant and it is integrated into η. 4 STABILITY-BASED STOPPING CRITERION FOR CLASSIFICATION In this section, we apply the proposed method to binary classification tasks. We first introduce the two classifiers used in this study, LR and SVMs, and the two active sampling methods, US and EMC. Then, the details of SC MS for the two base models are derived. Finally, we analyze the computation complexity of our method. ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. 19:6 Y. Zhang et al. 4.1 Classification Models 4.1.1 Logistic Regression. Logistic regression [2], which can be viewed as arising from a 1 Bernoulli model, is usually formulated as f (x ) = 1+exp(−θ T x ) , where θ is the parameter vector m characterizing the model. Given a dataset D = {(x i , yi )}i=1 with yi ∈ YLR = {0, 1}, the conditional probability of assigning a class label y ∈ YLR to the data point x is p(y|x; θ ) = f (x )y (1 − f (x )) 1−y . As a result, building an LR model is to maximize the likelihood Lθ (D) = lently, we minimize the following logistic loss: Lθ (D) = − logLθ (D) = m (9) m i=1 p(yi |x i ; θ ). Equiva- −yi logf (x i ) − (1 − yi )log(1 − f (x i )), (10) i=1 which is a commonly adopted convention. 4.1.2 Support Vector Machines. For SVMs [6], it is common to denote the class labels by YSVM = {−1, 1}. The linear SVM model can be represented by a hyperplane f (x ) = w Tx + b = 0, where w is the weight vector parameterizing the classifier. For simplicity, we omit the bias term b throughout this study. Constructing an SVM classifier is to solve the Quadratic Programming (QP) problem 1 ||w || 2 + C ξi 2 i=1 m min w s.t. yi w Tx i ≥ 1 − ξ i , ξ i ≥ 0, i = 1, . . . , m, (11) where ξ i is a slack variable and C is the penalty parameter. 4.2 Active Learning Methods In this section, we introduce the active sampling methods used in our study: US and EMC. Our stopping criterion can be combined with any other active-learning strategies. 4.2.1 Uncertainty Sampling. In a US [15] framework, an active learner queries instances about which it is least certain, which is often straightforward for probabilistic learning models. For binary classification, US simply selects the instance whose posterior probability is nearest to 0.5: ∗ = arg min p(y = ŷ|x; θ ), x US x ∈U (12) ∗ is the instance selected by US, U is the unlabeled pool set, θ is the model parameter where x US vector, and ŷ is the predicted label of x. 4.2.2 Expected Model Change. EMC [22] selects the instance that will impart the greatest change to the current model if it is labeled and added into the training set D. The sampling function of EMC is defined as ∂lθ ((x, yk )) ∗ , (13) = arg max p(yk |x; θ ) x EMC ∂θ x ∈U k ∗ where x EMC is the instance queried by EMC, lθ is the loss function with model θ , and yk is the kth possible label of x. ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. Model Stability based Stopping Criterion 19:7 Discussion. Our proposed model stability–based stopping criterion SC MS is related to the EMC active learning algorithm as both of the two algorithms measure the model-changing ability of examples. However, they have some differences in terms of motivation and generalization: (1) EMC aims to choose the examples maximizing the model parameter change for learning, while SC MS focuses on stopping the active-learning process when the changing ability of all unlabeled instances is less than a threshold. (2) EMC is restricted to an LR model, but we view this measurement method as a general framework and generalize it to many important kinds of learning models, including SVMs, kernelized models, and nongradient models. 4.3 Stopping Criterion for Logistic Regression Assume that a candidate example x + with a given label y + is added to the training set, then the likelihood on the expanded training set D + = D ∪ (x + , y + ) and becomes Lθ (D + ) = m + + i=1 p(yi |x i ; θ )p(y |x ; θ ). Thus, the logistic loss becomes Lθ (D + ) = −logLθ (D) − logLθ (x + ) = m −yi logf (x i ) − (1 − yi )log(1 − f (x i )) i=1 − y + logf (x + ) − (1 − y + )log(1 − f (x + )). (14) The derivative of the logistic loss Lθ (x + ) at the candidate example (x + , y + ) is calculated as Gθ (x + ) = ∂Lθ (x + ) = ( f (x + ) − y + )x + . ∂θ (15) Since the true class label y + is unknown before querying, we employ the current classifier to estimate the prediction distribution y + ∈ YLR , and then use the expectation to approximate the true parameter change. Therefore, for LR, the expectation of the model change is p(y + |x + ; θ )||( f (x + ) − y + )x + ||, (16) Ey + ||Gθ (x + )|| = y + ∈YLR and active learning ceases when Equation (8) is satisfied. An interpretation behind the SC MS for LR is explained as follows. The parameter change is proportional to the error term ( f (x + ) − y + ). Hence, if the classifier is good enough to accurately predict, the change ability of instances will be small, indicating that the model is stable. The corresponding pseudocode of SC MS for LR is presented in Algorithm 1. 4.4 Stopping Criterion for Support Vector Machines In this section, we derive the SC MS for SVMs. However, due to the constraints in the above QP optimization problem, we cannot directly apply the stochastic gradient update rule to SVMs. To address this problem, we rewrite the above QP problem as an unconstrained problem by rearranging the constraints and substituting the parameter C for C = λ1 : λ w 2 + [1 − yi w Tx i ]+ , 2 i=1 m min w (17) where the subscript “+” indicates the positive part, the first term is the regularizer, and the second term represents the standard Hinge loss. In the literature, several stochastic gradient–based learning algorithms have been studied for solving SVMs’ optimization problems [13, 25]. Similarly, we consider the update rule in active learning. If a candidate point x + is added into the training set with a given label y + , the objective ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. 19:8 Y. Zhang et al. ALGORITHM 1: SCMS for Logistic Regression Require: The initial labeled data set D, the unlabeled pool set U, the LR classifier θ trained with D and the predefined threshold η. 1: for each x + ∈ U do 2: for each possible y + ∈ YLR do 3: Calculate p(y + |x + ; θ ) by the current classifier θ . 4: Calculate the derivative of loss Lθ (x + ) with regard to θ via Equation (15). 5: end for 6: Estimate the model parameter change via Equation (16). 7: end for 8: if Ey + ||Gθ (x + )|| ≤ η is satisfied for ∀x + ∈ U then 9: Stop the active-learning process. 10: else 11: Query the most uncertainty instances by (Equation (12)) if using US as the sampling algorithm. 12: Query the instances changing the model most by (Equation (13)) if using EMC as the sampling algorithm. 13: Update D, U, and θ , and then go to step 1. 14: end if function on the enlarged training set D + = D ∪ (x + , y + ), then becomes λ w 2 + [1 − yi w Tx i ]+ + [1 − y +w Tx + ]+ . 2 i=1 m min w (18) We estimate the effect of adding the new candidate point on the training loss to approximate the parameter change; hence, the derivative of the loss at the example (x + , y + ) is Gw (x + ) = ⎪ −y +x + , if y +w Tx + < 1, ∂Lw (x + ) ⎧ =⎨ ⎪ 0, ∂w otherwise. ⎩ (19) It shows that SVM updates its weight solely on examples that satisfy y +w Tx + < 1, which is straightforward. Again, since the true label y + ∈ YSVM is unknown in advance, we therefore rewrite the inequality constraint y +w Tx + < 1 as |w Tx + | < 1. Meanwhile, we take the expectation over all possible labels y + ∈ YSVM to approximate the true parameter change. Thus, for x + ∈ U, s.t. |w Tx + | < 1, the expected model change can be formulated as Ey + ||Gw (x + )|| = p(y + |x + ; w )|| − y +x + || = ||x + ||, (20) y + ∈YSVM other x + (x + )|| and for ∈ U, Ey + ||Gw = 0. We stop active learning when Equation (8) is satisfied. A good interpretation is that the examples within the margin are the ones having the ability to change SVMs, that is, |w Tx + | < 1. Thus, an SVM classifier is sufficiently stable when there are no data examples lying in the margin, thereby resulting in a good generalization performance. Also, our SC MS for SVMs can be regarded as a general case of previous work [21], which stops active learning when there are no unlabeled points within the margin. The corresponding pseudocode of SC MS for SVMs is presented in Algorithm 2. 4.5 Computation Complexity Since we focus on the stopping criterion, we only consider the computation time taken by our SC MS . Suppose that there are N examples of d dimensions in the pool set U. ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. Model Stability based Stopping Criterion 19:9 ALGORITHM 2: SCMS for Support Vector Machines Require: The initial labeled data set D, the unlabeled pool set U, the current SVM classifier w trained with D and the predefined threshold η. 1: for each x + ∈ U do 2: if |w Tx + | < 1 then 3: Estimate the model parameter change via Equation (20). 4: else 5: The model parameter change is 0. 6: end if 7: end for 8: if Ey + ||Gw (x + )|| ≤ η is satisfied for ∀x + ∈ U then 9: Stop the active-learning process. 10: else 11: Query the most uncertainty instances by (Equation (12)) if using US as the sampling algorithm. 12: Query the instances changing the model most by(Equation (13)) if using EMC as the sampling algorithm. 13: Update D, U, and w, and then go to step 1. 14: end if For LR, the computation complexity needed by our SC MS is mainly from the computation of the model parameter change with O (Nd ) time. For SVMs, the computation contains two aspects: the calculation of |w Tx + | with O (Nd ) time and the computation of the model parameter change, that is, ||x + ||, which needs O (kd ) time if there are k instances satisfying |w Tx + | < 1. Therefore, the total time needed is O ((N + k )d ). Therefore, for both LR and SVMs, the computation complexity required by our SC MS is linear to the size of the unlabeled pool set, which is promising in practice. 5 THEORETICAL ANALYSIS In this section, we theoretically analyze the proposed method concentrating on the following two questions: (1) How is the stability of the model obtained when active learning stops according to our SC MS ? (2) How is the generalization ability of the model obtained when active learning stops according to our SC MS ? In the following, we try to answer these questions, but first, we provide some notations and useful tools. 5.1 Notations We have a space of points X and a set of labels Y. Given a dataset Q = D ∪ U drawn from an unknown distribution P, since the dataset Q is usually very large in practice, we can deem the dataset as an approximation of the distribution P. Suppose that there are (n − m) instances selected during active learning by a given sampling strategy, then when active learning stops, we have a labeled set S of size n in Z = X × Y S = D ∪ {zm+1 , . . . , zn } = {z 1 , . . . , zm , zm+1 , . . . , zn } (21) and an unlabeled pool set U − = U \ {zm+1 , . . . , zn } of size N − (n − m), where N is the size of U and zi = (x i , yi ). As suggested by [5], we consider the randomness of the initial labeled set D, ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. 19:10 Y. Zhang et al. then if we obtain a training set S of size n when active learning stops according to our SC MS , the threshold η S will be different for the different initial set D. We denote the maximum value of all these η S by η max , namely, η max = max S η S . To support the latter derivations, we further define the following two sets: (1) By adding one more instance z + randomly selected from U − into the training set S S + = {z 1 , . . . , zi−1 , zi , zi+1 , . . . , zn , z + }, (22) (2) By replacing the ith element of S by z + S i = {z 1 , . . . , zi−1 , z + , zi+1 , . . . , zn }, (23) which are useful in our analysis. Since unlabeled data usually can be easily collected in practical applications, we assume that U is extremely large such that after removing S from Q, the rest set U − can still be seen as an approximation of P. Let l (θ S , z) be the loss function of the model θ S that is learned from the set S, and we suppose that 0 ≤ l (θ S , z) ≤ M always holds for all z ∈ Z and all sets S, where M is a constant. Let R = Ez [l (θ S , z)] be the generalization error of θ S with regard to distribution P and we simplify the z ∼ P to be z hereafter. Since P is an unknown distribution and we cannot compute R directly, we estimate R by the empirical error R emp = n1 ni=1 l (θ S , zi ). Furthermore, as active learning selects examples by some specific strategies, the examples in S are not i.i.d. with regard to the underlying distribution P. However, approximately, they can be seen as i.i.d. with regard to an unknown distribution A. We define the generalization error of θ S on the distribution A as R A = Ez∼A [l (θ S , z)]. 5.2 Main Tools Here are some main tools we will use in our latter analysis. Lemma 5.1 (Lipschitz Condition [5]). A function f is said to satisfy the Lipschitz condition if there exits a constant t such that | f (x 1 , y) − f (x 2 , y)| ≤ t |x 1 − x 2 |. If ∂f (x,y ) ∂x (24) exists and has a bound for all x, then f (x, y) meets the Lipschitz condition. Lemma 5.2. Let θ S + and θ S i be the models estimated with S + and S i , respectively. Then, according to our SC MS , the following equations will hold: ∂l (θ S , z + ) ∂l (θ i , zi ) S ≤ η S and ||θ S + − θ S i || = ≤ η max . ||θ S + − θ S || = (25) ∂θ ∂θ S Si The second equation in the above lemma is because S + can be seen as the dataset constructed by adding zi into S i , which can be deemed as obtained by our SC MS with an initial set D. Lemma 5.3. Let l satisfy the Lipschitz condition. Then, for ∀i ∈ {1, 2, . . . , n}, we have that |l (θ S + , z) − l (θ S , z)| ≤t ||θ S + − θ S || ≤ tη S , (26) |l (θ S + , z) − l (θ S i , z)| ≤t ||θ S + − θ S i || ≤ tη max . (27) and Lemma 5.4. For any symmetric learning algorithm, for ∀i ∈ {1, 2, . . . , n}, we have that E S [R − R emp ] = E S,z + ∼A [l (θ S , z + ) − l (θ S i , z + )] + ν, (28) where E S [R] = E S [R A ] + ν and ν is a bias term that is equal to zero when distribution A is the same as P. ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. Model Stability based Stopping Criterion 19:11 Lemma 5.5. For any learning algorithm and loss function l such that 0 ≤ l (θ S , z) ≤ M is held for all z ∈ Z and all sets S, we have the following inequality for ∀i ∈ {1, 2, . . . , n}: E S [(R − R emp ) 2 ] ≤ M2 + 3ME S,z + ∼A [|l (θ S , zi ) − l (θ S i , zi )|] 2n + 2ν E S,z + ∼A [l (θ S , z + ) − l (θ S i , z + )] + ν 2 . (29) The proof for the above two lemmas is similar to [5] by replacing E S [R] to be E S [R A ] + ν ; thus, we omit the details here. For simplicity, we will drop ν in the following, which does not affect the behavior of the analysis. Lemma 5.6 (Efron-Stein [4], [10]). Let S and S i be defined as above. Let F : Z n → R be any measurable function; then, we have the following inequality for the variance of F (S): 1 E S,z + [(F (S) − F (S i )) 2 ]. 2 i=1 n V ar (F (S)) = E S [(F (S) − E S [F (S)]) 2 ] ≤ (30) Lemma 5.7 (McDiarmid [4], [19]). Let S and S i be defined as above. Let F : Z n → R be any measurable function for which there exists constants c i (i = 1, 2, . . . , n) such that sup S ∈Z n ,z + ∈Z |F (S) − F (S i )| ≤ c i . Then, −2ϵ 2 P S [F (S) − E S [F (S)] ≥ ϵ] ≤ exp n . (31) 2 i=1 c i 5.3 Stability Analysis Stability means how much the model can be changed if we add one more instance into the training set after active learning stops. Next, we analyze the model stability by deriving an upper bound for the model change, which considers the randomness of the initial labeled set. The derived upper bound indicates that the model will reach a stable status with the increase of the training set size. Theorem 5.8. With a loss function l such that for all z ∈ Z and all sets S, 0 ≤ l (θ S , z) ≤ M and the Lipschitz condition is satisfied, then, for any initial labeled set D, when active learning stops with a training set S of size n according to our SC MS , if we add one more instance into the training set, the model parameter change will have the following upper bound: ||θ S + √ M + M 2 + nτ , − θ S || ≤ O nt (32) where τ ≥ 0 is a slack term, and θ S and θ S + are the models estimated with S and S + , respectively. Proof. By Lemma 5.3, we can get the following inequality: |R − R i | ≤ Ez [|l (θ S , z) − l (θ S i , z)|] ≤ Ez [|l (θ S , z) − l (θ S + , z)| + |l (θ S + , z) − l (θ S i , z)|] ≤ t (η S + η max ), (33) where R i is the generalization error of the model θ S i with regard to P. ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. 19:12 Y. Zhang et al. Similarly, we can obtain that i |≤ |R emp − R emp ≤ |l (θ S , z j ) − l (θ S i , z j )| |l (θ , zi ) − l (θ i , z + )| S S + n n ji |l (θ S , z j ) − l (θ S + , z j )| + |l (θ S + , z j ) − l (θ S i , z j )| n ji ≤ t (η S + η max ) + + M n M , n (34) i where R emp is the empirical risk of the model θ S i . Combining Equation (33) and Equation (34), we have that i i |R − R emp − (R i − R emp )| ≤|R − R i | + |R emp − R emp | ≤2t (η S + η max ) + M . n (35) According to Lemma 5.6, let F (S) = R − R emp ; then, we can obtain the variance of R − R emp n 1 M 2 M2 2 + 4Mtη max , V ar (R − R emp ) ≤ E S,z + 2t (η S + η max ) + + ≤8nt 2η max 2 i=1 n 2n (36) where the last step is due to the fact that η S ≤ η max is satisfied for all S ∈ Z n , and ∃τ ≥ 0 such that M2 2 + 4Mtη max − τ . V ar (R − R emp ) = 8nt 2η max + (37) 2n By Lemma 5.5, we have the following inequality: E S [(R − R emp ) 2 ] ≤ M2 M2 + 3ME S,z + ∼A [t (η S + η max )] ≤ + 6Mtη max . 2n 2n (38) Since V ar (R − R emp ) ≤ E S [(R − R emp ) 2 ], we can get the following inequality: 2 8nt 2η max + M2 M2 + 4Mtη max − τ ≤ + 6Mtη max . 2n 2n (39) Furthermore, according to our SC MS and Equation (25), we have the following upper bound for the model change: √ √ M + M 2 + nτ M + M 2 + 8nτ ||θ S + − θ S || ≤ =O . (40) 8nt nt Based on the analysis provided above, we obtain the following two conclusions. (1) Given initial training set D, when active learning stops with a training set S of size n according to our SC MS , the √ M 2 +nτ model can be changed O ( M + nt ) at most by adding one more instance into the training set. (2) The upper bound is negatively related to n, indicating that the model will become increasingly stable with the growth of the training set. In addition, the bound also guarantees good generalization performance, as proved in the next section. ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. Model Stability based Stopping Criterion 19:13 5.4 Generalization Error Analysis Generalization ability is the model performance on future unseen data. In this section, we will try to give an exponential bound for the generalization error of the model suggested by our SC MS in the spirit of Bousquet [5] and Boucheron [4]. Theorem 5.9. With a loss function l such that for all z ∈ Z and all sets S, 0 ≤ l (θ S , z) ≤ M and the Lipschitz condition is satisfied, for any initial labeled set D, when active learning stops with a training set S of size n according to our SC MS , the following generalization error bound will hold with probability at least 1-δ : 1 √ 1 1 2 + log , (41) R ≤ R emp + O M + M + nτ n n δ where τ ≥ 0 is a slack term. Proof. According to Equation (35), F (S) = R − R emp meets the condition of Lemma 5.7 with c i = 4tη max + M n , then P S R − R emp − E S [R − R emp ] ≥ ϵ ≤ δ, (42) 2 where δ = exp(− (4nt η2nϵ 2 ). By Lemma 5.4, we can obtain that E S [R − R emp ] ≤ 2tη max . Thus, max +M ) the following inequality holds true: P S (R − R emp − 2tη max ≥ ϵ ) ≤ δ, (43) P S (R − R emp − 2tη max ≤ ϵ ) ≥ 1 − δ . (44) which is equivalent to According to δ , we can get the formulation of ϵ and the following inequality holds true with probability at least 1 − δ : 1 1 log R ≤ R emp + 2tη max + (4ntη max + M ) 2n δ √ √ 1 1 M + M 2 + 8nτ 3M + M 2 + 8nτ + log ≤ R emp + 4n 2 2n δ 1 √ 1 1 2 + log . (45) = R emp + O M + M + nτ n n δ Based on the above theoretical justification, the following two conclusions are derived. (1) The above theorem provides a generalization error bound for the model obtained when active learning stops with the training set S. (2) Optimizing model stability can guarantee the generalization ability, as the upper bound of model change is a constituent term of the generalization error bound and the two bounds are positively related. 6 EXPERIMENTS In this section, we carry out substantial experiments to validate the effectiveness of our proposed SC MS . ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. 19:14 Y. Zhang et al. 6.1 Experiments on UCI Data Sets 6.1.1 Experimental Setup. To validate the effectiveness of the proposed strategy, we conduct experiments on nine benchmark datasets of various domains from the UCI machine-learning repository1 : Biodeg, Firm-Teacher-Clave-Direction-Classification, Ionosphere, Spambase, WDBC, Letter. Since Letter is a multiclass dataset, we select four pairs of letters (i.e., D-vs-P, E-vs-F, M-vs-N, U-vs-V) that are relatively difficult to distinguish and build a binary-class dataset for each pair. Similarly, we select reverse clave class and forward clave class from Firm-TeacherClave-Direction-Classification and build the dataset Clave. Each dataset is randomly divided into three disjoint subsets: the initial labeled set D (5%) to train the initial model, the unlabeled pool set U (75%) to select the most informative examples, and the test set T (20%) to evaluate different stopping criteria. The features are normalized as suggested by [8]. Two learning models are used as the base learner: LR and SVMs. The active-learning algorithms used here are US and EMC. In this study, 1% of the whole examples are selected in each learning iteration. To avoid random fluctuation, each experiment is repeated 20 times with random initialization and the averaged results are reported. In each experiment, all stopping critera use the same initialization for fair comparison. For parameter η, we find its optimal value by simulating the active-learning process using the labeled set. Since the size of the initial labeled set is limited, we enlarge it by assigning pseudolabels to instances in the pool set whose labels are highly certain under the current classifier, which is inspired by [18]. Then, we separate the enlarged labeled set into D , U , and T , with the proportion the same as the setting taken in the real-learning process and simulate the activelearning process 20 times with random initialization. Finally, we set the threshold having the best averaged performance. We set the threshold candidates as [0.2, 0.5, 1.0, 1.5, 2.0, 2.5]. 6.1.2 Implementation. An initial model is built with the seed training set. Then, we calculate the stopping-criterion score for unlabeled examples and judge the stopping criterion. If the criterion is satisfied, we cease the active-learning process. Otherwise, the informativeness score will be computed for each unlabeled instance according to the used active-learning algorithm (US or EMC in this study). Then, the top-k instances with the highest scores will be selected and added into the training set with their true labels. In this study, we use the greedy batch mode and k is set as the number of 1% of the whole dataset. The above process is repeated until the stopping criterion is satisfied or the whole unlabeled set is added into the training set. 6.1.3 Comparison Methods. To test the effectiveness of our model stability–based stopping criterion (MS), we compare it with the following six state-of-the-art competitors: (a) stop set prediction-based stopping criterion (SP), (b) gradient-based stopping criterion (GRAD), (c) overalluncertainty (OU), (d) max-confidence (MC), (e) min-error (ME) and (f) QUIRE. The experimental settings for these competitors are set as follows. (1) For SP, Bloodgood [3] uses the Kappa statistic to measure the agreement among most recently learned models. We follow their settings in our reimplementation of this method. (2) For GRAD, we use a window of size 100 and a threshold of 0.00005, as suggested by Laws [14]. (3) For OU, MC, and ME, Zhu [35] shows that a start threshold of 0.1 with the threshold update technique performs best. Thus, we follow their setting in our implementation. 1 http://archive.ics.uci.edu/ml/. ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. Model Stability based Stopping Criterion 19:15 (4) For QUIRE [12], we stop the active-learning process when the scores of informativeness and representativeness for all unlabeled examples are lower than a threshold that is automatically set following the principle of cross-validation. Since it is only applicable for SVMs, we compare it only with SVMs. Because some of these methods are specific to probabilistic models, we utilize the sigmoid function to transform SVMs’ outputs to posterior probabilities as suggested by [20]. 6.1.4 Evaluation Metrics. In active-learning scenarios, when the model first reaches the highest performance, which is the highest classification accuracy in this study, it is suggested that the process can stop. We refer to this point as the Highest Performance Point (HPP). With the HPP, we use the following two metrics [35] to evaluate each stopping criterion. (a) The difference between the training set size at the HPP (denoted by Size HPP ) and the size at the stopping point suggested by a certain stopping criterion (denoted by Size SC ): ΔSize = Size SC − Size HPP . (46) The closer ΔSize is to zero, the better the stopping criterion. (b) The difference between the highest performance (denoted by Acc HPP ) and the performance obtained at the stopping point with a stopping criterion (denoted by Acc SC ): ΔAcc = Acc HPP − Acc SC . (47) The smaller the ΔAcc , the better the stopping criterion. Combining the above two metrics, for the stopping criterion that obtains the highest accuracy among the competitors, the smaller the ΔSize , the better the stopping criterion. 6.1.5 Experimental Results with Uncertainty Sampling Strategy. The comparison results of ΔSize and ΔAcc for these seven criteria are presented in Table 1 and Table 2 for LR and SVMs, taking US as the active-learning method. The values having the best performance are highlighted in bold. “Initial” denotes the initial settings before selecting instances and “All” shows the results after labeling and adding the whole unlabeled set U into the training set. Before we compare the results, it is necessary to stress the following three facts. First, all results that we present are averages; therefore, sometimes two methods could have very similar ΔSize but wildly different average classification accuracy. Second, some stopping points would not lie on their active-learning curves since the results are averages. Third, the learning model with the higher number of training data may have a lower classification accuracy, which is called “less is more” in active learning and was first reported by Schohn in [21]. As shown in these tables, MS is observed to perform the best among these methods in most cases, demonstrating the effectiveness of MS. SP performs poorly in most cases. GRAD achieves a small ΔAcc on some datasets but also a relatively large ΔSize , which may be because the uncertainty or performance is not stable until in later iterations for these datasets. OU and MC perform well in terms of ΔAcc but poorly in terms of ΔSize , and the relatively large ΔSize indicates that OU and MC tend to stop active learning at the latter iteration with the goal of achieving a high performance. ME performs well in terms of ΔSize but poorly in terms of ΔAcc , and the ΔSize is lower than zero in many cases, indicating that it stops before some informative instances being selected and results in low performance. QUIRE performs well on some datasets but not consistently. MS performs not so well sometimes. A possible reason is that the model change is performed as an approximation, that is, the true change is approximated using the expectation calculation over all possible labels, which may hurt its performance in some cases. In addition, we present two intuitive examples of active learning curves with stopping points based on the US sampling strategy in Figure 2. As observed, MS stops at the point that achieves ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. 19:16 Y. Zhang et al. Table 1. Results of ΔSize and ΔAcc on UCI Datasets for LR with US Strategy Dataset Size Biodeg Acc Size Clave Acc Size Ionosphere Acc Size Spambase Acc Size WDBC Acc Size D-vs-P Acc Size E-vs-F Acc Size M-vs-N Acc Size U-vs-V Acc Initial 52.0 68.87% 430.0 95.71% 17.0 70.21% 230.0 77.63% 28.0 88.7% 80.0 95.31% 77.0 94.56% 78.0 93.58% 78.0 95.62% HPP 317.1 87.5% 1694.2 96.68% 123.4 89.86% 867.9 92.53% 83.5 96.96% 174.4 98.85% 224.0 97.73% 247.6 97.18% 158.0 98.86% All 843.0 84.72% 6880.0 96.54% 280.0 87.11% 3680.0 91.44% 454.0 95.0% 1286.0 98.7% 1234.0 97.48% 1259.0 96.84% 1260.0 98.62% ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc MS −84.2 12.31% −541.8 0.14% 52.6 2.32% 122.7 0.37% 39.0 1.3% 51.2 0.09% 0.0 0.19% 24.0 0.32% 42.0 0.12% SP 187.0 1.89% −713.8 0.18% −8.6 3.38% 318.9 0.57% 54.0 1.74% 36.8 0.19% 6.0 0.32% −67.2 0.6% 44.0 0.12% GRAD 27.8 3.23% 3285.2 0.18% 68.0 2.68% 953.7 1.64% 370.5 1.96% 1111.0 0.16% 903.0 0.26% 1011.4 0.35% 1102.0 0.24% OU 360.3 2.64% −619.2 0.15% 141.4 2.89% 1576.3 0.88% 127.5 1.3% 147.2 0.09% 66.0 0.39% 328.0 0.44% 206.0 0.16% MC 437.3 2.81% 550.4 0.16% 148.5 3.1% 2014.8 0.93% 220.5 1.96% 356.8 0.09% 276.0 0.32% 558.4 0.41% 410.0 0.2% ME 134.8 1.67% −1264.2 0.97% 31.0 2.46% 555.1 0.78% 42.0 1.74% 4.8 0.34% −36.0 0.45% −30.4 0.66% 28.0 0.2% Note: The thresholds of MS are 0.5, 1.0, 0.5, 0.5, 1.0, 0.5, 0.5, 1.0, and 0.5, respectively. Table 2. Results of ΔSize and ΔAcc on UCI Datasets for SVMs with US Strategy Dataset Size Biodeg Acc Size Clave Acc Size Ionosphere Acc Size Spambase Acc Size WDBC Acc Size D-vs-P Acc Size E-vs-F Acc Size M-vs-N Acc Size U-vs-V Acc Initial HPP All 52.0 393.0 843.0 ΔSize 73.87% 88.3% 86.7% ΔAcc 430.0 1204.0 6880.0 ΔSize 96.15% 96.76% 96.69% ΔAcc 17.0 133.8 280.0 ΔSize 67.75% 88.87% 87.46% ΔAcc 230.0 1104.0 3680.0 ΔSize 86.21% 91.97% 90.88% ΔAcc 28.0 57.4 454.0 ΔSize 93.04% 98.17% 97.39% ΔAcc 80.0 160.0 1286.0 ΔSize 96.55% 98.91% 98.54% ΔAcc 77.0 192.5 1234.0 ΔSize 94.24% 98.09% 97.48% ΔAcc 78.0 193.2 1259.0 ΔSize 94.84% 97.69% 97.31% ΔAcc 78.0 146.8 1260.0 ΔSize 96.85% 99.43% 99.31% ΔAcc MS SP GRAD OU MC ME QUIRE 26.4 26.4 324.8 140.8 299.2 −44.0 −211.0 1.04% 1.04% 1.98% 1.51% 1.32% 1.7% 2.45% −172.0 −103.2 3113.2 −258.0 894.4 −774.0 34.0 0.06% 0.08% 0.08% 0.12% 0.08% 0.6% 0.08% 0.8 −8.0 87.0 73.2 118.0 −3.2 −101.0 2.82% 3.66% 1.83% 2.96% 1.83% 2.96% 10.28% 92.0 276.0 2438.0 368.0 1380.0 −92.0 −322.0 0.33% 0.54% 1.09% 0.33% 0.54% 0.98% 1.41% 70.2 51.6 325.8 38.4 79.2 34.8 37.0 0.61% 0.78% 0.78% 0.78% 0.7% 0.78% 1.13% 152.0 76.8 692.8 44.8 144.0 −27.2 99.0 0.31% 0.4% 0.34% 0.4% 0.31% 1.49% 0.37% 102.0 87.0 472.5 51.0 130.5 52.5 90.0 0.29% 0.52% 0.65% 0.49% 0.42% 0.71% 0.58% 192.0 97.6 852.8 83.2 291.2 −28.8 53.0 0.32% 0.51% 0.38% 0.47% 0.41% 1.9% 0.32% 102.4 38.4 528.0 24.0 100.8 −68.8 88.0 0.06% 0.19% 0.09% 0.16% 0.16% 2.59% 0.22% Note: The thresholds of MS are 1.0, 1.0, 2.0, 1.0, 2.0, 2.0, 2.0, 2.0, and 2.0, respectively. ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. Model Stability based Stopping Criterion 19:17 Fig. 2. Examples of active-learning curves with stopping points based on US sampling strategy. Fig. 3. Examples of active-learning curves with stopping points based on EMC sampling strategy. high accuracy with relatively fewer training examples, demonstrating that MS can stop active learning at an appropriate point. In summary, our MS method outperforms the state-of-the-art criteria on most data sets for both LR and SVMs with US, indicating the effectiveness of the proposed method. We explain that as SP and GRAD being based on performance. They suggest a stopping point when they observe a stable performance on the stop set or a window of recent selected instances. However, before some important informative unlabeled instances are added into the training set, the capability of the model will not be changed, that is, the model’s performance on the stop set and recent selected instances will be stable. The OU method is to calculate the average uncertainty value of all remaining unlabeled examples. However, even some important unlabeled instances have high uncertainty; with a large size of pool set, the average value will also be very small. MC considers the maximum uncertainty of instances in the pool set that can be easily influenced by the outliers. The performance of ME is inconsistent. It works well on some datasets, but performs poorly on the others. This is likely due to the fact that the expected error estimation is highly correlated with the characteristics of the datasets. QUIRE is based on both the informativeness and representativeness, which are hard to balance, therefore the performance cannot be guaranteed. 6.1.6 Experimental Results with Expected Model Change Strategy. Table 3 and Table 4 present the results for LR and SVMs taking EMC as the active-learning method. Similarly, MS outperforms the baselines most of the time, which demonstrates the effectiveness of our proposed method. In addition, we present two intuitive examples of active learning curves with stopping points based on the EMC strategy in Figure 3. Similar results can be observed. 6.1.7 Efficiency Comparison. Indeed, the time required by the whole active learning process is based on two parts: the time cost by the active-learning algorithm and the time cost by stopping ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. 19:18 Y. Zhang et al. Table 3. Results of ΔSize and ΔAcc on UCI Datasets for LR with EMC Strategy Dataset Size Biodeg Acc Size Clave Acc Size Ionosphere Acc Size Spambase Acc Size WDBC Acc Size D-vs-P Acc Size E-vs-F Acc Size M-vs-N Acc Size U-vs-V Acc Initial 52.0 72.78% 430.0 95.84% 17.0 69.08% 230.0 79.83% 28.0 85.65% 80.0 95.96% 77.0 95.6% 78.0 93.59% 78.0 94.45% HPP 387.0 86.66% 1221.0 96.65% 154.0 90.56% 2052.0 91.83% 79.0 97.39% 170.0 98.82% 190.0 98.25% 318.0 96.84% 177.0 98.86% All 843.0 84.3% 6880.0 96.47% 280.0 88.38% 3680.0 91.26% 454.0 95.22% 1286.0 98.39% 1234.0 97.83% 1259.0 96.6% 1260.0 98.68% ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc MS 149.0 1.42% −149.0 0.17% 30.0 1.55% −1311.0 3.29% 69.0 1.3% −3.0 0.12% 9.0 0.26% −12.0 0.24% 13.0 0.13% SP 124.0 1.75% −235.0 0.19% −27.0 5.28% −741.0 1.03% 69.0 1.3% 29.0 0.37% 41.0 0.58% −80.0 0.79% 26.0 0.19% GRAD 227.0 2.56% 4208.0 0.19% 126.0 2.18% 1559.0 0.6% 375.0 2.17% 1116.0 0.43% 954.0 0.42% 941.0 0.24% 1083.0 0.19% OU 305.0 2.16% −120.0 0.19% 111.0 2.39% 920.0 0.61% 150.0 1.74% 154.0 0.37% 113.0 0.39% 256.0 0.32% 195.0 0.19% MC 388.0 2.22% 1350.0 0.18% 118.0 2.32% 1615.0 0.58% 246.0 2.17% 381.0 0.43% 344.0 0.36% 512.0 0.32% 416.0 0.19% ME 107.0 1.89% −791.0 0.8% 14.0 2.11% −207.0 0.24% 57.0 1.74% 64.0 0.25% 27.0 0.65% −48.0 0.55% 32.0 0.25% Note: The thresholds of MS are 0.5, 1.0, 1.0, 0.2, 0.5, 1.0, 1.0, 0.5, and 0.5, respectively. Table 4. Results of ΔSize and ΔAcc on UCI Datasets for SVMs with EMC Strategy Dataset Initial HPP All MS SP GRAD OU MC ME QUIRE Size 52.0 470.0 843.0 ΔSize −6.0 −23.0 −137.0 373.0 373.0 317.0 −351.0 Biodeg Acc 75.16%86.91%85.85% ΔAcc 0.55% 1.38% 2.87% 1.06% 1.06% 0.98% 8.41% Size 430.0 2648.0 6880.0 ΔSize −1437.0−1573.0−1358.0 104.0 4232.0−2218.0 762.0 Clave Acc 96.06%96.73%96.54% ΔAcc 0.12% 0.16% 0.16% 0.17% 0.19% 0.66% 0.15% Size 17.0 107.0 280.0 ΔSize 59.0 69.0 52.0 173.0 173.0 139.0 −67.0 Ionosphere Acc 70.25%89.17%87.15% ΔAcc 1.58% 1.85% 2.55% 2.02% 2.02% 2.29% 11.0% Size 230.0 2484.0 3680.0 ΔSize −718.0 −1187.0 −340.0 1196.01196.0 −386.0 −1829.0 Spambase Acc 85.17%90.92%90.51% ΔAcc 0.22% 1.8% 1.59% 0.41% 0.41% 0.24% 4.33% Size 28.0 83.0 454.0 ΔSize 4.0 15.0 157.0 134.0 361.0 371.0 9.0 WDBC Acc 92.05%98.01%96.27% ΔAcc 0.37% 1.12% 1.12% 1.37% 1.49% 1.74% 0.62% Size 80.0 158.0 1286.0 ΔSize 72.0 16.0 192.0 98.0 979.0 −78.0 118.0 D-vs-P Acc 96.77%98.63%98.48% ΔAcc 0.06% 0.31% 0.09% 0.12% 0.16% 1.86% 0.16% Size 77.0 316.0 1234.0 ΔSize −46.0 −56.0 26.0 19.0 907.0 918.0 166.0 E-vs-F Acc 94.15%98.06%97.39% ΔAcc 0.35% 0.86% 0.71% 0.5% 0.65% 0.67% 0.54% Size 78.0 272.0 1259.0 ΔSize 6.0 −10.0 119.0 170.0 978.0 65.0 −30.0 M-vs-N Acc 94.83%97.58%97.12% ΔAcc 0.59% 0.57% 0.68% 0.51% 0.44% 1.69% 1.04% Size 78.0 180.0 1260.0 ΔSize 29.0 −27.0 246.0 684.0 338.0 489.0 109.0 U-vs-V Acc 97.1% 99.31%99.15% ΔAcc 0.09% 0.38% 0.19% 0.16% 0.16% 1.23% 0.09% Note: The thresholds of MS are 1.5, 2.5, 1.0, 0.2, 1.5, 2.0, 2.0, 2.0, and 1.5, respectively. ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. Model Stability based Stopping Criterion 19:19 Table 5. CPU Runtime (In Seconds) of the Entire Active Learning Process for SVMs with US Strategy Dataset Exam. × Fea. (U) Biodeg 1003 × 41 Clave 8170 × 16 Ionosphere 334 × 34 Spambase 4371 × 58 WDBC 541 × 30 D-vs-P 1528 × 16 E-vs-F 1466 × 16 M-vs-N 1497 × 16 U-vs-V 1499 × 16 MS 21.655 40.035 5.507 138.743 4.561 8.773 8.153 12.081 6.467 SP 21.846 30.44 5.632 106.336 3.996 5.817 7.506 8.042 3.971 GRAD 32.005 232.971 7.351 364.799 10.935 19.734 15.621 26.008 14.84 OU 22.788 20.34 6.846 103.159 2.523 3.42 4.55 5.625 2.526 MC 30.728 84.902 8.452 233.793 3.947 6.194 6.794 11.744 4.629 ME 13.218 0.006 4.355 55.686 2.369 1.415 4.547 2.329 0.001 QUIRE 121.985 1565.004 9.893 955.127 53.857 164.752 185.895 156.94 146.474 judgment. To validate the efficiency, we compare the averaged CPU runtime when the entire active learning stops. Due to space limitations, we only present the results on SVMs with US strategy, and similar observations can be obtained for other situations. Table 5 shows the results together with the size of the pool set U. As shown in Table 5, we can see that ME needs the least CPU time to stop active learning in most cases due to fewer learning iterations. MS, SP, GRAD, OU, and MC have similar time cost. QUIRE is relatively inefficient and costs much more time than the other six methods, which is because it requires many complicated matrix computations during the stopping judgment, which is time-consuming. 6.2 Experiments on Image Categorization 6.2.1 Experimental Setup. In this section, we apply our method to a large-scale image dataset for image categorization: ImageNet2 . A detailed description can be found in [9]. For the purpose of diversity, we construct the image classification datasets at three levels of the mammal subtree, representing three different perceived difficulty levels. We randomly select seven categories of domestic dogs and create a total of fourteen datasets in the following three levels: (1) one kind of dogs-versus-rest dogs: Basenji, Chihuahua, English setter, German police dog, Standard poodle, Vizsla, and Yorkshire terrier; (2) dogs-versus-canids (i.e., wild dog, wolf, and fox): Wild dog, Wolf, and Fox; (3) dogs-versus-other mammals (i.e., rabbit, cat, panda and elephant): Rabbit, Cat, Panda, and Elephant. Examples of images from the ImageNet are presented in Figure 4. We then construct a balanced binary-class dataset for each versus pair by random sampling. Images are represented by the publicly available bag-of-words features [9]. We select the top 50 features using Principle Component Analysis (PCA), which is commonly used in vision data [11]. The other settings follow the experiments on the UCI datasets. 6.2.2 Experimental Results with Uncertainty Sampling Strategy. Table 6 and Table 7 present the results for the ImageNet datasets with US as the sampling method. MS is still observed to perform the best, which is consistent with the results for UCI datasets. SP performs well on ΔSize sometimes but poorly on ΔAcc . It stops either too early or too late, which may be related to the stop set. For some cases, the performance on stop set seems to be stable before some important instances being added and SP will stop early. For some cases, the performance on stop set will not be stable until much later iterations, then SP will stop late. GRAD stops with a 2 http://www.image-net.org. ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. 19:20 Y. Zhang et al. Fig. 4. Examples of images from the ImageNet: dog, wild dog, wolf, fox, rabbit, cat, panda, and elephant. Table 6. Results of ΔSize and ΔAcc on ImageNet DatasSets for LR with US Strategy Dataset Basenji Chihuahua English setter German police dog Standard poodle Vizsla Yorkshire terrier Wild dog Wolf Fox Rabbit Cat Panda Elephant Size Acc Size Acc Size Acc Size Acc Size Acc Size Acc Size Acc Size Acc Size Acc Size Acc Size Acc Size Acc Size Acc Size Acc Initial 184.0 50.69% 175.0 51.86% 242.0 53.24% 174.0 51.31% 195.0 58.52% 233.0 53.92% 304.0 66.6% 242.0 62.84% 561.0 69.05% 478.0 67.48% 349.0 63.37% 732.0 62.94% 351.0 71.65% 392.0 73.42% HPP 1999.9 62.54% 2198.0 65.4% 2765.5 63.04% 2255.8 62.82% 2386.8 71.92% 2601.8 66.56% 3044.4 73.58% 2357.5 71.72% 5825.0 73.7% 5402.8 73.72% 4444.0 74.13% 5045.7 67.32% 3862.7 81.0% 3994.4 82.72% All 2951.0 61.27% 2800.0 64.6% 3881.0 62.23% 2785.0 61.8% 3129.0 71.2% 3734.0 65.57% 4874.0 72.76% 3875.0 70.68% 8986.0 73.09% 7654.0 73.32% 5593.0 73.41% 11722.0 66.14% 5628.0 80.58% 6282.0 82.2% ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc MS 806.6 1.06% 465.5 0.7% 1015.6 0.72% 348.2 1.07% 308.1 0.63% 799.0 0.91% 919.0 0.82% 1054.4 0.8% 1736.0 0.34% 931.2 0.32% 266.0 0.56% 5871.4 0.97% 79.3 0.43% 79.0 0.46% SP 951.1 1.27% 602.0 0.8% 1115.5 0.8% 529.2 1.02% 641.1 0.7% 1014.8 1.08% 639.5 0.81% 497.9 0.78% −545.1 0.43% −744.0 0.64% −308.0 0.72% 4043.2 0.74% −714.0 0.62% −948.0 0.59% GRAD −1534.7 4.52% −1683.5 4.36% −2048.2 5.03% −1813.0 5.87% −1868.1 4.71% −1997.5 6.26% −2229.2 3.12% −1717.1 3.46% −2023.4 1.31% −1784.4 1.2% −2626.2 3.17% −2926.0 1.45% −1283.7 1.77% 686.5 1.35% OU 951.1 1.27% 602.0 0.8% 1115.5 0.8% 529.2 1.02% 735.0 0.74% 1132.2 1.0% 1629.7 0.75% 1517.5 1.04% 3136.0 0.54% 2251.2 0.4% 1149.0 0.72% 6676.3 1.18% 1397.7 0.43% 1769.6 0.48% MC 951.1 1.27% 602.0 0.8% 1115.5 0.8% 529.2 1.02% 741.3 0.73% 1132.2 1.0% 1766.3 0.8% 1517.5 1.04% 3140.2 0.55% 2251.2 0.4% 1149.0 0.72% 6676.3 1.18% 1612.3 0.38% 2125.1 0.48% ME 951.1 1.27% 584.5 0.87% 1103.8 0.82% 523.7 0.98% 331.5 0.74% 1071.6 1.02% 662.2 0.77% 999.7 0.77% 1396.3 0.37% 998.4 0.36% 70.0 0.69% 5694.0 0.95% −443.3 0.48% −434.5 0.48% Note: The thresholds of MS are 1.0, 0.5, 1.0, 1.0, 1.0, 1.0, 0.5, 1.0, 2.0, 2.0, 0.5, 2.0, 0.5, and 0.5, respectively. ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. Model Stability based Stopping Criterion 19:21 Table 7. Results of ΔSize and ΔAcc on ImageNet Datasets for SVMs with US Strategy Dataset Initial HPP All Size 184.0 1659.2 2951.0 ΔSize Basenji Acc 50.86% 62.62% 60.88% ΔAcc Size 175.0 1669.5 2800.0 ΔSize Chihuahua Acc 49.03% 65.56% 64.39% ΔAcc English Size 242.0 2458.0 3881.0 ΔSize Acc 50.72% 62.16% 60.91% ΔAcc setter German Size 174.0 2048.3 2785.0 ΔSize police dog Acc 49.69% 63.34% 62.27% ΔAcc Standard Size 195.0 1822.0 3129.0 ΔSize poodle Acc 63.39% 72.2% 71.01% ΔAcc Size 233.0 2222.0 3734.0 ΔSize Vizsla Acc 55.9% 65.81% 64.68% ΔAcc Yorkshire Size 304.0 2642.7 4874.0 ΔSize terrier Acc 67.98% 73.77% 72.89% ΔAcc Size 242.0 2009.5 3875.0 ΔSize Wild dog Acc 66.44% 72.47% 71.37% ΔAcc Size 561.0 5970.6 8986.0 ΔSize Wolf Acc 70.36% 73.63% 73.01% ΔAcc Size 478.0 3222.0 7654.0 ΔSize Fox Acc 69.77% 73.81% 73.09% ΔAcc Size 349.0 2682.1 5593.0 ΔSize Rabbit Acc 65.98% 73.54% 72.44% ΔAcc Size 732.0 5438.9 11722.0 ΔSize Cat Acc 63.74% 67.25% 65.81% ΔAcc Size 351.0 2830.4 5628.0 ΔSize Panda Acc 75.99% 81.03% 80.44% ΔAcc Size 392.0 2834.0 6282.0 ΔSize Elephant Acc 75.78% 81.55% 80.95% ΔAcc MS SP GRAD OU MC ME QUIRE 799.9 1261.0 −1005.7 1291.8 1291.8 1289.2 −956.0 1.62% 1.84% 4.1% 1.73% 1.73% 1.73% 4.08% 533.8 960.8 −1030.8 1130.5 1130.5 1125.3 −845.0 1.16% 1.19% 4.44% 1.18% 1.18% 1.29% 2.87% 834.7 1389.5 −1566.8 1423.0 1423.0 1412.3 −1509.0 1.15% 1.35% 4.15% 1.25% 1.25% 1.26% 3.68% 306.3 657.1 −1445.2 736.8 736.8 728.4 −1307.0 1.32% 1.43% 5.02% 1.07% 1.07% 1.06% 5.15% 477.5 687.2 −1058.8 1298.3 1303.1 600.3 −1028.0 1.06% 1.07% 3.27% 1.1% 1.12% 1.16% 2.73% 667.4 1056.3 −1324.7 1512.0 1512.0 1429.3 −1279.0 0.97% 1.02% 3.3% 1.13% 1.13% 1.09% 3.17% 832.7 1121.8 −821.2 2128.3 2198.8 683.3 −1098.0 0.87% 0.87% 2.12% 0.83% 0.89% 0.9% 1.71% 686.7 633.9 −763.6 1865.5 1865.5 1121.1 −1042.0 0.86% 0.87% 1.95% 1.09% 1.09% 0.88% 2.08% −302.4 −638.4 −323.5 2923.2 2992.9 616.0 −2251.0 0.55% 0.6% 1.12% 0.55% 0.61% 0.45% 0.87% 1563.2 1582.4 2028.8 4412.8 4432.0 2593.6 −307.0 0.75% 0.76% 1.09% 0.72% 0.72% 0.74% 0.82% 1020.6 1162.7 −347.1 2904.5 2910.9 1261.4 −834.0 0.91% 0.93% 1.92% 1.11% 1.1% 0.93% 1.67% 2382.9 2810.6 −934.2 6283.1 6283.1 5128.8 −811.0 0.6% 0.7% 1.38% 1.45% 1.45% 1.17% 0.81% 509.6 587.3 1018.3 2104.9 2508.8 107.8 −832.0 0.58% 0.65% 1.09% 0.53% 0.55% 0.65% 1.12% 1286.8 686.4 2428.0 2329.6 3024.8 212.4 −799.0 0.65% 0.93% 0.78% 0.67% 0.67% 0.95% 1.23% Note: The thresholds of MS are all 2.0. small size of training data and low accuracies most of the time, indicating that it stops early and omits some important instances, which is especially obvious for the ImageNet dataset. The reason may be that in such image category tasks, a great deal of instances will be highly uncertain under the linear classifier at early iterations of the learning process, which will result in the observation of stable uncertainty on the window of recently selected instances and early stop suggested by GRAD. OU and MC obtain the smallest ΔAcc sometimes but large ΔSize mostly, showing that they may aspire to high accuracies, thus stop with a large training set. The performance of ME is inconsistent among different datasets, which is likely because the expected error estimation is highly correlated with the characteristics of datasets. QUIRE performs poorly most of the time, which may be because it is hard to balance the informativeness and representativeness; thus, the performance cannot be guaranteed. 6.2.3 Experimental Results with Expected Model Change Strategy. Table 8 and Table 9 present the results for LR and SVMs with EMC as the sampling method. As shown in the tables, similar results can be obtained, that is, MS outperforms the baselines on most cases, demonstrating the ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. 19:22 Y. Zhang et al. Table 8. Results of ΔSize and ΔAcc on ImageNet Datasets for LR with EMC Strategy Dataset Basenji Chihuahua English setter German police dog Standard poodle Vizsla Yorkshire terrier Wild dog Wolf Fox Rabbit Cat Panda Elephant Size Acc Size Acc Size Acc Size Acc Size Acc Size Acc Size Acc Size Acc Size Acc Size Acc Size Acc Size Acc Size Acc Size Acc Initial 184.0 51.03% 175.0 53.1% 242.0 50.91% 174.0 50.36% 195.0 55.81% 233.0 55.42% 304.0 67.08% 242.0 62.72% 561.0 68.97% 478.0 68.36% 349.0 63.87% 732.0 64.01% 351.0 71.71% 392.0 74.11% HPP 2312.0 61.87% 1792.0 65.26% 3123.0 62.45% 2336.0 62.81% 2824.0 73.33% 2501.0 65.81% 2598.0 73.22% 2558.0 71.81% 6990.0 73.59% 5105.0 74.21% 3988.0 73.81% 7347.0 67.5% 4388.0 81.02% 4073.0 82.88% All 2951.0 60.87% 2800.0 64.04% 3881.0 61.19% 2785.0 61.89% 3129.0 72.64% 3734.0 65.15% 4874.0 72.46% 3875.0 70.84% 8986.0 73.19% 7654.0 73.74% 5593.0 73.05% 11722.0 66.85% 5628.0 80.69% 6282.0 82.42% ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc ΔSize ΔAcc MS 502.0 0.84% 830.0 1.19% 594.0 1.03% 384.0 0.72% −94.0 0.54% 1107.0 0.56% 1427.0 0.57% 948.0 0.76% 358.0 0.36% 1382.0 0.36% 862.0 0.61% 3704.0 0.76% −389.0 0.28% 158.0 0.34% SP 640.0 1.0% 1008.0 1.21% 758.0 1.26% 449.0 0.92% 111.0 0.59% 1095.0 0.7% 872.0 0.66% 361.0 1.08% −1859.0 0.52% −499.0 0.38% 92.0 0.79% 1316.0 0.55% −1229.0 0.6% −924.0 0.51% GRAD −1233.0 3.1% −525.0 2.0% −1252.0 3.3% −1553.0 4.25% −1872.0 4.52% 47.0 1.06% −757.0 2.75% 178.0 1.31% 826.0 0.46% −922.0 1.19% −905.0 2.14% 1537.0 0.65% −122.0 0.86% 1770.0 0.55% OU 640.0 1.0% 1008.0 1.21% 758.0 1.26% 449.0 0.92% 300.0 0.56% 1233.0 0.66% 2092.0 0.57% 1317.0 0.96% 1971.0 0.39% 2549.0 0.47% 1605.0 0.76% 4375.0 0.65% 879.0 0.32% 1754.0 0.43% MC 640.0 1.0% 1008.0 1.21% 758.0 1.26% 449.0 0.92% 305.0 0.69% 1233.0 0.66% 2214.0 0.66% 1317.0 0.96% 1981.0 0.36% 2549.0 0.47% 1605.0 0.76% 4375.0 0.65% 1097.0 0.3% 2054.0 0.47% ME 640.0 1.0% 994.0 1.39% 747.0 1.2% 447.0 0.99% −47.0 0.72% 1206.0 0.62% 1171.0 0.7% 868.0 0.76% 426.0 0.38% 1373.0 0.42% 638.0 0.71% 3410.0 0.77% −902.0 0.44% −395.0 0.42% Note: The thresholds of MS are 1.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, and 1.0, respectively. effectiveness of our proposed stopping criterion. Similar to previous experiments, the other methods do not perform well most of the time. We omit the detailed descriptions here due to space limitation. 7 EXTENSIONS In the above, the proposed SC MS is applied to linear models in a stochastic gradient descent learning manner. In fact, our SC MS is flexible and can be generalized to many models. In this section, we extend SC MS to kernelized models and nongradient-based models. Since SC MS stops active learning when C(x + ) ≤ η, ∀x + ∈ U, the main problem of the extensions lies in the measurement of model change C(x + ). 7.1 Stopping Criterion for Kernelized Models In this section, we discuss the extensions of our method to kernelized models, including kernel LR and kernel SVM. The kernel trick transforms the feature space into a Hilbert feature space with ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. Model Stability based Stopping Criterion 19:23 Table 9. Results of ΔSize and ΔAcc on ImageNet Datasets for SVMs with EMC Strategy Dataset Initial HPP All MS SP GRAD OU MC ME QUIRE Size 184.0 2001.0 2951.0 ΔSize 828.0 940.0 −895.0 950.0 950.0 950.0 −1103.0 Basenji Acc 50.04%61.73% 60.54% ΔAcc 0.83% 1.23% 3.1% 1.19% 1.19% 1.19% 3.98% Size 175.0 2167.0 2800.0 ΔSize 420.0 620.0 −805.0 634.0 634.0 634.0 −1715.0 Chihuahua Acc 53.49%64.69% 63.76% ΔAcc 0.7% 0.87% 2.29% 0.93% 0.93% 0.93% 5.96% English Size 242.0 2920.0 3881.0 ΔSize 923.0 961.0 −1152.0 961.0 961.0 961.0 −2311.0 setter Acc 49.28%62.06% 61.34% ΔAcc 0.69% 0.72% 2.5% 0.72% 0.72% 0.72% 6.24% German Size 174.0 2337.0 2785.0 ΔSize 368.0 448.0 −1195.0 448.0 448.0 448.0 −1649.0 police dog Acc 50.62%62.44% 61.54% ΔAcc 0.72% 0.9% 4.71% 0.9% 0.9% 0.9% 5.77% Standard Size 195.0 2289.0 3129.0 ΔSize 335.0 513.0 −635.0 840.0 840.0 838.0 −1509.0 poodle Acc 64.6% 71.69% 70.98% ΔAcc 0.64% 0.66% 2.46% 0.7% 0.7% 0.7% 3.78% Size 233.0 3006.0 3734.0 ΔSize 423.0 630.0 −2059.0 728.0 728.0 728.0 −2186.0 Vizsla Acc 54.61%65.48% 64.78% ΔAcc 0.61% 0.63% 3.46% 0.71% 0.71% 0.71% 4.23% Yorkshire Size 304.0 3952.0 4874.0 ΔSize −6.0 43.0 −479.0 922.0 922.0 573.0 −2788.0 terrier Acc 68.39%73.82% 73.3% ΔAcc 0.29% 0.3% 1.79% 0.52% 0.52% 0.49% 4.25% Size 242.0 2690.0 3875.0 ΔSize 638.0 182.0 599.0 1185.01185.0 1185.0 −1858.0 Wild dog Acc 66.96%71.81% 70.86% ΔAcc 0.72% 0.82% 1.11% 0.95% 0.95% 0.95% 3.33% Size 561.0 5900.0 8986.0 ΔSize 1997.0 −373.0 1170.0 3086.03086.0 3086.0 −3677.0 Wolf Acc 70.34%73.79% 73.42% ΔAcc 0.29% 0.39% 0.76% 0.37% 0.37% 0.37% 1.85% Size 478.0 6248.0 7654.0 ΔSize 336.0 −1258.0 1406.0 1406.01406.0 1406.0 −4195.0 Fox Acc 69.32%73.36% 73.12% ΔAcc 0.35% 0.52% 0.24% 0.24% 0.24% 0.24% 1.46% Size 349.0 4612.0 5593.0 ΔSize 336.0 −42.0 −389.0 981.0 981.0 981.0 −3220.0 Rabbit Acc 66.35%72.75% 72.22% ΔAcc 0.41% 0.44% 1.22% 0.54% 0.54% 0.54% 3.27% Size 732.0 9748.0 11722.0 ΔSize 1801.0 −588.0 −2021.01974.01974.0 1974.0 −6468.0 Cat Acc 64.17%66.47% 65.77% ΔAcc 0.63% 0.64% 1.33% 0.7% 0.7% 0.7% 2.47% Size 351.0 3872.0 5628.0 ΔSize 238.0 −266.0 912.0 1756.01756.0 1015.0 −3010.0 Panda Acc 75.82%80.56% 80.2% ΔAcc 0.24% 0.67% 0.46% 0.36% 0.36% 0.25% 4.27% Size 392.0 4872.0 6282.0 ΔSize −429.0 −813.0 904.0 1410.01410.0 −446.0−3702.0 Elephant Acc 76.65%82.14% 81.8% ΔAcc 0.27% 0.5% 0.52% 0.34% 0.34% 0.25% 2.07% Note: The thresholds of MS are all 1.5. kernel function K (, ). Here, we consider two commonly used kernels: radial basis function (RBF) 2 kernel K r (z, x ) = exp(− z−x ) and polynomial (PN) kernel K p (z, x ) = (z Tx + 1)p . 2σ 2 7.1.1 SC MS for Kernel Logistic Regression. With the kernel trick, the kernel LR model [34] can be formulated as f (x ) = 1+exp(−θ1 T ϕ (x )) , where θ ϕ is the model parameter vector in the Hilbert feature ϕ space and ϕ (.) is the mapping function. Based on the analysis in [24], we obtain θ ϕ ≈ ϕ (θ ), thus the kernel LR model can be rewritten as f (x ) = 1 . 1 + exp(−K (θ, x )) (48) Given the dataset D, training the kernel LR is to minimize the following logistic loss: Lθ (D) = m −yi K (θ, x i ) + log[1 + exp(K (θ, x i ))], (49) i=1 ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. 19:24 Y. Zhang et al. which can be solved by the stochastic gradient update rule as well. Therefore, we can estimate the model change by the gradient of the logistic loss Lθ (x + ) with regard to the parameter vector θ . Then, for kernel LR, the expected model change can be denoted by ∂K (θ, x + ) ∂Lθ (x + ) = 2f (x + )(1 − f (x + )) , Cθ (x + ) = Ey + Gθ (x + ) = Ey + ∂θ ∂θ where for the RBF kernel ∂K r (θ, x + ) θ − x + 2 θ − x + = −exp − , ∂θ 2σ 2 σ2 (50) (51) and for the PN kernel ∂K p (θ, x + ) = p(θ Tx + + 1)p−1x + . ∂θ According to Equation (2), SC MS will stop active learning when Cθ (x + ) ≤ η, ∀x + ∈ U. (52) 7.1.2 SC MS for Kernel SVM. The kernel SVM model can be represented by a hyperplane f (x ) = w ϕT ϕ (x ), where w ϕ ≈ ϕ (w ) is the model parameter in the Hilbert feature space as defined in [24]. Therefore, kernel SVM can be rewritten as f (x ) = K (w, x ) with the loss function λ K (w, w ) + [1 − yi K (w, x i )]+ , 2 i=1 m Lw (D) = (53) which can also be solved by the stochastic gradient update rule. As before, after adding x + , the model change can be measured by the gradient of the loss Lw (x + ) with regard to w. Then, for instance, x + ∈ U, which satisfies |K (w, x + )| < 1, the expectation of the model change can be formulated as ∂Lw (x + ) ∂K (w, x + ) = , Cw (x + ) = Ey + ||Gw (x + )|| = Ey + (54) ∂w ∂w where the gradient of the kernel function K (w, x + ) with regard to w for RBF kernel and PN kernel is similar to Equation (51) and Equation (52); for other x + ∈ U, we have that Cw (x + ) = 0. Similarly, taking kernel SVM as the base model, we stop active learning when Cw (x + ) ≤ η is satisfied for ∀x + ∈ U. Detailed experimental results can be found in https://github.com/zhyxun/ Stopping-criterion-for-active-learning. 7.2 Stopping Criterion for Nongradient-Based Models As the essential step of SC MS is to estimate the model change, it is naturally applicable to gradientbased models. However, applying SC MS to nongradient-based models is more complex; the key challenge is to get the closed form of model change. Here, we extend SC MS to nongradient-based models, taking Gaussian process regression (GPR) as an example. Given the training set D, GPR can be defined as f (x; α ) = m α i κ (x i , x ), (55) i=1 where κ (, ) is a given kernel function and α = [α 1 , α 2 , . . . , αm ]T is the model parameter vector 2 I ) −1Y , where K is the kernel matrix of set D, I is the identity that can be obtained by α = (K + σm 2 is the assumed noise variance, and Y is the label vector. matrix, σm ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. Model Stability based Stopping Criterion 19:25 As suggested by Cai [7], after adding a new instance x + into the training set, the updated model parameter for GPR can be computed by a closed form: 2 I ) −1κ κT α − y + (K + σm α + + 2 α = , (56) 2 0 −1 σf + + σm where κ is the vector of kernel values of the training set and x + (i.e., the ith term is κ (x i , x + )) and σf2+ is the predictive variance of x + . Therefore, the expectation of model change for GPR can be formulated as α + − [α T , 0]T P (y + |x + ) dy + . CGPR (x + ) = Ey + ||α + − [α T , 0]T || = (57) Y According to SC MS , taking GPR as the base model, we stop active learning if the condition CGPR (x + ) ≤ η, ∀x + ∈ U is satisfied. Detailed experimental results are presented in https://github. com/zhyxun/Stopping-criterion-for-active-learning. 8 CONCLUSIONS In this article, we propose a novel model stability–based stopping criterion, which considers the potential capability of each unlabeled example to change the model once added to the training set. The main idea is that active learning should stop when the model will not change with more training data. With this idea, we stop active learning when the changing ability of all remaining unlabeled examples is less than a given threshold. Inspired by the stochastic gradient update rule, we use the gradient of the loss at each candidate example to measure its capability to change the model. We apply the method to two popular classifiers: logistic regression and support vector machines. Then, we theoretically analyze the stability and generalizability of the model obtained by our method and provide a theoretical support. Substantial experiments on various datasets demonstrate the effectiveness of the proposed method. Finally, we generalize it to several other kinds of models, including kernelized models and nongradient-based models, to show its widespread applications. In this study, we focus on binary classification. In fact, our proposed model stability–based stopping criterion can also be generalized to multiclass classification problems, which will be done in our future work. In addition, the model change now is estimated by the expectation over all possible labels; we will consider finding a better way for more reliable estimation of the model change in the future. ACKNOWLEDGMENTS The work is partially supported by the High Technology Research and Development Program of China 2015AA015801, NSFC 61521062, and STCSM 12DZ2272600. REFERENCES [1] N. Abe and H. Mamitsuka. 1998. Query learning strategies using boosting and bagging. In Proceedings of the 15th International Conference on Machine Learning. 1–10. [2] I. S. Andrew and H. U. Lyle. 2007. Active learning for logistic regression: An evaluation. Machine Learning 68, 3, 235–265. [3] M. Bloodgood and K. Vijay-Shanker. 2009. A method for stopping active learning based on stabilizing predictions and the need for user-adjustable stopping. In Proceedings of the 13th Conference on Computational Natural Language Learning Association for Computational Linguistics. 39–47. [4] S. Boucheron, G. Lugosi, and O. Bousquet. 2013. Concentration Inequalities. Oxford, Oxford University Press. [5] O. Bousquet and A. Elisseeff. 2002. Stability and generalization. Journal of Machine Learning Research 2, 3, 499–526. [6] C. Burges. 1998. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2, 2, 121–167. ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017. 19:26 Y. Zhang et al. [7] W. Cai, M. Zhang, and Y. Zhang. 2017. Batch mode active learning for regression with expected model change. IEEE Transactions on Neural Networks and Learning Systems 28, 7, 1668–1681. [8] W. Cai, Y. Zhang, and J. Zhou. 2013. Maximizing expected model change for active learning in regression. In Proceedings of the 13th International Conference on Data Mining. 51–60. [9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 248–255. [10] B. Efron and C. Stein. 1981. The jackknife estimate of variance. The Annals of Statistics 9, 3, 586–596. [11] T. M. Hospedales, S. Gong, and T. Xiang. 2013. Finding rare classes: Active learning with generative and discriminative models. IEEE Transactions on Knowledge and Data Engineering 25, 2, 374–386. [12] S. Huang, R. Jin, and Z. Zhou. 2014. Active learning by querying informative and representative examples. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 10, 1936–1949. [13] J. Kivinen, A. Smola, and R. Williamson. 2004. Online learning with kernels. IEEE Transactions on Signal Processing 52, 8, 2165–2176. [14] F. Laws and H. Schutze. 2008. Stopping criteria for active learning of named entity recognition. In Proceedings of the 22nd International Conference on Computational Linguistics. 465–472. [15] D. D. Lewis and W. A. Gale. 1994. A sequential algorithm for training text classifiers. In Proceedings of the 17th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3–12. [16] L. Li, X. Jin, S. Pan, and J. Sun. 2012. Multi-domain active learning for text classification. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1086–1094. [17] M. Li and I. K. Sethi. 2006. Confidence-based active learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 8, 1251–1261. [18] A. Liu, G. Jun, and J. Ghosh. 2009. A self-training approach to cost sensitive uncertainty sampling. Machine Learning 76, 2, 257–270. [19] C. McDiarmid. 1989. On the method of bounded differences. Surveys in Combinatorics 141, 1, 148–188. [20] J. Platt. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers 10, 3, 61–74. [21] G. Schohn and D. Cohn. 2000. Less is more: Active learning with support vector machines. In Proceedings of the 17th International Conference on Machine Learning. 839–846. [22] B. Settles. 2010. Active learning literature survey. University of Wisconsin– Madison 39, 2, 127–131. [23] H. S. Seung, M. Opper, and H. Sompolinsky. 1992. Query by committee. In Proceedings of the 5th Workshop on Computational Learning Theory. 287–294. [24] G. Sharma, F. Jurie, and P. Perez. 2014. Learning non-linear SVM in input space for image classification. HAL Archives 1–14. [25] S. Shwartz, Y. Singer, and N. Srebro. 2007. Pegasos: Primal estimated sub-gradient solver for SVM. In Proceedings of the 24th International Conference on Machine Learning. 807–814. [26] K. Small and D. Roth. 2010. Margin-based active learning for structured predictions. International Journal of Machine Learning and Cybernetics 1, 1-4, 3–25. [27] M. Tang, X. Luo, and S. Roukos. 2002. Active learning for statistical natural language parsing. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 120–127. [28] K. Tomanek, J. Wermter, and U. Hahn. 2007. An approach to text corpus construction which cuts annotation costs and maintains reusability of annotated data. In Proceedings of Joint Meeting of the Conference on Empirical Methods on Natural Language Processing and the Conference on Natural Language Learning. 486–495. [29] S. Tong and E. Chang. 2001. Support vector machine active learning for image retrieval. In Proceedings of the 9th ACM International Conference on Multimedia. 107–118. [30] S. Tong and D. Koller. 2002. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research 2, 1, (2002), 45–66. [31] A. Vlachos. 2008. A stopping criterion for active learning. Computer Speech and Language 22, 3, 295–312. [32] J. Wang, E. Sung, and W. Yau. 2011. Active learning for solving the incomplete data problem in facial age classification by the furthest nearest-neighbor criterion. IEEE Transactions on Image Processing 20, 7, 2049–2062. [33] W. Wang, W. Cai, and Y. Zhang. 2014. Stability-based stopping criterion for active learning. In Proceedings of the 14th International Conference on Data Mining. 1019–1024. [34] J. Zhu and T. Hastie. 2001. Kernel logistic regression and the import vector machine. In Proceedings of the 14th Advances in Neural Information Processing Systems. 1081–1088. [35] J. Zhu, H. Wang, and E. Hovy. 2010. Confidence-based stopping criteria for active learning for data annotation. ACM Transactions on Speech and Language Processing 6, 3, 1–24. Received March 2017; revised June 2017; accepted July 2017 ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 2, Article 19. Publication date: October 2017.

1/--страниц