c 2017 Society for Industrial and Applied Mathematics SIAM J. IMAGING SCIENCES Vol. 10, No. 4, pp. 1845–1877 Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php Sequential Convex Programming for Computing Information-Theoretic Minimal Partitions: Nonconvex Nonsmooth Optimization∗ Youngwook Kee† , Yegang Lee‡ , Mohamed Souiai§ , Daniel Cremers§ , and Junmo Kim‡ Abstract. We consider an unsupervised image segmentation problem—from figure-ground separation to multiregion partitioning—that consists of maximal distribution separation (in terms of mutual information) with spatial regularity (total variation regularization), which is what we call informationtheoretic minimal partitioning. Adopting the bounded variation framework, we provide an in-depth analysis of the problem which establishes theoretical foundations and investigates the structure of the associated energy from a variational perspective. In doing so, we show that the objective exhibits a form of difference of convex functionals, which leads us to a class of large-scale nonconvex optimization problems where convex optimization techniques can be successfully applied. In this regard, we propose sequential convex programming based on the philosophy of stochastic optimization and the Chambolle–Pock primal-dual algorithm. The key idea behind it is to construct a stochastic family of convex approximations of the original nonconvex function and sequentially minimize the associated subproblems. Indeed, its stochastic nature makes it possible to often escape from bad local minima toward near-optimal solutions, where such optimality can be justified in terms of the recent findings in statistical physics regarding a striking characteristic of the high-dimensional landscapes. We experimentally demonstrate such a favorable ability of the proposed algorithm as well as show the capacity of our approach in numerous experiments. The preliminary conference paper can be found in [Y. Kee, M. Souiai, D. Cremers, and J. Kim, Proceedings of the IEEE Conference on Computer Vision and Pattern, Recognition, 2014]. Key words. information theory, total variation regularization, image partitioning, difference of convex functions, stochastic optimization AMS subject classifications. 49M20, 49M29, 65K15, 68U10 DOI. 10.1137/16M1078653 1. Introduction. We present and analyze what we call the information-theoretic minimal partitioning problem for fully unsupervised (multiregion) image segmentation. The problem includes foreground-background segmentation, i.e., figure-ground separation, which can be ∗ Received by the editors June 6, 2016; accepted for publication (in revised form) April 15, 2017; published electronically October 26, 2017. http://www.siam.org/journals/siims/10-4/M107865.html Funding: This work was supported in part by the National Research Foundation of Korea within the Ministry of Science, ICT and Future Planning through the Korean Government under grant NRF-2017R1A2A2A05001400, in part by the ICT R&D program of MSIP/IITP 2016-0-00563, Research on Adaptive Machine Learning Technology Development for Intelligent Autonomous Digital Companion, and in part by the ERC Starting grant “Convex Vision.” † School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, South Korea, and Department of Radiology, Weill Cornell Medical College, New York, NY 10021 (youngwook.kee@kaist.ac.kr, yok2013@med.cornell.edu). ‡ School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, South Korea (askhow@kaist.ac.kr, junmo.kim@kaist.ac.kr). § Department of Computer Science, Technical University of Munich, 85748 Garching, Germany (souiai@tum.de, cremers@tum.de). 1845 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php 1846 Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM Input Initialization Intermediate Segmentation Histograms Figure 1. From an observation (input), we jointly estimate the ground truth partition and the associated distributions which are both unknown to us. The colors (red, green, and blue) of the estimated intensity distributions (histograms) correspond to those of the estimated partition (segmentation). Starting with a random initialization, we obtain the partition and its associated distributions by solving a family of convex subproblems of the information-theoretic minimal partitioning problem, which will be presented throughout the paper. more interesting in computer vision and machine learning—here the problem boils down to fully unsupervised binary image labeling. Yet, the task of figure-ground separation is rather challenging because it is inherently a chicken-and-egg problem: Where is the object and what distinguishes it from the background? A common assumption that lies behind the task is that the object and background have different intensity (or color) distributions. Likewise, one can extend the assumption to the task of multiregion image segmentation: Each region of the partition has distinct intensity (color) distribution. Again, if these are completely unknown and if they show considerable overlapping, then the joint estimation of color distributions and segmentation becomes a major algorithmic challenge. In this paper, we attempt to tackle such a challenging unsupervised (binary and multiregion) image segmentation problem from a variational perspective where the above assumption plays a pivotal role. Let φ : (Ω ⊂ R2 ) → R be an observed intensity image function. The information-theoretic minimal partitioning problem solves the variational problem k minimize (1.1) {R` }k`=1 1X Per(R` ; Ω) − λ D(P1 , . . . , Pk ) 2 `=1 subject to Ri ∩ Rj = ∅, ∀i 6= j and k [ R` = Ω, `=1 where the first term of the objective is half the sum of the perimeters of the sets {R` }k`=1 , and the second term is a generalized statistical distance (possibly divergence) defined on the space of probability distributions, weighted by a tuning parameter λ > 0, that measures the discrepancy between the distributions P1 , . . . , Pk associated with R1 , . . . , Rk , respectively. Notice that the associated distributions are unknown to us; rather, we estimate them from subsets of {φ(x) : x ∈ R` } for ` = 1, . . . , k. Then, an optimal solution of the problem (1.1) maximally separates the distributions with spatial regularity in terms of the total length of region boundaries. As shown in Figure 1, it is a joint estimation problem—segmentation and the associated distributions—that we consider throughout the paper. The aim of the variational formulation (1.1) matches well the assumption of unsupervised image segmentation that each region of the partition has different intensity distribution. Indeed, the formulation would better capture subtle differences between the distributions that Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS 1847 do not appear in lower-order statistics than other existing models—based on the Mumford– Shah functional [36]—such as the Chan–Vese model [14] and its multiphase extension [53]. The primary reason for this is that such existing models do not take into account any higher-order statistics but the first-order moment. Needless to say, their convex formulations [13, 9, 11] inherently have the same limitations. One may want to incorporate the second-order moment into the models—see [49], for example—but obviously it does not address the fundamental issue. On the other hand, it is not suprising that such limitations barely disappear as long as the associated distributions are estimated in a parametric way, because natural images typically exhibit complex coloring so that such a parametric representation would not be able to capture the differences. In this light, there has been a line of research to incorporate nonparametric statistical methods into the framework of unsupervised image segmentation with active contour models. For example, Kim et al. [30] and Herbulot et al. [25] proposed curve evolution equations driven by mutual information and conditional entropy, respectively, using the Parzen–Rosenblatt window method [46, 41]. Similarly, Michailovich, Rathi, and Tannenbaum [35] derived the gradient flow of the Bhattacharyya coefficient for curve evolution using nonparametric methods. Note that the active contour models discussed were all implemented by means of the level-set methods [40]. At the same time, computing optimal solutions of the existing energy models in low-level vision has been another line of research along with recent advances in large-scale algorithms for convex optimization [31]. The basic idea behind this line of research—also famously known as convex relaxation methods—is that a given hard-to-solve nonconvex optimization problem is approximated by its convex alternative relatively easy-to-solve in terms of computing an optimal solution within a reasonable amount of time. Most importantly, an unknown optimal solution of the original problem can be recovered from the optimal solution of the convex one. In this regard, the seminal work of Chan, Esedoglu, and Nikolova [13] proposed a convex formulation for the two-phase instance of the piecewise constant Mumford–Shah functional (relaxation) and proved that an optimal solution of the original nonconvex geometric problem is exactly recovered from that of the convex formulation by simple thresholding (thresholding theorem). However, the class of such problems whose optimal solutions are computed by the above technique (relaxation and thresholding) is quite limited: Not only should the objective itself be convex along with its relaxed constraints, but also the coarea formula from geometric measure theory [21] should hold. Note that if the objective is convex but the coarea formula does not hold, a near-optimal solution can be computed with a performance guarantee—per-instance bound, for example. See [57, 33, 11] for convex approaches for computing a near-optimal solution of the continuous Potts model. Also, see the work of Pock et al. [42], where they find a near-optimal solution of the piecewise smooth Mumford–Shah problem, which is based on the technique of Chan, Esedoglu, and Nikolova [13] and the calibration approach of Alberti, Bouchitté, and Dal Maso [2]. Meanwhile, there have been algorithmic advances in splitting methods for nonlinear and nonsmooth optimization in line with such convex relaxation techniques: A first-order primal dual algorithm was proposed by Chambolle and Pock [12] with applications to imaging, the split Bregman was proposed by Goldstein and Osher [24] for `1 -regularized problems, and a globally convergent Douglas–Rachford scheme was proposed by Lellmann and Schnörr [33], to name a few. For other low-level vision tasks and recent advances in this context, see [17, 39, 50] and references therein. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php 1848 Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM Not surprisingly, the problem (1.1) we consider throughout the paper, where various types of statistical distance measure can be possibly involved, does not fall into this class, i.e., the objective in (1.1) is nonconvex so that one needs to devise a new optimization algorithm to efficiently compute a near-optimal solution. Indeed, the paper is dedicated to analysis on the mathematical structure of the problem and the development of such an algorithm. In this line of research, we here mention some notable techniques recently proposed—yet quite different from ours: Among others such as those making use of user inputs, e.g., scribbles or bounding boxes, the work of Punithakumar et al. [43] proposed a bound optimization technique for the Bhattacharyya coefficient between a given (a priori known) distribution and distributions of the candidate figures to be segmented, which is obviously considered to be supervised figureground separation. Another notable example regarding unsupervised image segmentation is the work of Ni et al. [38]. In their work, an energy functional that consists of the Wasserstein distance between (local) intensity histograms from two disjoint regions (foreground and background) and the total variation regularization was proposed. Most recently, a couple of research papers [55, 44] addressed similar issues in the context of binary image segmentation, but their performance largely depends on initialization (bounding boxes). Unfortunately, none of these approaches addresses the problem of unsupervised multiregion image segmentation. 1.1. Outline and contributions. From a variational perspective, the paper is at the cutting edge of fully unsupervised image segmentation via what we call information-theoretic minimal partitioning. The technical contributions of this work are in a nutshell threefold. First, we establish theoretical foundations of the information-theoretic minimal partitioning problem with a rigorous treatment (section 2). To this end, we revisit the notion of mutual information proposed by Kim et al. [30] for the statistical distance measure in (1.1), which turns out be the generalized Jensen–Shannon (JS) divergence (sections 2.1 and 2.2). Then, we introduce, step by step, an objective that combines the negative mutual information and the total length of the region boundaries (the Potts prior); its minimization is what we call throughout the paper the problem of information-theoretic minimal partitioning (section 2.3). Furthermore, we reformulate the problem using the theory of functions of bounded variation and convex relaxation techniques for the Potts prior so that we analyze its mathematical structure from the perspective of numerical optimization: We show that the problem exhibits a form of the difference of convex (DC) functionals (sections 2.3.1 and 2.3.2). Second, we develop an optimization algorithm to compute a near-optimal solution of the original nonconvex problem (section 3). To this end, we introduce a sequential convex optimization technique that employs the difference of convex functions algorithm (DCA) [51, 52] and the Chambolle–Pock primal-dual splitting method [12]. Roughly speaking, the proposed algorithm constructs a family of convex subproblems for the original nonconvex objective and sequentially solves it in such a way that a minimizing sequence is guaranteed to converge to a critical point of the original nonconvex function (section 3.2). Note that the proposed procedure is based on the first-order methods and the problem we tackle is nonconvex so that the minimizing sequence is likely to end up a critical point (often likely bad local minima energetically and visually). Therefore, we further consider two strategies applied to the procedure—sections 3.2.1 and 3.2.2—to let a minimizing sequence escape from such undesirable points toward good local minima. Indeed, the proposed algorithm can be successfully Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS 1849 applied to such a class of nonconvex and nonsmooth problems where the partitioning problem we consider falls. Last, we experimentally validate the capacity of our approach by showing convincing fully unsupervised image partitions and we compare the proposed method with our previous conference work [28] for figure-ground separation. The proposed method has a couple of favorable properties over [28]: computing partitions can be solved with a provably convergent algorithm and it is substantially faster. Then, we consider the question of how to establish the relationship between the unknown optimal solution of the original nonconvex problem and a computed solution from the proposed algorithm. Unfortunately, we are unaware of any performance guarantee such as per-instance bounds in our case because the convex subproblems are in fact upper bounds of the objective, albeit tight. Nevertheless, we believe that a computed solution is statistically likely a near-optimal solution in terms of recent findings in statistical physics, which will be elaborated in section 4.3. We end the paper with a conclusion in section 5 with some open issues for the future research directions. 2. Information-theoretic minimal partitions. We start off the section revisiting the notion of mutual information proposed by Kim et al. [29, 30] in the context of image segmentation. We view the notion from a measure-theoretic perspective and employ the bounded variation framework in order to formulate what we call the information-theoretic minimal partitioning problem and generalize it to any measurable space in any dimension; in addition, it provides a new perspective on optimal labeling in computer vision in that a given image is a realization of the unknown target random image. In doing so, we establish the existence of a minimizer and the mathematical structure of the problem, which leads us to a certain class of large-scale nonconvex optimization problems to which convex optimization techniques can be successfully applied. The main contributions of this section can be found in Theorem 2.3, Proposition 2.4, and Proposition 2.5. Furthermore, in section 2.2 we elaborate on the connection between the mutual information criterion and the generalized JS divergence for those interested, which would broaden the understanding of the mutual information criterion for image partitioning. Notation and symbols used throughout the paper is listed in Table 1. 2.1. Mutual information revisited. Let Ω ⊂ Rd , d ≥ 2, be a bounded open set with Lipschitz boundary, and let (A, A, P` ) be probability spaces S for ` = 1, . . . , k. Suppose that {Ω` }k`=1 is a partition of Ω such that Ωi ∩Ωj = ∅ ∀i 6= j and k`=1 Ω` = Ω. Then, functions Φ` : A × Ω` → Rn , (a, x) 7→ (Φ1` (a, x), . . . , Φn` (a, x)) are vector-valued random variables associated with the probability measures P` , for example, n = 3 for RGB color images. Suppose that each P` := P` ◦ Φ−1 is absolutely continuous with respect to µ (written P` µ) for ` = ` 1, . . . , k, where µ denotes the Lebesque measure on Rn . Then, P` are the induced probability measures on (Rn , B(Rn )), i.e., distributions. We consider collections of the random variables as independent and identically distributed spatial stochastic processes (random fields) as follows: i.i.d. {Φ` (x) ∈ Rn | x ∈ Ω` } ∼ P` ∀` = 1, . . . , k, where the index variable x takes values in the uncountable set Ω. Here, we omit the elements a of the sample space A for notational convenience. It is important to notice that φ` : Ω` → Rn is an observed function of Φ` , i.e., φ` (·) is a snapshot (realization) of the stochastic process Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. 1850 Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php Table 1 Notation and symbols. Ω {Ω` }k`=1 {R` }k`=1 χ R` u` u Per R g (R` ; Ω) g|Du` | Ω BV (Ω) Lp (Ω) (A, A, P` ) X Φ` φ` P` P` µ P` h(Φ(X)) h(Φ(X)|u(X)) I(Φ(X); u(X)) π1 , . . . , πk Q DKL (P` ||Q) dP` /dQ DπJS (P1 , . . . , Pk ) p` p̂` KH H, K Γ0 (H) argminimax Proxg∗ Proxq About geometry Bounded open subset of Rd , d ≥ 2, with Lipschitz boundary. S Ground truth partition such that Ωi ∩ Ωj = ∅ ∀i 6= j and k`=1 Ω` = Ω. Sk Candidate partition such that Ri ∩ Rj = ∅ ∀i 6= j and `=1 R` = Ω. Characteristic function of R` , i.e., χR` (x) = 1 if x ∈ R` and 0 otherwise. χ R` . P k-tuple u = (u1 , . . . , uk ) such that k`=1 u` (x) = 1. Weighted perimeter of R` with a nonnegative function g. Weighted total variation of u` with a nonnegative function g. Space of functions of bounded variation. Lebesque spaces. About information theory Probability space for ` = 1, . . . , k. Random location; its realization is x. Vector-valued random variable associated with probability measure P` . Observed function (realization) of Φ` . Induced probability measure on (Rn , B(Rn )), i.e., P` := P` ◦ Φ−1 ` . P` is absolutely continuous with respect to µ. Mixture of P1 , . . . , Pk , i.e., P` := PΦ(X)|X∈R` . Entropy of Φ(X). Conditional entropy: the entropy of Φ(X) given u(X) = 1. Mutual information between Φ(X) and u(X). Nonnegative real numbers such that π1 + · · · + πk = 1. P Q := k`=1 π` P` , i.e., probability measure on (Rn , B(Rn )). Kullback–Leibler divergence from P` to Q. Radon–Nikodym derivative of P` with respect to Q. Generalized JS divergence of P1 , . . . , Pk . Density of P` , i.e., p` := pΦ(X)|u` (X)=1 . Kernel density estimate for p` . Kernel function with the n × n bandwidth matrix H. About optimization Real Hilbert spaces. Class of all proper, lower semicontinuous, and convex functions defined on H. Set of saddle points of the argument. Proximity operator of δK (ξ), i.e., orthogonal projector onto the convex set K. Proximity operator of δS (u), i.e., orthogonal projector onto the convex set S. Φ` (·) which is the set of snapshots across the ensemble. Let Φ(x) denote the sum of the random variables Φ1 (x)+· · ·+Φk (x), and let φ(x) denote the sum of the realizations φ1 (x)+· · ·+φk (x), respectively. Then, Φ(x) = Φ` (x) and φ(x) = φ` (x) for x ∈ Ω` , ` = 1, . . . , k, which is illustrated in Figure 2(a) when d = 2 and k = 4. Let X be a uniformly distributed random location (index) over the domain Ω. For example, if one randomly extracts a pixel from the domain of a digital image, every pixel is equally likely to occur. Then, the distribution of the random variable Φ(X) is given by the following finite mixture: (2.1) PΦ(X) (φ) = k X `=1 Pr(X ∈ Ω` )PΦ(X)|X∈Ω` (φ) = |Ω1 | |Ωk | P1 (φ) + · · · + Pk (φ), |Ω| |Ω| Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS Ω2 Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php Ω1 Φ(x) ∼ P1 Φ(x) ∼ P3 Ω3 Φ(x) ∼ P2 R2 R1 Φ(x) ∼ P1 Φ(x) ∼ P4 Ω4 (a) Ground truth model Φ(x) ∼ P3 R3 R2 R1 u2 (x) = 1 Φ(x) ∼ P2 u1 (x) = 1 Φ(x) ∼ P4 (b) Candidate partition 1851 R4 u3 (x) = 1 R3 u4 (x) = 1 R4 (c) Characteristic functions Figure 2. Suppose that the candidate partition (b) is made on the ground P truth model (a) such that each region R` is encoded by its characteristic function u` as shown in (c), where k`=1 u` (x) = 1 almost everywhere def in Ω. Then, each distribution P` = PΦ(X)|u` (x)=1 is given by a mixture of P1 , . . . , P4 . The key idea of what we call information-theoretic minimal perimeter partitioning is to reconstruct the unknown ground truth model from an observation φ in such a way that the associated distributions of a computed partition are as homogeneous as possible with spatial regularity. where | · | denotes the d-dimensional Lebesque measure. As can be seen, the distribution PΦ(X) involves two sources of uncertainty: first, the uncertainty of point location modeled by Pr(X ∈ Ω` ), i.e., the region where a given location belongs; and second, the uncertainty in distribution for each region given by PΦ(X)|X∈Ω` . Note that we assume P` µ for ` = 1, . . . , k so that the density of each P` , denoted by p` , is defined as the Radon–Nikodym derivative of P` with respect to µ. In doing so, P and p are interchangeably used throughout the paper. For instance, the mixture density pΦ(X) (φ) can also be written in the form of (2.1) but in terms of pΦ(X)|X∈Ω` . Then, the problem we address in this paper is to jointly estimate the ground truth partition {Ω` }k`=1 and its associated distributions P1 , . . . , Pk , which are both unknown to us, thereby estimating the mixture distribution PΦ(X) . To this end, let us consider a candidate partition S {R` }k`=1 such that Ri ∩ Rj = ∅ ∀i 6= j and k`=1 R` = Ω (up to sets of Lebesque measure zero) and its associated distributions P1 , . . . , Pk as illustrated in Figure 2(b) when d = 2 and k = 4. Here, it is crucial to notice that each P` := PΦ(X)|X∈R` is given by a mixture distribution of P1 , . . . , Pk as follows: P` = |R` ∩ Ω1 | |R` ∩ Ωk | P1 + · · · + Pk |R` | |R` | ∀` = 1, . . . , k. Hence, coming up with a certain criterion that measures the homogeneity of the distributions P1 , . . . , Pk , one may want to optimize it with respect to the candidate partition {R` }k`=1 in order to obtain the ground truth partition. Notice that the larger a portion of R` coincides with one of the sets Ω1 , . . . , Ωk , the more homogeneous its associated distribution becomes. Indeed, the partitioning problem can be captured in this way by maximizing the mutual information between Φ(X) and the characteristic function of the candidate partition {R` }k`=1 . To elaborate, let us observe that the characteristic function of {R` }k`=1 is given as follows: ( 1 with probability |R` |/|Ω|, def (2.2) u` (X) = χR` (X) = 0 with probability |Ω \ R` |/|Ω|, Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. 1852 Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php P such that k`=1 u` (x) = 1, where x ∈ Ω. Note that the function u` (X) is a random variable considering the uncertainty of point location X. Then, the mutual information between Φ(X) and u(X) := (u1 (X), . . . , uk (X)) is defined as (2.3) def I(Φ(X); u(X)) = h(Φ(X)) − h(Φ(X)|u(X)) = h(Φ(X)) − k X Pr(u` (X) = 1)h(Φ(X)|u` (X) = 1), `=1 where Pr(u` (X) = 1) = Pr(X ∈ R` ) = |R` |/|Ω|. Now, let us come back to what we have mentioned—the partitioning problem can be captured by maximizing the mutual information between Φ(X) and u(X). Theorem 2.1 (Kim et al. [30]). Let {R` }k`=1 be a candidate partition of Ω. Then, the mutual information I(Φ(X); u(X)) is maximized if and only if the candidate partition coincides with the ground truth partition {Ω` }k`=1 . Proof. It is a natural extension of the fact about mutual information [30] proven for the two-region case to the general case with multiple regions. As shown in [30], the mutual information I(Φ(X); u(X)) is maximized if and only if Φ(X) and X are conditionally independent given u(X), so it suffices to show that PΦ(X)|u(X) = PΦ(X)|X,u(X) if and only if the candidate partition {R` }k`=1 coincides with the ground truth partition {Ω` }k`=1 , i.e., R` = Ωπ(`) for 1 ≤ ` ≤ k, where π(·) is a permutation function. Similar to [30], the rest of the proof comes from the fact that PΦ(X)|u(X)=e` , where e` = (0, 0, . . . , 1, . . . , 0) whose `th element is 1, is homogeneous for all ` if the two partitions coincide and that if the two partitions do not coincide there is at least one region R` for which PΦ(X)|u(X)=e` is a mixture of Pi ’s. The conditional distribution PΦ(X)|u` (X)=e` is given as a mixture of P1 , . . . , Pk as follows: PΦ(X)|u(X)=e` = |R` ∩ Ω1 | |R` ∩ Ωk | P1 + · · · + Pk . |R` | |R` | On the other hand, PΦ(X)|X=x,u(X)=e` = Pj if x ∈ Ω` indicating that PΦ(X)|X=x,u(X)=e` is homogeneous. (⇐=): If R` = Ωπ(`) for ` = 1, . . . , k, then PΦ(X)|X=x,u(X)=e` = Pπ(`) (∵ e` = u(X) = u(x) ⇒ x ∈ R` = Ωπ(`) ) and PΦ(X)|u(X)=e` = Pπ(`) (∵ u(X) = e` ⇒ X ∈ R` = Ωπ(`) ). (=⇒): If partitions do not coincide, there is at least one ` such that |R` ∩Ωa | = 6 0, |R` ∩Ωb | = 6 0, and PΦ(X)|u(X)=e` = αa Pa + αb Pb + · · · 6= PΦ(X)|X=x,u(X)=e` = Pj , if x ∈ Ωj . Theorem 2.1 can be understood as a justification for the suitability of maximizing the mutual information for unsupervised image segmentation; its inherent issue, however, will be elaborated in section 2.3. Before moving on to how to optimize the mutual information criterion, we further analyze it in terms of the Kullback–Leibler (KL) divergence. It will broaden our understanding of the notion of mutual information with a slightly different perspective. 2.2. More on the mutual information criterion. Suppose that π1 , . . . , πk are nonnegative P real numbers such that π1 + · · · + πk = 1. Setting Q := k`=1 π` P` , one can see that Q is again a probability measure on the measurable space (Rn , B(Rn )). The KL divergence from P` to Q is defined as Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS 1853 a3 1 Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php P3 Q := P3 ℓ=1 P3 πℓ = 1 ℓ=1 P1 a2 1 1 a1 πℓ P ℓ such that P2 Figure 3. Suppose that A = {a1 , a2 , a3 } and A = 2A . Then, the generalized JS divergence on the 2-simplex measures the similarity between the probability measures P1 , P2 , and P3 by means of a convex combination of P def P KL divergences between P` and 3`=1 π` P` . Note that Q = 3`=1 π` P` is again a probability measure. (2.4) def DKL (P` ||Q) = Z log Rn dP` dQ dP` , P` Q, where dP` /dQ is the Radon–Nikodym derivative of P` with respect to Q and the logarithm is taken to base 2. Then, the generalized JS divergence for the probability measures P1 , . . . , Pk is defined as follows:1 def (2.5) DπJS (P1 , . . . , Pk ) = k X `=1 π` DKL (P` ||Q), which is a convex combination of KL divergences between P` and Q. A schematic illustration of how it measures the differences between P1 , P2 , and P3 on the probability simplex is shown in Figure 3. Setting π` = |R` |/|Ω| for ` = 1, . . . , k, we can rewrite the mutual information between Φ(X) and u(X) as follows: I(Φ(X); u(X)) = h(Q) − k X =− Z log =− k X π` = Rn `=1 k X `=1 π` h(P` ) `=1 Z dQ dµ Rn log dQ + dQ dµ k X `=1 π` Z dP` + log Rn k X `=1 π` Z dP` dµ Rn log dP` dP` dµ dP` π` DKL (P` ||Q). 1 Note that the JS divergence between two probability distributions P1 and P2 is defined as DJS (P1 ||P2 ) := DKL (P1 ||M) + 12 DKL (P2 ||M), where M = 21 (P1 + P2 ). One can find its generalization, i.e., the generalized JS divergences, in [34]. 1 2 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php 1854 Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM Thus, maximizing the mutual information between Φ(X) and u(X) can be understood as maximizing the generalized JS divergence of the probability measures P1 , . . . , Pk , which maximally separates the associated distributions from Q in terms of the weighted KL divergences. The following lemma is then quite straightforward. Lemma 2.2. 0 ≤ DπJS (P1 , . . . , Pk ) ≤ log k. Proof. It follows from the fact that mutual information is nonnegative and I(Φ(X); u(X)) ≤ h(u(X)) ≤ log k, where the last inequality is from the number of labels (cardinality) in the range of u(X). 2.3. Regularized mutual information minimization. The fundamental problem we are now faced with is the fact that computing the mutual information between Φ(X) and u(X) involves the ground truth model, {Ω` }k`=1 and P1 , . . . , Pk . Obviously, one cannot compute the mutual information in practice because the ground truth model is unknown to us, let alone its maximization. Hence, we replace the mutual information with its estimate. To this end, let us remind the readers of what we assumed in section 2.1: Suppose P` µ for ` = 1, . . . , k. Then, the density of each P` , denoted by p` := pΦ(X)|u` (X)=1 , is defined by dP` /dµ. Let {φ(x) : u` (x) = 1, x ∈ Ω} be a collection of independent and identically distributed samples drawn from the unknown density p` . Then, for ` = 1, . . . , k, we estimate the unknown density by means of kernel density estimation as follows: R KH (ϕ − φ(x))u` (x)dx R p̂` (ϕ) = Ω , Ω u` (x)dx where KH is the kernel function with the n × n bandwidth matrix H (symmetric and positive definite). When n = 1, for instance, KH is the Gaussian kernel with the scalar bandwidth H, which is given by 1 (ϕ − φ(x))2 KH (ϕ − φ(x)) = √ exp − . 2H 2 2πH 2 In [47], one can find a number of feasible definitions of the kernel function and the bandwidth matrices. We then estimate the conditional entropy terms in (2.3) using the integral estimate proposed in [20] as follows: Z def (2.6) ĥ(Φ(X)|u` (X) = 1) = − p̂` (ϕ) log p̂` (ϕ)dϕ, Rn where it is worth mentioning that the above estimate converges to the true entropy almost surely as the number of samples for kernel density estimation goes to infinity. In practice, the collection {φ(x) : u` (x) = 1, x ∈ Ω} is given as a finite set whose cardinality P is the number of pixels in R` . Thus, the kernel density estimate is computed as p̂` (ϕ) = N1 N i=1 KH (ϕ−φ(xi )), where N is the number of pixels. Let ĥN be an integral estimate of the true entropy h computed by the above kernel density estimate; then Dmitriev and Tarasenko [20] showed Pr(limN →∞ ĥN = h) = 1. On the other hand, the entropy terms in [30] are estimated by using the resubstitution estimate proposed in [1], which is essentially the law of large numbers. The estimate is known to be mean square consistent. For further details, we refer the readers to an overview of the nonparametric entropy estimation [6]. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS 1855 Now, the following maximization problem (maximizing the mutual information estimate) looks tempting in that we can actually compute its solutions: n o def ˆ maximize I(Φ(X); u(X)) = ĥ(Φ(X)) − ĥ(Φ(X)|u(X)) . {R` }k`=1 Yet, as often observed in empirical risk minimization [37], the solutions of such empirical objectives (risks) typically exhibit overfitting, which is obviously unwanted. Therefore, we penalize ˆ the complexity of solutions by combining the mutual information estimate I(Φ(X); u(X)) with spatial regularity—which is what we call information-theoretic minimal partitioning—as follows: For an observed function φ(x) of Φ(x) for a.e. in Ω k (2.7) 1X minimize Perg (R` ; Ω) − λ {R` }k`=1 2 `=1 Z Ω ˆ I(Φ(X); u(X))dx, where the first term is half the sum of the perimeters of the sets {R` }k`=1 with respect to a metric defined by a nonnegative function g(x), which includes the geodesic active contours ˆ [10]—see [17] for more details. The mutual information estimate I(Φ(X); u(X)) is now integrated over the entire (bounded) domain to incorporate the total amount of information. Note that the mutual information between Φ(X) and u(X) is the amount of information defined on ˆ a single (random) location X. Then, we can rewrite the second term as λ|Ω|I(Φ(X); u(X)) since the mutual information estimate is not a function of x; here λ > 0 is a parameter. See [30] for details. In short, the problem combines maximal distribution separation—one might want to recall the fact that the mutual information is the same as the generalized JS divergence estimate (see section 2.2)—with spatial regularity in the sense of the total length of the region boundaries. Also, one may want to write the problem as k 1X minimize Perg (R` ; Ω) − λ|Ω| DπJS (P̂1 , . . . , P̂k ), {R` }k`=1 2 `=1 where π` = |R` |/|Ω|. This is clearly the same as (2.7). 2.3.1. The bounded variation framework. We introduced the characteristic function to encode the candidate partition {R` }k`=1 , geometric variable, as shown in (2.2). As can be found in [33, 11], we further assume that each u` is in BV (Ω), the space of functions Pk of bounded k variation [3], so that the k-tuple u = (u1 , . . . , uk ) ∈ BV (Ω; R ) satisfies `=1 u` (x) = 1, where u` (x) ∈ {0, 1}, almost everywhere in Ω. Then, the perimeter of R in Ω is defined as ` R the total variation of its characteristic function, Ω |DχR` |; thus we rewrite (2.7) as ( ) k Z X def 1 ˆ minimize E(u) = (2.8) g|Du` | − λ|Ω|I(Φ(X); u(X)) u∈∆ 2 Ω `=1 with the set (2.9) ∆= ( u ∈ BV (Ω; {0, 1}k ) : k X `=1 ) u` = 1 a.e. in Ω , Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. 1856 Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php where g(x) is a nonnegative function as described in (2.7). The encoding scheme is illustrated in Figure 2(c). The most immediate question that follows is whether the above partitioning problem admits a minimizer. Theorem 2.3 (existence theorem). There exists a minimizer u(·) such that inf{E(u) : u ∈ ∆} is achieved. Proof. We follow the direct method of the calculus of variations: (1) Since inf{E(u) : u ∈ ∆} > −∞ (recall that the mutual information estimate is bounded—see Lemma 2.2), there exists a minimizing sequence {un }n≥1 ∈ ∆, meaning E(un ) → inf{E(u) : u ∈ ∆}, such that supn ||un ||BV < +∞. (2) Then, the minimizing sequence admits some subsequence {unh }h≥1 converging to u? ∈ ∆ in L1 (Ω; {0, 1}k ) by the compactness result in BV (see [3, Theorem 3.23]). (3) Since the total variation is lower-semicontinuous (l.s.c.) in the L1 (Ω; {0, 1}k ) ˆ topology, we need to only show that the mutual information estimate I(Φ(X); u(X)) is also l.s.c. in the same topology. To this end, recall that the integral estimate of the differential entropy is continuous with respect to convergence in variational distance (which is the same as L1 -convergence) [54]. Hence, it suffices to show that kernel density estimates are continuous with respect to the L1 (Ω; {0, 1}) topology (recall the universal property of the product topology); i.e., we claim ∀ > 0, ∃δ > 0 such that ||u` − u?` ||L1 (Ω;{0,1}) < δ =⇒ ||p̂Φ(X)|u` (X)=1 − p̂Φ(X)|u?` (X)=1 ||T V < , where || · ||T V is the total variation distance (variational distance) of probability measures. See Appendix A for the proof of the claim. Then, we have E(u), sequentially l.s.c. with respect to the L1 (Ω; {0, 1}k ) topology. Now that the existence of a minimizer has been established, we turn our attention to the question of how to actually compute such a minimizer within a reasonable time. Indeed, the rest of this section is devoted to finding out the underlying mathematical structure of the information-theoretic minimal partitioning problem, upon which we develop an efficient optimization algorithm for the problem with an in-depth analysis and demonstrate the capacity of our approach in numerous experiments for the second half of the paper. We start off reminding the readers of convex relaxation techniques for the total variation of characteristic functions (of Caccioppoli sets). Note that its discrete counterpart known as the Potts model is NP-hard [7], so that we should make use of convex relaxation methods to deal with the constraint set ∆. From the local convex envelope approach R to minimal partitions P [11], we set J : L2 (Ω; Rk ) → [0, +∞] such that u 7→ J(u) = 21 k`=1 Ω |Du` | if u ∈ ∆ and +∞ otherwise. Then, the information-theoretic minimal partitioning problem, i.e., (2.8) with (2.9), can be rewritten as follows (here we assume g = 1): (2.10) ˆ minimize J(u) − λ|Ω|I(Φ(X); u(X)). u∈L2 (Ω;Rk ) Let J ∗∗ be the biconjugate of J, i.e., J ∗∗ (u) = sup{hv, ui − J ∗ (v) : v ∈ L2 (Ω; Rk )}, where J ∗ is the Fenchel conjugate of J —see, for example, [5], where the (effective) domain of J is ∆. From the convex envelope approach [11, Proposition 3.1], we have the domain of J ∗∗ as follows: Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php dom J ∗∗ = ( u ∈ BV (Ω; [0, 1]k ) : k X `=1 1857 ) u` = 1 a.e. in Ω . Unfortunately, it is believed that no explicit forms of J ∗∗ are known. As a consequence, a significant amount of attention has been paid in recent years [57, 33, 11] to searching for a convex, l.s.c. function J˘ such that J˘ ≤ J in dom J ∗∗ . Considering the dual representation of the total variation, it turns out that the function J˘ is of the form ( Z k ) X ˘ (2.11) J(u) = sup − u` div ξ` : ξ(x) ∈ K ∀x ∈ Ω , Ω `=1 where the dual variable ξ := (ξ1 , . . . , ξk ) is an element of the set K ⊂ Cc1 (Ω; Rd×k ). Then, the convex relaxation techniques developed so far boil down to the way of imposing constraints on the dual variable ξ by the convex set K. For example, Zach et al. [57] proposed the set K as follows: n o def (2.12) K = KZach = ξ ∈ Cc1 (Ω; Rd×k ) : |ξ` | ≤ 1/2 ∀` = 1, . . . , k , where | · | denotes the Euclidean norm in Rd . The approach of Lellmann and Schnörr [33], on the other hand, reduces to the following convex set: n √ o def K = KLellmann = ξ ∈ Cc1 (Ω; Rd×k ) : |ξ| ≤ 1/ 2 , (2.13) where | · | denotes the Frobenius norm in Rd×k . The convex set from the local convex envelope approach proposed by Chambolle, Cremers, and Pock [11] is given as n o def (2.14) K = KChambolle = ξ ∈ Cc1 (Ω; Rd×k ) : |ξi − ξj | ≤ 1 ∀i < j , where | · | denotes the Euclidean norm in Rd . See Lemma 3.4 and Proposition 5.1 in [11] for more details on tightness. Therefore, the relaxed problem is given as n o def ˘ ˆ (2.15) minimize E (u) = J(u) − λ|Ω| I(Φ(X); u(X)) , rel ∗∗ u∈dom J ˘ where J(u) is given in (2.11) and the respective convex set K is either (2.12), (2.13), or (2.14). Note that it is straightforward to incorporate a nonnegative function g into the convex sets. 2.3.2. Difference of convex functionals. Now that we have established the existence of minimizers (Theorem 2.3), we are aiming to compute such a minimizer of the problem. Indeed, ˆ if the relaxed problem (2.15) is convex in dom J ∗∗ (in other words, if −λ|Ω|I(Φ(X); u(X)) is convex), one can make use of some suitable convex optimization techniques to efficiently compute a globally optimal solution of (2.15). Then, a near-optimal binary solution of the original nonconvex problem, (2.8) with (2.9), can be obtained by thresholding the global minimizer of the relaxed problem with a performance guarantee (per-instance energy bound2 ). 2 Let u?bin be a thresholded version of u?rel , a global minimizer of Erel (u) over the domain dom J ∗∗ , that is to say, u?bin = 1u?rel ≥θ for any value θ ∈ (0, 1), and let uopt be a globally optimal binary solution of E(u). A per-instance energy bound for the optimal solution is given by |E(u?bin ) − E(uopt )| ≤ |E(u?bin ) − E(urel )|, i.e., by evaluating E(u) at urel and ubin , one can check how close ubin is to uopt energetically. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. 1858 Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php Therefore, we need to investigate the structure of the objective to check whether the overall energy is convex. Proposition 2.4. Suppose that λ is nonnegative and the domain Ω is bounded, then the data ˆ term, −λ|Ω|I(Φ(X); u(X)), is concave in u ∈ dom J ∗∗ . Proof. Since the entropy estimate ĥ(Φ(X)) is a constant with respect to u, it suffices to show that the conditional entropy estimate ĥ(Φ(X)|u(X)) is concave in u. To this end, we rewrite Pr(u` (X) = 1)ĥ(Φ(X)|u(X) = 1) using its integral estimate (2.6) as follows: Z Z 1 def Ẽ(u) = Pr(u` (X) = 1)ĥ(Φ(X)|u` (X) = 1) = − u` (x)dx p̂` (ϕ) log p̂` (ϕ)dϕ |Ω| Ω Rn R Z Z − φ(x))u` (x)dx 1 Ω KH (ϕ R =− KH (ϕ − φ(x))u` (x)dx log , |Ω| Rn Ω Ω u` (x)dx where the second equality is from Fubini’s theorem. Then, for 0 ≤ t ≤ 1 and u`1 , u`2 ∈ dom J ∗∗ , we have Z Z 1 KH (ϕ − φ(x))(tu`1 (x) + (1 − t)u`2 (x))dx Ẽ(tu`1 + (1 − t)u`2 ) = − |Ω| Rn Ω R − φ(x))(tu`1 (x) + (1 − t)u`2 (x))dx Ω KH (ϕ R dϕ log tu`1 (x) + (1 − t)u`2 (x)dx Ω Z Z t ≥− KH (ϕ − φ(x))u`1 (x)dx log p̂`1 (ϕ)dϕ |Ω| Rn Ω Z Z 1−t KH (ϕ − φ(x))u`2 (x)dx log p̂`2 (ϕ)dϕ − |Ω| Rn Ω = tẼ(u`1 ) + (1 − t)Ẽ(u`2 )), where the inequality is from the log sum inequality [16]. Since the sum of concave functions is concave, ĥ(Φ(X)|u(X)) is concave in u. Proposition 2.4 implies that the overall energy Erel is nonconvex. Thus, it does not seem straightforward to efficiently minimize the relaxed energy in order to compute a solution such that inf{Erel (u) : u ∈ dom J ∗∗ } is achieved, let alone a near-optimal binary solution of the original nonconvex problem. Yet, we make an intriguing observation on the structure of the relaxed problem as follows. Proposition 2.5. Erel (u) with the domain dom J ∗∗ is a form of the DC functionals. Proof. Let Γ0 (H) denote the class of all proper, l.s.c., and convex functions3 defined on a 2 k real Hilbert space H taking values in (−∞, ∞]; R suppose that H = L (Ω; R ). Then, as Pk and ˘ previously done for the total variation of u, `=1 Ω |Du` |, we can similarly extend J(u) to ∗∗ ˘ H by letting J(u) = +∞ if u 6∈ dom J . In other words, setting J : H → [0, +∞] such that ˘ u 7→ J(u) if u ∈ dom J ∗∗ , +∞ otherwise, we rewrite the problem (2.15) as follows: 3 A key property of such functions in Γ0 (H) is that every element h ∈ Γ0 (H) is characterized by the supremum of its continuous affine minorants: h(x) = sup{hu, xi − h∗ (u) : u ∈ H} ∀x ∈ H, where h∗ defined by h∗ (u) = sup{hx, ui − h(x) : x ∈ H} ∀u ∈ H is the Fenchel conjugate of h. In fact, h∗ ∈ Γ0 (H). Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php (2.16) 1859 ˆ minimize J (u) − λ|Ω|I(Φ(X); u(X)). u∈H ˆ Here, the regularization term J (u) is clearly in Γ0 (H). For the data term λ|Ω|I(Φ(X); u(X)), one should check if the following inner product makes sense (integrability): Z KH (ϕ − φ(x))u` (x)dx, hKH (ϕ − φ), u` iL2 (Ω;R) = Ω where ϕ ∈ Rn . That is to check for almost every ϕ if the function x 7→ KH (ϕ − φ(x))u` (x) is integrable; which can be shown by Fubini’s theorem and the Cauchy–Schwarz inequality [48]. Hence, the data term is also in Γ0 (H). Therefore, we observe that the problem amounts to minimization the DC functions, where we will see that convex analysis and optimization come into play. 3. Sequential convex programming. What we have established so far is as follows: The information-theoretic minimal partitioning problem admits a minimizer; however, the overall relaxation is still nonconvex so that the efficient computation of such a minimizer becomes a major algorithmic challenge. Yet, we have shown that the relaxed problem exhibits a form of DC functionals which, in this section, leads us to DC programming or the so-called majorization-minimization (MM) algorithm where convex optimization techniques can be successfully applied. Therefore, we start off the section with essential preliminaries on convex analysis for DC programming and the DCA, where the notion of DC duality and optimality conditions become crucial to design such a convergent algorithm. For more details, we refer the readers to [51, 52, 26, 4, 19] and references therein. Then, we consider a larger class of DC functions where a linear operator is involved—the partitioning problem we consider falls into this class. 3.1. Preliminaries: Convex analysis for DC programming and DCA. With the convention +∞ − (+∞) = +∞, the canonical DC programming we consider is of the form (3.1) def α = inf{f (u) = g(u) − h(u) : u ∈ H}, where g, h ∈ Γ0 (H). The function f is called a DC function, and its components g and h are called DC components. A closed convex constrained set C ⊂ H is incorporated into the canonical form (3.1) by adding the indicator function δC of C to g, i.e., δC (u) = 0 if u ∈ C, and +∞ otherwise. Substituting h in the primal program (3.1) with its Fenchel conjugate, we obtain the Fenchel–Rockafellar type dual (3.2) α = inf{h∗ (v) − g ∗ (v) : v ∈ H}, where the finiteness of α implies dom g ⊂ dom h (dually, dom h∗ ⊂ dom g ∗ ). Then, one can clearly see that the dual program (3.2) is also another DC problem whose dual (dual of dual) is exactly the primal (3.1). In addition, the optimal value in the dual program (3.2) is the same as that of the primal program—see the Toland–Singer theorem [5, Corollary 14.20]. This property, also known as DC duality, plays a pivotal role in establishing optimal conditions in Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php 1860 Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM Algorithm 1. Simplified DCA [51, 52] 1: Choose u0 ∈ dom g. 2: while not converged do 3: v n ∈ ∂h(un ). 4: un+1 ∈ ∂g ∗ (v n ). 5: end while DC programming, and it has opened up new possibilities of convex optimization to tackle a wide class of nonconvex optimization problems.4 Theorem 3.1 (KKT conditions and local optimality [51]). Let u∗ be a critical point of f in (3.1); then ∂g(u∗ ) ∩ ∂h(u∗ ) 6= ∅. Furthermore, if u∗ is a local minimizer, then ∂h(u∗ ) ⊂ ∂g(u∗ ). Dually, let v ∗ be a critical point (resp., local minimizer) of h∗ − g ∗ in (3.2); then ∂h∗ (v ∗ ) ∩ ∂g ∗ (v ∗ ) 6= ∅ (resp., ∂g ∗ (v ∗ ) ⊂ ∂h∗ (v ∗ )). Based on the above conditions and DC duality, a globally convergent algorithm 5 —also known as the DCA—was proposed in [51, 52]. Again, notice that a globally convergent algorithm does not guarantee a minimizing sequence to converge to a globally optimal solution such that it achieves the infimum of the objective. Rather, the algorithm minimizes (3.1) and (3.2) in such a way that it let the sequences {g(un ) − h(un )} and {h∗ (v n ) − g ∗ (v n )} monotonically decrease. More importantly, DCA constructs the sequence {(un , v n )} converging to a pair of critical points (u∗ , v ∗ ) under some regularity conditions (see section 3.2 in [51]). See Algorithm 1. Meanwhile, the algorithm can be alternatively derived by replacing h in the primal (3.1) with its Fenchel conjugate (3.3) α = inf inf{g(u) + h∗ (v) − hu, vi : u, v ∈ H}, which is convex in u for fixed v ∈ H and v for fixed u ∈ H, respectively. Note that the problem is not jointly convex with respect to both u and v. Hence, we make use of alternating minimization for a given initial point u0 ∈ dom g as follows: v n ∈ argmin {g(un ) + h∗ (v) − hun , vi : v ∈ H} ⇐⇒ 0 ∈ ∂h∗ (v n ) − un ⇐⇒ v n ∈ ∂h(un ); un+1 ∈ argmin {g(u) + h∗ (v n ) − hu, v n i : u ∈ H} ⇐⇒ 0 ∈ ∂g(un+1 ) − v n ⇐⇒ un+1 ∈ ∂g ∗ (v n ). The above derivation boils down to Algorithm 1. As can be clearly seen, the algorithm does not guarantee the global optimality of solutions due to the nature of alternating optimization. In what follows, we consider a larger class of DC functions where a linear operator is involved. Indeed, the objective we consider in (2.15) falls into this class. 4 According to [51] and references therein, the vector space of DC functions, DC(H) = Γ0 (H) − Γ0 (H), encompasses most real-world objectives and is closed under usual operations in optimization. 5 Roughly speaking, an iterative algorithm A is said to be globally convergent if, for any initial point u0 , a sequence {ut }∞ t=0 generated by the algorithm converges to a point for which an optimality condition holds. For precise definitions, see section 3 in [32]. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS 1861 Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php 3.2. Primal-dual DC procedure. Let K be a real Hilbert space, and let L : H → K be a continuous linear operator with the induced norm ||L|| = sup{||Lu|| : u ∈ H with ||u|| ≤ 1}. Then, the problem (2.16) is of the form (3.4) β = inf{g(Lu) − h(u) : u ∈ H}, where g ∈ Γ0 (K) and h ∈ Γ0 (H). Similarly, the finiteness of β implies dom g ⊂ L(dom h). Here, one notices that it is far from straightforward to derive the Fenchel–Rockafellar type dual of (3.4) in contrast to (3.1). In addition, it does not seem helpful to do so considering that one would not be able to make use of Theorem 3.1 and Algorithm 1, i.e., v n ∈ ∂(g ◦ L)(un+1 ) < un+1 ∈ ∂(g ◦ L)∗ (v n ). On the other hand, if the composite g ◦ L is in Γ0 (H), one can apply the algorithm by replacing g ∗ with (g ◦ L)∗ . The total variation J (u) is in Γ0 (H) as observed in section 2.3.2. Thus, it leads us to a primal-dual DC procedure where un+1 ∈ ∂(g ◦ L)∗ (v n ) is formulated as a saddle-point problem that can be solved by the first-order primal-dual (Chambolle–Pock ) algorithm in [12]. Note that different types of first-order algorithms— such as Nesterov’s method, Bregman iterative algorithms, and the Douglas–Rachford splitting method—can be used for such a saddle-point problem; see section 6 in [33] for comparison. Suppose that q ∈ Γ0 (H) is added into the form (3.4) to take additional convex constraints into account. Then, for a given initial point u0 ∈ L−1 (dom g), the procedure is as follows: v n ∈ argmin {g(Lun ) + h∗ (v) − hun , vi + q(un ) : v ∈ H} ⇐⇒ 0 ∈ ∂h∗ (v n ) − un ⇐⇒ v n ∈ ∂h(un ); un+1 ∈ argmin {g(Lu) + h∗ (v n ) − hu, v n i + q(u) : u ∈ H} ˆ ∈ argminimax {hLu, ξi − g ∗ (ξ) +h∗ (v n ) − hu, v n i + q(u) : u ∈ H, ξ ∈ K} ⇐⇒ (un+1 , ξ) ˆ v n − L∗ ξˆ ∈ ∂q(un+1 ), ⇐⇒ Lun+1 ∈ ∂g ∗ (ξ), where argminimax denotes the set of saddle points of the argument (see Definition 11.49 in [45]). As mentioned above, we here apply the Chambolle–Pock algorithm [12] to solve the ˆ v n −L∗ ξˆ ∈ ∂q(un+1 ), then its convergence analysis saddle-point problem, i.e., Lun+1 ∈ ∂g ∗ (ξ), becomes straightforward from Theorem 3.7 in [52] and Theorem 1 in [12] by replacing g with g ◦ L ∈ Γ0 (H). Remark 3.2. When g ◦ L ∈ Γ0 (H), the convergence analysis of the primal-dual DC procedure becomes straightforward from Theorem 3.7 in [52] and Theorem 1 in [12] by replacing g with g ◦ L. Note that the sequence converges weakly in general real Hilbert spaces, and if h in the primal is differentiable, the DC algorithm reduces to the concave-convex procedure [56]. It implies that the concave-convex procedure is also globally convergent, which was proved alternatively in [32] by means of Zangwill’s global convergence theory of iterative algorithms. The MM principle [27], on the other hand, is capable of encompassing a larger class of nonconvex problems providing a rather general strategy for constructing surrogate functions (majorizers). However, constructing a good surrogate of the original nonconvex objective is in general far from straightforward, let alone its optimality conditions. Note that the monotonicity of a minimizing sequence itself does not guarantee its convergence to a critical point of the objective function. Last, one recognizes that the philosophy behind the DC algorithm can be Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php 1862 Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM Algorithm 2. Primal-dual DC procedure 1: Choose u0 ∈ dom g, ξ 0 ∈ K, and θ ∈ [0, 1]. 2: Choose τ, σ > 0 such that τ σ||L||2 < 1. 3: while not converged do 4: v n1 ∈ ∂h(un1 ). 5: while not converged do 6: ξ n2 +1 = Proxσg∗ (ξ n2 + σLūn2 ). 7: un2 +1 = Proxτ q (un2 − τ (v n1 − L∗ ξ n2 +1 )). 8: ūn2 +1 = un2 +1 + θ(un2 +1 − un2 ). 9: end while 10: un1 = un2 +1 . 11: end while understood as an MM algorithm when the objective is a DC function and its concave part is majorized by the first-order Taylor expansion. Then, from an algorithmic perspective, the major difference between the current study and the preliminary conference work of the authors [28] is that the surrogate function in the primal-dual DC procedure consists of a linear function and a convex regularizer, whereas the one in [28] consists of a quadratic function and a convex regularizer—more importantly, the convergence of a minimizing sequence was not properly addressed. In addition, we will show later on that proposed method is substantially faster. Now, we apply the primal-dual DC procedure to the problem (2.16) in order to compute an optimal solution (critical point). Because the negative mutual information estimate is elementwise Gâteaux differentiable at u ∈ L2 (Ω; Rk ), the subdifferential ∂h(u) := (∂h(u1 ), . . . , ∂h(uk )) is a singleton whose elements are computed as Z 1 v` ∈ ∂h(u` ) = KH (ϕ − φ(x)) log (3.5) dϕ, p̂` (ϕ) Rn from which v is defined by the k-tuple v := (v1 , . . . , vk ). Let us recall the dual representation of the total variation seminorm and its convex relaxations—see section 2.3.1. We identify g ∗ (ξ) as δK (ξ), where the convex set K is either KZach , KLellmann , or KChambolle . The proximity operator of δK (ξ) denoted by Proxg∗ is an orthogonal projector onto the convex set K. Likewise, q(u) is identified as δS (u), where S := dom J ∗∗ . The proximity operator of δS (u) denoted by Proxq is then an orthogonal projector onto the set S. See Algorithm 2 for the overall procedure. We refer the readers to sections 5 and 6 in [33] for implementation details on such projectors. Remark 3.3. Suppose that the concave part of a DC function is differentiable, that is, ∂h(u) is a singleton. Then, in the concave-convex procedure, h(u) is linearized as (3.6) −h(u) ≤ −h(un ) − h∂h(un ), u − un i, un ∈ H, where ∂h(un ) is computed as in (3.5) for the problem (2.16). From an MM perspective, the right-hand side is one of the majorizers that can be constructed in various ways. Clearly, the linear one is the tightest convex upper bound for the concave function h. As previously mentioned, one can find a quadratic upper bound in [28]; see Proposition 3 therein. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS 1863 Algorithm 3. Stochastic primal-dual DC procedure 1: Choose u0 ∈ dom g, ξ 0 ∈ K, and θ ∈ [0, 1]. 2: Choose τ, σ > 0 such that τ σ||L||2 < 1. 3: while not converged do 4: Extract pixels using random sampling 5: v n1 ∈ ∂h(un1 ). 6: for n2 = 1 : N2 do 7: ξ n2 +1 = Proxσg∗ (ξ n2 + σLūn2 ). 8: un2 +1 = Proxτ q (un2 − τ (v n1 − L∗ ξ n2 +1 )). 9: ūn2 +1 = un2 +1 + θ(un2 +1 − un2 ). 10: end for 11: un1 = un2 +1 . 12: end while As we have seen so far, the primal-dual DC procedure computes a minimizing sequence that provably converges to a critical point. Other than that, there is no further optimality guarantee—recall the procedure is solely based on the first-order methods. Needless to say, the minimization problem (2.16) is nonconvex. The procedure is prone to getting stuck in bad local minima—visually and energetically. Hence, we make some modifications to the primaldual DC procedure in such a way that it guides the path of a minimizing sequence {ut }t≥1 from an arbitrary initial point towards good local minima: In what follows, we suggest two viable strategies. 3.2.1. Tuning the number of inner iterations. We fix the number of inner iterations of the primal-dual DC procedure at a certain number rather than let the procedure compute the optimal solution of each convex subproblem. In doing so, we guide the minimizing sequence not to get easily stuck in bad local minima; see the for-loop in the sixth line of Algorithm 3 compared to the corresponding while-loop in Algorithm 2, where N2 is the number of inner iterations. Indeed, the limit of such a minimizing sequence that consists of optimal solutions of the consecutive convex subproblems is statistically likely to become far from near-optimal. The reason for this is that optimal solutions of the first few convex subproblems, i.e., initial partitions at an early stage, would make the kernel density estimation (3.5) prematurely overfitted; we will experimentally demonstrate and analyze in more detail in section 4.3. 3.2.2. Incorporating random sampling. Another way we could do the minimizing sequence to avoid getting stuck in bad local minima is to cause some perturbation during the optimization process. In so doing, we would make the sequence escape from undesirable points toward a good local minimum—preferably global minimum—under the assumption that such undesirable points are unstable with respect to a perturbation. To this end, we incorporate the notion of stochastic optimization into the primal-dual DC procedure in terms of random sampling, where such a perturbation is implemented by means of random sampling for kernel density estimation. To be specific, at every outer loop (or for the first few) of the procedure in Algorithm 2, a subset of R` is randomly extracted and p̂` (ϕ) is computed based on the set for ` = 1, . . . , k. A simple implementation here for the random sampling is to use the Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php 1864 Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM following modules in order: A random number generator between 0 and 1, the orthogonal projector onto the standard k-simplex, and the first-max rounding scheme—see section 4.5.3 in [39], for example. Indeed, stochastic optimization plays a significant role in preventing a minimizing sequence from converging to either a stationary point or a bad local minimum, which will be demonstrated and analyzed in more detail in the following section. Having taken into consideration the two strategies, we describe what we call the stochastic primal-dual DC procedure in Algorithm 3. Note that the two strategies can be employed just for the first few iterations to make sure the minimizing sequence converges. 4. Experiments. In sections 4.1 and 4.2, we demonstrate the capacity of our approach by showing fully unsupervised multiregion image partitions and compare the proposed method with our preliminary work [28] for figure-ground segmentation. We refer the readers to the work [28] for other compelling examples such as segmenting invisible texts from the background—see Figure 6 therein. Compared to the work of Kim et al. [30], where the energy functional is formulated by means of active contours and optimized by using curve evolution, we also achieve a much better algorithm that finds near-optimal solutions. The primary reason for this is that the update of the segmentation map in [30] happens only at the vicinity of the current position of active contour, which limits the possible directions of updates, whereas the formulation presented here removes this limitation entirely and makes it possible to update the segmentation map toward arbitrary directions, enlarging the search space enormously. Clearly, the proposed algorithm outperforms the optimization method in [30]. Referring the readers to Figure 1 in [28] for comparison, we rather demonstrate in section 4.3 the effectiveness of the two strategies presented in the previous sections (sections 3.2.1 and 3.2.2). We apply the algorithm to fully unsupervised binary segmentation and analyze how these two factors come into play during optimization. Note that for the entire experiments reported in the paper, we set g = 1, i.e., isotropic total variation without weighting. 4.1. Unsupervised multiregion image partitioning. We compute information-theoretic minimal partitions on a few images taken from the Berkeley Segmentation Dataset (BSD500) as shown in Figure 4, where the image size is 321 × 481, by means of the proposed algorithm (Algorithm 3 without random sampling) setting H = 0.08, λ = 0.1, σ = 0.5, τ = 0.25, and N2 = 200. The cameraman image experiment shown in Figure 1 is also from Algorithm 3 with random initialization; see Appendix B for implementation details. As a termination criterion, we use the norm of the difference between the current and the previous relaxed solution, i.e., if the norm of this difference divided by the entire number of pixels falls below a certain t t−1 || 2 threshold, we regard the procedure as converged; we use ||u −u < 10−8 . The runtime |Ω| of the primal-dual nested loop (resp., the outer loop) is approximated 0.075 (resp., 0.57), 0.139 (resp., 0.86), 0.203 (resp., 1.15), 0.295 (resp., 1.43), and 0.383 (resp., 1.73) seconds per iteration when k = 2, . . . , 6, respectively, with unoptimized MATLAB code. Here, we report the total numbers of iterations required for convergence: for the zebra image (left, top), 1328 and 1906 for 2 and 3 regions, respectively; for the leopard image (right, top), 1580 and 1900 for 2 and 3 regions, respectively; for the fish image (left, second row), 1321 and 2912 for 2 and 3 regions, respectively; for the elephant image (right, second row), 1571 and 1514 for 2 and 3 regions, respectively; for the horse image (third row), 1287, 1932, 2608, 5656, and 4001 for 2 to 6 regions, respectively; and for the mountain image (bottom row), 1186, 1523, 1610, Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS Input Histogram Input Histogram Input Histogram Input Histogram 2 regions 3 regions Histogram (2) Histogram (3) 2 regions 3 regions Histogram (2) Histogram (3) 2 regions 3 regions Input Histogram Input Histogram 4 regions 1865 2 regions 3 regions Histogram (2) Histogram (3) 2 regions 3 regions Histogram (2) Histogram (3) 5 regions 6 regions Histogram (2) Histogram (3) Histogram (4) Histogram (5) Histogram (6) 2 regions 3 regions 4 regions 5 regions 6 regions Histogram (2) Histogram (3) Histogram (4) Histogram (5) Histogram (6) Figure 4. Information-theoretic minimal partitions computed by the stochastic primal-dual DC procedure. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. 1866 Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM 4 x 10 10 0.75 0.7 8 0.65 6 0.6 0.55 4 0.5 2 0.45 0.4 0 Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php 0.35 −2 0.3 −4 0.25 200 Input Kee et al. [28] Proposed 400 600 800 1000 1200 1400 Primal energy 200 400 600 800 1000 1200 1400 Runtime/iter Figure 5. Comparison with the work of Kee et al. [28] for binary segmentation. 4355, and 3977 for 2 to 6 regions, respectively. The algorithm runs on an Intel Core i5 CPU computer with 2.67 GHz and 8 GB of memory. We also compute the intensity histograms that correspond to each partition as shown in the same figure, where one can see that unknown latent histograms are well separated by minimizing the negative JS divergence with spatial regularity. Note that the parameters can be tuned subjectively if prior information about images to be segmented is available—semisupervised partitioning. 4.2. Comparison. We apply the proposed algorithm (Algorithm 3) to the zebra image shown in Figure 5 for binary segmentation in order to compare with our preliminary work [28], where the MM algorithm with a family of quadratic upper bounds is employed. Note that the major difference between the current work and the work of Kee et al. [28] from an MM perspective is the tightness of convex upper bounds: linear (3.6) and quadratic [28] bounds. Based on the parameter settings in [28], we set N2 = 1, H = 0.08, λ = 0.05, σ = 0.5, and τ = 0.25 for both cases; in addition, the random sampling procedure in Algorithm 3 is not executed. As shown in Figure 5, the primal energy of the quadratic bound (red) decreases as the way of the linear bound (blue) and the quality of segmentation is almost the same when terminated. However, the linear majorizer drastically improves the runtime of the algorithm (on average 0.26 seconds per iteration) compared to the quadratic one (0.64 seconds per iteration). The total number of iterations taken to converge was 1500 for both cases. We also observed that the convergence speed with and without random sampling (50 percent of the entire pixels) for the linear bound (3.5) remains the same for this zebra image. 4.3. Effects of the number of inner iterations and random sampling. 4.3.1. Nonstochastic primal-dual DC programming. We compute information-theoretic minimal partitions (binary segmentation) of a cameraman image as shown in Figure 6 with the nonstochastic primal-dual DC procedure, i.e., the procedure makes use of the entire number of pixels for density estimation (Algorithm 2 with the fixed number of iterations or Algorithm 3 without random sampling). Setting H = 0.1 and λ = 0.05, we run the procedure 500 times with random initialization for each different number of inner iterations, 1, 5, 10, 50, 100, 200, 300, 400, and 500, respectively, i.e., 4500 times in total. As a termination criterion for the nonstochastic procedure, we use the same criterion described in section 4.1. Throughout the experiments, we observe that the procedure interestingly ends up with, roughly speaking, either of the four partition classes—which we call P1 , P2 , P3 , and P4 , respectively—as shown in the top row of Figure 6. The associated distributions are also computed and shown in the second row of the same figure. Here, the solid line is from the region labeled as black (0), Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS Input Initialization Class P1 1867 Class P2 Class P3 Class P4 6 6 6 6 6 5 5 5 5 5 4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Input % samples 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Class P1 SD(class P1 ) 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Class P2 SD(class P2 ) # inner iterations 1 5 10 50 100 100 200 300 400 500 maximum SD (minimum SD) termination criterion 1 0 0 #(class P1 ) 490 490 482 388 263 24 10 10 7 0.0426 (0) 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 Class P3 SD(class P3 ) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Class P4 SD(class P4 ) #(class P2 ) #(class P3 ) #(class P4 ) 10 0 0 10 0 0 18 0 0 112 0 0 235 0 2 474 2 0 488 1 1 485 2 3 492 0 1 0.0685 (0) 0.0027 (0) 0.036 (0) ||ut − ut−1 ||2 /|Ω| < 1 × 1e-8 Figure 6. Nonstochastic primal-dual DC programming for a cameraman image—see section 4.3.1 for details. and the dashed line is from the region labeled as white (1). Compared to a great deal of visual difference that is clearly seen between the classes, the amount of difference within each partition class is negligibly small. Indeed, their variations are rarely visible even along the relaxed boundaries as shown in the third row from the top of Figure 6, where each pixel value is supposed to represent a pixelwise standard deviation (SD), but the images are displayed in such a way that the minimum value (SD) in each image is black, and the maximum value (SD) is white. From the P1 class to P4 , the maximum (minimum) SDs are 0.0426(0), 0.0685(0), 0.0027(0), and 0.036(0), respectively. More profoundly, we observe from the table in Figure Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. 1868 Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php Table 2 The highest and lowest energy values of each energy band (class). class/energy band P1 P2 P3 P4 lowest energy −962.683 −922.233 −237.273 −233.898 highest energy −933.246 −275.763 −237.212 −233.307 6 that the number of partitions (binary segmentation results) falling into the P1 class is monotonically increasing as the number of inner iterations of the procedure is decreasing from 500 to 1. To figure out what is happening, we need to classify such partition classes in terms of not only their appearance but also energy, which may allow us to make use of a term “energy band” to consider appearance and energy all together as shown in Table 2. Not surprisingly, partitions in the P1 energy band have the lowest energy among others, whereas those in the P4 band have the highest energy. Indeed, it matches our expectations to a considerable extent. Nonetheless, it is by no means certain that one could possibly argue the lowest energy partition in the P1 class is a global minimizer of the problem in that neither the objective is convex nor the procedure is a global search algorithm; rather, the best explanation would be such a partition is a merely local minimizer that has the lowest energy among the observed solutions throughout the massive amount of experiments. Hence, the question is how to escape from such a band, namely, P1 , toward another one, if any, which is located below it—preferably near-optimal. Before we turn our attention to this question in the following section, we make another observation: We compute energy histograms of computed partitions according to the number of inner iterations as shown in Figure 7, where the number of inner iterations is written right on the bottom of each histogram. As can be seen, energy becomes concentrated as the number of inner iterations decreases. Note that the “bandgap” between P1 and P2 is 11.013, P2 and P3 is 38.49, and P3 and P4 is 3.314. The following sums up this section. Observation 4.1. The fewer inner iterations the nonstochastic primal-dual DC procedure is set to run with, the more likely it generates low energy partitions, which may hint that too many inner iterations lead to bad local minima or bad critical points. 4.3.2. Stochastic primal-dual DC programming. Apart from the number of inner iterations, there is another tuning parameter—the number of pixels for density estimation. Again, setting H = 0.1 and λ = 0.05 to make sure we optimize the same energy, we run the procedure this time using a subset of the whole pixels, which is randomly extracted from the entire domain Ω. For the moment, suppose that the procedure is set to use 1 percent of the total amount of pixels at each outer iteration, and we set the number of inner iterations to 100. Then, one can see that sequences of computed partitions are not likely to converge in contrast to the previous case where the procedure makes use of all the available pixels. In other words, such computed partitions that are consecutively generated by the procedure constantly change iteration by iteration even after a sufficiently large number of outer iterations. It is because of the fact that current distribution estimates are not likely to be the same as the previous ones which are estimated based on statistically different sets of pixels, meaning Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS 500 500 400 400 300 300 200 200 100 100 0 −1000 −900 −800 −700 −600 −500 −400 −300 −200 0 −1000 −900 −800 1869 −700 −600 # inner iterations = 5 # inner iterations = 10 500 500 400 400 300 300 200 200 100 −500 −400 −300 −200 −500 −400 −300 −200 −500 −400 −300 −200 −500 −400 −300 −200 100 0 −1000 −900 −800 −700 −600 −500 −400 −300 −200 0 −1000 −900 −800 −700 −600 # inner iterations = 50 # inner iterations = 100 500 500 400 400 300 300 200 200 100 100 0 −1000 −900 −800 −700 −600 −500 −400 −300 −200 # inner iterations = 200 0 −1000 50 40 40 30 30 20 20 10 10 −900 −800 −700 −600 −800 −700 −600 # inner iterations = 300 50 0 −1000 −900 −500 −400 −300 −200 # inner iterations = 400 0 −1000 −900 −800 −700 −600 # inner iterations = 500 Figure 7. Energy histograms for the partition classes, P1 , P2 , P3 , and P4 . The x-axis (energy) limits are set to range from −1000 to −200, and the y-axis (frequency) limits are set to range from 0 to 500 for the first six histograms from the top; for the rest (bottom) they are set to range from 0 to 50. Note that when the number of inner iterations is 1, its histogram is not distinguishable from that when the number of inner iterations is 5 in this scale, so we do not show it here. the energy landscape of the current objective is slightly different from the previous one. Yet, such sequences of partitions drastically change neither geometrically nor energetically after some outer iterations since a great amount of pixel values are likely to be similar then. In this regard, the termination criterion we used in the previous section needs to be loose; otherwise the procedure may not stop. To this end, we use (4.1) def tol(N, ) = N 1 1 X ||ut−(τ −1) − ut−τ ||2 < 1. N |Ω| τ =1 Here, we set N = 300 and = 5 × 10−7 for this particular case—when the number of inner iterations is 100. With this stopping criterion (4.1), we now observe that the procedure is capable of computing a lower energy partition as shown in Figure 8. Note that this result was not seen in section 4.3.1, where all the pixels were used. In what follows, we elaborate our new finding. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. 1870 Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM −550 −940 −940 −600 −945 −945 −950 −950 −955 −955 −960 −960 −965 −965 −970 −970 −650 Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php −700 −750 −800 −850 −900 −950 200 300 400 500 600 −975 600 Iterations Initialization −975 800 1000 1200 1400 1600 1800 2000 3000 Iterations E = −713 E = −962 4000 5000 6000 Iterations E = −967 E = −970 Figure 8. With a subset of the entire pixels, the primal-dual DC procedure is capable of computing a lower energy partition (E = −970). See section 4.3.2 for details. Suppose that we have such a finite sequence of partitions {ut }6000 t=1 generated by the primaldual DC procedure that makes use of 1 percent of the total amount of pixels at each outer iteration for density estimation. Since Erel : dom J ∗∗ → (−∞, +∞], we can draw an energy trajectory parameterized by iteration number on the energy landscape in dom J ∗∗ ×(−∞, +∞] as shown in the top row of Figure 8, where the x-axis (iterations) limits are set to range from 200 to 6000, and the y-axis (Erel (ut )) limits are set to range from −950 to −550 for the left one and from −975 to −940 for the rest. Note that the overall energy Erel (ut ) is evaluated at each iteration using kernel density estimation that makes use of the entire pixels of the input image, even though a fraction of them are actually used during optimization. What we can see from the trajectory is roughly that Erel (ut ) is rapidly decreasing until the 700th iteration; then from that point to the 1400th iteration ut seems to be wandering around a sort of plateau struggling to decrease the energy toward a global minimum. At the 1400th iteration, the procedure seems to have found out a way to do so; then ut escapes from this high plateau and settles down on a low plateau from the 1450th iteration on. The bottom row of Figure 8 shows five different partitions in energetically descending order, where the third partition from the left (Erel = −962) is a partition that is typically seen in the high plateau and the fifth partition (Erel = −970) is that seen in the low one. Again, ut wanders around this low plateau seeking another way out to go into a possibly lower one, constantly—yet slightly—changing its geometry and energy. However, such a region has not been found until the procedure terminates (6000 iterations) as shown on the right in the top row of Figure 8. Indeed, during the experiment, we observe that once a sequence of partitions escapes from the high plateau and wanders around on the low plateau, it rarely jumps out of place to go Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS 1871 back to the high plateau—and even if so it immediately goes back to the low plateau, as seen on the right (around the 2700th iteration) in the top row of Figure 8. Now, we clearly have another partition class which we call P0 whose energy seems to be distributed around −970; more importantly, such a class is not observable when the procedure makes use of the total amount of pixels, namely, nonstochastic primal-dual DC procedure. In order to figure out in what circumstances such partitions can be obtained, we run the procedure with different size of sets of pixels randomly extracted from the entire domain Ω and also with different numbers of inner iterations as shown in Table 3, from which the energy band of the partition class, P0 , is specified as shown in Table 4. Clearly, the obtained partition class, P0 , is located below the P1 energy band and slightly overlaps it. The reason for this is mainly because of the fact that the termination criteria were somewhat not tight enough— the stochastic procedure would not stop otherwise. Then, the following observation is now quite straightforward. Observation 4.2. With fewer inner iterations and fewer pixels—randomly extracted from the entire domain Ω—the primal-dual DC procedure makes use of, it is statistically likely for the procedure to generate lower energy partitions. Then again, it is by no means certain that such partitions in P0 become global minimizers since the objective is nonconvex. Even so, nonconvexity notwithstanding, we are quite inclined to believe that such partitions are much more than “mere local minimizers (or possibly critical points)” considering the recent findings in statistical physics in regard to the nature of largescale nonconvex energy landscapes and the nature of stochastic optimization. The gist of the findings is as follows: The number of saddle points is exponentially many, but not local minima in high dimensional energy landscapes. Most importantly, such local minima are statistically likely to be around the global minima in such circumstances. In this light, as long as numerical algorithms guarantee that computed solutions are local minimizers, we may be able to say that they are near-optimal solutions. Indeed, the assertion can be confirmed (somewhat arguably)—see [8, 23, 22] for example—in conjunction with stochastic gradient descent. Although the models therein are quite different from what we consider in this paper, we have already provided empirical evidence from the massive amount of experiments as shown in Table 3. See also [18, 15], for example, and references therein. Now it seems quite clear how such stochastic moves make it possible for a sequence of partitions to escape from the high plateau with the lower amount of pixels. However, it is still unclear that quite a number of partitions still end up falling into the class P2 —sometimes as frequent as the number of partitions in the class P0 . To put it into perspective, we consider another sequence of partitions {ut }6000 t=1 generated by the same procedure, but this time it ends up falling into the class P2 as shown in Figure 9. Undoubtedly, the region where the sequence is wandering around—approximately from the 500th iteration—is another plateau, and such stochastic moves in this case are not likely to make the sequence escape from the place once it gets trapped there. Indeed, we observed none of the sequences that once fall into this “cursed” plateau gets out of the area toward a “blessed” plateau, in contrast to what is happening in what we call the high plateau with the smaller amount of pixels. The reason for this is possibly such a plateau is surrounded by high hills like a basin so that even the stochastic DC procedure is very likely to end up with generating a sequence of partitions Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php 1872 Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM Table 3 Comparison between nonstochastic and stochastic primal-dual DC programming for computing informationtheoretic minimal partitions (when k = 2) of the cameraman image. As for the termination criteria for each case, setting N = 300, we fixed at 1.0 × 1e − 6 when % samples = 0.5; 1.5 × 1e − 6 (# inner iters. = 10), 1.0 × 1e − 6 (# inner iters. = 50), and 5.0 × 1e − 7 for the rest, when % samples = 1, 2, and 3; 1.0 × 1e − 6 (# inner iters. = 5) and 5.0 × 1e − 7 for the rest, when % samples = 5; 5.0 × 1e − 7 when % samples = 10. % samples 0.5 1 2 3 5 10 # inner iters. 50 100 200 10 50 100 200 300 400 500 10 50 100 200 300 400 500 10 50 100 200 300 400 500 5 10 50 100 200 300 400 500 5 10 50 100 |class P0 | 357 222 159 402 292 281 141 35 31 19 0 71 104 53 17 21 7 0 3 26 29 4 7 6 0 0 0 2 0 1 2 0 0 0 0 0 |class P1 | 1 10 20 53 52 15 23 14 13 11 455 275 145 88 34 31 13 465 362 246 104 33 51 16 480 476 381 277 104 30 35 9 485 484 373 269 |class P2 | 142 268 320 45 156 204 332 447 452 465 45 154 248 357 444 442 474 35 135 225 363 461 433 476 20 24 119 221 389 463 458 490 15 16 126 228 |class P3 | 0 0 1 0 0 0 1 4 2 5 0 0 1 0 2 3 5 0 0 0 0 0 2 2 0 0 0 0 3 2 3 0 0 0 0 0 Table 4 The highest and lowest energy values of the P0 partition class. Class/energy band P0 Lowest energy −975.246 Highest energy −969.479 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. |class P4 | 0 0 0 0 0 0 3 0 2 0 0 0 2 2 3 3 1 0 0 3 4 2 7 0 0 0 0 0 4 4 2 1 0 0 1 3 COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS −300 −400 1873 −840 −900 −860 −905 −880 −910 −900 −915 −920 −920 Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php −500 −600 −700 −800 −900 200 250 300 350 400 −940 400 Iterations Initialization −925 420 440 460 480 500 1000 2000 Iterations E = −646 E = −810 3000 4000 5000 6000 Iterations E = −913 E = −924 Figure 9. With a subset of all the pixels, the stochastic primal-dual DC procedure may end up with a bad local minimum (E = −924). See section 4.3.2 for details. inside the plateau at each outer iteration. For future work, a rigorous analysis should take place to investigate this phenomenon in nonconvex optimization. 5. Conclusion. We have considered what we call the information-theoretic minimal partitioning problem by providing a rigorous mathematical treatment, and we proposed a sequential convex programming approach for large-scale DC functions optimization problems where a linear operator is involved. We have demonstrated the capacity of our approach in numerous experiments—also referring the reader to our preliminary conference paper [28]—that show convincing fully unsupervised image partitioning in terms of both segmentation and robustness to initialization. In addition, we have thoroughly examined the behavior of the proposed sequential convex programming for nonconvex nonsmooth optimization. Although the proposed approach does not guarantee global optimality of computed solutions, we have experimentally demonstrated that the stochastic primal-dual DC procedure is capable of finding good local minima which we believe are near-optimal in the context of the recent findings in statistical physics. Indeed, this is one of the significant contributions of the paper in energy minimization methods in computer vision, especially continuous optimization. Therefore, we believe that the framework described here would open up new possibilities to tackle large-scale nonconvex objectives in not only computer vision but machine learning. Appendix A. Proof of the claim in Theorem 2.3. We prove the following claim: ∀ > 0, ∃δ > 0 such that ||u` − u?` ||L1 (Ω;{0,1}) < δ =⇒ ||p̂Φ(X)|u` (X)=1 − p̂Φ(X)|u?` (X)=1 ||T V < , where || · ||T V is the total variation distance (variational distance) of probability measures. To this end, we make the following observations. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php 1874 Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM Observation A.1. Z Z Z ? KH (ϕ − φ(x))u` (x)dx − KH (ϕ − φ(x))u` (x)dx dϕ Ω Rn ZΩ Z Z ? ≤ |KH (ϕ − φ(x))(u` (x)dx − u` (x))| dxdϕ ≤ |u` (x)dx − u?` (x)| dx = ||u` − u?` ||L1 . Rn Ω Ω Observation A.2. Z Z Z KH (ϕ − φ(x))u` (x)dx dϕ = u` (x)dx. n Ω R Ω Observation A.3. Z Z Z ? ≤ u` (x)dx − u (x)dx ` Ω Ω Ω Z ? u` (x)dx − dx = ||u` − u? ||L1 . u (x) ` ` Ω From the above observations, we prove the claim as follows: ||p̂Φ(X)|u` (X)=1 − p̂Φ(X)|u?` (X)=1 ||T V R Z R ? (x)dx Ω KH (ϕ − φ(x))u` (x)dx K (ϕ − φ(x))u 1 H ` Ω dϕ R R ? = − 2 Rn (x)dx u (x)dx u ` ` Ω Ω R R R R Z Ω KH (ϕ − φ(x))u` (x)dx Ω u?` (x)dx − Ω KH (ϕ − φ(x))u?` (x)dx Ω u` (x)dx dϕ R R = 2 Ω u` (x)dx Ω u?` (x)dx Rn R R R R Z ? φ(x))u?` (x)dx Ω u?` (x)dx Ω KH (ϕ − φ(x))u` (x)dx ΩRu` (x)dx − R Ω KH (ϕ − dϕ ≤ 2 u` (x)dx u? (x)dx Rn Ω Ω ` R ? R R Z R ? (x)dx ? (x)dx K (ϕ − φ(x))u u (x)dx − K (ϕ − φ(x))u u (x)dx H H ` ` ` Ω Ω ΩR ` RΩ ? + dϕ 2 u` (x)dx u (x)dx Rn ` Ω Ω R Z R − Ω KH (ϕ − φ(x))u?` (x)dx Ω KH (ϕ − φ(x))u` (x)dx R = dϕ 2 u` (x)dx Rn Ω R R ? Z R ? (x)dx u` (x)dx − Ω u` (x)dx ` Ω Ω KH (ϕ − φ(x))u R R + dϕ 2 u` (x)dx u? (x)dx Rn Ω Ω ` ||u` − u?` ||L1 (Ω;{0,1}) 2||u` − u?` ||L1 (Ω;{0,1}) R R ≤ < , Ω u` (x)dx Ω u` (x)dx where the first inequality is from the triangle inequality, andR the second inequality is from the u (x)dx observations. Clearly, the fact that ||u` − u?` ||L1 (Ω;{0,1}) < Ω 2` implies ||p̂Φ(X)|u` (X)=1 − R p̂Φ(X)|u?` (X)=1 ||T V < . Hence, for all > 0, δ := Ω u` (x)dx 2 is the number fulfulling the claim. Appendix B. A MATLAB implementation for random initialization. simple MATLAB program code for random initialization as follows: Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. We provide a COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS 1875 Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php function [u] = randInitial(M,N,nL,rounding) % % % % M = # rows in an input image N = # columns in an input image nL = # labels rounding option: either 0 or 1 tmp = rand(M,N,nL); tmp_ = repmat(sum(tmp,3), [1 1 nL]); u = tmp./temp_; if rounding == 1 % first-max [u_max,idx] = max(u,[],3); v = zeros(M,N,nL); for k = 1 : nL tmp = zeros(M,N); tmp(idx == k) = 1; v(:,:,k) = tmp; end u = v; end end Acknowledgments. The authors thank the anonymous reviewers for their insightful comments and the Cornell MRI Research Lab for providing computational resources in revising the manuscript. REFERENCES [1] I. Ahmad and P.-E. Lin, A nonparametric estimation of the entropy for absolutely continuous distributions (corresp.), IEEE Trans. Inform. Theory, 22 (1976), pp. 372–375, https://doi.org/10.1109/TIT. 1976.1055550. [2] G. Alberti, G. Bouchitté, and G. Dal Maso, The calibration method for the Mumford-Shah functional and free-discontinuity problems, Calc. Var. Partial Differential Equations, 16 (2003), pp. 299–333. [3] L. Ambrosio, N. Fusco, and D. Pallara, Functions of Bounded Variation and Free Discontinuity Problems, Clarendon Press, Oxford, UK, 2000. [4] L. T. H. An and P. D. Tao, The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems, Ann. Oper. Res., 133 (2005), pp. 23–46. [5] H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, Springer, New York, 2011. [6] J. Beirlant, E. J. Dudewicz, L. Györfi, and E. C. van der Meulen, Nonparametric entropy estimation: An overview, Int. J. Math. Stat. Sci., 6 (1997), pp. 17–40. [7] Y. Boykov, O. Veksler, and R. Zabih, Fast approximate energy minimization via graph cuts, IEEE Trans. Pattern Anal. Machine Intelligence, 23 (2001), pp. 1222–1239. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php 1876 Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM [8] A. J. Bray and D. S. Dean, Statistics of critical points of Gaussian fields on large-dimensional spaces, Phys. Rev. Lett., 98 (2007), 150201. [9] E. S. Brown, T. F. Chan, and X. Bresson, Completely convex formulation of the Chan-Vese image segmentation model, Int. J. Comput. Vis., 98 (2012), pp. 103–121. [10] V. Caselles, R. Kimmel, and G. Sapiro, Geodesic active contours, Int. J. Comput. Vis., 22 (1997), pp. 61–79. [11] A. Chambolle, D. Cremers, and T. Pock, A convex approach to minimal partitions, SIAM J. Imaging Sci., 5 (2012), pp. 1113–1158. [12] A. Chambolle and T. Pock, A first-order primal-dual algorithm for convex problems with applications to imaging, J. Math. Imaging Vis., 40 (2011), pp. 120–145, https://doi.org/10.1007/ s10851-010-0251-1. [13] T. F. Chan, S. Esedoglu, and M. Nikolova, Algorithms for finding global minimizers of image segmentation and denoising models, SIAM J. Appl. Math., 66 (2006), pp. 1632–1648. [14] T. F. Chan, L. Vese, et al., Active contours without edges, IEEE Trans. Image Process., 10 (2001), pp. 266–277. [15] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun, The Loss Surface of Multilayer Networks, arXiv:1412.0233, 2014. [16] T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley-Interscience, New York, 2006. [17] D. Cremers, T. Pock, K. Kolev, and A. Chambolle, Convex Relaxation Techniques for Segmentation, Stereo and Multiview Reconstruction, in Markov Random Fields for Vision and Image Processing, MIT Press, Cambridge, MA, 2011. [18] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, in Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA, 2014, pp. 2933–2941. [19] T. P. Dinh and H. A. Le Thi, Recent advances in DC programming and DCA, in Transactions on Computational Intelligence XIII, Springer, New York, 2014, pp. 1–37. [20] Y. G. Dmitriev and F. P. Tarasenko, On the estimation of functionals of the probability density and its derivatives, Theory Probab. Appl., 18 (1974), pp. 628–633. [21] H. Federer, Geometric Measure Theory, Springer, New York, 1969. [22] Y. V. Fyodorov and C. Nadal, Critical behavior of the number of minima of a random landscape at the glass transition point and the Tracy-Widom distribution, Phys. Rev. Lett., 109 (2012), 167203. [23] Y. V. Fyodorov and I. Williams, Replica symmetry breaking condition exposed by random matrix calculation of landscape complexity, J. Statist. Phys., 129 (2007), pp. 1081–1116. [24] T. Goldstein and S. Osher, The split Bregman method for l1-regularized problems, SIAM J. Imaging Sci., 2 (2009), pp. 323–343. [25] A. Herbulot, S. Jehan-Besson, S. Duffner, M. Barlaud, and G. Aubert, Segmentation of vectorial image features using shape gradients and information measures, J. Math. Imaging Vision, 25 (2006), pp. 365–386. [26] R. Horst and N. V. Thoai, DC programming: Overview, J. Optim. Theory Appl., 103 (1999), pp. 1–43. [27] D. R. Hunter and K. Lange, A tutorial on MM algorithms, Amer. Statist., 58 (2004), pp. 30–37. [28] Y. Kee, M. Souiai, D. Cremers, and J. Kim, Sequential convex relaxation for mutual informationbased unsupervised figure-ground segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern, Recognition, 2014. [29] J. Kim, J. W. Fisher III, A. Yezzi, M. Çetin, and A. S. Willsky, Nonparametric methods for image segmentation using information theory and curve evolution, in Proceedings of the IEEE Conference on Image Processing, Vol. 3, IEEE, New York, 2002, pp. 797–800. [30] J. Kim, J. W. Fisher III, A. Yezzi, M. Çetin, and A. S. Willsky, A nonparametric statistical method for image segmentation using information theory and curve evolution, IEEE Trans. Image Process., 14 (2005), pp. 1486–1502. [31] N. Komodakis and J.-C. Pesquet, Playing with Duality: An Overview of Recent Primal-Dual Approaches for Solving Large-Scale Optimization Problems, arXiv:1406.5429, 2014. [32] G. R. Lanckriet and B. K. Sriperumbudur, On the convergence of the concave-convex procedure, in Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA, 2009, pp. 1759–1767. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS 1877 [33] J. Lellmann and C. Schnörr, Continuous multiclass labeling approaches and algorithms, SIAM J. Imaging Sci., 4 (2011), pp. 1049–1096. [34] J. Lin, Divergence measures based on the Shannon entropy, in IEEE Trans. Inform. Theory, 37 (1991), pp. 145–151. [35] O. Michailovich, Y. Rathi, and A. Tannenbaum, Image segmentation using active contours driven by the Bhattacharyya gradient flow, IEEE Trans. Imaging Process., 16 (2007), pp. 2787–2801. [36] D. Mumford and J. Shah, Optimal approximations by piecewise smooth functions and associated variational problems, Comm. Pure Appl. Math., 42 (1989), pp. 577–685. [37] K. P. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, Cambridge, MA, 2012. [38] K. Ni, X. Bresson, T. F. Chan, and S. Esedoglu, Local histogram based segmentation using the Wasserstein distance, Int. J. Comput. Vis., 84 (2009), pp. 97–111. [39] C. Nieuwenhuis, E. Töppe, and D. Cremers, A survey and comparison of discrete and continuous multi-label optimization approaches for the Potts model, Int. J. Comput. Vis., 104 (2013), pp. 223–240. [40] S. Osher and J. A. Sethian, Fronts propagating with curvature-dependent speed: Algorithms based on Hamilton-Jacobi formulations, J. Comput. Phys., 79 (1988), pp. 12–49, https://doi.org/10.1016/ 0021-9991(88)90002-2. [41] E. Parzen, On estimation of a probability density function and mode, Ann. Math. Statist., 33 (1962), pp. 1065–1076. [42] T. Pock, D. Cremers, H. Bischof, and A. Chambolle, An algorithm for minimizing the piecewise smooth Mumford-Shah functional, in Proceedings of ICCV, 2009. [43] K. Punithakumar, J. Yuan, I. B. Ayed, S. Li, and Y. Boykov, A convex max-flow approach to distribution-based figure-ground separation, SIAM J. Imaging Sci., 5 (2012), pp. 1333–1354. [44] J. Rabin and N. Papadakis, Convex Color Image Segmentation with Optimal Transport Distances, in Scale Space and Variational Methods in Computer Vision, Springer, New York, 2015, pp. 256–269. [45] R. T. Rockafellar and R. J.-B. Wets, Variational Analysis, Grundlehren Math. Wiss. 317, Springer, Berlin, 1998. [46] M. Rosenblatt, Remarks on some nonparametric estimates of a density function, Ann. Math. Statist., 27 (1956), pp. 832–837. [47] J. S. Simonoff, Smoothing Methods in Statistics, Springer, New York, 1996. [48] E. M. Stein and R. Shakarchi, Real Analysis: Measure Theory, Integration, and Hilbert Spaces, Princeton University Press, Princeton, NJ, 2009. [49] P. Strandmark, F. Kahl, and N. C. Overgaard, Optimizing parametric total variation models, in Proceedings of the 12th International Conference on Computer Vision, IEEE, 2009, pp. 2240–2247. [50] E. Strekalovskiy, A. Chambolle, and D. Cremers, Convex relaxation of vectorial problems with coupled regularization, SIAM J. Imaging Sci., 7 (2014), pp. 294–336. [51] P. D. Tao and L. T. H. An, Convex analysis approach to DC programming: Theory, algorithms and applications, Acta Math. Vietnam., 22 (1997), pp. 289–355. [52] P. D. Tao and L. T. H. An, A DC optimization algorithm for solving the trust-region subproblem, SIAM J. Optim., 8 (1998), pp. 476–505. [53] L. A. Vese and T. F. Chan, A multiphase level set framework for image segmentation using the Mumford and Shah model, Int. J. Comput. Vis., 50 (2002), pp. 271–293. [54] R. W. Yeung, Information Theory and Network Coding, Springer, New York, 2008. [55] R. Yıldızoğlu, J.-F. Aujol, and N. Papadakis, A convex formulation for global histogram based binary segmentation, in Energy Minimization Methods in Computer Vision and Pattern Recognition, Springer, New York, 2013, pp. 335–349. [56] A. L. Yuille and A. Rangarajan, The concave-convex procedure, Neural Comput., 15 (2003), pp. 915–936. [57] C. Zach, D. Gallup, J.-M. Frahm, and M. Niethammer, Fast global labeling for real-time stereo using multiple plane sweeps, in Proceedings of VMV, 2008. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

1/--страниц