close

Вход

Забыли?

вход по аккаунту

?

16M1078653

код для вставкиСкачать
c 2017 Society for Industrial and Applied Mathematics
SIAM J. IMAGING SCIENCES
Vol. 10, No. 4, pp. 1845–1877
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Sequential Convex Programming for Computing Information-Theoretic Minimal
Partitions: Nonconvex Nonsmooth Optimization∗
Youngwook Kee† , Yegang Lee‡ , Mohamed Souiai§ , Daniel Cremers§ ,
and
Junmo Kim‡
Abstract. We consider an unsupervised image segmentation problem—from figure-ground separation to
multiregion partitioning—that consists of maximal distribution separation (in terms of mutual information) with spatial regularity (total variation regularization), which is what we call informationtheoretic minimal partitioning. Adopting the bounded variation framework, we provide an in-depth
analysis of the problem which establishes theoretical foundations and investigates the structure of
the associated energy from a variational perspective. In doing so, we show that the objective exhibits
a form of difference of convex functionals, which leads us to a class of large-scale nonconvex optimization problems where convex optimization techniques can be successfully applied. In this regard, we
propose sequential convex programming based on the philosophy of stochastic optimization and the
Chambolle–Pock primal-dual algorithm. The key idea behind it is to construct a stochastic family of
convex approximations of the original nonconvex function and sequentially minimize the associated
subproblems. Indeed, its stochastic nature makes it possible to often escape from bad local minima
toward near-optimal solutions, where such optimality can be justified in terms of the recent findings in statistical physics regarding a striking characteristic of the high-dimensional landscapes. We
experimentally demonstrate such a favorable ability of the proposed algorithm as well as show the
capacity of our approach in numerous experiments. The preliminary conference paper can be found
in [Y. Kee, M. Souiai, D. Cremers, and J. Kim, Proceedings of the IEEE Conference on Computer
Vision and Pattern, Recognition, 2014].
Key words. information theory, total variation regularization, image partitioning, difference of convex functions, stochastic optimization
AMS subject classifications. 49M20, 49M29, 65K15, 68U10
DOI. 10.1137/16M1078653
1. Introduction. We present and analyze what we call the information-theoretic minimal
partitioning problem for fully unsupervised (multiregion) image segmentation. The problem
includes foreground-background segmentation, i.e., figure-ground separation, which can be
∗
Received by the editors June 6, 2016; accepted for publication (in revised form) April 15, 2017; published
electronically October 26, 2017.
http://www.siam.org/journals/siims/10-4/M107865.html
Funding: This work was supported in part by the National Research Foundation of Korea within the Ministry
of Science, ICT and Future Planning through the Korean Government under grant NRF-2017R1A2A2A05001400,
in part by the ICT R&D program of MSIP/IITP 2016-0-00563, Research on Adaptive Machine Learning Technology
Development for Intelligent Autonomous Digital Companion, and in part by the ERC Starting grant “Convex Vision.”
†
School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, South Korea,
and Department of Radiology, Weill Cornell Medical College, New York, NY 10021 (youngwook.kee@kaist.ac.kr,
yok2013@med.cornell.edu).
‡
School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, South Korea
(askhow@kaist.ac.kr, junmo.kim@kaist.ac.kr).
§
Department of Computer Science, Technical University of Munich, 85748 Garching, Germany (souiai@tum.de,
cremers@tum.de).
1845
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
1846
Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM
Input
Initialization
Intermediate
Segmentation
Histograms
Figure 1. From an observation (input), we jointly estimate the ground truth partition and the associated
distributions which are both unknown to us. The colors (red, green, and blue) of the estimated intensity distributions (histograms) correspond to those of the estimated partition (segmentation). Starting with a random
initialization, we obtain the partition and its associated distributions by solving a family of convex subproblems
of the information-theoretic minimal partitioning problem, which will be presented throughout the paper.
more interesting in computer vision and machine learning—here the problem boils down to
fully unsupervised binary image labeling. Yet, the task of figure-ground separation is rather
challenging because it is inherently a chicken-and-egg problem: Where is the object and what
distinguishes it from the background? A common assumption that lies behind the task is that
the object and background have different intensity (or color) distributions. Likewise, one can
extend the assumption to the task of multiregion image segmentation: Each region of the
partition has distinct intensity (color) distribution. Again, if these are completely unknown
and if they show considerable overlapping, then the joint estimation of color distributions and
segmentation becomes a major algorithmic challenge. In this paper, we attempt to tackle
such a challenging unsupervised (binary and multiregion) image segmentation problem from
a variational perspective where the above assumption plays a pivotal role.
Let φ : (Ω ⊂ R2 ) → R be an observed intensity image function. The information-theoretic
minimal partitioning problem solves the variational problem
k
minimize
(1.1)
{R` }k`=1
1X
Per(R` ; Ω) − λ D(P1 , . . . , Pk )
2
`=1
subject to Ri ∩ Rj = ∅, ∀i 6= j and
k
[
R` = Ω,
`=1
where the first term of the objective is half the sum of the perimeters of the sets {R` }k`=1 ,
and the second term is a generalized statistical distance (possibly divergence) defined on the
space of probability distributions, weighted by a tuning parameter λ > 0, that measures
the discrepancy between the distributions P1 , . . . , Pk associated with R1 , . . . , Rk , respectively.
Notice that the associated distributions are unknown to us; rather, we estimate them from
subsets of {φ(x) : x ∈ R` } for ` = 1, . . . , k. Then, an optimal solution of the problem (1.1)
maximally separates the distributions with spatial regularity in terms of the total length of
region boundaries. As shown in Figure 1, it is a joint estimation problem—segmentation and
the associated distributions—that we consider throughout the paper.
The aim of the variational formulation (1.1) matches well the assumption of unsupervised image segmentation that each region of the partition has different intensity distribution.
Indeed, the formulation would better capture subtle differences between the distributions that
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS
1847
do not appear in lower-order statistics than other existing models—based on the Mumford–
Shah functional [36]—such as the Chan–Vese model [14] and its multiphase extension [53]. The
primary reason for this is that such existing models do not take into account any higher-order
statistics but the first-order moment. Needless to say, their convex formulations [13, 9, 11]
inherently have the same limitations. One may want to incorporate the second-order moment
into the models—see [49], for example—but obviously it does not address the fundamental
issue. On the other hand, it is not suprising that such limitations barely disappear as long
as the associated distributions are estimated in a parametric way, because natural images
typically exhibit complex coloring so that such a parametric representation would not be able
to capture the differences. In this light, there has been a line of research to incorporate nonparametric statistical methods into the framework of unsupervised image segmentation with
active contour models. For example, Kim et al. [30] and Herbulot et al. [25] proposed curve
evolution equations driven by mutual information and conditional entropy, respectively, using
the Parzen–Rosenblatt window method [46, 41]. Similarly, Michailovich, Rathi, and Tannenbaum [35] derived the gradient flow of the Bhattacharyya coefficient for curve evolution using
nonparametric methods. Note that the active contour models discussed were all implemented
by means of the level-set methods [40].
At the same time, computing optimal solutions of the existing energy models in low-level
vision has been another line of research along with recent advances in large-scale algorithms for
convex optimization [31]. The basic idea behind this line of research—also famously known
as convex relaxation methods—is that a given hard-to-solve nonconvex optimization problem is approximated by its convex alternative relatively easy-to-solve in terms of computing
an optimal solution within a reasonable amount of time. Most importantly, an unknown
optimal solution of the original problem can be recovered from the optimal solution of the
convex one. In this regard, the seminal work of Chan, Esedoglu, and Nikolova [13] proposed
a convex formulation for the two-phase instance of the piecewise constant Mumford–Shah
functional (relaxation) and proved that an optimal solution of the original nonconvex geometric problem is exactly recovered from that of the convex formulation by simple thresholding
(thresholding theorem). However, the class of such problems whose optimal solutions are
computed by the above technique (relaxation and thresholding) is quite limited: Not only
should the objective itself be convex along with its relaxed constraints, but also the coarea
formula from geometric measure theory [21] should hold. Note that if the objective is convex but the coarea formula does not hold, a near-optimal solution can be computed with
a performance guarantee—per-instance bound, for example. See [57, 33, 11] for convex approaches for computing a near-optimal solution of the continuous Potts model. Also, see the
work of Pock et al. [42], where they find a near-optimal solution of the piecewise smooth
Mumford–Shah problem, which is based on the technique of Chan, Esedoglu, and Nikolova
[13] and the calibration approach of Alberti, Bouchitté, and Dal Maso [2]. Meanwhile, there
have been algorithmic advances in splitting methods for nonlinear and nonsmooth optimization in line with such convex relaxation techniques: A first-order primal dual algorithm was
proposed by Chambolle and Pock [12] with applications to imaging, the split Bregman was
proposed by Goldstein and Osher [24] for `1 -regularized problems, and a globally convergent
Douglas–Rachford scheme was proposed by Lellmann and Schnörr [33], to name a few. For
other low-level vision tasks and recent advances in this context, see [17, 39, 50] and references
therein.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
1848
Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM
Not surprisingly, the problem (1.1) we consider throughout the paper, where various types
of statistical distance measure can be possibly involved, does not fall into this class, i.e., the
objective in (1.1) is nonconvex so that one needs to devise a new optimization algorithm to
efficiently compute a near-optimal solution. Indeed, the paper is dedicated to analysis on the
mathematical structure of the problem and the development of such an algorithm. In this line
of research, we here mention some notable techniques recently proposed—yet quite different
from ours: Among others such as those making use of user inputs, e.g., scribbles or bounding
boxes, the work of Punithakumar et al. [43] proposed a bound optimization technique for the
Bhattacharyya coefficient between a given (a priori known) distribution and distributions of
the candidate figures to be segmented, which is obviously considered to be supervised figureground separation. Another notable example regarding unsupervised image segmentation is
the work of Ni et al. [38]. In their work, an energy functional that consists of the Wasserstein distance between (local) intensity histograms from two disjoint regions (foreground and
background) and the total variation regularization was proposed. Most recently, a couple of
research papers [55, 44] addressed similar issues in the context of binary image segmentation,
but their performance largely depends on initialization (bounding boxes). Unfortunately, none
of these approaches addresses the problem of unsupervised multiregion image segmentation.
1.1. Outline and contributions. From a variational perspective, the paper is at the cutting edge of fully unsupervised image segmentation via what we call information-theoretic
minimal partitioning. The technical contributions of this work are in a nutshell threefold.
First, we establish theoretical foundations of the information-theoretic minimal partitioning
problem with a rigorous treatment (section 2). To this end, we revisit the notion of mutual
information proposed by Kim et al. [30] for the statistical distance measure in (1.1), which
turns out be the generalized Jensen–Shannon (JS) divergence (sections 2.1 and 2.2). Then,
we introduce, step by step, an objective that combines the negative mutual information and
the total length of the region boundaries (the Potts prior); its minimization is what we call
throughout the paper the problem of information-theoretic minimal partitioning (section 2.3).
Furthermore, we reformulate the problem using the theory of functions of bounded variation
and convex relaxation techniques for the Potts prior so that we analyze its mathematical
structure from the perspective of numerical optimization: We show that the problem exhibits
a form of the difference of convex (DC) functionals (sections 2.3.1 and 2.3.2).
Second, we develop an optimization algorithm to compute a near-optimal solution of the
original nonconvex problem (section 3). To this end, we introduce a sequential convex optimization technique that employs the difference of convex functions algorithm (DCA) [51, 52]
and the Chambolle–Pock primal-dual splitting method [12]. Roughly speaking, the proposed
algorithm constructs a family of convex subproblems for the original nonconvex objective and
sequentially solves it in such a way that a minimizing sequence is guaranteed to converge
to a critical point of the original nonconvex function (section 3.2). Note that the proposed
procedure is based on the first-order methods and the problem we tackle is nonconvex so
that the minimizing sequence is likely to end up a critical point (often likely bad local minima energetically and visually). Therefore, we further consider two strategies applied to the
procedure—sections 3.2.1 and 3.2.2—to let a minimizing sequence escape from such undesirable points toward good local minima. Indeed, the proposed algorithm can be successfully
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS
1849
applied to such a class of nonconvex and nonsmooth problems where the partitioning problem
we consider falls.
Last, we experimentally validate the capacity of our approach by showing convincing fully
unsupervised image partitions and we compare the proposed method with our previous conference work [28] for figure-ground separation. The proposed method has a couple of favorable
properties over [28]: computing partitions can be solved with a provably convergent algorithm and it is substantially faster. Then, we consider the question of how to establish the
relationship between the unknown optimal solution of the original nonconvex problem and
a computed solution from the proposed algorithm. Unfortunately, we are unaware of any
performance guarantee such as per-instance bounds in our case because the convex subproblems are in fact upper bounds of the objective, albeit tight. Nevertheless, we believe that a
computed solution is statistically likely a near-optimal solution in terms of recent findings in
statistical physics, which will be elaborated in section 4.3. We end the paper with a conclusion
in section 5 with some open issues for the future research directions.
2. Information-theoretic minimal partitions. We start off the section revisiting the notion
of mutual information proposed by Kim et al. [29, 30] in the context of image segmentation.
We view the notion from a measure-theoretic perspective and employ the bounded variation
framework in order to formulate what we call the information-theoretic minimal partitioning
problem and generalize it to any measurable space in any dimension; in addition, it provides a
new perspective on optimal labeling in computer vision in that a given image is a realization
of the unknown target random image. In doing so, we establish the existence of a minimizer
and the mathematical structure of the problem, which leads us to a certain class of large-scale
nonconvex optimization problems to which convex optimization techniques can be successfully
applied. The main contributions of this section can be found in Theorem 2.3, Proposition 2.4,
and Proposition 2.5. Furthermore, in section 2.2 we elaborate on the connection between the
mutual information criterion and the generalized JS divergence for those interested, which
would broaden the understanding of the mutual information criterion for image partitioning.
Notation and symbols used throughout the paper is listed in Table 1.
2.1. Mutual information revisited. Let Ω ⊂ Rd , d ≥ 2, be a bounded open set with
Lipschitz boundary, and let (A, A, P` ) be probability spaces
S for ` = 1, . . . , k. Suppose that
{Ω` }k`=1 is a partition of Ω such that Ωi ∩Ωj = ∅ ∀i 6= j and k`=1 Ω` = Ω. Then, functions Φ` :
A × Ω` → Rn , (a, x) 7→ (Φ1` (a, x), . . . , Φn` (a, x)) are vector-valued random variables associated
with the probability measures P` , for example, n = 3 for RGB color images. Suppose that
each P` := P` ◦ Φ−1
is absolutely continuous with respect to µ (written P` µ) for ` =
`
1, . . . , k, where µ denotes the Lebesque measure on Rn . Then, P` are the induced probability
measures on (Rn , B(Rn )), i.e., distributions. We consider collections of the random variables as
independent and identically distributed spatial stochastic processes (random fields) as follows:
i.i.d.
{Φ` (x) ∈ Rn | x ∈ Ω` } ∼ P`
∀` = 1, . . . , k,
where the index variable x takes values in the uncountable set Ω. Here, we omit the elements a
of the sample space A for notational convenience. It is important to notice that φ` : Ω` → Rn
is an observed function of Φ` , i.e., φ` (·) is a snapshot (realization) of the stochastic process
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
1850
Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Table 1
Notation and symbols.
Ω
{Ω` }k`=1
{R` }k`=1
χ R`
u`
u
Per
R g (R` ; Ω)
g|Du` |
Ω
BV (Ω)
Lp (Ω)
(A, A, P` )
X
Φ`
φ`
P`
P` µ
P`
h(Φ(X))
h(Φ(X)|u(X))
I(Φ(X); u(X))
π1 , . . . , πk
Q
DKL (P` ||Q)
dP` /dQ
DπJS (P1 , . . . , Pk )
p`
p̂`
KH
H, K
Γ0 (H)
argminimax
Proxg∗
Proxq
About geometry
Bounded open subset of Rd , d ≥ 2, with Lipschitz boundary.
S
Ground truth partition such that Ωi ∩ Ωj = ∅ ∀i 6= j and k`=1 Ω` = Ω.
Sk
Candidate partition such that Ri ∩ Rj = ∅ ∀i 6= j and `=1 R` = Ω.
Characteristic function of R` , i.e., χR` (x) = 1 if x ∈ R` and 0 otherwise.
χ R` .
P
k-tuple u = (u1 , . . . , uk ) such that k`=1 u` (x) = 1.
Weighted perimeter of R` with a nonnegative function g.
Weighted total variation of u` with a nonnegative function g.
Space of functions of bounded variation.
Lebesque spaces.
About information theory
Probability space for ` = 1, . . . , k.
Random location; its realization is x.
Vector-valued random variable associated with probability measure P` .
Observed function (realization) of Φ` .
Induced probability measure on (Rn , B(Rn )), i.e., P` := P` ◦ Φ−1
` .
P` is absolutely continuous with respect to µ.
Mixture of P1 , . . . , Pk , i.e., P` := PΦ(X)|X∈R` .
Entropy of Φ(X).
Conditional entropy: the entropy of Φ(X) given u(X) = 1.
Mutual information between Φ(X) and u(X).
Nonnegative real numbers such that π1 + · · · + πk = 1.
P
Q := k`=1 π` P` , i.e., probability measure on (Rn , B(Rn )).
Kullback–Leibler divergence from P` to Q.
Radon–Nikodym derivative of P` with respect to Q.
Generalized JS divergence of P1 , . . . , Pk .
Density of P` , i.e., p` := pΦ(X)|u` (X)=1 .
Kernel density estimate for p` .
Kernel function with the n × n bandwidth matrix H.
About optimization
Real Hilbert spaces.
Class of all proper, lower semicontinuous, and convex functions defined on H.
Set of saddle points of the argument.
Proximity operator of δK (ξ), i.e., orthogonal projector onto the convex set K.
Proximity operator of δS (u), i.e., orthogonal projector onto the convex set S.
Φ` (·) which is the set of snapshots across the ensemble. Let Φ(x) denote the sum of the random
variables Φ1 (x)+· · ·+Φk (x), and let φ(x) denote the sum of the realizations φ1 (x)+· · ·+φk (x),
respectively. Then, Φ(x) = Φ` (x) and φ(x) = φ` (x) for x ∈ Ω` , ` = 1, . . . , k, which is
illustrated in Figure 2(a) when d = 2 and k = 4.
Let X be a uniformly distributed random location (index) over the domain Ω. For example,
if one randomly extracts a pixel from the domain of a digital image, every pixel is equally
likely to occur. Then, the distribution of the random variable Φ(X) is given by the following
finite mixture:
(2.1)
PΦ(X) (φ) =
k
X
`=1
Pr(X ∈ Ω` )PΦ(X)|X∈Ω` (φ) =
|Ω1 |
|Ωk |
P1 (φ) + · · · +
Pk (φ),
|Ω|
|Ω|
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS
Ω2
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Ω1
Φ(x) ∼ P1
Φ(x) ∼ P3
Ω3
Φ(x) ∼ P2
R2
R1
Φ(x) ∼ P1
Φ(x) ∼ P4
Ω4
(a) Ground truth model
Φ(x) ∼ P3
R3
R2
R1
u2 (x) = 1
Φ(x) ∼ P2
u1 (x) = 1
Φ(x) ∼ P4
(b) Candidate partition
1851
R4
u3 (x) = 1
R3
u4 (x) = 1
R4
(c) Characteristic functions
Figure 2. Suppose that the candidate partition (b) is made on the ground
P truth model (a) such that each
region R` is encoded by its characteristic function u` as shown in (c), where k`=1 u` (x) = 1 almost everywhere
def
in Ω. Then, each distribution P` = PΦ(X)|u` (x)=1 is given by a mixture of P1 , . . . , P4 . The key idea of what we
call information-theoretic minimal perimeter partitioning is to reconstruct the unknown ground truth model from
an observation φ in such a way that the associated distributions of a computed partition are as homogeneous
as possible with spatial regularity.
where | · | denotes the d-dimensional Lebesque measure. As can be seen, the distribution
PΦ(X) involves two sources of uncertainty: first, the uncertainty of point location modeled by
Pr(X ∈ Ω` ), i.e., the region where a given location belongs; and second, the uncertainty in
distribution for each region given by PΦ(X)|X∈Ω` . Note that we assume P` µ for ` = 1, . . . , k
so that the density of each P` , denoted by p` , is defined as the Radon–Nikodym derivative of
P` with respect to µ. In doing so, P and p are interchangeably used throughout the paper.
For instance, the mixture density pΦ(X) (φ) can also be written in the form of (2.1) but in
terms of pΦ(X)|X∈Ω` .
Then, the problem we address in this paper is to jointly estimate the ground truth partition
{Ω` }k`=1 and its associated distributions P1 , . . . , Pk , which are both unknown to us, thereby
estimating the mixture distribution PΦ(X) . To this end, let us consider a candidate partition
S
{R` }k`=1 such that Ri ∩ Rj = ∅ ∀i 6= j and k`=1 R` = Ω (up to sets of Lebesque measure zero)
and its associated distributions P1 , . . . , Pk as illustrated in Figure 2(b) when d = 2 and k = 4.
Here, it is crucial to notice that each P` := PΦ(X)|X∈R` is given by a mixture distribution of
P1 , . . . , Pk as follows:
P` =
|R` ∩ Ω1 |
|R` ∩ Ωk |
P1 + · · · +
Pk
|R` |
|R` |
∀` = 1, . . . , k.
Hence, coming up with a certain criterion that measures the homogeneity of the distributions
P1 , . . . , Pk , one may want to optimize it with respect to the candidate partition {R` }k`=1 in
order to obtain the ground truth partition. Notice that the larger a portion of R` coincides
with one of the sets Ω1 , . . . , Ωk , the more homogeneous its associated distribution becomes.
Indeed, the partitioning problem can be captured in this way by maximizing the mutual
information between Φ(X) and the characteristic function of the candidate partition {R` }k`=1 .
To elaborate, let us observe that the characteristic function of {R` }k`=1 is given as follows:
(
1 with probability |R` |/|Ω|,
def
(2.2)
u` (X) = χR` (X) =
0 with probability |Ω \ R` |/|Ω|,
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
1852
Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
P
such that k`=1 u` (x) = 1, where x ∈ Ω. Note that the function u` (X) is a random variable
considering the uncertainty of point location X. Then, the mutual information between Φ(X)
and u(X) := (u1 (X), . . . , uk (X)) is defined as
(2.3)
def
I(Φ(X); u(X)) = h(Φ(X)) − h(Φ(X)|u(X))
= h(Φ(X)) −
k
X
Pr(u` (X) = 1)h(Φ(X)|u` (X) = 1),
`=1
where Pr(u` (X) = 1) = Pr(X ∈ R` ) = |R` |/|Ω|. Now, let us come back to what we have
mentioned—the partitioning problem can be captured by maximizing the mutual information
between Φ(X) and u(X).
Theorem 2.1 (Kim et al. [30]). Let {R` }k`=1 be a candidate partition of Ω. Then, the mutual information I(Φ(X); u(X)) is maximized if and only if the candidate partition coincides
with the ground truth partition {Ω` }k`=1 .
Proof. It is a natural extension of the fact about mutual information [30] proven for the
two-region case to the general case with multiple regions. As shown in [30], the mutual information I(Φ(X); u(X)) is maximized if and only if Φ(X) and X are conditionally independent
given u(X), so it suffices to show that PΦ(X)|u(X) = PΦ(X)|X,u(X) if and only if the candidate partition {R` }k`=1 coincides with the ground truth partition {Ω` }k`=1 , i.e., R` = Ωπ(`)
for 1 ≤ ` ≤ k, where π(·) is a permutation function. Similar to [30], the rest of the proof
comes from the fact that PΦ(X)|u(X)=e` , where e` = (0, 0, . . . , 1, . . . , 0) whose `th element is
1, is homogeneous for all ` if the two partitions coincide and that if the two partitions do
not coincide there is at least one region R` for which PΦ(X)|u(X)=e` is a mixture of Pi ’s. The
conditional distribution PΦ(X)|u` (X)=e` is given as a mixture of P1 , . . . , Pk as follows:
PΦ(X)|u(X)=e` =
|R` ∩ Ω1 |
|R` ∩ Ωk |
P1 + · · · +
Pk .
|R` |
|R` |
On the other hand, PΦ(X)|X=x,u(X)=e` = Pj if x ∈ Ω` indicating that PΦ(X)|X=x,u(X)=e` is
homogeneous. (⇐=): If R` = Ωπ(`) for ` = 1, . . . , k, then PΦ(X)|X=x,u(X)=e` = Pπ(`) (∵ e` =
u(X) = u(x) ⇒ x ∈ R` = Ωπ(`) ) and PΦ(X)|u(X)=e` = Pπ(`) (∵ u(X) = e` ⇒ X ∈ R` = Ωπ(`) ).
(=⇒): If partitions do not coincide, there is at least one ` such that |R` ∩Ωa | =
6 0, |R` ∩Ωb | =
6 0,
and PΦ(X)|u(X)=e` = αa Pa + αb Pb + · · · 6= PΦ(X)|X=x,u(X)=e` = Pj , if x ∈ Ωj .
Theorem 2.1 can be understood as a justification for the suitability of maximizing the
mutual information for unsupervised image segmentation; its inherent issue, however, will be
elaborated in section 2.3. Before moving on to how to optimize the mutual information criterion, we further analyze it in terms of the Kullback–Leibler (KL) divergence. It will broaden
our understanding of the notion of mutual information with a slightly different perspective.
2.2. More on the mutual information criterion. Suppose
that π1 , . . . , πk are nonnegative
P
real numbers such that π1 + · · · + πk = 1. Setting Q := k`=1 π` P` , one can see that Q is again
a probability measure on the measurable space (Rn , B(Rn )). The KL divergence from P` to
Q is defined as
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS
1853
a3
1
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
P3
Q :=
P3
ℓ=1
P3
πℓ = 1
ℓ=1
P1
a2
1
1
a1
πℓ P ℓ
such that
P2
Figure 3. Suppose that A = {a1 , a2 , a3 } and A = 2A . Then, the generalized JS divergence on the 2-simplex
measures the similarity between the probability measures P1 , P2 , and P3 by means of a convex combination of
P
def P
KL divergences between P` and 3`=1 π` P` . Note that Q = 3`=1 π` P` is again a probability measure.
(2.4)
def
DKL (P` ||Q) =
Z
log
Rn
dP`
dQ
dP` ,
P` Q,
where dP` /dQ is the Radon–Nikodym derivative of P` with respect to Q and the logarithm is
taken to base 2. Then, the generalized JS divergence for the probability measures P1 , . . . , Pk
is defined as follows:1
def
(2.5)
DπJS (P1 , . . . , Pk ) =
k
X
`=1
π` DKL (P` ||Q),
which is a convex combination of KL divergences between P` and Q. A schematic illustration
of how it measures the differences between P1 , P2 , and P3 on the probability simplex is shown
in Figure 3. Setting π` = |R` |/|Ω| for ` = 1, . . . , k, we can rewrite the mutual information
between Φ(X) and u(X) as follows:
I(Φ(X); u(X)) = h(Q) −
k
X
=−
Z
log
=−
k
X
π`
=
Rn
`=1
k
X
`=1
π` h(P` )
`=1
Z
dQ
dµ
Rn
log
dQ +
dQ
dµ
k
X
`=1
π`
Z
dP` +
log
Rn
k
X
`=1
π`
Z
dP`
dµ
Rn
log
dP`
dP`
dµ
dP`
π` DKL (P` ||Q).
1
Note that the JS divergence between two probability distributions P1 and P2 is defined as DJS (P1 ||P2 ) :=
DKL (P1 ||M) + 12 DKL (P2 ||M), where M = 21 (P1 + P2 ). One can find its generalization, i.e., the generalized
JS divergences, in [34].
1
2
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
1854
Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM
Thus, maximizing the mutual information between Φ(X) and u(X) can be understood as
maximizing the generalized JS divergence of the probability measures P1 , . . . , Pk , which maximally separates the associated distributions from Q in terms of the weighted KL divergences.
The following lemma is then quite straightforward.
Lemma 2.2. 0 ≤ DπJS (P1 , . . . , Pk ) ≤ log k.
Proof. It follows from the fact that mutual information is nonnegative and I(Φ(X); u(X)) ≤
h(u(X)) ≤ log k, where the last inequality is from the number of labels (cardinality) in the
range of u(X).
2.3. Regularized mutual information minimization. The fundamental problem we are
now faced with is the fact that computing the mutual information between Φ(X) and u(X)
involves the ground truth model, {Ω` }k`=1 and P1 , . . . , Pk . Obviously, one cannot compute the
mutual information in practice because the ground truth model is unknown to us, let alone
its maximization. Hence, we replace the mutual information with its estimate. To this end,
let us remind the readers of what we assumed in section 2.1: Suppose P` µ for ` = 1, . . . , k.
Then, the density of each P` , denoted by p` := pΦ(X)|u` (X)=1 , is defined by dP` /dµ. Let
{φ(x) : u` (x) = 1, x ∈ Ω} be a collection of independent and identically distributed samples
drawn from the unknown density p` . Then, for ` = 1, . . . , k, we estimate the unknown density
by means of kernel density estimation as follows:
R
KH (ϕ − φ(x))u` (x)dx
R
p̂` (ϕ) = Ω
,
Ω u` (x)dx
where KH is the kernel function with the n × n bandwidth matrix H (symmetric and positive
definite). When n = 1, for instance, KH is the Gaussian kernel with the scalar bandwidth H,
which is given by
1
(ϕ − φ(x))2
KH (ϕ − φ(x)) = √
exp −
.
2H 2
2πH 2
In [47], one can find a number of feasible definitions of the kernel function and the bandwidth
matrices. We then estimate the conditional entropy terms in (2.3) using the integral estimate
proposed in [20] as follows:
Z
def
(2.6)
ĥ(Φ(X)|u` (X) = 1) = −
p̂` (ϕ) log p̂` (ϕ)dϕ,
Rn
where it is worth mentioning that the above estimate converges to the true entropy almost
surely as the number of samples for kernel density estimation goes to infinity. In practice, the
collection {φ(x) : u` (x) = 1, x ∈ Ω} is given as a finite set whose cardinality
P is the number of
pixels in R` . Thus, the kernel density estimate is computed as p̂` (ϕ) = N1 N
i=1 KH (ϕ−φ(xi )),
where N is the number of pixels. Let ĥN be an integral estimate of the true entropy h
computed by the above kernel density estimate; then Dmitriev and Tarasenko [20] showed
Pr(limN →∞ ĥN = h) = 1. On the other hand, the entropy terms in [30] are estimated by using
the resubstitution estimate proposed in [1], which is essentially the law of large numbers. The
estimate is known to be mean square consistent. For further details, we refer the readers to
an overview of the nonparametric entropy estimation [6].
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS
1855
Now, the following maximization problem (maximizing the mutual information estimate)
looks tempting in that we can actually compute its solutions:
n
o
def
ˆ
maximize I(Φ(X);
u(X)) = ĥ(Φ(X)) − ĥ(Φ(X)|u(X)) .
{R` }k`=1
Yet, as often observed in empirical risk minimization [37], the solutions of such empirical objectives (risks) typically exhibit overfitting, which is obviously unwanted. Therefore, we penalize
ˆ
the complexity of solutions by combining the mutual information estimate I(Φ(X);
u(X))
with spatial regularity—which is what we call information-theoretic minimal partitioning—as
follows: For an observed function φ(x) of Φ(x) for a.e. in Ω
k
(2.7)
1X
minimize
Perg (R` ; Ω) − λ
{R` }k`=1 2
`=1
Z
Ω
ˆ
I(Φ(X);
u(X))dx,
where the first term is half the sum of the perimeters of the sets {R` }k`=1 with respect to a
metric defined by a nonnegative function g(x), which includes the geodesic active contours
ˆ
[10]—see [17] for more details. The mutual information estimate I(Φ(X);
u(X)) is now integrated over the entire (bounded) domain to incorporate the total amount of information. Note
that the mutual information between Φ(X) and u(X) is the amount of information defined on
ˆ
a single (random) location X. Then, we can rewrite the second term as λ|Ω|I(Φ(X);
u(X))
since the mutual information estimate is not a function of x; here λ > 0 is a parameter. See
[30] for details. In short, the problem combines maximal distribution separation—one might
want to recall the fact that the mutual information is the same as the generalized JS divergence estimate (see section 2.2)—with spatial regularity in the sense of the total length of the
region boundaries. Also, one may want to write the problem as
k
1X
minimize
Perg (R` ; Ω) − λ|Ω| DπJS (P̂1 , . . . , P̂k ),
{R` }k`=1 2
`=1
where π` = |R` |/|Ω|. This is clearly the same as (2.7).
2.3.1. The bounded variation framework. We introduced the characteristic function to
encode the candidate partition {R` }k`=1 , geometric variable, as shown in (2.2). As can be found
in [33, 11], we further assume that each u` is in BV (Ω), the space of functions
Pk of bounded
k
variation [3], so that the k-tuple u = (u1 , . . . , uk ) ∈ BV (Ω; R ) satisfies
`=1 u` (x) = 1,
where u` (x) ∈ {0, 1}, almost everywhere in Ω. Then,
the
perimeter
of
R
in
Ω
is defined as
`
R
the total variation of its characteristic function, Ω |DχR` |; thus we rewrite (2.7) as
(
)
k Z
X
def 1
ˆ
minimize E(u) =
(2.8)
g|Du` | − λ|Ω|I(Φ(X);
u(X))
u∈∆
2
Ω
`=1
with the set
(2.9)
∆=
(
u ∈ BV (Ω; {0, 1}k ) :
k
X
`=1
)
u` = 1 a.e. in Ω ,
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
1856
Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
where g(x) is a nonnegative function as described in (2.7). The encoding scheme is illustrated
in Figure 2(c). The most immediate question that follows is whether the above partitioning
problem admits a minimizer.
Theorem 2.3 (existence theorem). There exists a minimizer u(·) such that inf{E(u) : u ∈
∆} is achieved.
Proof. We follow the direct method of the calculus of variations: (1) Since inf{E(u) :
u ∈ ∆} > −∞ (recall that the mutual information estimate is bounded—see Lemma 2.2),
there exists a minimizing sequence {un }n≥1 ∈ ∆, meaning E(un ) → inf{E(u) : u ∈ ∆},
such that supn ||un ||BV < +∞. (2) Then, the minimizing sequence admits some subsequence
{unh }h≥1 converging to u? ∈ ∆ in L1 (Ω; {0, 1}k ) by the compactness result in BV (see [3, Theorem 3.23]). (3) Since the total variation is lower-semicontinuous (l.s.c.) in the L1 (Ω; {0, 1}k )
ˆ
topology, we need to only show that the mutual information estimate I(Φ(X);
u(X)) is also
l.s.c. in the same topology.
To this end, recall that the integral estimate of the differential entropy is continuous with
respect to convergence in variational distance (which is the same as L1 -convergence) [54].
Hence, it suffices to show that kernel density estimates are continuous with respect to the
L1 (Ω; {0, 1}) topology (recall the universal property of the product topology); i.e., we claim
∀ > 0, ∃δ > 0 such that ||u` − u?` ||L1 (Ω;{0,1}) < δ =⇒ ||p̂Φ(X)|u` (X)=1 − p̂Φ(X)|u?` (X)=1 ||T V < ,
where || · ||T V is the total variation distance (variational distance) of probability measures. See
Appendix A for the proof of the claim. Then, we have E(u), sequentially l.s.c. with respect
to the L1 (Ω; {0, 1}k ) topology.
Now that the existence of a minimizer has been established, we turn our attention to
the question of how to actually compute such a minimizer within a reasonable time. Indeed,
the rest of this section is devoted to finding out the underlying mathematical structure of the
information-theoretic minimal partitioning problem, upon which we develop an efficient optimization algorithm for the problem with an in-depth analysis and demonstrate the capacity
of our approach in numerous experiments for the second half of the paper.
We start off reminding the readers of convex relaxation techniques for the total variation
of characteristic functions (of Caccioppoli sets). Note that its discrete counterpart known as
the Potts model is NP-hard [7], so that we should make use of convex relaxation methods to
deal with the constraint set ∆. From the local convex envelope approach
R to minimal partitions
P
[11], we set J : L2 (Ω; Rk ) → [0, +∞] such that u 7→ J(u) = 21 k`=1 Ω |Du` | if u ∈ ∆ and
+∞ otherwise. Then, the information-theoretic minimal partitioning problem, i.e., (2.8) with
(2.9), can be rewritten as follows (here we assume g = 1):
(2.10)
ˆ
minimize J(u) − λ|Ω|I(Φ(X);
u(X)).
u∈L2 (Ω;Rk )
Let J ∗∗ be the biconjugate of J, i.e., J ∗∗ (u) = sup{hv, ui − J ∗ (v) : v ∈ L2 (Ω; Rk )}, where
J ∗ is the Fenchel conjugate of J —see, for example, [5], where the (effective) domain of J is
∆. From the convex envelope approach [11, Proposition 3.1], we have the domain of J ∗∗ as
follows:
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
dom J ∗∗ =
(
u ∈ BV (Ω; [0, 1]k ) :
k
X
`=1
1857
)
u` = 1 a.e. in Ω .
Unfortunately, it is believed that no explicit forms of J ∗∗ are known. As a consequence, a
significant amount of attention has been paid in recent years [57, 33, 11] to searching for a
convex, l.s.c. function J˘ such that J˘ ≤ J in dom J ∗∗ . Considering the dual representation of
the total variation, it turns out that the function J˘ is of the form
( Z k
)
X
˘
(2.11)
J(u)
= sup −
u` div ξ` : ξ(x) ∈ K ∀x ∈ Ω ,
Ω `=1
where the dual variable ξ := (ξ1 , . . . , ξk ) is an element of the set K ⊂ Cc1 (Ω; Rd×k ). Then, the
convex relaxation techniques developed so far boil down to the way of imposing constraints
on the dual variable ξ by the convex set K. For example, Zach et al. [57] proposed the set K
as follows:
n
o
def
(2.12)
K = KZach = ξ ∈ Cc1 (Ω; Rd×k ) : |ξ` | ≤ 1/2 ∀` = 1, . . . , k ,
where | · | denotes the Euclidean norm in Rd . The approach of Lellmann and Schnörr [33], on
the other hand, reduces to the following convex set:
n
√ o
def
K = KLellmann = ξ ∈ Cc1 (Ω; Rd×k ) : |ξ| ≤ 1/ 2 ,
(2.13)
where | · | denotes the Frobenius norm in Rd×k . The convex set from the local convex envelope
approach proposed by Chambolle, Cremers, and Pock [11] is given as
n
o
def
(2.14)
K = KChambolle = ξ ∈ Cc1 (Ω; Rd×k ) : |ξi − ξj | ≤ 1 ∀i < j ,
where | · | denotes the Euclidean norm in Rd . See Lemma 3.4 and Proposition 5.1 in [11] for
more details on tightness. Therefore, the relaxed problem is given as
n
o
def ˘
ˆ
(2.15)
minimize
E
(u)
=
J(u)
−
λ|Ω|
I(Φ(X);
u(X))
,
rel
∗∗
u∈dom J
˘
where J(u)
is given in (2.11) and the respective convex set K is either (2.12), (2.13), or (2.14).
Note that it is straightforward to incorporate a nonnegative function g into the convex sets.
2.3.2. Difference of convex functionals. Now that we have established the existence of
minimizers (Theorem 2.3), we are aiming to compute such a minimizer of the problem. Indeed,
ˆ
if the relaxed problem (2.15) is convex in dom J ∗∗ (in other words, if −λ|Ω|I(Φ(X);
u(X)) is
convex), one can make use of some suitable convex optimization techniques to efficiently
compute a globally optimal solution of (2.15). Then, a near-optimal binary solution of the
original nonconvex problem, (2.8) with (2.9), can be obtained by thresholding the global
minimizer of the relaxed problem with a performance guarantee (per-instance energy bound2 ).
2
Let u?bin be a thresholded version of u?rel , a global minimizer of Erel (u) over the domain dom J ∗∗ , that is
to say, u?bin = 1u?rel ≥θ for any value θ ∈ (0, 1), and let uopt be a globally optimal binary solution of E(u). A
per-instance energy bound for the optimal solution is given by |E(u?bin ) − E(uopt )| ≤ |E(u?bin ) − E(urel )|, i.e.,
by evaluating E(u) at urel and ubin , one can check how close ubin is to uopt energetically.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
1858
Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Therefore, we need to investigate the structure of the objective to check whether the overall
energy is convex.
Proposition 2.4. Suppose that λ is nonnegative and the domain Ω is bounded, then the data
ˆ
term, −λ|Ω|I(Φ(X);
u(X)), is concave in u ∈ dom J ∗∗ .
Proof. Since the entropy estimate ĥ(Φ(X)) is a constant with respect to u, it suffices to
show that the conditional entropy estimate ĥ(Φ(X)|u(X)) is concave in u. To this end, we
rewrite Pr(u` (X) = 1)ĥ(Φ(X)|u(X) = 1) using its integral estimate (2.6) as follows:
Z
Z
1
def
Ẽ(u) = Pr(u` (X) = 1)ĥ(Φ(X)|u` (X) = 1) = −
u` (x)dx
p̂` (ϕ) log p̂` (ϕ)dϕ
|Ω| Ω
Rn
R
Z Z
− φ(x))u` (x)dx
1
Ω KH (ϕ
R
=−
KH (ϕ − φ(x))u` (x)dx log
,
|Ω| Rn Ω
Ω u` (x)dx
where the second equality is from Fubini’s theorem. Then, for 0 ≤ t ≤ 1 and u`1 , u`2 ∈
dom J ∗∗ , we have
Z Z
1
KH (ϕ − φ(x))(tu`1 (x) + (1 − t)u`2 (x))dx
Ẽ(tu`1 + (1 − t)u`2 ) = −
|Ω| Rn Ω
R
− φ(x))(tu`1 (x) + (1 − t)u`2 (x))dx
Ω KH (ϕ
R
dϕ
log
tu`1 (x) + (1 − t)u`2 (x)dx
Ω
Z Z
t
≥−
KH (ϕ − φ(x))u`1 (x)dx log p̂`1 (ϕ)dϕ
|Ω| Rn Ω
Z Z
1−t
KH (ϕ − φ(x))u`2 (x)dx log p̂`2 (ϕ)dϕ
−
|Ω| Rn Ω
= tẼ(u`1 ) + (1 − t)Ẽ(u`2 )),
where the inequality is from the log sum inequality [16]. Since the sum of concave functions
is concave, ĥ(Φ(X)|u(X)) is concave in u.
Proposition 2.4 implies that the overall energy Erel is nonconvex. Thus, it does not seem
straightforward to efficiently minimize the relaxed energy in order to compute a solution such
that inf{Erel (u) : u ∈ dom J ∗∗ } is achieved, let alone a near-optimal binary solution of the
original nonconvex problem. Yet, we make an intriguing observation on the structure of the
relaxed problem as follows.
Proposition 2.5. Erel (u) with the domain dom J ∗∗ is a form of the DC functionals.
Proof. Let Γ0 (H) denote the class of all proper, l.s.c., and convex functions3 defined on a
2
k
real Hilbert space H taking values in (−∞, ∞];
R suppose that H = L (Ω; R ). Then, as
Pk and
˘
previously done for the total variation of u, `=1 Ω |Du` |, we can similarly extend J(u)
to
∗∗
˘
H by letting J(u) = +∞ if u 6∈ dom J . In other words, setting J : H → [0, +∞] such that
˘
u 7→ J(u)
if u ∈ dom J ∗∗ , +∞ otherwise, we rewrite the problem (2.15) as follows:
3
A key property of such functions in Γ0 (H) is that every element h ∈ Γ0 (H) is characterized by the
supremum of its continuous affine minorants: h(x) = sup{hu, xi − h∗ (u) : u ∈ H} ∀x ∈ H, where h∗ defined by
h∗ (u) = sup{hx, ui − h(x) : x ∈ H} ∀u ∈ H is the Fenchel conjugate of h. In fact, h∗ ∈ Γ0 (H).
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
(2.16)
1859
ˆ
minimize J (u) − λ|Ω|I(Φ(X);
u(X)).
u∈H
ˆ
Here, the regularization term J (u) is clearly in Γ0 (H). For the data term λ|Ω|I(Φ(X);
u(X)),
one should check if the following inner product makes sense (integrability):
Z
KH (ϕ − φ(x))u` (x)dx,
hKH (ϕ − φ), u` iL2 (Ω;R) =
Ω
where ϕ ∈ Rn . That is to check for almost every ϕ if the function x 7→ KH (ϕ − φ(x))u` (x) is
integrable; which can be shown by Fubini’s theorem and the Cauchy–Schwarz inequality [48].
Hence, the data term is also in Γ0 (H).
Therefore, we observe that the problem amounts to minimization the DC functions, where
we will see that convex analysis and optimization come into play.
3. Sequential convex programming. What we have established so far is as follows: The
information-theoretic minimal partitioning problem admits a minimizer; however, the overall
relaxation is still nonconvex so that the efficient computation of such a minimizer becomes
a major algorithmic challenge. Yet, we have shown that the relaxed problem exhibits a
form of DC functionals which, in this section, leads us to DC programming or the so-called
majorization-minimization (MM) algorithm where convex optimization techniques can be successfully applied. Therefore, we start off the section with essential preliminaries on convex
analysis for DC programming and the DCA, where the notion of DC duality and optimality
conditions become crucial to design such a convergent algorithm. For more details, we refer
the readers to [51, 52, 26, 4, 19] and references therein. Then, we consider a larger class of
DC functions where a linear operator is involved—the partitioning problem we consider falls
into this class.
3.1. Preliminaries: Convex analysis for DC programming and DCA. With the convention +∞ − (+∞) = +∞, the canonical DC programming we consider is of the form
(3.1)
def
α = inf{f (u) = g(u) − h(u) : u ∈ H},
where g, h ∈ Γ0 (H). The function f is called a DC function, and its components g and h
are called DC components. A closed convex constrained set C ⊂ H is incorporated into the
canonical form (3.1) by adding the indicator function δC of C to g, i.e., δC (u) = 0 if u ∈ C,
and +∞ otherwise. Substituting h in the primal program (3.1) with its Fenchel conjugate,
we obtain the Fenchel–Rockafellar type dual
(3.2)
α = inf{h∗ (v) − g ∗ (v) : v ∈ H},
where the finiteness of α implies dom g ⊂ dom h (dually, dom h∗ ⊂ dom g ∗ ). Then, one can
clearly see that the dual program (3.2) is also another DC problem whose dual (dual of dual)
is exactly the primal (3.1). In addition, the optimal value in the dual program (3.2) is the
same as that of the primal program—see the Toland–Singer theorem [5, Corollary 14.20]. This
property, also known as DC duality, plays a pivotal role in establishing optimal conditions in
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
1860
Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM
Algorithm 1. Simplified DCA [51, 52]
1: Choose u0 ∈ dom g.
2: while not converged do
3:
v n ∈ ∂h(un ).
4:
un+1 ∈ ∂g ∗ (v n ).
5: end while
DC programming, and it has opened up new possibilities of convex optimization to tackle a
wide class of nonconvex optimization problems.4
Theorem 3.1 (KKT conditions and local optimality [51]). Let u∗ be a critical point of f
in (3.1); then ∂g(u∗ ) ∩ ∂h(u∗ ) 6= ∅. Furthermore, if u∗ is a local minimizer, then ∂h(u∗ ) ⊂
∂g(u∗ ). Dually, let v ∗ be a critical point (resp., local minimizer) of h∗ − g ∗ in (3.2); then
∂h∗ (v ∗ ) ∩ ∂g ∗ (v ∗ ) 6= ∅ (resp., ∂g ∗ (v ∗ ) ⊂ ∂h∗ (v ∗ )).
Based on the above conditions and DC duality, a globally convergent algorithm 5 —also
known as the DCA—was proposed in [51, 52]. Again, notice that a globally convergent algorithm does not guarantee a minimizing sequence to converge to a globally optimal solution
such that it achieves the infimum of the objective. Rather, the algorithm minimizes (3.1) and
(3.2) in such a way that it let the sequences {g(un ) − h(un )} and {h∗ (v n ) − g ∗ (v n )} monotonically decrease. More importantly, DCA constructs the sequence {(un , v n )} converging to
a pair of critical points (u∗ , v ∗ ) under some regularity conditions (see section 3.2 in [51]). See
Algorithm 1.
Meanwhile, the algorithm can be alternatively derived by replacing h in the primal (3.1)
with its Fenchel conjugate
(3.3)
α = inf inf{g(u) + h∗ (v) − hu, vi : u, v ∈ H},
which is convex in u for fixed v ∈ H and v for fixed u ∈ H, respectively. Note that the
problem is not jointly convex with respect to both u and v. Hence, we make use of alternating
minimization for a given initial point u0 ∈ dom g as follows:
v n ∈ argmin {g(un ) + h∗ (v) − hun , vi : v ∈ H} ⇐⇒ 0 ∈ ∂h∗ (v n ) − un
⇐⇒ v n ∈ ∂h(un );
un+1 ∈ argmin {g(u) + h∗ (v n ) − hu, v n i : u ∈ H} ⇐⇒ 0 ∈ ∂g(un+1 ) − v n
⇐⇒ un+1 ∈ ∂g ∗ (v n ).
The above derivation boils down to Algorithm 1. As can be clearly seen, the algorithm does
not guarantee the global optimality of solutions due to the nature of alternating optimization.
In what follows, we consider a larger class of DC functions where a linear operator is involved.
Indeed, the objective we consider in (2.15) falls into this class.
4
According to [51] and references therein, the vector space of DC functions, DC(H) = Γ0 (H) − Γ0 (H),
encompasses most real-world objectives and is closed under usual operations in optimization.
5
Roughly speaking, an iterative algorithm A is said to be globally convergent if, for any initial point u0 , a
sequence {ut }∞
t=0 generated by the algorithm converges to a point for which an optimality condition holds. For
precise definitions, see section 3 in [32].
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS
1861
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
3.2. Primal-dual DC procedure. Let K be a real Hilbert space, and let L : H → K be a
continuous linear operator with the induced norm ||L|| = sup{||Lu|| : u ∈ H with ||u|| ≤ 1}.
Then, the problem (2.16) is of the form
(3.4)
β = inf{g(Lu) − h(u) : u ∈ H},
where g ∈ Γ0 (K) and h ∈ Γ0 (H). Similarly, the finiteness of β implies dom g ⊂ L(dom h).
Here, one notices that it is far from straightforward to derive the Fenchel–Rockafellar type dual
of (3.4) in contrast to (3.1). In addition, it does not seem helpful to do so considering that one
would not be able to make use of Theorem 3.1 and Algorithm 1, i.e., v n ∈ ∂(g ◦ L)(un+1 ) <
un+1 ∈ ∂(g ◦ L)∗ (v n ). On the other hand, if the composite g ◦ L is in Γ0 (H), one can apply
the algorithm by replacing g ∗ with (g ◦ L)∗ . The total variation J (u) is in Γ0 (H) as observed
in section 2.3.2. Thus, it leads us to a primal-dual DC procedure where un+1 ∈ ∂(g ◦ L)∗ (v n )
is formulated as a saddle-point problem that can be solved by the first-order primal-dual
(Chambolle–Pock ) algorithm in [12]. Note that different types of first-order algorithms—
such as Nesterov’s method, Bregman iterative algorithms, and the Douglas–Rachford splitting
method—can be used for such a saddle-point problem; see section 6 in [33] for comparison.
Suppose that q ∈ Γ0 (H) is added into the form (3.4) to take additional convex constraints
into account. Then, for a given initial point u0 ∈ L−1 (dom g), the procedure is as follows:
v n ∈ argmin {g(Lun ) + h∗ (v) − hun , vi + q(un ) : v ∈ H}
⇐⇒ 0 ∈ ∂h∗ (v n ) − un ⇐⇒ v n ∈ ∂h(un );
un+1 ∈ argmin {g(Lu) + h∗ (v n ) − hu, v n i + q(u) : u ∈ H}
ˆ ∈ argminimax {hLu, ξi − g ∗ (ξ) +h∗ (v n ) − hu, v n i + q(u) : u ∈ H, ξ ∈ K}
⇐⇒ (un+1 , ξ)
ˆ v n − L∗ ξˆ ∈ ∂q(un+1 ),
⇐⇒ Lun+1 ∈ ∂g ∗ (ξ),
where argminimax denotes the set of saddle points of the argument (see Definition 11.49 in
[45]). As mentioned above, we here apply the Chambolle–Pock algorithm [12] to solve the
ˆ v n −L∗ ξˆ ∈ ∂q(un+1 ), then its convergence analysis
saddle-point problem, i.e., Lun+1 ∈ ∂g ∗ (ξ),
becomes straightforward from Theorem 3.7 in [52] and Theorem 1 in [12] by replacing g with
g ◦ L ∈ Γ0 (H).
Remark 3.2. When g ◦ L ∈ Γ0 (H), the convergence analysis of the primal-dual DC procedure becomes straightforward from Theorem 3.7 in [52] and Theorem 1 in [12] by replacing g
with g ◦ L.
Note that the sequence converges weakly in general real Hilbert spaces, and if h in the
primal is differentiable, the DC algorithm reduces to the concave-convex procedure [56]. It
implies that the concave-convex procedure is also globally convergent, which was proved alternatively in [32] by means of Zangwill’s global convergence theory of iterative algorithms. The
MM principle [27], on the other hand, is capable of encompassing a larger class of nonconvex
problems providing a rather general strategy for constructing surrogate functions (majorizers). However, constructing a good surrogate of the original nonconvex objective is in general
far from straightforward, let alone its optimality conditions. Note that the monotonicity of
a minimizing sequence itself does not guarantee its convergence to a critical point of the objective function. Last, one recognizes that the philosophy behind the DC algorithm can be
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
1862
Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM
Algorithm 2. Primal-dual DC procedure
1: Choose u0 ∈ dom g, ξ 0 ∈ K, and θ ∈ [0, 1].
2: Choose τ, σ > 0 such that τ σ||L||2 < 1.
3: while not converged do
4:
v n1 ∈ ∂h(un1 ).
5:
while not converged do
6:
ξ n2 +1 = Proxσg∗ (ξ n2 + σLūn2 ).
7:
un2 +1 = Proxτ q (un2 − τ (v n1 − L∗ ξ n2 +1 )).
8:
ūn2 +1 = un2 +1 + θ(un2 +1 − un2 ).
9:
end while
10:
un1 = un2 +1 .
11: end while
understood as an MM algorithm when the objective is a DC function and its concave part
is majorized by the first-order Taylor expansion. Then, from an algorithmic perspective, the
major difference between the current study and the preliminary conference work of the authors [28] is that the surrogate function in the primal-dual DC procedure consists of a linear
function and a convex regularizer, whereas the one in [28] consists of a quadratic function and
a convex regularizer—more importantly, the convergence of a minimizing sequence was not
properly addressed. In addition, we will show later on that proposed method is substantially
faster.
Now, we apply the primal-dual DC procedure to the problem (2.16) in order to compute an
optimal solution (critical point). Because the negative mutual information estimate is elementwise Gâteaux differentiable at u ∈ L2 (Ω; Rk ), the subdifferential ∂h(u) := (∂h(u1 ), . . . , ∂h(uk ))
is a singleton whose elements are computed as
Z
1
v` ∈ ∂h(u` ) =
KH (ϕ − φ(x)) log
(3.5)
dϕ,
p̂` (ϕ)
Rn
from which v is defined by the k-tuple v := (v1 , . . . , vk ). Let us recall the dual representation of
the total variation seminorm and its convex relaxations—see section 2.3.1. We identify g ∗ (ξ) as
δK (ξ), where the convex set K is either KZach , KLellmann , or KChambolle . The proximity operator
of δK (ξ) denoted by Proxg∗ is an orthogonal projector onto the convex set K. Likewise, q(u) is
identified as δS (u), where S := dom J ∗∗ . The proximity operator of δS (u) denoted by Proxq
is then an orthogonal projector onto the set S. See Algorithm 2 for the overall procedure. We
refer the readers to sections 5 and 6 in [33] for implementation details on such projectors.
Remark 3.3. Suppose that the concave part of a DC function is differentiable, that is,
∂h(u) is a singleton. Then, in the concave-convex procedure, h(u) is linearized as
(3.6)
−h(u) ≤ −h(un ) − h∂h(un ), u − un i,
un ∈ H,
where ∂h(un ) is computed as in (3.5) for the problem (2.16). From an MM perspective, the
right-hand side is one of the majorizers that can be constructed in various ways. Clearly,
the linear one is the tightest convex upper bound for the concave function h. As previously
mentioned, one can find a quadratic upper bound in [28]; see Proposition 3 therein.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS
1863
Algorithm 3. Stochastic primal-dual DC procedure
1: Choose u0 ∈ dom g, ξ 0 ∈ K, and θ ∈ [0, 1].
2: Choose τ, σ > 0 such that τ σ||L||2 < 1.
3: while not converged do
4:
Extract pixels using random sampling
5:
v n1 ∈ ∂h(un1 ).
6:
for n2 = 1 : N2 do
7:
ξ n2 +1 = Proxσg∗ (ξ n2 + σLūn2 ).
8:
un2 +1 = Proxτ q (un2 − τ (v n1 − L∗ ξ n2 +1 )).
9:
ūn2 +1 = un2 +1 + θ(un2 +1 − un2 ).
10:
end for
11:
un1 = un2 +1 .
12: end while
As we have seen so far, the primal-dual DC procedure computes a minimizing sequence
that provably converges to a critical point. Other than that, there is no further optimality
guarantee—recall the procedure is solely based on the first-order methods. Needless to say,
the minimization problem (2.16) is nonconvex. The procedure is prone to getting stuck in bad
local minima—visually and energetically. Hence, we make some modifications to the primaldual DC procedure in such a way that it guides the path of a minimizing sequence {ut }t≥1
from an arbitrary initial point towards good local minima: In what follows, we suggest two
viable strategies.
3.2.1. Tuning the number of inner iterations. We fix the number of inner iterations of
the primal-dual DC procedure at a certain number rather than let the procedure compute the
optimal solution of each convex subproblem. In doing so, we guide the minimizing sequence
not to get easily stuck in bad local minima; see the for-loop in the sixth line of Algorithm 3
compared to the corresponding while-loop in Algorithm 2, where N2 is the number of inner
iterations. Indeed, the limit of such a minimizing sequence that consists of optimal solutions
of the consecutive convex subproblems is statistically likely to become far from near-optimal.
The reason for this is that optimal solutions of the first few convex subproblems, i.e., initial
partitions at an early stage, would make the kernel density estimation (3.5) prematurely
overfitted; we will experimentally demonstrate and analyze in more detail in section 4.3.
3.2.2. Incorporating random sampling. Another way we could do the minimizing sequence to avoid getting stuck in bad local minima is to cause some perturbation during the
optimization process. In so doing, we would make the sequence escape from undesirable points
toward a good local minimum—preferably global minimum—under the assumption that such
undesirable points are unstable with respect to a perturbation. To this end, we incorporate
the notion of stochastic optimization into the primal-dual DC procedure in terms of random
sampling, where such a perturbation is implemented by means of random sampling for kernel
density estimation. To be specific, at every outer loop (or for the first few) of the procedure
in Algorithm 2, a subset of R` is randomly extracted and p̂` (ϕ) is computed based on the
set for ` = 1, . . . , k. A simple implementation here for the random sampling is to use the
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
1864
Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM
following modules in order: A random number generator between 0 and 1, the orthogonal
projector onto the standard k-simplex, and the first-max rounding scheme—see section 4.5.3
in [39], for example. Indeed, stochastic optimization plays a significant role in preventing a
minimizing sequence from converging to either a stationary point or a bad local minimum,
which will be demonstrated and analyzed in more detail in the following section. Having taken
into consideration the two strategies, we describe what we call the stochastic primal-dual DC
procedure in Algorithm 3. Note that the two strategies can be employed just for the first few
iterations to make sure the minimizing sequence converges.
4. Experiments. In sections 4.1 and 4.2, we demonstrate the capacity of our approach by
showing fully unsupervised multiregion image partitions and compare the proposed method
with our preliminary work [28] for figure-ground segmentation. We refer the readers to
the work [28] for other compelling examples such as segmenting invisible texts from the
background—see Figure 6 therein. Compared to the work of Kim et al. [30], where the
energy functional is formulated by means of active contours and optimized by using curve
evolution, we also achieve a much better algorithm that finds near-optimal solutions. The
primary reason for this is that the update of the segmentation map in [30] happens only at
the vicinity of the current position of active contour, which limits the possible directions of
updates, whereas the formulation presented here removes this limitation entirely and makes
it possible to update the segmentation map toward arbitrary directions, enlarging the search
space enormously. Clearly, the proposed algorithm outperforms the optimization method in
[30]. Referring the readers to Figure 1 in [28] for comparison, we rather demonstrate in section
4.3 the effectiveness of the two strategies presented in the previous sections (sections 3.2.1 and
3.2.2). We apply the algorithm to fully unsupervised binary segmentation and analyze how
these two factors come into play during optimization. Note that for the entire experiments
reported in the paper, we set g = 1, i.e., isotropic total variation without weighting.
4.1. Unsupervised multiregion image partitioning. We compute information-theoretic
minimal partitions on a few images taken from the Berkeley Segmentation Dataset (BSD500)
as shown in Figure 4, where the image size is 321 × 481, by means of the proposed algorithm
(Algorithm 3 without random sampling) setting H = 0.08, λ = 0.1, σ = 0.5, τ = 0.25, and
N2 = 200. The cameraman image experiment shown in Figure 1 is also from Algorithm 3 with
random initialization; see Appendix B for implementation details. As a termination criterion,
we use the norm of the difference between the current and the previous relaxed solution, i.e.,
if the norm of this difference divided by the entire number of pixels falls below a certain
t
t−1 ||
2
threshold, we regard the procedure as converged; we use ||u −u
< 10−8 . The runtime
|Ω|
of the primal-dual nested loop (resp., the outer loop) is approximated 0.075 (resp., 0.57),
0.139 (resp., 0.86), 0.203 (resp., 1.15), 0.295 (resp., 1.43), and 0.383 (resp., 1.73) seconds per
iteration when k = 2, . . . , 6, respectively, with unoptimized MATLAB code. Here, we report
the total numbers of iterations required for convergence: for the zebra image (left, top), 1328
and 1906 for 2 and 3 regions, respectively; for the leopard image (right, top), 1580 and 1900
for 2 and 3 regions, respectively; for the fish image (left, second row), 1321 and 2912 for 2
and 3 regions, respectively; for the elephant image (right, second row), 1571 and 1514 for 2
and 3 regions, respectively; for the horse image (third row), 1287, 1932, 2608, 5656, and 4001
for 2 to 6 regions, respectively; and for the mountain image (bottom row), 1186, 1523, 1610,
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS
Input
Histogram
Input
Histogram
Input
Histogram
Input
Histogram
2 regions
3 regions
Histogram (2) Histogram (3)
2 regions
3 regions
Histogram (2) Histogram (3)
2 regions
3 regions
Input
Histogram
Input
Histogram
4 regions
1865
2 regions
3 regions
Histogram (2) Histogram (3)
2 regions
3 regions
Histogram (2) Histogram (3)
5 regions
6 regions
Histogram (2) Histogram (3) Histogram (4) Histogram (5) Histogram (6)
2 regions
3 regions
4 regions
5 regions
6 regions
Histogram (2) Histogram (3) Histogram (4) Histogram (5) Histogram (6)
Figure 4. Information-theoretic minimal partitions computed by the stochastic primal-dual DC procedure.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
1866
Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM
4
x 10
10
0.75
0.7
8
0.65
6
0.6
0.55
4
0.5
2
0.45
0.4
0
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
0.35
−2
0.3
−4
0.25
200
Input
Kee et al. [28]
Proposed
400
600
800
1000
1200
1400
Primal energy
200
400
600
800
1000
1200
1400
Runtime/iter
Figure 5. Comparison with the work of Kee et al. [28] for binary segmentation.
4355, and 3977 for 2 to 6 regions, respectively. The algorithm runs on an Intel Core i5 CPU
computer with 2.67 GHz and 8 GB of memory. We also compute the intensity histograms that
correspond to each partition as shown in the same figure, where one can see that unknown
latent histograms are well separated by minimizing the negative JS divergence with spatial
regularity. Note that the parameters can be tuned subjectively if prior information about
images to be segmented is available—semisupervised partitioning.
4.2. Comparison. We apply the proposed algorithm (Algorithm 3) to the zebra image
shown in Figure 5 for binary segmentation in order to compare with our preliminary work
[28], where the MM algorithm with a family of quadratic upper bounds is employed. Note that
the major difference between the current work and the work of Kee et al. [28] from an MM
perspective is the tightness of convex upper bounds: linear (3.6) and quadratic [28] bounds.
Based on the parameter settings in [28], we set N2 = 1, H = 0.08, λ = 0.05, σ = 0.5, and
τ = 0.25 for both cases; in addition, the random sampling procedure in Algorithm 3 is not
executed. As shown in Figure 5, the primal energy of the quadratic bound (red) decreases as
the way of the linear bound (blue) and the quality of segmentation is almost the same when
terminated. However, the linear majorizer drastically improves the runtime of the algorithm
(on average 0.26 seconds per iteration) compared to the quadratic one (0.64 seconds per
iteration). The total number of iterations taken to converge was 1500 for both cases. We also
observed that the convergence speed with and without random sampling (50 percent of the
entire pixels) for the linear bound (3.5) remains the same for this zebra image.
4.3. Effects of the number of inner iterations and random sampling.
4.3.1. Nonstochastic primal-dual DC programming. We compute information-theoretic
minimal partitions (binary segmentation) of a cameraman image as shown in Figure 6 with the
nonstochastic primal-dual DC procedure, i.e., the procedure makes use of the entire number
of pixels for density estimation (Algorithm 2 with the fixed number of iterations or Algorithm
3 without random sampling). Setting H = 0.1 and λ = 0.05, we run the procedure 500 times
with random initialization for each different number of inner iterations, 1, 5, 10, 50, 100,
200, 300, 400, and 500, respectively, i.e., 4500 times in total. As a termination criterion for
the nonstochastic procedure, we use the same criterion described in section 4.1. Throughout
the experiments, we observe that the procedure interestingly ends up with, roughly speaking,
either of the four partition classes—which we call P1 , P2 , P3 , and P4 , respectively—as shown
in the top row of Figure 6. The associated distributions are also computed and shown in the
second row of the same figure. Here, the solid line is from the region labeled as black (0),
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS
Input
Initialization
Class P1
1867
Class P2
Class P3
Class P4
6
6
6
6
6
5
5
5
5
5
4
4
4
4
4
3
3
3
3
3
2
2
2
2
2
1
1
1
1
0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Input
% samples
0.8
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Class P1
SD(class P1 )
0.9
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Class P2
SD(class P2 )
# inner iterations
1
5
10
50
100
100
200
300
400
500
maximum SD (minimum SD)
termination criterion
1
0
0
#(class P1 )
490
490
482
388
263
24
10
10
7
0.0426 (0)
0.8
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
Class P3
SD(class P3 )
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Class P4
SD(class P4 )
#(class P2 ) #(class P3 ) #(class P4 )
10
0
0
10
0
0
18
0
0
112
0
0
235
0
2
474
2
0
488
1
1
485
2
3
492
0
1
0.0685 (0)
0.0027 (0)
0.036 (0)
||ut − ut−1 ||2 /|Ω| < 1 × 1e-8
Figure 6. Nonstochastic primal-dual DC programming for a cameraman image—see section 4.3.1 for details.
and the dashed line is from the region labeled as white (1). Compared to a great deal of
visual difference that is clearly seen between the classes, the amount of difference within each
partition class is negligibly small. Indeed, their variations are rarely visible even along the
relaxed boundaries as shown in the third row from the top of Figure 6, where each pixel value
is supposed to represent a pixelwise standard deviation (SD), but the images are displayed in
such a way that the minimum value (SD) in each image is black, and the maximum value (SD)
is white. From the P1 class to P4 , the maximum (minimum) SDs are 0.0426(0), 0.0685(0),
0.0027(0), and 0.036(0), respectively. More profoundly, we observe from the table in Figure
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
1868
Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Table 2
The highest and lowest energy values of each energy band (class).
class/energy band
P1
P2
P3
P4
lowest energy
−962.683
−922.233
−237.273
−233.898
highest energy
−933.246
−275.763
−237.212
−233.307
6 that the number of partitions (binary segmentation results) falling into the P1 class is
monotonically increasing as the number of inner iterations of the procedure is decreasing from
500 to 1. To figure out what is happening, we need to classify such partition classes in terms of
not only their appearance but also energy, which may allow us to make use of a term “energy
band” to consider appearance and energy all together as shown in Table 2.
Not surprisingly, partitions in the P1 energy band have the lowest energy among others,
whereas those in the P4 band have the highest energy. Indeed, it matches our expectations to
a considerable extent. Nonetheless, it is by no means certain that one could possibly argue the
lowest energy partition in the P1 class is a global minimizer of the problem in that neither the
objective is convex nor the procedure is a global search algorithm; rather, the best explanation
would be such a partition is a merely local minimizer that has the lowest energy among the
observed solutions throughout the massive amount of experiments. Hence, the question is
how to escape from such a band, namely, P1 , toward another one, if any, which is located
below it—preferably near-optimal.
Before we turn our attention to this question in the following section, we make another
observation: We compute energy histograms of computed partitions according to the number
of inner iterations as shown in Figure 7, where the number of inner iterations is written right
on the bottom of each histogram. As can be seen, energy becomes concentrated as the number
of inner iterations decreases. Note that the “bandgap” between P1 and P2 is 11.013, P2 and
P3 is 38.49, and P3 and P4 is 3.314. The following sums up this section.
Observation 4.1. The fewer inner iterations the nonstochastic primal-dual DC procedure is
set to run with, the more likely it generates low energy partitions, which may hint that too
many inner iterations lead to bad local minima or bad critical points.
4.3.2. Stochastic primal-dual DC programming. Apart from the number of inner iterations, there is another tuning parameter—the number of pixels for density estimation. Again,
setting H = 0.1 and λ = 0.05 to make sure we optimize the same energy, we run the procedure this time using a subset of the whole pixels, which is randomly extracted from the
entire domain Ω. For the moment, suppose that the procedure is set to use 1 percent of the
total amount of pixels at each outer iteration, and we set the number of inner iterations to
100. Then, one can see that sequences of computed partitions are not likely to converge in
contrast to the previous case where the procedure makes use of all the available pixels. In
other words, such computed partitions that are consecutively generated by the procedure constantly change iteration by iteration even after a sufficiently large number of outer iterations.
It is because of the fact that current distribution estimates are not likely to be the same as
the previous ones which are estimated based on statistically different sets of pixels, meaning
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS
500
500
400
400
300
300
200
200
100
100
0
−1000
−900
−800
−700
−600
−500
−400
−300
−200
0
−1000
−900
−800
1869
−700
−600
# inner iterations = 5
# inner iterations = 10
500
500
400
400
300
300
200
200
100
−500
−400
−300
−200
−500
−400
−300
−200
−500
−400
−300
−200
−500
−400
−300
−200
100
0
−1000
−900
−800
−700
−600
−500
−400
−300
−200
0
−1000
−900
−800
−700
−600
# inner iterations = 50
# inner iterations = 100
500
500
400
400
300
300
200
200
100
100
0
−1000
−900
−800
−700
−600
−500
−400
−300
−200
# inner iterations = 200
0
−1000
50
40
40
30
30
20
20
10
10
−900
−800
−700
−600
−800
−700
−600
# inner iterations = 300
50
0
−1000
−900
−500
−400
−300
−200
# inner iterations = 400
0
−1000
−900
−800
−700
−600
# inner iterations = 500
Figure 7. Energy histograms for the partition classes, P1 , P2 , P3 , and P4 . The x-axis (energy) limits are
set to range from −1000 to −200, and the y-axis (frequency) limits are set to range from 0 to 500 for the first
six histograms from the top; for the rest (bottom) they are set to range from 0 to 50. Note that when the number
of inner iterations is 1, its histogram is not distinguishable from that when the number of inner iterations is 5
in this scale, so we do not show it here.
the energy landscape of the current objective is slightly different from the previous one. Yet,
such sequences of partitions drastically change neither geometrically nor energetically after
some outer iterations since a great amount of pixel values are likely to be similar then. In this
regard, the termination criterion we used in the previous section needs to be loose; otherwise
the procedure may not stop. To this end, we use
(4.1)
def
tol(N, ) =
N
1 1 X ||ut−(τ −1) − ut−τ ||2
< 1.
N
|Ω|
τ =1
Here, we set N = 300 and = 5 × 10−7 for this particular case—when the number of inner
iterations is 100. With this stopping criterion (4.1), we now observe that the procedure is
capable of computing a lower energy partition as shown in Figure 8. Note that this result was
not seen in section 4.3.1, where all the pixels were used. In what follows, we elaborate our
new finding.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
1870
Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM
−550
−940
−940
−600
−945
−945
−950
−950
−955
−955
−960
−960
−965
−965
−970
−970
−650
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
−700
−750
−800
−850
−900
−950
200
300
400
500
600
−975
600
Iterations
Initialization
−975
800
1000
1200
1400
1600
1800
2000
3000
Iterations
E = −713
E = −962
4000
5000
6000
Iterations
E = −967
E = −970
Figure 8. With a subset of the entire pixels, the primal-dual DC procedure is capable of computing a lower
energy partition (E = −970). See section 4.3.2 for details.
Suppose that we have such a finite sequence of partitions {ut }6000
t=1 generated by the primaldual DC procedure that makes use of 1 percent of the total amount of pixels at each outer
iteration for density estimation. Since Erel : dom J ∗∗ → (−∞, +∞], we can draw an energy
trajectory parameterized by iteration number on the energy landscape in dom J ∗∗ ×(−∞, +∞]
as shown in the top row of Figure 8, where the x-axis (iterations) limits are set to range from
200 to 6000, and the y-axis (Erel (ut )) limits are set to range from −950 to −550 for the left
one and from −975 to −940 for the rest. Note that the overall energy Erel (ut ) is evaluated
at each iteration using kernel density estimation that makes use of the entire pixels of the
input image, even though a fraction of them are actually used during optimization. What
we can see from the trajectory is roughly that Erel (ut ) is rapidly decreasing until the 700th
iteration; then from that point to the 1400th iteration ut seems to be wandering around a
sort of plateau struggling to decrease the energy toward a global minimum. At the 1400th
iteration, the procedure seems to have found out a way to do so; then ut escapes from this
high plateau and settles down on a low plateau from the 1450th iteration on. The bottom row
of Figure 8 shows five different partitions in energetically descending order, where the third
partition from the left (Erel = −962) is a partition that is typically seen in the high plateau
and the fifth partition (Erel = −970) is that seen in the low one. Again, ut wanders around
this low plateau seeking another way out to go into a possibly lower one, constantly—yet
slightly—changing its geometry and energy. However, such a region has not been found until
the procedure terminates (6000 iterations) as shown on the right in the top row of Figure 8.
Indeed, during the experiment, we observe that once a sequence of partitions escapes from
the high plateau and wanders around on the low plateau, it rarely jumps out of place to go
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS
1871
back to the high plateau—and even if so it immediately goes back to the low plateau, as seen
on the right (around the 2700th iteration) in the top row of Figure 8.
Now, we clearly have another partition class which we call P0 whose energy seems to be
distributed around −970; more importantly, such a class is not observable when the procedure
makes use of the total amount of pixels, namely, nonstochastic primal-dual DC procedure.
In order to figure out in what circumstances such partitions can be obtained, we run the
procedure with different size of sets of pixels randomly extracted from the entire domain Ω
and also with different numbers of inner iterations as shown in Table 3, from which the energy
band of the partition class, P0 , is specified as shown in Table 4. Clearly, the obtained partition
class, P0 , is located below the P1 energy band and slightly overlaps it. The reason for this is
mainly because of the fact that the termination criteria were somewhat not tight enough—
the stochastic procedure would not stop otherwise. Then, the following observation is now
quite straightforward.
Observation 4.2. With fewer inner iterations and fewer pixels—randomly extracted from
the entire domain Ω—the primal-dual DC procedure makes use of, it is statistically likely for
the procedure to generate lower energy partitions.
Then again, it is by no means certain that such partitions in P0 become global minimizers
since the objective is nonconvex. Even so, nonconvexity notwithstanding, we are quite inclined
to believe that such partitions are much more than “mere local minimizers (or possibly critical
points)” considering the recent findings in statistical physics in regard to the nature of largescale nonconvex energy landscapes and the nature of stochastic optimization. The gist of
the findings is as follows: The number of saddle points is exponentially many, but not local
minima in high dimensional energy landscapes. Most importantly, such local minima are
statistically likely to be around the global minima in such circumstances. In this light, as
long as numerical algorithms guarantee that computed solutions are local minimizers, we may
be able to say that they are near-optimal solutions. Indeed, the assertion can be confirmed
(somewhat arguably)—see [8, 23, 22] for example—in conjunction with stochastic gradient
descent. Although the models therein are quite different from what we consider in this paper,
we have already provided empirical evidence from the massive amount of experiments as shown
in Table 3. See also [18, 15], for example, and references therein.
Now it seems quite clear how such stochastic moves make it possible for a sequence of
partitions to escape from the high plateau with the lower amount of pixels. However, it is still
unclear that quite a number of partitions still end up falling into the class P2 —sometimes as
frequent as the number of partitions in the class P0 . To put it into perspective, we consider
another sequence of partitions {ut }6000
t=1 generated by the same procedure, but this time it
ends up falling into the class P2 as shown in Figure 9. Undoubtedly, the region where the
sequence is wandering around—approximately from the 500th iteration—is another plateau,
and such stochastic moves in this case are not likely to make the sequence escape from the
place once it gets trapped there. Indeed, we observed none of the sequences that once fall
into this “cursed” plateau gets out of the area toward a “blessed” plateau, in contrast to
what is happening in what we call the high plateau with the smaller amount of pixels. The
reason for this is possibly such a plateau is surrounded by high hills like a basin so that even
the stochastic DC procedure is very likely to end up with generating a sequence of partitions
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
1872
Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM
Table 3
Comparison between nonstochastic and stochastic primal-dual DC programming for computing informationtheoretic minimal partitions (when k = 2) of the cameraman image. As for the termination criteria for each
case, setting N = 300, we fixed at 1.0 × 1e − 6 when % samples = 0.5; 1.5 × 1e − 6 (# inner iters. = 10),
1.0 × 1e − 6 (# inner iters. = 50), and 5.0 × 1e − 7 for the rest, when % samples = 1, 2, and 3; 1.0 × 1e − 6
(# inner iters. = 5) and 5.0 × 1e − 7 for the rest, when % samples = 5; 5.0 × 1e − 7 when % samples = 10.
% samples
0.5
1
2
3
5
10
# inner iters.
50
100
200
10
50
100
200
300
400
500
10
50
100
200
300
400
500
10
50
100
200
300
400
500
5
10
50
100
200
300
400
500
5
10
50
100
|class P0 |
357
222
159
402
292
281
141
35
31
19
0
71
104
53
17
21
7
0
3
26
29
4
7
6
0
0
0
2
0
1
2
0
0
0
0
0
|class P1 |
1
10
20
53
52
15
23
14
13
11
455
275
145
88
34
31
13
465
362
246
104
33
51
16
480
476
381
277
104
30
35
9
485
484
373
269
|class P2 |
142
268
320
45
156
204
332
447
452
465
45
154
248
357
444
442
474
35
135
225
363
461
433
476
20
24
119
221
389
463
458
490
15
16
126
228
|class P3 |
0
0
1
0
0
0
1
4
2
5
0
0
1
0
2
3
5
0
0
0
0
0
2
2
0
0
0
0
3
2
3
0
0
0
0
0
Table 4
The highest and lowest energy values of the P0 partition class.
Class/energy band
P0
Lowest energy
−975.246
Highest energy
−969.479
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
|class P4 |
0
0
0
0
0
0
3
0
2
0
0
0
2
2
3
3
1
0
0
3
4
2
7
0
0
0
0
0
4
4
2
1
0
0
1
3
COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS
−300
−400
1873
−840
−900
−860
−905
−880
−910
−900
−915
−920
−920
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
−500
−600
−700
−800
−900
200
250
300
350
400
−940
400
Iterations
Initialization
−925
420
440
460
480
500
1000
2000
Iterations
E = −646
E = −810
3000
4000
5000
6000
Iterations
E = −913
E = −924
Figure 9. With a subset of all the pixels, the stochastic primal-dual DC procedure may end up with a bad
local minimum (E = −924). See section 4.3.2 for details.
inside the plateau at each outer iteration. For future work, a rigorous analysis should take
place to investigate this phenomenon in nonconvex optimization.
5. Conclusion. We have considered what we call the information-theoretic minimal partitioning problem by providing a rigorous mathematical treatment, and we proposed a sequential
convex programming approach for large-scale DC functions optimization problems where a
linear operator is involved. We have demonstrated the capacity of our approach in numerous
experiments—also referring the reader to our preliminary conference paper [28]—that show
convincing fully unsupervised image partitioning in terms of both segmentation and robustness
to initialization. In addition, we have thoroughly examined the behavior of the proposed sequential convex programming for nonconvex nonsmooth optimization. Although the proposed
approach does not guarantee global optimality of computed solutions, we have experimentally
demonstrated that the stochastic primal-dual DC procedure is capable of finding good local
minima which we believe are near-optimal in the context of the recent findings in statistical
physics. Indeed, this is one of the significant contributions of the paper in energy minimization
methods in computer vision, especially continuous optimization. Therefore, we believe that
the framework described here would open up new possibilities to tackle large-scale nonconvex
objectives in not only computer vision but machine learning.
Appendix A. Proof of the claim in Theorem 2.3. We prove the following claim:
∀ > 0, ∃δ > 0 such that ||u` − u?` ||L1 (Ω;{0,1}) < δ =⇒ ||p̂Φ(X)|u` (X)=1 − p̂Φ(X)|u?` (X)=1 ||T V < ,
where || · ||T V is the total variation distance (variational distance) of probability measures. To
this end, we make the following observations.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
1874
Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM
Observation A.1.
Z
Z Z
?
KH (ϕ − φ(x))u` (x)dx −
KH (ϕ − φ(x))u` (x)dx dϕ
Ω
Rn
ZΩ Z
Z
?
≤
|KH (ϕ − φ(x))(u` (x)dx − u` (x))| dxdϕ ≤
|u` (x)dx − u?` (x)| dx = ||u` − u?` ||L1 .
Rn
Ω
Ω
Observation A.2.
Z
Z
Z
KH (ϕ − φ(x))u` (x)dx dϕ =
u` (x)dx.
n
Ω
R
Ω
Observation A.3.
Z
Z
Z
?
≤
u` (x)dx −
u
(x)dx
`
Ω
Ω
Ω
Z
?
u` (x)dx −
dx = ||u` − u? ||L1 .
u
(x)
`
`
Ω
From the above observations, we prove the claim as follows:
||p̂Φ(X)|u` (X)=1 − p̂Φ(X)|u?` (X)=1 ||T V
R
Z R
? (x)dx Ω KH (ϕ − φ(x))u` (x)dx
K
(ϕ
−
φ(x))u
1
H
`
Ω
dϕ
R
R ?
=
−
2 Rn (x)dx
u
(x)dx
u
`
`
Ω
Ω
R
R
R
R
Z
Ω KH (ϕ − φ(x))u` (x)dx Ω u?` (x)dx − Ω KH (ϕ − φ(x))u?` (x)dx Ω u` (x)dx dϕ
R
R
=
2 Ω u` (x)dx Ω u?` (x)dx
Rn
R
R
R
R
Z ?
φ(x))u?` (x)dx Ω u?` (x)dx
Ω KH (ϕ − φ(x))u` (x)dx ΩRu` (x)dx − R Ω KH (ϕ −
dϕ
≤
2 u` (x)dx u? (x)dx
Rn
Ω
Ω `
R ?
R
R
Z R
? (x)dx
? (x)dx
K
(ϕ
−
φ(x))u
u
(x)dx
−
K
(ϕ
−
φ(x))u
u
(x)dx
H
H
`
`
`
Ω
Ω
ΩR `
RΩ ?
+
dϕ
2 u` (x)dx u (x)dx
Rn
`
Ω
Ω
R
Z R
− Ω KH (ϕ − φ(x))u?` (x)dx
Ω KH (ϕ − φ(x))u` (x)dx
R
=
dϕ
2 u` (x)dx
Rn
Ω
R
R ?
Z R
? (x)dx
u` (x)dx − Ω u` (x)dx `
Ω
Ω KH (ϕ − φ(x))u
R
R
+
dϕ
2 u` (x)dx u? (x)dx
Rn
Ω
Ω `
||u` − u?` ||L1 (Ω;{0,1})
2||u` − u?` ||L1 (Ω;{0,1})
R
R
≤
<
,
Ω u` (x)dx
Ω u` (x)dx
where the first inequality is from the triangle inequality, andR the second inequality is from the
u (x)dx
observations. Clearly, the fact that ||u` − u?` ||L1 (Ω;{0,1})
< Ω 2`
implies ||p̂Φ(X)|u` (X)=1 −
R
p̂Φ(X)|u?` (X)=1 ||T V < . Hence, for all > 0, δ :=
Ω
u` (x)dx
2
is the number fulfulling the claim.
Appendix B. A MATLAB implementation for random initialization.
simple MATLAB program code for random initialization as follows:
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
We provide a
COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS
1875
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
function [u] = randInitial(M,N,nL,rounding)
%
%
%
%
M = # rows in an input image
N = # columns in an input image
nL = # labels
rounding option: either 0 or 1
tmp = rand(M,N,nL);
tmp_ = repmat(sum(tmp,3), [1 1 nL]);
u = tmp./temp_;
if rounding == 1
% first-max
[u_max,idx] = max(u,[],3);
v = zeros(M,N,nL);
for k = 1 : nL
tmp = zeros(M,N);
tmp(idx == k) = 1;
v(:,:,k) = tmp;
end
u = v;
end
end
Acknowledgments. The authors thank the anonymous reviewers for their insightful comments and the Cornell MRI Research Lab for providing computational resources in revising
the manuscript.
REFERENCES
[1] I. Ahmad and P.-E. Lin, A nonparametric estimation of the entropy for absolutely continuous distributions (corresp.), IEEE Trans. Inform. Theory, 22 (1976), pp. 372–375, https://doi.org/10.1109/TIT.
1976.1055550.
[2] G. Alberti, G. Bouchitté, and G. Dal Maso, The calibration method for the Mumford-Shah
functional and free-discontinuity problems, Calc. Var. Partial Differential Equations, 16 (2003),
pp. 299–333.
[3] L. Ambrosio, N. Fusco, and D. Pallara, Functions of Bounded Variation and Free Discontinuity
Problems, Clarendon Press, Oxford, UK, 2000.
[4] L. T. H. An and P. D. Tao, The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems, Ann. Oper. Res., 133 (2005),
pp. 23–46.
[5] H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert
Spaces, Springer, New York, 2011.
[6] J. Beirlant, E. J. Dudewicz, L. Györfi, and E. C. van der Meulen, Nonparametric entropy
estimation: An overview, Int. J. Math. Stat. Sci., 6 (1997), pp. 17–40.
[7] Y. Boykov, O. Veksler, and R. Zabih, Fast approximate energy minimization via graph cuts, IEEE
Trans. Pattern Anal. Machine Intelligence, 23 (2001), pp. 1222–1239.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
1876
Y. KEE, Y. LEE, M. SOUIAI, D. CREMERS, AND J. KIM
[8] A. J. Bray and D. S. Dean, Statistics of critical points of Gaussian fields on large-dimensional spaces,
Phys. Rev. Lett., 98 (2007), 150201.
[9] E. S. Brown, T. F. Chan, and X. Bresson, Completely convex formulation of the Chan-Vese image
segmentation model, Int. J. Comput. Vis., 98 (2012), pp. 103–121.
[10] V. Caselles, R. Kimmel, and G. Sapiro, Geodesic active contours, Int. J. Comput. Vis., 22 (1997),
pp. 61–79.
[11] A. Chambolle, D. Cremers, and T. Pock, A convex approach to minimal partitions, SIAM J. Imaging
Sci., 5 (2012), pp. 1113–1158.
[12] A. Chambolle and T. Pock, A first-order primal-dual algorithm for convex problems with applications to imaging, J. Math. Imaging Vis., 40 (2011), pp. 120–145, https://doi.org/10.1007/
s10851-010-0251-1.
[13] T. F. Chan, S. Esedoglu, and M. Nikolova, Algorithms for finding global minimizers of image
segmentation and denoising models, SIAM J. Appl. Math., 66 (2006), pp. 1632–1648.
[14] T. F. Chan, L. Vese, et al., Active contours without edges, IEEE Trans. Image Process., 10 (2001),
pp. 266–277.
[15] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun, The Loss Surface of
Multilayer Networks, arXiv:1412.0233, 2014.
[16] T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley-Interscience, New York, 2006.
[17] D. Cremers, T. Pock, K. Kolev, and A. Chambolle, Convex Relaxation Techniques for Segmentation, Stereo and Multiview Reconstruction, in Markov Random Fields for Vision and Image Processing,
MIT Press, Cambridge, MA, 2011.
[18] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, Identifying
and attacking the saddle point problem in high-dimensional non-convex optimization, in Advances in
Neural Information Processing Systems, MIT Press, Cambridge, MA, 2014, pp. 2933–2941.
[19] T. P. Dinh and H. A. Le Thi, Recent advances in DC programming and DCA, in Transactions on
Computational Intelligence XIII, Springer, New York, 2014, pp. 1–37.
[20] Y. G. Dmitriev and F. P. Tarasenko, On the estimation of functionals of the probability density and
its derivatives, Theory Probab. Appl., 18 (1974), pp. 628–633.
[21] H. Federer, Geometric Measure Theory, Springer, New York, 1969.
[22] Y. V. Fyodorov and C. Nadal, Critical behavior of the number of minima of a random landscape at
the glass transition point and the Tracy-Widom distribution, Phys. Rev. Lett., 109 (2012), 167203.
[23] Y. V. Fyodorov and I. Williams, Replica symmetry breaking condition exposed by random matrix
calculation of landscape complexity, J. Statist. Phys., 129 (2007), pp. 1081–1116.
[24] T. Goldstein and S. Osher, The split Bregman method for l1-regularized problems, SIAM J. Imaging
Sci., 2 (2009), pp. 323–343.
[25] A. Herbulot, S. Jehan-Besson, S. Duffner, M. Barlaud, and G. Aubert, Segmentation of vectorial image features using shape gradients and information measures, J. Math. Imaging Vision, 25
(2006), pp. 365–386.
[26] R. Horst and N. V. Thoai, DC programming: Overview, J. Optim. Theory Appl., 103 (1999), pp. 1–43.
[27] D. R. Hunter and K. Lange, A tutorial on MM algorithms, Amer. Statist., 58 (2004), pp. 30–37.
[28] Y. Kee, M. Souiai, D. Cremers, and J. Kim, Sequential convex relaxation for mutual informationbased unsupervised figure-ground segmentation, in Proceedings of the IEEE Conference on Computer
Vision and Pattern, Recognition, 2014.
[29] J. Kim, J. W. Fisher III, A. Yezzi, M. Çetin, and A. S. Willsky, Nonparametric methods for image
segmentation using information theory and curve evolution, in Proceedings of the IEEE Conference
on Image Processing, Vol. 3, IEEE, New York, 2002, pp. 797–800.
[30] J. Kim, J. W. Fisher III, A. Yezzi, M. Çetin, and A. S. Willsky, A nonparametric statistical method
for image segmentation using information theory and curve evolution, IEEE Trans. Image Process.,
14 (2005), pp. 1486–1502.
[31] N. Komodakis and J.-C. Pesquet, Playing with Duality: An Overview of Recent Primal-Dual Approaches for Solving Large-Scale Optimization Problems, arXiv:1406.5429, 2014.
[32] G. R. Lanckriet and B. K. Sriperumbudur, On the convergence of the concave-convex procedure, in Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA, 2009,
pp. 1759–1767.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 10/26/17 to 130.64.11.153. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
COMPUTING INFORMATION-THEORETIC MINIMAL PARTITIONS
1877
[33] J. Lellmann and C. Schnörr, Continuous multiclass labeling approaches and algorithms, SIAM J.
Imaging Sci., 4 (2011), pp. 1049–1096.
[34] J. Lin, Divergence measures based on the Shannon entropy, in IEEE Trans. Inform. Theory, 37 (1991),
pp. 145–151.
[35] O. Michailovich, Y. Rathi, and A. Tannenbaum, Image segmentation using active contours driven
by the Bhattacharyya gradient flow, IEEE Trans. Imaging Process., 16 (2007), pp. 2787–2801.
[36] D. Mumford and J. Shah, Optimal approximations by piecewise smooth functions and associated variational problems, Comm. Pure Appl. Math., 42 (1989), pp. 577–685.
[37] K. P. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, Cambridge, MA, 2012.
[38] K. Ni, X. Bresson, T. F. Chan, and S. Esedoglu, Local histogram based segmentation using the
Wasserstein distance, Int. J. Comput. Vis., 84 (2009), pp. 97–111.
[39] C. Nieuwenhuis, E. Töppe, and D. Cremers, A survey and comparison of discrete and continuous
multi-label optimization approaches for the Potts model, Int. J. Comput. Vis., 104 (2013), pp. 223–240.
[40] S. Osher and J. A. Sethian, Fronts propagating with curvature-dependent speed: Algorithms based
on Hamilton-Jacobi formulations, J. Comput. Phys., 79 (1988), pp. 12–49, https://doi.org/10.1016/
0021-9991(88)90002-2.
[41] E. Parzen, On estimation of a probability density function and mode, Ann. Math. Statist., 33 (1962),
pp. 1065–1076.
[42] T. Pock, D. Cremers, H. Bischof, and A. Chambolle, An algorithm for minimizing the piecewise
smooth Mumford-Shah functional, in Proceedings of ICCV, 2009.
[43] K. Punithakumar, J. Yuan, I. B. Ayed, S. Li, and Y. Boykov, A convex max-flow approach to
distribution-based figure-ground separation, SIAM J. Imaging Sci., 5 (2012), pp. 1333–1354.
[44] J. Rabin and N. Papadakis, Convex Color Image Segmentation with Optimal Transport Distances, in
Scale Space and Variational Methods in Computer Vision, Springer, New York, 2015, pp. 256–269.
[45] R. T. Rockafellar and R. J.-B. Wets, Variational Analysis, Grundlehren Math. Wiss. 317, Springer,
Berlin, 1998.
[46] M. Rosenblatt, Remarks on some nonparametric estimates of a density function, Ann. Math. Statist.,
27 (1956), pp. 832–837.
[47] J. S. Simonoff, Smoothing Methods in Statistics, Springer, New York, 1996.
[48] E. M. Stein and R. Shakarchi, Real Analysis: Measure Theory, Integration, and Hilbert Spaces,
Princeton University Press, Princeton, NJ, 2009.
[49] P. Strandmark, F. Kahl, and N. C. Overgaard, Optimizing parametric total variation models, in
Proceedings of the 12th International Conference on Computer Vision, IEEE, 2009, pp. 2240–2247.
[50] E. Strekalovskiy, A. Chambolle, and D. Cremers, Convex relaxation of vectorial problems with
coupled regularization, SIAM J. Imaging Sci., 7 (2014), pp. 294–336.
[51] P. D. Tao and L. T. H. An, Convex analysis approach to DC programming: Theory, algorithms and
applications, Acta Math. Vietnam., 22 (1997), pp. 289–355.
[52] P. D. Tao and L. T. H. An, A DC optimization algorithm for solving the trust-region subproblem, SIAM
J. Optim., 8 (1998), pp. 476–505.
[53] L. A. Vese and T. F. Chan, A multiphase level set framework for image segmentation using the Mumford
and Shah model, Int. J. Comput. Vis., 50 (2002), pp. 271–293.
[54] R. W. Yeung, Information Theory and Network Coding, Springer, New York, 2008.
[55] R. Yıldızoğlu, J.-F. Aujol, and N. Papadakis, A convex formulation for global histogram based
binary segmentation, in Energy Minimization Methods in Computer Vision and Pattern Recognition,
Springer, New York, 2013, pp. 335–349.
[56] A. L. Yuille and A. Rangarajan, The concave-convex procedure, Neural Comput., 15 (2003),
pp. 915–936.
[57] C. Zach, D. Gallup, J.-M. Frahm, and M. Niethammer, Fast global labeling for real-time stereo
using multiple plane sweeps, in Proceedings of VMV, 2008.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Документ
Категория
Без категории
Просмотров
3
Размер файла
1 515 Кб
Теги
16m1078653
1/--страниц
Пожаловаться на содержимое документа