close

Вход

Забыли?

вход по аккаунту

?

s10207-017-0392-y

код для вставкиСкачать
Int. J. Inf. Secur.
DOI 10.1007/s10207-017-0392-y
REGULAR CONTRIBUTION
Optimization algorithm for k-anonymization of datasets with low
information loss
Keisuke Murakami1 · Takeaki Uno2
© Springer-Verlag GmbH Germany 2017
Abstract Anonymization is the modification of data to
mask the correspondence between a person and sensitive
information in the data. Several anonymization models such
as k-anonymity have been intensively studied. Recently, a
new model with less information loss than existing models
was proposed; this is a type of non-homogeneous generalization. In this paper, we present an alternative anonymization
algorithm that further reduces the information loss using
optimization techniques. We also prove that a modified
dataset is checked whether it satisfies the k-anonymity by
a polynomial-time algorithm. Computational experiments
were conducted and demonstrated the efficiency of our algorithm even on large datasets.
Keywords Security and protection · Database models ·
K -anonymity · Large-scale dataset · Graph problem ·
Optimization
1 Introduction
Recent progress in information technology has enabled the
easy collection of large datasets containing information concerning the lives and personalities of individuals. These
individual data, often called microdata [1] (Table 1), are
useful for demographic analysis and similar tasks but have
B
Keisuke Murakami
k_mura@kansai-u.ac.jp
Takeaki Uno
uno@nii.jp
1
Kansai University, 3-3-35 Yamate, Suita, Osaka 564-8680,
Japan
2
National Institute of Informatics, 2-1-2 Hitotsubashi,
Chiyoda-ku, Tokyo 101-8430, Japan
a significant risk of being unintentionally disclosed. Simply removing the attributes identifying individuals from
microdata cannot prevent this threat because individuals can
be identified by other attributes. In general, datasets are
classified as “public” or “private”. Public datasets contain
individual-identification attribute(s) such as names, whereas
private datasets contain sensitive individual information that
should be dissociated from specific persons, such as income.
Examples of public and private datasets are listed in Table 2
and 3, respectively. When data of both types are disclosed,
privacy cannot be protected because each individual is linked
with his and her personal information. For example, if
Tables 2 and 3 are published, each name could be linked to
an income, even though there are no individual-identifying
in Table 3 (e.g., an attacker would know that Alan’s income
is 10,000 because each table associates a single individual with Zip code = 94111, Gender = M and Race = White).
Therefore, to protect individuals’ privacy, it is necessary to
deindentify microdata before disclosing them. In this paper,
it is assumed that any user can know the public dataset.
A commonly used deidentification strategy is k-anonymity
[2], which declares that the probability of an attacker of identifying an anonymized record is at most 1/k [6,8,10]. In
k-anonymity, a dataset contains three types of attributes.
Identifiers (IDs): These attributes identify individual
records. IDs must be removed from the microdata in the private dataset. Examples of IDs are passport number, social
security number, fingerprint information, and name.
Quasi-identifiers (QIDs): These attributes can link record
in public datasets with those in private datasets. Linking a
QID to an individual poses no problem. Example of QIDs
are zip code, gender, race, and birth date.
Sensitive attributes (SAs): These attributes should remain
strictly unassociated with specific persons. Examples of SAs
are disease, income, and criminal offense.
123
K. Murakami, T. Uno
Table 1 Microdata
Name
Zip code
Gender
Country
Income
Alan
94221
M
US
10,000
Betiina
94112
F
US
5000
Christina
94121
F
US
1500
Devola
94111
M
Canada
3000
Edmond
94222
M
Canada
30,000
Flora
94122
F
UK
20,000
Georgia
93111
M
Canada
40,000
Table 2 Public dataset
Record ID
Name
Zip code
Gender
Country
r1
Alan
94221
M
US
r2
Betiina
94112
F
US
r3
Christina
94121
F
US
r4
Devola
94111
M
Canada
r5
Edmond
94222
M
Canada
r6
Flora
94122
F
UK
r7
Georgia
93111
M
Canada
Income
Table 3 Private dataset
Record ID
Zip code
Gender
Country
r1
94221
M
US
10,000
94112
F
US
5000
94121
F
US
1500
94111
M
Canada
3000
r2
r3
r4
r5
r6
r7
94222
M
Canada
30,000
94122
F
UK
20,000
93111
M
Canada
40,000
Public and private datasets contain IDs and SAs, respectively, and both datasets contain QIDs. K -anonymity edits
some of the QIDs in private datasets.
K -anonymity popularly is achieved by generalization [3,
4], which represents the QIDs as hierarchies. The precision
of the values is reduced as one moves up the hierarchy. For
example, Fig. 1a illustrates a hierarchical grouping of “Country“. The cardinality of a QID value is defined by the number
of its descendant (leaves). For example in Fig. 1a, the cardinalities of “America” and “Country“ are 2 and 7, respectively.
At the top of the hierarchy, the QID value is fully concealed.
The method of concealing the QID values as one moves up
the hierarchy is demonstrated in Fig. 1b [15]. To simplify the
explanation, k-anonymity is hereafter discussed in terms of
the generalization of concealing the QID values. However,
our approach in this paper can easily extendable to the general case in which the hierarchies are given (as described in
Sect. 3.2.4).
1.1 Existing models and our contribution
There exist two models of k-anonymity; the global-recoding
model [5–8] and the local-recoding model[12,14].
The global-recoding model generalizes the same site in all
records. An example is presented in Table 4, where the generalized values are indicated by asterisks (*). When Tables 2
and 4 are visible, record ID r1 can correspond to either r1
or r5 . Similarly, each record in Table 2 can be identified
with at least two records in Table 4. Thus, Table 4 satisfies 2-anonymity. One of the most fundamental algorithms
for implementing the global-recoding model is the incognito
algorithm [9], which is based on the hill descendant approach.
Initially, the anonymization is fully generalized; that is, all
records have the same attribute values. The attributes are then
specialized one by one, until further specialization would
violate the k-anonymity. This approach yields a minimal
k-anonymization. All minimal k-anonymizations can be enumerated by backtracking searches.
The local-recoding model determines the generalized sites
in each record. Table 5 shows an example of local-recoding
with 2-anonymity. Note that Table 5 uses 15 asterisks,
while the global-recoding model (Table 4) uses 28 asterisks.
Clearly, the local-recoding model preserves more information than the global-recoding model. The local-recoding
model is therefore applied in this paper. If a QID cannot
be partially hidden such as Gender and Country, its value
will be either wholly hidden or wholly retained. On the other
(a)
Fig. 1 Hierarchy for country. a Standard generalization, b generalization of concealing the QID values
123
(b)
Optimization algorithm for k-anonymization of datasets with low information loss
Table 4 Global-recoding
Table 6 Non-homogeneous
Record ID
Zip code
Gender
Country
r1
9*2**
M
*
10,000
9*1**
F
*
5000
9*1**
F
*
1500
9*1**
M
*
3000
r2
r3
r4
r5
r6
r7
Income
9*2**
M
*
30,000
9*1**
F
*
20,000
9*1**
M
*
40,000
Record ID
Zip code
Gender
r1
9422*
M
*
941*2
F
*
5000
941*1
F
US
1500
9*111
M
Canada
3000
9422*
M
*
30,000
941**
F
*
20,000
9*111
M
Canada
40,000
r2
r3
r4
r5
r6
r7
Country
Income
10,000
Table 5 Local-recoding
Record ID
Zip code
Gender
Country
r1
9422*
M
*
10,000
941**
F
*
5000
941**
F
*
1500
9*111
M
Canada
3000
9422*
M
*
30,000
941**
F
*
20,000
9*111
M
Canada
40,000
r2
r3
r4
r5
r6
r7
Income
hand, if the QID can be partially hidden, such as a Zip code,
each digit of the QID is regarded as a pseudo-QID, which
is independently decided whether to be hidden to reduce the
information loss (if the values of Zip code are wholly hidden,
the information loss increases). For simplicity, pseudo-QIDs
are hereafter referred to as QIDs. Table 3 contains seven
QIDs, because each Zip code contains five pseudo-QIDs.
In most previous studies of local-recoding models, kanonymity is considered to be satisfied when each record in
a dataset is similar to at least another k − 1 QID records. For
example, Table 5, it is only necessary to consider the private
dataset, not the public dataset. This approach is called the
homogeneous approach [11,12]. The records are first partitioned into groups with identical QIDs (i.e., the QIDs are fully
generalized). However, “the same” generalized QID is not
necessarily required for k-anonymity. Wong et al. proposed a
new approach, called non-homogeneous, which can achieve
k-anonymity without the same generalized QIDs [16] (see
Table 6). In general, the non-homogeneous approach incurs
less information loss than the homogeneous approach. For
example, the number of asterisks in Table 6 is reduced to 12.
Meanwhile, the non-homogeneous approach is not as easy to
apply as the homogeneous one, because the generalized sites
(QIDs) must be determined for each record. To overcome
this difficulty, Wong et al. proposed a simple algorithm that
first divides the records into partitions sized between k and
2k, then randomly generalizes the QID of each record in each
partition. However, while Wong et al.’s algorithm is very fast,
the small partitioning and random generalization might incur
high information loss in an anonymized dataset.
The present study focuses on large databases (of the order
of 10 million records), whereas the previous studies on localrecoding models have handled at most hundreds of thousands
of records. Thus, it is possible that the anonymization time
could have become problematic. However, because each
database requires only one anonymization, longer execution
times (about one day) are considered to be practical. It is also
assumed that an anonymized database would be useful if it
had a moderately sized k. A large k value usually means high
information loss in an anonymized database, which renders
the database useless for analysis.
Our proposed optimization algorithm for k-anonymity
incorporates the non-homogeneous approach and techniques
for graph algorithms. Its main aim is to minimize the information loss in the k-anonymity while maintaining a practical
execution time. In this algorithm, the information loss is
prioritized over the computation time. It is expected that
this algorithm will retain information better than Wong et
al.’s algorithm, at the expense of longer computational time.
Nevertheless, the algorithm finishes within a few hours. Optimization algorithms are not generally applicable to large
datasets as it is, because the size of the dataset can exceed the
memory capacity of a computer. Even when implemented
on a high-memory computer, a polynomial-time algorithm
may not obtain a solution within a practical time. Therefore, we effectively reduce the size of the problem before
solving the problem, where some unpromising variables in
the optimization algorithm (problem) are excluded, and the
database further is partitioned after using valid sorting techniques (these details are presented in Sects. 3.2.1 and 3.2.3,
respectively).
The rest of this paper is organized as follows. Section 2
models the k-anonymity problem using a bipartite graph, and
Sect. 3 formulates it as an optimization problem and proposes
a solution algorithm. In Sect. 4, our algorithm is compared
with the existing method for large-scale datasets. Section 5
concludes the paper.
123
K. Murakami, T. Uno
Fig. 2 Anonymity graph
Fig. 3 Biased anonymity graph
2 Graph expression of anonymization
This linkage occurs because edge (r4 , r4 ) is used in all the
perfect matchings. A perfect matching means that every vertex of the graph is adjacent to exactly one vertex. In Fig. 3,
a perfect matching is composed of seven edges such as
(r1 , r1 ),(r2 , r2 ),(r3 , r3 ), (r4 , r4 ),(r5 , r5 ), (r6 , r6 ),(r7 , r7 ).
Wong et al. overcame the shortcoming of Condition 1 by
postulating that:
Consider a public dataset T (Table 2), which has been already
published, and a private dataset T (Table 3). Both datasets
contain QIDs, but whereas T contains ID(s), the private
dataset T contains SAs. Some or all of the QIDs are common
to both datasets. Suppose that it is also desired to release T with privacy protection. If the private dataset T were published without modification, individuals might be identified,
through common QIDs, which relate T to T as described in
the previous section. Hence, to protect individual privacy, it
is necessary to anonymize dataset T , that is, to preserve its
k-anonymity.
Consider a bipartite graph with two vertex sets VT and VT representing the sets of records in T and T , respectively [17–
19]. The vertices ri ∈ VT and r j ∈ VT are connected if r j
can correspond to ri , that is, if the h-th QID of r j is hidden
(i.e., by an asterisk symbol (*)) or identical to that of ri for
all h. An example is shown in Fig. 2. In Tables 2 and 6, r2
is adjacent to r2 and r6 . In other words, if vertex ri can be
adjacent to r j , the different QID values in record r j are hidden
from ri . Such as a bipartite graph is called an “anonymity
graph”.
2.1 Conditions for k-anonymity
To clarify k-anonymity, a stepwise discussion follows, beginning with the fundamental conditions of k-anonymity. We
first introduce the following simple condition.
Condition 1 In an anonymity graph G, the modified table
T satisfies k-anonymity if the degree of every vertex in VT
and VT is not less than k.
Condition 1 originates from the principle that an attacker
can identify an anonymized record with a probability of 1/k
at most. Thus, Condition 1 is a necessary anonymization condition.
However, privacy protection is insufficient, because individuals might be identified. For example, Fig. 3 displays
the anonymity graph when some values in T are hidden. Because the degree of every vertex in VT and VT is
more than or equal to 3, the generalized dataset T achieves
3-anonymity under Condition 1. However, record r4 is unambiguously linked to r4 ; therefore that individual is identified.
123
Definition 1 A modified table T has k-anonymity if and
only if the anonymity graph G includes k edge-disjoint perfect matchings.
In k edge-disjoint perfect matchings, each edge is included
in only one perfect matching.
2.2 Verification of k-anonymity condition
Wong et al. proposed Definition 1 of k-anonymity, but did
not verify it. Here it is proven that Definition 1 can be verified
in polynomial time. The verification needs to check the existence of k edge-disjoint perfect matchings in the anonymity
graph G. Hall’s theorem [21] states that a bipartite graph
includes k edge-disjoint perfect matchings if and only if the
graph includes a k-regular subgraph, in which all vertices
have degree k. This condition can be checked by the following maximum flow algorithm.
A flow from source s to sink t of a directed network is a
value on each arc such that the inflow equals the outflow on
each vertex v (the flow condition). This value is called amount
of the flow. Flow values cannot exceed capacity of the arc.
Intuitively, a flow is a superposition of paths from the source
vertex s to the sink vertex t. The maximum-flow problem
finds the flow that maximizes its amount. The maximum flow
in |V | vertex network can be solved in O(|V |3 ) time [13].
The polynomiality can be stated in term of the maximum
flow.
Theorem 1 The k-anonymity of an anonymity graph G =
(V, E) given by Definition 1 can be verified in O(|V |3 ) time.
Proof Let G̃ = (Ṽ , Ẽ) be the graph obtained from G by
adding two vertices s and t, such that s is connected to all
vertices in VT , and t is connected to all vertices in VT . All
edges of G̃ are directed from s to VT , VT to VT and VT to
t, so G̃ is a directed graph. The edge capacity is set to k for
Optimization algorithm for k-anonymization of datasets with low information loss
NCP(q, r ) expresses the degree of generalization at QID
q in record r . The degrees at all QIDs in all records of T are summed as follows:
NCP(T ) =
|A(q, r )| − 1
|A∗ | − 1
(2)
q∈QIDs r ∈T
The GCP is then defined by
GCP(T ) =
Fig. 4 Maximum-flow problem
edges incident to s or t, and to 1 for other edges (see Fig. 4).
We show that G̃ has an amount of flow k × |VT | if and only
if G has k-anonymity.
Suppose that G satisfies k-anonymity. Because G has k
disjoint perfect matchings, it includes a k-regular subgraph
H . Consider a flow from s to t, in which the flow value is
k for edges adjacent to s or t, and 1 for edges in H . Since
each vertex in VT and VT is incident to k edges in H , the
flow condition is satisfied. Thus, the maximum-flow value is
k × |VT |.
Suppose that the maximum amount of a flow is k × |VT |.
As each edge from s to a vertex v in VT has a flow value of
k, v is incident to exactly k edges with flow values of 1. As
each edge from a vertex u in VT to t also has a flow value
of k, u is also incident to exactly k edges with flow values
of 1. Thus, the set of edges with flow values of 1 forms a
k regular graph. Consequently, G includes k edge-disjoint
perfect matchings.
NCP(T )
|QIDs| × |T |
(3)
Note that GCP(T ) is a normalized version of NCP(T ).
We now present the NCP and GCP of a generalized dataset
with concealing the QID values. Generalizing (Concealing) an attribute value at QID q in record r implies that
A(q, r ) = A∗ (see Fig. 1). If the attribute value is generalized (concealed), NCP(q, r ) = 1; otherwise NCP(q, r ) =
0. Accordingly, NCP(T ) represents the number of hidden
sites in T , and GCP(T ) defines the ratio of the number of
the hidden sites. Based on the GCP(T ) of the generalized
dataset with concealing the QID values, the information loss
is thus defined as follows:
Definition 2 (Information loss-generalization of concealing
the QID values) Given an anonymized dataset T , the information loss is defined by the ratio of the number of the hidden
sites.
In this paper, the information loss is measured by Definition 2 (i.e., GCP(T )).
2.3 Information loss
Although the modified private dataset T satisfies k-anonymity,
its utility cannot be lowered; otherwise, the database T would become unusable. The utility is usually measured by
the information loss. Popular measures are the normalized
certainty penalty (NCP) and global certainty penalty (GCP).
This subsection presents NCP and GCP first for the standard
generalized dataset (i.e., the multilevel hierarchies such as
Fig. 1a are considered), and then for the generalized dataset
with concealing the QID values (i.e., one-level hierarchies
such as Fig. 1b are considered).
Let A(q, r ) be an attribute value at QID q in record r and denote |A(q, r )| as the cardinality of A(q, r ) (e.g.,
in Fig. 1a, if A(q, r ) = “Europe”, |A(q, r )| = 3, because
“Europe” = Italy, France, England). In particular, there A∗
is the QID attribute value at the top of the generalization
hierarchy, and |A∗ | is the cardinality of A∗ (i.e., in Fig. 1a,
A∗ = “Country” and |A∗ | = 7). Then, the NCP(q, r ) given
by
NCP(q, r ) =
|A(q, r )| − 1
|A∗ | − 1
(1)
3 Optimization approach
3.1 Formulation
Our objective is to minimize the information loss of a modified private dataset T with k-anonymity.
Let Ai j denote the set of attributes with differently valued
ri and r j . For an edge set F ⊆ VT × VT , let N (v, F) be
the vertices adjacent to vertex v, where
any edge in F is
incident to vertex v. Let C(r j , F) = u∈N (r ,F) A ju . Then
j
GCP(T ) =
r j ∈VT |C(r j , F)|, and the problem can be
written as follows.
min
|C(r j , F)|
(4)
F⊆VT ×VT r j ∈VT s.t.|N (v, F)| = k, ∀v ∈ VT ∪ VT (5)
F : decision variable
(6)
The objective function (4) gives the total weight of edges.
Here, we explain the relation between the objective function
123
K. Murakami, T. Uno
and the weight of edge. Let w(ri , r j ) denote the weight of
edge between ri ∈ VT and r j ∈ VT . The weight w(ri , r j )
is defined as the increment of C(r j , F) by the addition of
(ri , r j ), i.e., w(ri , r j ) = |C(r j , F ∪ {(ri , r j )})| − |C(r j , F)|.
For example, in Fig. 5, when there is no adjacent vertex,
the weights of (r2 , r1 ) and (r2 , r3 ) are 3 and 2, respectively.
Constraints (5) ensure that edge set F configures a k-regular
graph. Constraints (6) represents that edge set F is the decision variable, i.e., the objective of this problem is to find
an edge set minimizing the total weight. This optimization
problem is called k-anonymity problem.
3.2 Algorithm
It is obvious that the weight w(ri , ri ) is 0, when records ri
and ri have the same individual. We obtain a perfect matching
using these edges and the total weight of the edges is 0. Thus,
when k = 1, this perfect matching is the optimal solution and
the objective function is 0. This perfect matching is called
“identical perfect matching”. Because any dataset includes an
identical perfect matching, the anonymity problem is viable
only when k ≥ 2.
The edge weights are dynamic rather than fixed, because
the number of hidden sites depends on the previously hidden
sites in r j ∈ VT . For example in Fig. 6, when r4 is adjacent to r1 and r3 and some sites are preliminarily hidden,
edges (r2 , r1 ) and (r2 , r3 ) are weighted as 2 and 1, respectively, which are different from the weightings in Fig. 5.
This dynamism greatly complicates the k-anonymity problem. Note that when k = 2, the weight of each edge is
not dynamic, because this method requires only one perfect
matching (the other perfect matching is the identical perfect
matching).
Here we propose a heuristic algorithm for the large-scale
k-anonymity problem. In our experience, complicated algorithms are unavailable for large-scale problems; thus, our
algorithm employs the basic techniques of graph algorithms.
Our algorithm is a two-step process based on minimum-cost
flow algorithms; an initial feasible solution is first found by
a greedy technique and is improved by canceling negative
cycles in the second step.
Fig. 5 Computation of weight
123
3.2.1 Initial feasible solution
Because the k-anonymity problem is modeled as a graph
problem, it should be naturally solvable by graph algorithms. In the k-anonymity problem, the numbers and sites of
asterisks in each record are obtainable only when k records
adjacent to the record of interest are determined. Thus, the
edge weights dynamically change as described above. However, as most graph algorithms presume fixed edge weights,
our algorithm tentatively fixes the weights and seeks a perfect
matching. The perfect matching becomes part of the solution
and is used to update the weight of each edge. This process
is iterated k times (finding k perfect matching). This method
adopts a greedy algorithm (described below) that cannot find
an exact solution, but it is known to find a good solution
within a short time.
The algorithm first computes the weights of the edges
when dataset T contains no hidden sites (*). A perfect
matching is then found by using these weights of edges; this
corresponds to solving a minimum weight perfect matching
problem, for which there exist several exact and fast algorithms (e.g., [20]). After solving the minimum weight perfect
matching problem, the values in the dataset T that are incident to edges included in the perfect matching (solution) are
hidden. For example in Fig. 6, if edge (r4 , r1 ) is included in
the perfect matching (solution), the QIDs of r1 are modified
to Zip code = 94**1, Gender = M, County = *. The perfect
matching is stored as part of the initial solution, and the edges
included in the matching are removed from graph G. The
weights of the edges are then recomputed without the hidden
sites, and in similar way, we find a perfect matching and then
modify the dataset T (QIDs). This process is iterated k − 1
times. Note that the identical perfect matching is found before
Fig. 6 Dynamic weight
Optimization algorithm for k-anonymization of datasets with low information loss
executing greedy algorithm, so only k − 1 perfect matchings
are required. We describe this greedy algorithm below.
Greedy Algorithm
Step 0 Compute the initial weight of each edge wT (ri , r j ),
assuming no hidden sites in T . Set the edge set F =
∅.
Step 1 Find a minimum weight perfect matching.
Step 2 Update the edge set F by adding the edges included
in the perfect matching (the solution).
Step 3 If k − 1 perfect matchings exist in edge set F, exit;
otherwise, go to Step 4.
Step 4 Modify the dataset T by hiding the values in T corresponding to the perfect matching found in Step 1.
Step 5 Recompute the weight of each edge wT (ri , r j ) in
the modified dataset T and return to Step 1. Note
that the weights of the edges included in F are set
to ∞, which has the same meaning of removing the
edges form graph G.
In Step 1, the minimum weight perfect matching problem is a special case of the minimum-cost flow problem,
which is solvable by pre-flow push algorithm. Here the
practically efficient algorithm of Goldburg et. al. [20] was
adopted.
(a)
Example of the greedy algorithm Consider an anonymity
graph such as Fig. 7, where the numbers of records and QIDs
are 7 and 3, respectively. Note that the anonymity graph bears
no relation to the datasets in Tables 1, 2, 3, 4, 5 and 6.
Step 0 computes the initial weight of each edge in Fig. 7
(a). The weights are shown in Fig. 7d. For example, the weigh
w(r1 , r2 ) is 1, because the second (middle) QID value of r2
is hidden (i.e., r2 = 2*3) when it is assumed that r1 becomes
adjacent to r2 . Step 1 finds a minimum weight prefect matching using the initial weights (Fig. 7d), i.e., a perfect matching
such as Fig. 7b is found. Step 2 stores the perfect matching.
Step 4 hides some QID values in the dataset T corresponding to the perfect matching as Fig. 7b. Step 5 recomputes the
weight of each edge in Fig. 7b. The weights are shown in
Fig. 7e. For example, the weigh w(r1 , r2 ) is 0, because no
QID value of r2 is hidden when it is assumed that r1 becomes
adjacent to r2 . We can see that several weights change from
Fig. 7d, e (i.e., dynamic weight). Again, Step 1 finds a minimum weight prefect matching using the weights (Fig. 7e)
i.e., a perfect matching such as Fig. 7c is found. These steps
are iterated until k − 1 perfect matching in plural are found.
If the anonymity graph is naively constructed (i.e., the
anonymity graph G contains all edges between both sides),
the minimum weight perfect matching problem cannot be
solved because the memory and computational time requirements exceed the practical limits. Therefore, only the edges
(b)
(d)
(c)
(e)
Fig. 7 Example of greedy algorithm. a Initial graph, b 1-perfect matching graph, c 2-perfect matching graph, d initial weights, e updated weights
123
K. Murakami, T. Uno
with smaller weights are considered. Ideally, the edges
with smaller weights at each modification of T should be
found; however, this method demands excessive computational time. Therefore, some edges are only selected once as
follows.
If an edge is low-weighted in the original (non-modified)
dataset T , it is selected as a promising candidate. Although
these edges are not always included in the optimal solution,
we believe that this strategy selects appropriate edges and
reduces the number of the edges effectively. Computing the
edge weights in the original dataset T is equivalent to finding
the Hamming distance between ri ∈ VT and r j ∈ VT .
Thus, the lower-weighted edges (the Hamming distance) are
obtained by the fast algorithm proposed by Uno [24].
3.2.2 Improvement of solution
The initial solution is improved by negative cycle canceling,
a popular technique for solving minimum-cost flow problems [22,23].
First, a directed bipartite graph is constructed, including
the edges used in the minimum weight perfect matching
problem of our greedy algorithm. Here, G = (V , E ) is
a directed graph, and it is supposed that the edges in set
F (the initial feasible solution) are directed from VT to
VT , while other edges are directed from VT to VT . In
the directed graph, the edges from VT to VT consistently
represent the (initial) solution. Then, the hidden sites are
incremented when the directed edges r j ∈ VT → rh ∈ VT
and ri ∈ VT → r j ∈ VT are removed from and added to
the solution, respectively. For example in the upper part of
Fig. 8, the current solution contains the edge from r2 to r6 .
When this edge is removed from the current solution and the
edge from r2 to r2 is newly added (i.e, the directions of both
edges are reversed as shown in the bottom part of Fig. 8), the
number of hidden sites in record r2 is incremented by − 2,
because Zip code = 941*2 and Country = * are changed into
Zip code = 94112 and Country = US.
The increments are regarded as weights associated with
pairs of edges such as ri ∈ VT → r j ∈ VT and r j ∈ VT →
Fig. 8 Increment of the hidden sites
123
rh ∈ VT . That is, a negative weight implies that the number
of hidden sites reduces when the directions of both edges
are reversed (i.e., when an edge included in the solution is
exchanged).
When a sequence of such edge pairs forms a directed
cycle, the feasibility of the solution is retained when the
edges in the directed cycle reverse their direction, because the
updated anonymity graph certainly includes a k-regular graph
(k edge-disjoint perfect matchings). Thus, if the sum of the
weights on a directed cycle is negative and the edges reverse
their directions, the number of hidden sites is reduced without compromising the feasibility of the solution. Therefore,
the solution can be repeatedly improved by finding negativeweight directed cycles.
Example of improvement of the anonymized dataset by detecting negative cycle Consider the initial solution as Fig. 9a,
where k = 2 and the number of the hidden sites (*) is 9.
Then, a negative cycle is detected such as Fig. 9b (r1 →
r4 → r2 → r3 → r4 → r1 → r3 → r2 → r1 ), where the
increments of the hidden sites at records are − 3, + 1, 0, and
− 1, respectively; thus, the sum of them is − 3. This represents the solution is improved (i.e., the number of the hidden
sites reduces to 6 from 9) when the edges along the negative
cycle reverse their direction (Fig. 9c). This is repeated until
no negative cycle is detected.
We next explain how to find negative cycles in an
anonymity graph. Several algorithms for detecting negative
cycles have been proposed but are unsuitable for large-scale
k-anonymity problems. For example, the time complexity of
the Bellman–Ford algorithm, a basic algorithm for detecting negative cycles, is O(|V | × |E |) per negative cycle.
Despite being a polynomial algorithm, this algorithm cannot identify negative cycles in large-scale datasets (graphs)
within a reasonable computational time. Therefore, we propose a new algorithm that efficiently detects negative cycles
in large-scale networks. Our algorithm is based on a labeling
algorithm proposed for the shortest path problem. Although
it may not find all negative cycles, it is expected that our
algorithm will rapidly determine the negative cycle number,
by virtue of the fast labeling algorithm.
In a labeling algorithm, each vertex has a label, indicating
the (incumbent) shortest path length from the source to that
vertex, and a parent, representing the predecessors of the vertex on the shortest path. The label and parent of each vertex
are updated if the label (shortest path length) is improved.
This process is iterated until no labels are improved. Ultimately, the label of each vertex represents the exact length of
the shortest path from the source to that vertex. The shortest
path to each vertex is found by tracking through the parents
from vertex to source. The path through the parents from
each vertex is assembled into a tree graph, here called the
shortest path tree.
Optimization algorithm for k-anonymization of datasets with low information loss
1
1
1
1
1
1
2
2
2
2
2
2
3
3
3
3
3
3
4
4
4
4
4
4
(a)
(b)
(c)
Fig. 9 Example of improvement of the anonymized dataset by detecting negative cycle. a Initial solution (k = 2), b negative cycle, c improved
graph
The shortest path tree cannot be maintained in a directed
graph G containing negative cycles (because trees are noncyclic). Graphs including cycles are called shortest path
graphs. A shortest path graph might consist of several connected graphs, where one is tree and the others are graphs
including a negative cycle. For example, the left and right
graphs of Fig. 10 represent a negative cycle and tree structure, respectively.
The objective is to find negative cycles using the shortest path graph. When tracing upstream along the shortest
path from a vertex included in a negative cycle, we eventually return to the same vertex (and thus identify the negative
cycle). Even when a vertex is included in a graph containing
a negative cycle but is excluded from the cycle itself, the negative cycle can be found. For example, vertex j is excluded
from the negative cycle in the left graph of Fig. 10, but the
negative cycle can be entered by following the shortest path
graph upstream from vertex j.
Thus, it is necessary to find the vertices that are probably
included in graphs containing negative cycles. Because the
labels of such vertices are likely to be updated many times,
our algorithm focuses on the update frequency of the label
on each vertex. When the label of a certain vertex is updated
several times, the shortest path graph is tracked upstream
from the vertex (Fig. 10). This method cannot always detect
a negative cycle, but it is expected to be effective because
it intensively searches some promising candidate vertices.
This approach for detecting negative cycles in large-scale
networks (problems) has not been previously proposed.
Our new algorithm for detecting negative cycles proceeds
as follows. Each pair of edges is regarded as single edge
whose ends are vertices in VT . Note that each edge pair is
a sequence such as ri ∈ VT → r j ∈ VT and r j ∈ VT →
rh ∈ VT . Here li is the label of vertex vi ∈ VT , wi h denotes
the weight of the edge from vertex vi ∈ VT to vh ∈ VT , and
the threshold t is defined such that if a vertex is updated t
times, the algorithm proceeds upstream from the vertex along
the shortest path graph. This algorithm terminates after p
negative cycle
non-cycle
vertex j
Fig. 10 Shortest path graph
iterations of Step 1-Step 6, where p is preset. The negative
cycle detection algorithm is detailed below:
Negative Cycle Detection Algorithm
Step 0 Choose an arbitrary vertex in VT as a root and set
its label is set to 0. Initialize the labels of the other
vertices to ∞.
Step 1 Find the vertex with the smallest label and set the
vertex to vi .
Step 2 If li +wi h < lh , update lh ← li +wi h for all vh adjacent to vi , and then update the shortest path graph.
Step 3 If label lh is updated t times, proceed upstream from
vertex vh along the shortest path graph. Reset the
update frequency of vertex vh to 0.
Step 4 If negative cycle is detected, go to Step 5; otherwise,
return to Step 1.
Step 5 Update G = (V , E ), reversing the directions of the
edges included in the negative cycle. The labels of
all offspring of the vertices included in the deleted
negative cycle are set to ∞.
Step 6 If the number of iterations of Steps 1–6 exceeds the
preset value p, exit; otherwise, return to Step 1.
3.2.3 Application of our two-phase method to large dataset
The two-phase method finds an anonymized dataset T .
However, larger-scale k-anonymity problems are rendered
123
K. Murakami, T. Uno
difficult by the large memory and computational time. For
example, the graph of a dataset containing 10 million records
(vertices), each of which connects by one thousand edges,
includes 10 billion edges, requiring a vast memory capacity.
Reducing the number of edges incident to each vertex conserves memory but degrades the solution; in some cases, a
feasible solution may not be found. Moreover, the computation time remains prohibitive. Although a minimum weight
perfect matching problem in our greedy algorithm is solved in
√
polynomial time (namely, O(n 2 m), where n and m are the
number of vertices and edges, respectively [20]), the huge
numbers of vertices and edges (i.e., n = 10,000,000 and
m = 1000 × n) preclude a solution within any practical
timeframe.
To overcome this difficulty, the original dataset is partitioned (microdata; see Table 1) and several smaller-scale
anonymity graphs are generated. Before partitioning, the
dataset is subjected to sorting, such as Wong’s algorithm. We
first sort QIDs in ascending order of the cardinality of the QID
attribute values. For example, in Table 3, the cardinalities of
“Zip code”, “Gender”, and “Country” are assumed to be 7,
2, and 3, respectively (perhaps, the cardinalities of the potential “Zip code” and “Country” values are larger in the real
dataset). The dataset is reconstructed such that the QID order
is “Gender”, “Country”, and “Zip code”. Next, the records
in the dataset is sorted lexicographic QID order. Finally, the
records are partitioned in a top-down fashion. This sorting
technique is obviously effective because the records in a partition are likely to become similar to each other. Thus, it is
expected that the information loss reduces.
Experiments demonstrated that after executing the sorting
techniques, our algorithm derives a better solution than when
the data are unsorted (see Tables 19 and 20 in “Appendix”).
In our experiments, the size of the partitioned dataset is set
to 100,000 (100k) records.
3.2.4 Extension to standard generalization
Our algorithm can be extended from generalization of concealing the QID values to standard generalization (Fig. 1).
Our model defines the weight of each edge as the increased
number of hidden values in T , supposing that the edge is
incident with two vertices in the directed graph G . The
weight is quantified by the increase in NCP(T ), which also
defines the weight of each edge in the generalized G. To compute NCP(T ), a generalized value in each QID is required,
whose edge is incident to two vertices. When the hierarchies
are preliminarily given (as Fig. 1a), optimal generalized value
is easily obtainable by bottom-up searching through the hierarchy. The bottom-up search is terminated when the edge of
the generalized value becomes incident to two vertices. For
example, in Table 2 and 3, the increase in NCP(T ) for the
QID of “Country” are computed based on the condition that
123
dataset T is completely unmodified. When r1 is adjacent to
r4 , NCP(T ) increases by 16 because “Canada” is generalized
(modified) to “America” in record r4 , and |America| = 2
and |Country| = 7 (see Eq. (1)). The NCP(T ) similarly
increases for all QIDs and is then summed to yield the weight
of the edge.
In our algorithm, the two generalizations differ only by
the weight computation method, enabling easy extension to
the latter. However, the bottom-up search increases the computational time.
4 Experiments
In this section, the experimental results obtained using our
two-phase algorithm are compared with those obtained using
Wong’s algorithm (the implementation of Wong’s algorithm
was provided us by Wong). The experiments are performed
on a real dataset CENSUS (http://www.ipums.org) and on
datasets randomly generated from normal distribution. In
the CENSUS dataset, we use five instances, where the number of records is varied from 100,000 to 500,000 in 100,000
increments, and the number of QIDs is fixed at 8; the information of the QIDs is shown in Table 7. These datasets are
used in the article [16] and were provided us by the authors.
The five cases are defined as “Occ_100k”, “Occ_200k”,
…,“Occ_500k”. In the randomly generated datasets, the standard deviation (sd) is set to 1.0 or 1.5, the number of QIDs
is fixed at 10, and the number of records is set at 100,000,
1,000,000, or 10,000,000 (yielding six cases).
To evaluate the performance of our two-phase algorithm, a
lower bound is imposed on the number of hidden sites (objective function). In the k-anonymity problem (4)–(6), half of
the constraints (5) are relaxed (omitted), and the degrees of
all vertices in VT are ignored. The problem then becomes:
min
F⊆VT ×VT |C(r j , F)|
(7)
r j ∈VT
s.t.|N (v, F)| = k, ∀v ∈ VT (8)
F : decision variable
(9)
Table 7 QIDs in CENSUS
dataset
QID
Cardinality
Age
79
Gender
Education level
Marital status
Race
2
17
6
9
Work class
10
Country
83
Occupation
50
Optimization algorithm for k-anonymization of datasets with low information loss
The optimal solution is determined by evaluating all combinations of hidden sites in each record in VT . For example,
in record r1 in Fig. 3, the vertices (records) adjacent to
r1 are determined when a part of r1 is hidden, such as
Zip code = *4221, Gender = M and Country = US. Then, all
combinations of the hidden sites are considered as follows:
Table 9 Occ_200k (|T | = 200k)
k
Wong’s
Two-phase
Gap (%)
GCPw
Time (s)
GCPt
3
0.12
0.45
0.10
8
15.0
4
0.14
0.49
0.12
12
13.6
5
0.15
0.48
0.14
13
13.7
6
0.17
0.48
0.15
17
12.3
7
0.18
0.51
0.16
19
12.3
8
0.19
0.49
0.17
22
12.0
...,
9
0.20
0.52
0.18
26
11.9
Zip code = ∗ ∗ ∗ ∗ ∗, Gender = ∗ and Country = ∗.
10
0.21
0.51
0.19
27
11.7
Zip code = 9 ∗ 221, Gender = M and Country = US,
Zip code = 94 ∗ 21, Gender = M and Country = US,
However evaluating all combinations is likely unnecessary, because the optimal combinations of most records are
found at the smaller numbers of hidden sites (i.e., when few
QIDs are hidden, many of the records in VT are adjacent
to several vertices in VT ). Thus, starting with the set of no
hidden sites, the number of hidden site is iteratively incremented by one, and the new sets were examined. If the degree
of each record in VT is not less than k, the adjacent vertices
and number of hidden QIDs constituted the optimal combination and lower bound of the record, respectively. The lower
bounds of all records in VT are summed to give the lower
bound of the k-anonymity problem.
Tables 8, 9, 10, 11, 12, 13, 14, 15, 16, 17 and 18 list the
value of the objective functions GCP obtained by our twophase algorithm and Wong’s algorithm (denoted as GCPw
and GCPt , respectively), and their computational times. In
addition, the gaps between GCPw and GCPt and between
GCPt and the lower bound are computed by using Eqs. (10)
and (11), respectively:
GCPw − GCPt
Gap =
GCPt
GCPt − L B
LB_Gap =
LB
(10)
(11)
Time (min)
Table 10 Occ_300k (|T | = 300k)
k
Wong’s
Two-phase
Gap (%)
GCPw
Time (s)
GCPt
Time (min)
3
0.10
0.65
0.09
13
18.8
4
0.12
0.67
0.11
17
16.9
5
0.14
0.67
0.12
22
15.6
6
0.15
0.70
0.13
26
15.3
7
0.16
0.73
0.14
29
14.7
8
0.17
0.76
0.15
34
14.4
9
0.18
0.73
0.16
37
14.0
10
0.19
0.76
0.17
42
13.2
Table 11 Occ_400k (|T | = 400k)
k
Wong’s
Two-phase
Gap (%)
GCPw
Time (s)
GCPt
Time (min)
3
0.10
0.93
0.08
15
17.2
4
0.11
0.88
0.10
20
16.1
5
0.13
0.93
0.11
27
14.3
6
0.14
0.90
0.12
31
13.9
7
0.15
0.93
0.13
38
14.0
8
0.16
0.96
0.14
42
14.3
9
0.17
1.01
0.15
49
14.5
10
0.18
1.07
0.16
54
13.3
Table 8 Occ_100k (|T | = 100k)
k
Wong’s
Two-phase
Gap (%)
GCPw
Time (s)
GCPt
Time (min)
3
0.14
0.24
0.12
4
15.1
4
0.16
0.25
0.14
5
12.9
5
0.18
0.23
0.16
6
11.9
6
0.19
0.25
0.17
8
11.6
7
0.21
0.28
0.18
10
11.0
8
0.22
0.27
0.19
10
11.1
9
0.22
0.27
0.20
12
10.6
10
0.23
0.28
0.21
13
10.5
where L B represents the lower bound. The gaps are plotted
as functions of k in Figs. 11 and 12.
As is evident in the tables, Wong’s algorithm executed
much faster than our two-phase algorithm in all cases.
Wong’s algorithm solved the problem within 30 s, even at
the largest scales, whereas our two-phase algorithm consumed up to 18 h 40 min (see Table 18). However, our
algorithm identified GCPs superior to those obtained by
Wong’s algorithm. Although the differences appear small,
the gaps exceeded 40% in some instances. In the largerscale random simulations, the gaps decrease in cases in which
123
K. Murakami, T. Uno
Table 12 Occ_500k (|T | = 500k)
k
Wong’s
Table 15 Random_1m (|T | = 1m, sd = 1.0)
Two-phase
Gap (%)
GCPw
Time (s)
GCPt
Time (min)
3
0.09
1.12
0.08
18
18.9
4
0.11
1.15
0.09
24
17.8
5
0.12
1.13
0.10
31
6
0.13
1.16
0.11
38
7
0.14
1.18
0.12
8
0.15
1.21
0.13
9
0.16
1.26
10
0.17
1.27
Wong’s
3
0.08
1.74
0.05
28
53.1
4
0.11
1.82
0.07
40
49.7
17.2
5
0.13
1.99
0.09
50
42.9
17.5
6
0.14
1.99
0.10
61
41.2
43
17.4
7
0.16
2.21
0.11
72
37.9
49
17.3
8
0.17
2.27
0.12
86
36.1
0.14
54
16.7
9
0.18
2.43
0.13
95
34.1
0.14
61
17.0
10
0.19
2.49
0.14
110
32.1
Two-phase
Gap (%)
k
Two-phase
Gap (%)
Time (s)
GCPt
3
0.28
1.90
0.20
41
37.3
4
0.31
1.90
0.24
55
29.9
32.1
5
0.33
2.05
0.27
70
23.0
0.35
2.33
0.29
82
18.1
14.6
Time (min)
0.31
0.20
0.21
5
47.5
4
0.35
0.18
0.26
5
36.4
8
Wong’s
GCPw
3
0.29
Time (min)
Table 16 Random_1m (|T | = 1m, sd = 1.5)
GCPt
0.26
Gap (%)
GCPt
Time (s)
0.38
Two-phase
Time (s)
GCPw
5
Wong’s
GCPw
Table 13 Random_100k (|T | = 100k, sd = 1.0)
k
k
Time (min)
6
0.40
0.24
0.31
7
26.6
6
7
0.41
0.23
0.34
11
22.2
7
0.36
2.38
0.31
95
8
0.42
0.28
0.36
12
18.7
8
0.37
2.55
0.33
108
11.7
9
0.43
0.28
0.38
14
14.7
9
0.38
2.74
0.34
122
9.5
10
0.44
0.34
0.39
15
13.2
10
0.38
2.83
0.36
133
7.5
Table 17 Random_10m (|T | = 10m, sd = 1.0)
Table 14 Random_100k (|T | = 100k, sd = 1.5)
k
Wong’s
Two-phase
k
Gap (%)
GCPw
Time (s)
GCPt
3
0.40
0.20
0.27
3
45.9
4
0.44
0.20
0.33
5
35.7
5
0.47
0.20
0.36
8
31.6
6
0.49
0.26
0.39
8
25.0
7
0.50
0.24
0.41
9
22.8
8
0.51
0.28
0.42
12
19.6
9
0.52
0.29
0.44
14
16.8
10
0.52
0.35
0.46
15
14.5
Wong’s
Two-phase
Gap (%)
GCPw
Time (s)
GCPt
Time (min)
3
0.04
20.43
0.03
273
47.3
4
0.06
20.85
0.04
373
44.2
5
0.07
21.47
0.05
477
37.7
6
0.08
22.14
0.06
581
35.6
7
0.09
22.97
0.07
693
32.8
8
0.10
23.86
0.07
799
30.9
9
0.11
24.81
0.08
910
28.8
10
0.11
25.91
0.09
1024
27.0
Time (min)
Table 18 Random_10m (|T | = 10m, sd = 1.5)
the higher sd was used. When the data varied more widely,
more hidden sites were required. Consequently, the feasibility of even a randomly chosen solution is relatively high,
and Wong’s algorithm (random approach) yields reasonable
results. Our algorithm found better solutions than Wong’s
algorithm within a reasonable time, even when executed on
large-scale datasets. Therefore, the aims of the study were
satisfied.
As shown in Figs. 11 and 12, the gap between GCPt and
the lower bound increases with k. At large k, the lower bound
departs from the optimal solution because some vertices in
123
k
Wong’s
Two-phase
Gap (%)
GCPw
Time (s)
GCPt
Time (min)
3
0.15
21.35
0.11
353
38.2
4
0.18
22.21
0.14
458
30.3
5
0.20
23.13
0.16
563
23.9
6
0.21
24.35
0.18
673
20.1
7
0.22
25.70
0.19
790
17.0
8
0.23
27.21
0.20
901
14.6
9
0.24
28.74
0.22
1009
12.5
10
0.25
30.06
0.23
1119
10.6
Optimization algorithm for k-anonymization of datasets with low information loss
0.32
"Occ_100k"
"Occ_200k"
"Occ_300k"
"Occ_400k"
"Occ_500k"
0.3
0.28
LB_Gap
0.26
0.24
0.22
the (initial) solution by canceling negative cycles. In all
of the computational experiments, our algorithm obtained
smaller objective functions than Wong et al.’s algorithm, at
the expense of longer computational time. Shortening the
computation time and further decreasing the information loss
will be targeted in our future work.
Acknowledgements Part of this research is supported by the Funding
Program for World-Leading Innovative R&D on Science and Technology, Japan. We thank Professor Wong for providing us with the
programs used in our experiments.
0.2
0.18
0.16
0.14
0.12
3
4
5
6
k
7
8
9
10
Fig. 11 Gap between GCP(T ) (computed by our algorithm) and lower
bound in dataset CENSUS
0.8
"Random_100k_sd=1.0"
"Random_100k_sd=1.5"
"Random_1m_sd=1.0"
"Random_1m_sd=1.5"
"Random_10m_sd=1.0"
"Random_10m_sd=1.5"
0.7
0.6
The gaps between GCPt and GCPunsort are computed by
Gap_sort =
GCPunsort − GCPt
,
GCPt
(12)
where GCPunsort represents the information loss of the
anonymized dataset obtained using our algorithm without
preliminarily sorting. Note that we do not compute Gap_sort
for the instances of |T | = 100k because the datasets are not
partitioned; thus, GCPunsort equals to GCPt (Tables 19, 20).
0.5
LB_Gap
Appendix
0.4
0.3
0.2
Table 19 Gap_sort (Occ)
0.1
0
3
4
5
6
7
8
9
k
100k
200k
300k
400k
500k
3
–
19.3
37.7
48.1
91.9
4
–
18.7
36.3
46.5
86.4
5
–
19.2
34.8
43.7
82.7
6
–
17.3
33.3
41.8
79.4
7
–
17.0
31.9
40.8
76.5
8
–
16.6
31.0
40.3
74.4
9
–
16.7
30.0
39.8
71.7
10
–
16.5
28.9
38.0
70.7
10
k
Fig. 12 Gap between GCP(T ) (computed by our algorithm) and lower
bound in random instances
VT (not VT ) are probably adjacent to more than k vertices,
while others are adjacent to far fewer than k vertices. In other
words, the (relaxed) degree constraints (5) are considerably
violated. In Fig. 11, the gaps remain below 0.32 even for
k = 10, suggesting that our algorithm analyzed the CENSUS dataset with high performance. In Fig. 12, the gaps are
larger in Random_1m_sd = 1.0 and Random_10m_sd = 1.0
than in the other sample datasets, reflecting that optimization
problems with less-biased values (i.e., datasets with less variation) are notoriously difficult to solve.
5 Conclusions
We formulated the k-anonymity problem as an optimization problem and proposed a two-phase algorithm. The first
phase seeks an initial solution by iteratively finding minimum weight perfect matchings; the second phase improves
Table 20 Gap_sort (Random)
100k
1m
10m
k\sd
1.0
1.5
1.0
1.5
1.0
1.5
3
–
–
14.9
38.8
101.9
136.5
4
–
–
14.4
38.4
97.1
124.9
5
–
–
12.3
37.0
92.1
117.0
6
–
–
13.3
35.8
89.4
112.5
7
–
–
12.1
35.2
87.0
109.1
8
–
–
12.2
34.5
85.3
106.7
9
–
–
12.2
34.1
83.9
104.4
10
–
–
12.2
33.7
82.6
102.5
123
K. Murakami, T. Uno
References
1. Sacharidis, D., Mouratidis, K., Papadias, D.: k-Anonymity in the
presence of external databases. IEEE Trans. Knowl. Data Eng.
22(3), 392–403 (2010)
2. Dalenius, T.: Finding a needle in a haystack or identifying anonymous census record. J. Off. Stat. 2(3), 329–336 (1986)
3. Wang, K., Yu, P.S., Chakraborty, S.: Bottom-up generalization: a
data mining solution to privacy protection. In: Fourth IEEE International Conference on Data Mining, 2004. ICDM’04, pp. 249–256
(2004)
4. Samarati, P., Sweeney, L.: Generalizing data to provide anonymity
when disclosing information. In: Proceedings of the 17th ACM
SIGMOD-SIGACT-SIGART Symposium on the Principles of
Database Systems, p. 188 (1998)
5. Fung, B., Wang, K., YU, P.: Top-down specialization for information and privacy preservation. In: Proceedings. 21st International
Conference on Data Engineering, 2005. ICDE 2005. IEEE, pp.
205–216 (2005)
6. Samarati, P.: Protecting respondants identities in microdata release.
IEEE Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)
7. Sun, X., Li, M., Wang, H., Plank, A.: An efficient hash-based algorithm for minimal k-anonymity. In: Proceedings of the Thirty-First
Australasian Conference on Computer Science, vol. 74, pp. 101–
107 (2008)
8. Samarati, P., Sweeney, L.: Protecting privacy when disclosing
information: k-anonymity and its enforcement through generalization and suppression. Technical Report SRI-CSL-98–04, SRI
Computer Science Laboratory (1998)
9. LeFevre, K., DeWitt, D.J., Ramakrishnan, R., Incognito: Efficient
full-domain k-anonymity. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, vol. 21,
pp. 49–60 (2005)
10. Machanavajjhala, A., Gehrke, J., Kifer, D.: l-Diversity: privacy
beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1(3), 1–52
(2007)
11. Domingo-Ferrer, J.: Microaggregation for database and location
privacy. In: Etzion, O., Kuflik, T., Motro, A. (eds.) Next Generation Information Technologies and Systems. Lecture Notes in
Computer Science, vol. 4032. Springer, Berlin, Heidelberg, pp.
106–116 (2006)
123
12. Campan, A., Truta, T.M., Miller, J., Sinca, R.A.: A clustering
approach for achieving data privacy. In: Proceedings of the International Data Mining Conference, pp. 321–330 (2007)
13. Goldberg, A.V., Tarjan, R.E.: Efficient maximum flow algorithms.
Commun. ACM 57(8), 82–89 (2014)
14. Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy,
R., Thomas, D., Zhu, A.: Anonymizing tables. In: Proceedings
of the 10th International Conference on Database Theory, LNCS,
3363, pp. 246–258 (2005)
15. Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.
Based Syst. 10(5), 571–588 (2002)
16. Wong, W.K., Mamoulis, N., Cheung, D.W.-L.: Non-homogeneous
generalization in privacy preserving data publishing. In: The ACM
SIGMOD International Conference on Data Management (SIGMOD), pp. 747–758 (2010)
17. Murakami, K., Uno, T.: A matching model and an algorithm for
k-anonymity of large-scale data. In: Proceedings of the 15th KoreaJapan Joint Workshop on Algorithms and Computation, pp. 154–
160 (2012)
18. Shmueli, E., Tassa, T., Wasserstein, R., Shapira, B., Rokach, L.:
Limiting disclosure of sensitive data in sequential releases of
databases. Inf. Sci. 191, 98–127 (2012)
19. Shmueli, E., Tassa, T.: Privacy by diversity in sequential releases
of databases. Inf. Sci. 298, 344–372 (2015)
20. Goldberg, A.V.: An Efficient implementation of a scaling
minimum-cost flow algorithm. J. Algorithms 22(1), 1–29 (1997)
21. Hall, P.: On representatives of subsets. J. Lond. Math. Soc. 10(1),
26–30 (1935)
22. Goldberg, A.V., Tarjan, R.E.: Finding minimum-cost circulations
by canceling negative cycles. J. ACM 36(4), 873–886 (1989)
23. Sokkalingam, P.T.: New polynomial-time cycle-canceling algorithms for minimum cost flows. Ph.D. Thesis, Massachusetts
Institute of Technology, Cambridge, MA 02139, USA (1997)
24. Uno, T.: Multi-sorting algorithm for finding pairs of similar short
substrings from large-scale string data. Knowl. Inf. Syst. 25(2),
229–251 (2010)
Документ
Категория
Без категории
Просмотров
3
Размер файла
1 332 Кб
Теги
017, s10207, 0392
1/--страниц
Пожаловаться на содержимое документа