Int. J. Inf. Secur. DOI 10.1007/s10207-017-0392-y REGULAR CONTRIBUTION Optimization algorithm for k-anonymization of datasets with low information loss Keisuke Murakami1 · Takeaki Uno2 © Springer-Verlag GmbH Germany 2017 Abstract Anonymization is the modification of data to mask the correspondence between a person and sensitive information in the data. Several anonymization models such as k-anonymity have been intensively studied. Recently, a new model with less information loss than existing models was proposed; this is a type of non-homogeneous generalization. In this paper, we present an alternative anonymization algorithm that further reduces the information loss using optimization techniques. We also prove that a modified dataset is checked whether it satisfies the k-anonymity by a polynomial-time algorithm. Computational experiments were conducted and demonstrated the efficiency of our algorithm even on large datasets. Keywords Security and protection · Database models · K -anonymity · Large-scale dataset · Graph problem · Optimization 1 Introduction Recent progress in information technology has enabled the easy collection of large datasets containing information concerning the lives and personalities of individuals. These individual data, often called microdata [1] (Table 1), are useful for demographic analysis and similar tasks but have B Keisuke Murakami k_mura@kansai-u.ac.jp Takeaki Uno uno@nii.jp 1 Kansai University, 3-3-35 Yamate, Suita, Osaka 564-8680, Japan 2 National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan a significant risk of being unintentionally disclosed. Simply removing the attributes identifying individuals from microdata cannot prevent this threat because individuals can be identified by other attributes. In general, datasets are classified as “public” or “private”. Public datasets contain individual-identification attribute(s) such as names, whereas private datasets contain sensitive individual information that should be dissociated from specific persons, such as income. Examples of public and private datasets are listed in Table 2 and 3, respectively. When data of both types are disclosed, privacy cannot be protected because each individual is linked with his and her personal information. For example, if Tables 2 and 3 are published, each name could be linked to an income, even though there are no individual-identifying in Table 3 (e.g., an attacker would know that Alan’s income is 10,000 because each table associates a single individual with Zip code = 94111, Gender = M and Race = White). Therefore, to protect individuals’ privacy, it is necessary to deindentify microdata before disclosing them. In this paper, it is assumed that any user can know the public dataset. A commonly used deidentification strategy is k-anonymity [2], which declares that the probability of an attacker of identifying an anonymized record is at most 1/k [6,8,10]. In k-anonymity, a dataset contains three types of attributes. Identifiers (IDs): These attributes identify individual records. IDs must be removed from the microdata in the private dataset. Examples of IDs are passport number, social security number, fingerprint information, and name. Quasi-identifiers (QIDs): These attributes can link record in public datasets with those in private datasets. Linking a QID to an individual poses no problem. Example of QIDs are zip code, gender, race, and birth date. Sensitive attributes (SAs): These attributes should remain strictly unassociated with specific persons. Examples of SAs are disease, income, and criminal offense. 123 K. Murakami, T. Uno Table 1 Microdata Name Zip code Gender Country Income Alan 94221 M US 10,000 Betiina 94112 F US 5000 Christina 94121 F US 1500 Devola 94111 M Canada 3000 Edmond 94222 M Canada 30,000 Flora 94122 F UK 20,000 Georgia 93111 M Canada 40,000 Table 2 Public dataset Record ID Name Zip code Gender Country r1 Alan 94221 M US r2 Betiina 94112 F US r3 Christina 94121 F US r4 Devola 94111 M Canada r5 Edmond 94222 M Canada r6 Flora 94122 F UK r7 Georgia 93111 M Canada Income Table 3 Private dataset Record ID Zip code Gender Country r1 94221 M US 10,000 94112 F US 5000 94121 F US 1500 94111 M Canada 3000 r2 r3 r4 r5 r6 r7 94222 M Canada 30,000 94122 F UK 20,000 93111 M Canada 40,000 Public and private datasets contain IDs and SAs, respectively, and both datasets contain QIDs. K -anonymity edits some of the QIDs in private datasets. K -anonymity popularly is achieved by generalization [3, 4], which represents the QIDs as hierarchies. The precision of the values is reduced as one moves up the hierarchy. For example, Fig. 1a illustrates a hierarchical grouping of “Country“. The cardinality of a QID value is defined by the number of its descendant (leaves). For example in Fig. 1a, the cardinalities of “America” and “Country“ are 2 and 7, respectively. At the top of the hierarchy, the QID value is fully concealed. The method of concealing the QID values as one moves up the hierarchy is demonstrated in Fig. 1b [15]. To simplify the explanation, k-anonymity is hereafter discussed in terms of the generalization of concealing the QID values. However, our approach in this paper can easily extendable to the general case in which the hierarchies are given (as described in Sect. 3.2.4). 1.1 Existing models and our contribution There exist two models of k-anonymity; the global-recoding model [5–8] and the local-recoding model[12,14]. The global-recoding model generalizes the same site in all records. An example is presented in Table 4, where the generalized values are indicated by asterisks (*). When Tables 2 and 4 are visible, record ID r1 can correspond to either r1 or r5 . Similarly, each record in Table 2 can be identified with at least two records in Table 4. Thus, Table 4 satisfies 2-anonymity. One of the most fundamental algorithms for implementing the global-recoding model is the incognito algorithm [9], which is based on the hill descendant approach. Initially, the anonymization is fully generalized; that is, all records have the same attribute values. The attributes are then specialized one by one, until further specialization would violate the k-anonymity. This approach yields a minimal k-anonymization. All minimal k-anonymizations can be enumerated by backtracking searches. The local-recoding model determines the generalized sites in each record. Table 5 shows an example of local-recoding with 2-anonymity. Note that Table 5 uses 15 asterisks, while the global-recoding model (Table 4) uses 28 asterisks. Clearly, the local-recoding model preserves more information than the global-recoding model. The local-recoding model is therefore applied in this paper. If a QID cannot be partially hidden such as Gender and Country, its value will be either wholly hidden or wholly retained. On the other (a) Fig. 1 Hierarchy for country. a Standard generalization, b generalization of concealing the QID values 123 (b) Optimization algorithm for k-anonymization of datasets with low information loss Table 4 Global-recoding Table 6 Non-homogeneous Record ID Zip code Gender Country r1 9*2** M * 10,000 9*1** F * 5000 9*1** F * 1500 9*1** M * 3000 r2 r3 r4 r5 r6 r7 Income 9*2** M * 30,000 9*1** F * 20,000 9*1** M * 40,000 Record ID Zip code Gender r1 9422* M * 941*2 F * 5000 941*1 F US 1500 9*111 M Canada 3000 9422* M * 30,000 941** F * 20,000 9*111 M Canada 40,000 r2 r3 r4 r5 r6 r7 Country Income 10,000 Table 5 Local-recoding Record ID Zip code Gender Country r1 9422* M * 10,000 941** F * 5000 941** F * 1500 9*111 M Canada 3000 9422* M * 30,000 941** F * 20,000 9*111 M Canada 40,000 r2 r3 r4 r5 r6 r7 Income hand, if the QID can be partially hidden, such as a Zip code, each digit of the QID is regarded as a pseudo-QID, which is independently decided whether to be hidden to reduce the information loss (if the values of Zip code are wholly hidden, the information loss increases). For simplicity, pseudo-QIDs are hereafter referred to as QIDs. Table 3 contains seven QIDs, because each Zip code contains five pseudo-QIDs. In most previous studies of local-recoding models, kanonymity is considered to be satisfied when each record in a dataset is similar to at least another k − 1 QID records. For example, Table 5, it is only necessary to consider the private dataset, not the public dataset. This approach is called the homogeneous approach [11,12]. The records are first partitioned into groups with identical QIDs (i.e., the QIDs are fully generalized). However, “the same” generalized QID is not necessarily required for k-anonymity. Wong et al. proposed a new approach, called non-homogeneous, which can achieve k-anonymity without the same generalized QIDs [16] (see Table 6). In general, the non-homogeneous approach incurs less information loss than the homogeneous approach. For example, the number of asterisks in Table 6 is reduced to 12. Meanwhile, the non-homogeneous approach is not as easy to apply as the homogeneous one, because the generalized sites (QIDs) must be determined for each record. To overcome this difficulty, Wong et al. proposed a simple algorithm that first divides the records into partitions sized between k and 2k, then randomly generalizes the QID of each record in each partition. However, while Wong et al.’s algorithm is very fast, the small partitioning and random generalization might incur high information loss in an anonymized dataset. The present study focuses on large databases (of the order of 10 million records), whereas the previous studies on localrecoding models have handled at most hundreds of thousands of records. Thus, it is possible that the anonymization time could have become problematic. However, because each database requires only one anonymization, longer execution times (about one day) are considered to be practical. It is also assumed that an anonymized database would be useful if it had a moderately sized k. A large k value usually means high information loss in an anonymized database, which renders the database useless for analysis. Our proposed optimization algorithm for k-anonymity incorporates the non-homogeneous approach and techniques for graph algorithms. Its main aim is to minimize the information loss in the k-anonymity while maintaining a practical execution time. In this algorithm, the information loss is prioritized over the computation time. It is expected that this algorithm will retain information better than Wong et al.’s algorithm, at the expense of longer computational time. Nevertheless, the algorithm finishes within a few hours. Optimization algorithms are not generally applicable to large datasets as it is, because the size of the dataset can exceed the memory capacity of a computer. Even when implemented on a high-memory computer, a polynomial-time algorithm may not obtain a solution within a practical time. Therefore, we effectively reduce the size of the problem before solving the problem, where some unpromising variables in the optimization algorithm (problem) are excluded, and the database further is partitioned after using valid sorting techniques (these details are presented in Sects. 3.2.1 and 3.2.3, respectively). The rest of this paper is organized as follows. Section 2 models the k-anonymity problem using a bipartite graph, and Sect. 3 formulates it as an optimization problem and proposes a solution algorithm. In Sect. 4, our algorithm is compared with the existing method for large-scale datasets. Section 5 concludes the paper. 123 K. Murakami, T. Uno Fig. 2 Anonymity graph Fig. 3 Biased anonymity graph 2 Graph expression of anonymization This linkage occurs because edge (r4 , r4 ) is used in all the perfect matchings. A perfect matching means that every vertex of the graph is adjacent to exactly one vertex. In Fig. 3, a perfect matching is composed of seven edges such as (r1 , r1 ),(r2 , r2 ),(r3 , r3 ), (r4 , r4 ),(r5 , r5 ), (r6 , r6 ),(r7 , r7 ). Wong et al. overcame the shortcoming of Condition 1 by postulating that: Consider a public dataset T (Table 2), which has been already published, and a private dataset T (Table 3). Both datasets contain QIDs, but whereas T contains ID(s), the private dataset T contains SAs. Some or all of the QIDs are common to both datasets. Suppose that it is also desired to release T with privacy protection. If the private dataset T were published without modification, individuals might be identified, through common QIDs, which relate T to T as described in the previous section. Hence, to protect individual privacy, it is necessary to anonymize dataset T , that is, to preserve its k-anonymity. Consider a bipartite graph with two vertex sets VT and VT representing the sets of records in T and T , respectively [17– 19]. The vertices ri ∈ VT and r j ∈ VT are connected if r j can correspond to ri , that is, if the h-th QID of r j is hidden (i.e., by an asterisk symbol (*)) or identical to that of ri for all h. An example is shown in Fig. 2. In Tables 2 and 6, r2 is adjacent to r2 and r6 . In other words, if vertex ri can be adjacent to r j , the different QID values in record r j are hidden from ri . Such as a bipartite graph is called an “anonymity graph”. 2.1 Conditions for k-anonymity To clarify k-anonymity, a stepwise discussion follows, beginning with the fundamental conditions of k-anonymity. We first introduce the following simple condition. Condition 1 In an anonymity graph G, the modified table T satisfies k-anonymity if the degree of every vertex in VT and VT is not less than k. Condition 1 originates from the principle that an attacker can identify an anonymized record with a probability of 1/k at most. Thus, Condition 1 is a necessary anonymization condition. However, privacy protection is insufficient, because individuals might be identified. For example, Fig. 3 displays the anonymity graph when some values in T are hidden. Because the degree of every vertex in VT and VT is more than or equal to 3, the generalized dataset T achieves 3-anonymity under Condition 1. However, record r4 is unambiguously linked to r4 ; therefore that individual is identified. 123 Definition 1 A modified table T has k-anonymity if and only if the anonymity graph G includes k edge-disjoint perfect matchings. In k edge-disjoint perfect matchings, each edge is included in only one perfect matching. 2.2 Verification of k-anonymity condition Wong et al. proposed Definition 1 of k-anonymity, but did not verify it. Here it is proven that Definition 1 can be verified in polynomial time. The verification needs to check the existence of k edge-disjoint perfect matchings in the anonymity graph G. Hall’s theorem [21] states that a bipartite graph includes k edge-disjoint perfect matchings if and only if the graph includes a k-regular subgraph, in which all vertices have degree k. This condition can be checked by the following maximum flow algorithm. A flow from source s to sink t of a directed network is a value on each arc such that the inflow equals the outflow on each vertex v (the flow condition). This value is called amount of the flow. Flow values cannot exceed capacity of the arc. Intuitively, a flow is a superposition of paths from the source vertex s to the sink vertex t. The maximum-flow problem finds the flow that maximizes its amount. The maximum flow in |V | vertex network can be solved in O(|V |3 ) time [13]. The polynomiality can be stated in term of the maximum flow. Theorem 1 The k-anonymity of an anonymity graph G = (V, E) given by Definition 1 can be verified in O(|V |3 ) time. Proof Let G̃ = (Ṽ , Ẽ) be the graph obtained from G by adding two vertices s and t, such that s is connected to all vertices in VT , and t is connected to all vertices in VT . All edges of G̃ are directed from s to VT , VT to VT and VT to t, so G̃ is a directed graph. The edge capacity is set to k for Optimization algorithm for k-anonymization of datasets with low information loss NCP(q, r ) expresses the degree of generalization at QID q in record r . The degrees at all QIDs in all records of T are summed as follows: NCP(T ) = |A(q, r )| − 1 |A∗ | − 1 (2) q∈QIDs r ∈T The GCP is then defined by GCP(T ) = Fig. 4 Maximum-flow problem edges incident to s or t, and to 1 for other edges (see Fig. 4). We show that G̃ has an amount of flow k × |VT | if and only if G has k-anonymity. Suppose that G satisfies k-anonymity. Because G has k disjoint perfect matchings, it includes a k-regular subgraph H . Consider a flow from s to t, in which the flow value is k for edges adjacent to s or t, and 1 for edges in H . Since each vertex in VT and VT is incident to k edges in H , the flow condition is satisfied. Thus, the maximum-flow value is k × |VT |. Suppose that the maximum amount of a flow is k × |VT |. As each edge from s to a vertex v in VT has a flow value of k, v is incident to exactly k edges with flow values of 1. As each edge from a vertex u in VT to t also has a flow value of k, u is also incident to exactly k edges with flow values of 1. Thus, the set of edges with flow values of 1 forms a k regular graph. Consequently, G includes k edge-disjoint perfect matchings. NCP(T ) |QIDs| × |T | (3) Note that GCP(T ) is a normalized version of NCP(T ). We now present the NCP and GCP of a generalized dataset with concealing the QID values. Generalizing (Concealing) an attribute value at QID q in record r implies that A(q, r ) = A∗ (see Fig. 1). If the attribute value is generalized (concealed), NCP(q, r ) = 1; otherwise NCP(q, r ) = 0. Accordingly, NCP(T ) represents the number of hidden sites in T , and GCP(T ) defines the ratio of the number of the hidden sites. Based on the GCP(T ) of the generalized dataset with concealing the QID values, the information loss is thus defined as follows: Definition 2 (Information loss-generalization of concealing the QID values) Given an anonymized dataset T , the information loss is defined by the ratio of the number of the hidden sites. In this paper, the information loss is measured by Definition 2 (i.e., GCP(T )). 2.3 Information loss Although the modified private dataset T satisfies k-anonymity, its utility cannot be lowered; otherwise, the database T would become unusable. The utility is usually measured by the information loss. Popular measures are the normalized certainty penalty (NCP) and global certainty penalty (GCP). This subsection presents NCP and GCP first for the standard generalized dataset (i.e., the multilevel hierarchies such as Fig. 1a are considered), and then for the generalized dataset with concealing the QID values (i.e., one-level hierarchies such as Fig. 1b are considered). Let A(q, r ) be an attribute value at QID q in record r and denote |A(q, r )| as the cardinality of A(q, r ) (e.g., in Fig. 1a, if A(q, r ) = “Europe”, |A(q, r )| = 3, because “Europe” = Italy, France, England). In particular, there A∗ is the QID attribute value at the top of the generalization hierarchy, and |A∗ | is the cardinality of A∗ (i.e., in Fig. 1a, A∗ = “Country” and |A∗ | = 7). Then, the NCP(q, r ) given by NCP(q, r ) = |A(q, r )| − 1 |A∗ | − 1 (1) 3 Optimization approach 3.1 Formulation Our objective is to minimize the information loss of a modified private dataset T with k-anonymity. Let Ai j denote the set of attributes with differently valued ri and r j . For an edge set F ⊆ VT × VT , let N (v, F) be the vertices adjacent to vertex v, where any edge in F is incident to vertex v. Let C(r j , F) = u∈N (r ,F) A ju . Then j GCP(T ) = r j ∈VT |C(r j , F)|, and the problem can be written as follows. min |C(r j , F)| (4) F⊆VT ×VT r j ∈VT s.t.|N (v, F)| = k, ∀v ∈ VT ∪ VT (5) F : decision variable (6) The objective function (4) gives the total weight of edges. Here, we explain the relation between the objective function 123 K. Murakami, T. Uno and the weight of edge. Let w(ri , r j ) denote the weight of edge between ri ∈ VT and r j ∈ VT . The weight w(ri , r j ) is defined as the increment of C(r j , F) by the addition of (ri , r j ), i.e., w(ri , r j ) = |C(r j , F ∪ {(ri , r j )})| − |C(r j , F)|. For example, in Fig. 5, when there is no adjacent vertex, the weights of (r2 , r1 ) and (r2 , r3 ) are 3 and 2, respectively. Constraints (5) ensure that edge set F configures a k-regular graph. Constraints (6) represents that edge set F is the decision variable, i.e., the objective of this problem is to find an edge set minimizing the total weight. This optimization problem is called k-anonymity problem. 3.2 Algorithm It is obvious that the weight w(ri , ri ) is 0, when records ri and ri have the same individual. We obtain a perfect matching using these edges and the total weight of the edges is 0. Thus, when k = 1, this perfect matching is the optimal solution and the objective function is 0. This perfect matching is called “identical perfect matching”. Because any dataset includes an identical perfect matching, the anonymity problem is viable only when k ≥ 2. The edge weights are dynamic rather than fixed, because the number of hidden sites depends on the previously hidden sites in r j ∈ VT . For example in Fig. 6, when r4 is adjacent to r1 and r3 and some sites are preliminarily hidden, edges (r2 , r1 ) and (r2 , r3 ) are weighted as 2 and 1, respectively, which are different from the weightings in Fig. 5. This dynamism greatly complicates the k-anonymity problem. Note that when k = 2, the weight of each edge is not dynamic, because this method requires only one perfect matching (the other perfect matching is the identical perfect matching). Here we propose a heuristic algorithm for the large-scale k-anonymity problem. In our experience, complicated algorithms are unavailable for large-scale problems; thus, our algorithm employs the basic techniques of graph algorithms. Our algorithm is a two-step process based on minimum-cost flow algorithms; an initial feasible solution is first found by a greedy technique and is improved by canceling negative cycles in the second step. Fig. 5 Computation of weight 123 3.2.1 Initial feasible solution Because the k-anonymity problem is modeled as a graph problem, it should be naturally solvable by graph algorithms. In the k-anonymity problem, the numbers and sites of asterisks in each record are obtainable only when k records adjacent to the record of interest are determined. Thus, the edge weights dynamically change as described above. However, as most graph algorithms presume fixed edge weights, our algorithm tentatively fixes the weights and seeks a perfect matching. The perfect matching becomes part of the solution and is used to update the weight of each edge. This process is iterated k times (finding k perfect matching). This method adopts a greedy algorithm (described below) that cannot find an exact solution, but it is known to find a good solution within a short time. The algorithm first computes the weights of the edges when dataset T contains no hidden sites (*). A perfect matching is then found by using these weights of edges; this corresponds to solving a minimum weight perfect matching problem, for which there exist several exact and fast algorithms (e.g., [20]). After solving the minimum weight perfect matching problem, the values in the dataset T that are incident to edges included in the perfect matching (solution) are hidden. For example in Fig. 6, if edge (r4 , r1 ) is included in the perfect matching (solution), the QIDs of r1 are modified to Zip code = 94**1, Gender = M, County = *. The perfect matching is stored as part of the initial solution, and the edges included in the matching are removed from graph G. The weights of the edges are then recomputed without the hidden sites, and in similar way, we find a perfect matching and then modify the dataset T (QIDs). This process is iterated k − 1 times. Note that the identical perfect matching is found before Fig. 6 Dynamic weight Optimization algorithm for k-anonymization of datasets with low information loss executing greedy algorithm, so only k − 1 perfect matchings are required. We describe this greedy algorithm below. Greedy Algorithm Step 0 Compute the initial weight of each edge wT (ri , r j ), assuming no hidden sites in T . Set the edge set F = ∅. Step 1 Find a minimum weight perfect matching. Step 2 Update the edge set F by adding the edges included in the perfect matching (the solution). Step 3 If k − 1 perfect matchings exist in edge set F, exit; otherwise, go to Step 4. Step 4 Modify the dataset T by hiding the values in T corresponding to the perfect matching found in Step 1. Step 5 Recompute the weight of each edge wT (ri , r j ) in the modified dataset T and return to Step 1. Note that the weights of the edges included in F are set to ∞, which has the same meaning of removing the edges form graph G. In Step 1, the minimum weight perfect matching problem is a special case of the minimum-cost flow problem, which is solvable by pre-flow push algorithm. Here the practically efficient algorithm of Goldburg et. al. [20] was adopted. (a) Example of the greedy algorithm Consider an anonymity graph such as Fig. 7, where the numbers of records and QIDs are 7 and 3, respectively. Note that the anonymity graph bears no relation to the datasets in Tables 1, 2, 3, 4, 5 and 6. Step 0 computes the initial weight of each edge in Fig. 7 (a). The weights are shown in Fig. 7d. For example, the weigh w(r1 , r2 ) is 1, because the second (middle) QID value of r2 is hidden (i.e., r2 = 2*3) when it is assumed that r1 becomes adjacent to r2 . Step 1 finds a minimum weight prefect matching using the initial weights (Fig. 7d), i.e., a perfect matching such as Fig. 7b is found. Step 2 stores the perfect matching. Step 4 hides some QID values in the dataset T corresponding to the perfect matching as Fig. 7b. Step 5 recomputes the weight of each edge in Fig. 7b. The weights are shown in Fig. 7e. For example, the weigh w(r1 , r2 ) is 0, because no QID value of r2 is hidden when it is assumed that r1 becomes adjacent to r2 . We can see that several weights change from Fig. 7d, e (i.e., dynamic weight). Again, Step 1 finds a minimum weight prefect matching using the weights (Fig. 7e) i.e., a perfect matching such as Fig. 7c is found. These steps are iterated until k − 1 perfect matching in plural are found. If the anonymity graph is naively constructed (i.e., the anonymity graph G contains all edges between both sides), the minimum weight perfect matching problem cannot be solved because the memory and computational time requirements exceed the practical limits. Therefore, only the edges (b) (d) (c) (e) Fig. 7 Example of greedy algorithm. a Initial graph, b 1-perfect matching graph, c 2-perfect matching graph, d initial weights, e updated weights 123 K. Murakami, T. Uno with smaller weights are considered. Ideally, the edges with smaller weights at each modification of T should be found; however, this method demands excessive computational time. Therefore, some edges are only selected once as follows. If an edge is low-weighted in the original (non-modified) dataset T , it is selected as a promising candidate. Although these edges are not always included in the optimal solution, we believe that this strategy selects appropriate edges and reduces the number of the edges effectively. Computing the edge weights in the original dataset T is equivalent to finding the Hamming distance between ri ∈ VT and r j ∈ VT . Thus, the lower-weighted edges (the Hamming distance) are obtained by the fast algorithm proposed by Uno [24]. 3.2.2 Improvement of solution The initial solution is improved by negative cycle canceling, a popular technique for solving minimum-cost flow problems [22,23]. First, a directed bipartite graph is constructed, including the edges used in the minimum weight perfect matching problem of our greedy algorithm. Here, G = (V , E ) is a directed graph, and it is supposed that the edges in set F (the initial feasible solution) are directed from VT to VT , while other edges are directed from VT to VT . In the directed graph, the edges from VT to VT consistently represent the (initial) solution. Then, the hidden sites are incremented when the directed edges r j ∈ VT → rh ∈ VT and ri ∈ VT → r j ∈ VT are removed from and added to the solution, respectively. For example in the upper part of Fig. 8, the current solution contains the edge from r2 to r6 . When this edge is removed from the current solution and the edge from r2 to r2 is newly added (i.e, the directions of both edges are reversed as shown in the bottom part of Fig. 8), the number of hidden sites in record r2 is incremented by − 2, because Zip code = 941*2 and Country = * are changed into Zip code = 94112 and Country = US. The increments are regarded as weights associated with pairs of edges such as ri ∈ VT → r j ∈ VT and r j ∈ VT → Fig. 8 Increment of the hidden sites 123 rh ∈ VT . That is, a negative weight implies that the number of hidden sites reduces when the directions of both edges are reversed (i.e., when an edge included in the solution is exchanged). When a sequence of such edge pairs forms a directed cycle, the feasibility of the solution is retained when the edges in the directed cycle reverse their direction, because the updated anonymity graph certainly includes a k-regular graph (k edge-disjoint perfect matchings). Thus, if the sum of the weights on a directed cycle is negative and the edges reverse their directions, the number of hidden sites is reduced without compromising the feasibility of the solution. Therefore, the solution can be repeatedly improved by finding negativeweight directed cycles. Example of improvement of the anonymized dataset by detecting negative cycle Consider the initial solution as Fig. 9a, where k = 2 and the number of the hidden sites (*) is 9. Then, a negative cycle is detected such as Fig. 9b (r1 → r4 → r2 → r3 → r4 → r1 → r3 → r2 → r1 ), where the increments of the hidden sites at records are − 3, + 1, 0, and − 1, respectively; thus, the sum of them is − 3. This represents the solution is improved (i.e., the number of the hidden sites reduces to 6 from 9) when the edges along the negative cycle reverse their direction (Fig. 9c). This is repeated until no negative cycle is detected. We next explain how to find negative cycles in an anonymity graph. Several algorithms for detecting negative cycles have been proposed but are unsuitable for large-scale k-anonymity problems. For example, the time complexity of the Bellman–Ford algorithm, a basic algorithm for detecting negative cycles, is O(|V | × |E |) per negative cycle. Despite being a polynomial algorithm, this algorithm cannot identify negative cycles in large-scale datasets (graphs) within a reasonable computational time. Therefore, we propose a new algorithm that efficiently detects negative cycles in large-scale networks. Our algorithm is based on a labeling algorithm proposed for the shortest path problem. Although it may not find all negative cycles, it is expected that our algorithm will rapidly determine the negative cycle number, by virtue of the fast labeling algorithm. In a labeling algorithm, each vertex has a label, indicating the (incumbent) shortest path length from the source to that vertex, and a parent, representing the predecessors of the vertex on the shortest path. The label and parent of each vertex are updated if the label (shortest path length) is improved. This process is iterated until no labels are improved. Ultimately, the label of each vertex represents the exact length of the shortest path from the source to that vertex. The shortest path to each vertex is found by tracking through the parents from vertex to source. The path through the parents from each vertex is assembled into a tree graph, here called the shortest path tree. Optimization algorithm for k-anonymization of datasets with low information loss 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 (a) (b) (c) Fig. 9 Example of improvement of the anonymized dataset by detecting negative cycle. a Initial solution (k = 2), b negative cycle, c improved graph The shortest path tree cannot be maintained in a directed graph G containing negative cycles (because trees are noncyclic). Graphs including cycles are called shortest path graphs. A shortest path graph might consist of several connected graphs, where one is tree and the others are graphs including a negative cycle. For example, the left and right graphs of Fig. 10 represent a negative cycle and tree structure, respectively. The objective is to find negative cycles using the shortest path graph. When tracing upstream along the shortest path from a vertex included in a negative cycle, we eventually return to the same vertex (and thus identify the negative cycle). Even when a vertex is included in a graph containing a negative cycle but is excluded from the cycle itself, the negative cycle can be found. For example, vertex j is excluded from the negative cycle in the left graph of Fig. 10, but the negative cycle can be entered by following the shortest path graph upstream from vertex j. Thus, it is necessary to find the vertices that are probably included in graphs containing negative cycles. Because the labels of such vertices are likely to be updated many times, our algorithm focuses on the update frequency of the label on each vertex. When the label of a certain vertex is updated several times, the shortest path graph is tracked upstream from the vertex (Fig. 10). This method cannot always detect a negative cycle, but it is expected to be effective because it intensively searches some promising candidate vertices. This approach for detecting negative cycles in large-scale networks (problems) has not been previously proposed. Our new algorithm for detecting negative cycles proceeds as follows. Each pair of edges is regarded as single edge whose ends are vertices in VT . Note that each edge pair is a sequence such as ri ∈ VT → r j ∈ VT and r j ∈ VT → rh ∈ VT . Here li is the label of vertex vi ∈ VT , wi h denotes the weight of the edge from vertex vi ∈ VT to vh ∈ VT , and the threshold t is defined such that if a vertex is updated t times, the algorithm proceeds upstream from the vertex along the shortest path graph. This algorithm terminates after p negative cycle non-cycle vertex j Fig. 10 Shortest path graph iterations of Step 1-Step 6, where p is preset. The negative cycle detection algorithm is detailed below: Negative Cycle Detection Algorithm Step 0 Choose an arbitrary vertex in VT as a root and set its label is set to 0. Initialize the labels of the other vertices to ∞. Step 1 Find the vertex with the smallest label and set the vertex to vi . Step 2 If li +wi h < lh , update lh ← li +wi h for all vh adjacent to vi , and then update the shortest path graph. Step 3 If label lh is updated t times, proceed upstream from vertex vh along the shortest path graph. Reset the update frequency of vertex vh to 0. Step 4 If negative cycle is detected, go to Step 5; otherwise, return to Step 1. Step 5 Update G = (V , E ), reversing the directions of the edges included in the negative cycle. The labels of all offspring of the vertices included in the deleted negative cycle are set to ∞. Step 6 If the number of iterations of Steps 1–6 exceeds the preset value p, exit; otherwise, return to Step 1. 3.2.3 Application of our two-phase method to large dataset The two-phase method finds an anonymized dataset T . However, larger-scale k-anonymity problems are rendered 123 K. Murakami, T. Uno difficult by the large memory and computational time. For example, the graph of a dataset containing 10 million records (vertices), each of which connects by one thousand edges, includes 10 billion edges, requiring a vast memory capacity. Reducing the number of edges incident to each vertex conserves memory but degrades the solution; in some cases, a feasible solution may not be found. Moreover, the computation time remains prohibitive. Although a minimum weight perfect matching problem in our greedy algorithm is solved in √ polynomial time (namely, O(n 2 m), where n and m are the number of vertices and edges, respectively [20]), the huge numbers of vertices and edges (i.e., n = 10,000,000 and m = 1000 × n) preclude a solution within any practical timeframe. To overcome this difficulty, the original dataset is partitioned (microdata; see Table 1) and several smaller-scale anonymity graphs are generated. Before partitioning, the dataset is subjected to sorting, such as Wong’s algorithm. We first sort QIDs in ascending order of the cardinality of the QID attribute values. For example, in Table 3, the cardinalities of “Zip code”, “Gender”, and “Country” are assumed to be 7, 2, and 3, respectively (perhaps, the cardinalities of the potential “Zip code” and “Country” values are larger in the real dataset). The dataset is reconstructed such that the QID order is “Gender”, “Country”, and “Zip code”. Next, the records in the dataset is sorted lexicographic QID order. Finally, the records are partitioned in a top-down fashion. This sorting technique is obviously effective because the records in a partition are likely to become similar to each other. Thus, it is expected that the information loss reduces. Experiments demonstrated that after executing the sorting techniques, our algorithm derives a better solution than when the data are unsorted (see Tables 19 and 20 in “Appendix”). In our experiments, the size of the partitioned dataset is set to 100,000 (100k) records. 3.2.4 Extension to standard generalization Our algorithm can be extended from generalization of concealing the QID values to standard generalization (Fig. 1). Our model defines the weight of each edge as the increased number of hidden values in T , supposing that the edge is incident with two vertices in the directed graph G . The weight is quantified by the increase in NCP(T ), which also defines the weight of each edge in the generalized G. To compute NCP(T ), a generalized value in each QID is required, whose edge is incident to two vertices. When the hierarchies are preliminarily given (as Fig. 1a), optimal generalized value is easily obtainable by bottom-up searching through the hierarchy. The bottom-up search is terminated when the edge of the generalized value becomes incident to two vertices. For example, in Table 2 and 3, the increase in NCP(T ) for the QID of “Country” are computed based on the condition that 123 dataset T is completely unmodified. When r1 is adjacent to r4 , NCP(T ) increases by 16 because “Canada” is generalized (modified) to “America” in record r4 , and |America| = 2 and |Country| = 7 (see Eq. (1)). The NCP(T ) similarly increases for all QIDs and is then summed to yield the weight of the edge. In our algorithm, the two generalizations differ only by the weight computation method, enabling easy extension to the latter. However, the bottom-up search increases the computational time. 4 Experiments In this section, the experimental results obtained using our two-phase algorithm are compared with those obtained using Wong’s algorithm (the implementation of Wong’s algorithm was provided us by Wong). The experiments are performed on a real dataset CENSUS (http://www.ipums.org) and on datasets randomly generated from normal distribution. In the CENSUS dataset, we use five instances, where the number of records is varied from 100,000 to 500,000 in 100,000 increments, and the number of QIDs is fixed at 8; the information of the QIDs is shown in Table 7. These datasets are used in the article [16] and were provided us by the authors. The five cases are defined as “Occ_100k”, “Occ_200k”, …,“Occ_500k”. In the randomly generated datasets, the standard deviation (sd) is set to 1.0 or 1.5, the number of QIDs is fixed at 10, and the number of records is set at 100,000, 1,000,000, or 10,000,000 (yielding six cases). To evaluate the performance of our two-phase algorithm, a lower bound is imposed on the number of hidden sites (objective function). In the k-anonymity problem (4)–(6), half of the constraints (5) are relaxed (omitted), and the degrees of all vertices in VT are ignored. The problem then becomes: min F⊆VT ×VT |C(r j , F)| (7) r j ∈VT s.t.|N (v, F)| = k, ∀v ∈ VT (8) F : decision variable (9) Table 7 QIDs in CENSUS dataset QID Cardinality Age 79 Gender Education level Marital status Race 2 17 6 9 Work class 10 Country 83 Occupation 50 Optimization algorithm for k-anonymization of datasets with low information loss The optimal solution is determined by evaluating all combinations of hidden sites in each record in VT . For example, in record r1 in Fig. 3, the vertices (records) adjacent to r1 are determined when a part of r1 is hidden, such as Zip code = *4221, Gender = M and Country = US. Then, all combinations of the hidden sites are considered as follows: Table 9 Occ_200k (|T | = 200k) k Wong’s Two-phase Gap (%) GCPw Time (s) GCPt 3 0.12 0.45 0.10 8 15.0 4 0.14 0.49 0.12 12 13.6 5 0.15 0.48 0.14 13 13.7 6 0.17 0.48 0.15 17 12.3 7 0.18 0.51 0.16 19 12.3 8 0.19 0.49 0.17 22 12.0 ..., 9 0.20 0.52 0.18 26 11.9 Zip code = ∗ ∗ ∗ ∗ ∗, Gender = ∗ and Country = ∗. 10 0.21 0.51 0.19 27 11.7 Zip code = 9 ∗ 221, Gender = M and Country = US, Zip code = 94 ∗ 21, Gender = M and Country = US, However evaluating all combinations is likely unnecessary, because the optimal combinations of most records are found at the smaller numbers of hidden sites (i.e., when few QIDs are hidden, many of the records in VT are adjacent to several vertices in VT ). Thus, starting with the set of no hidden sites, the number of hidden site is iteratively incremented by one, and the new sets were examined. If the degree of each record in VT is not less than k, the adjacent vertices and number of hidden QIDs constituted the optimal combination and lower bound of the record, respectively. The lower bounds of all records in VT are summed to give the lower bound of the k-anonymity problem. Tables 8, 9, 10, 11, 12, 13, 14, 15, 16, 17 and 18 list the value of the objective functions GCP obtained by our twophase algorithm and Wong’s algorithm (denoted as GCPw and GCPt , respectively), and their computational times. In addition, the gaps between GCPw and GCPt and between GCPt and the lower bound are computed by using Eqs. (10) and (11), respectively: GCPw − GCPt Gap = GCPt GCPt − L B LB_Gap = LB (10) (11) Time (min) Table 10 Occ_300k (|T | = 300k) k Wong’s Two-phase Gap (%) GCPw Time (s) GCPt Time (min) 3 0.10 0.65 0.09 13 18.8 4 0.12 0.67 0.11 17 16.9 5 0.14 0.67 0.12 22 15.6 6 0.15 0.70 0.13 26 15.3 7 0.16 0.73 0.14 29 14.7 8 0.17 0.76 0.15 34 14.4 9 0.18 0.73 0.16 37 14.0 10 0.19 0.76 0.17 42 13.2 Table 11 Occ_400k (|T | = 400k) k Wong’s Two-phase Gap (%) GCPw Time (s) GCPt Time (min) 3 0.10 0.93 0.08 15 17.2 4 0.11 0.88 0.10 20 16.1 5 0.13 0.93 0.11 27 14.3 6 0.14 0.90 0.12 31 13.9 7 0.15 0.93 0.13 38 14.0 8 0.16 0.96 0.14 42 14.3 9 0.17 1.01 0.15 49 14.5 10 0.18 1.07 0.16 54 13.3 Table 8 Occ_100k (|T | = 100k) k Wong’s Two-phase Gap (%) GCPw Time (s) GCPt Time (min) 3 0.14 0.24 0.12 4 15.1 4 0.16 0.25 0.14 5 12.9 5 0.18 0.23 0.16 6 11.9 6 0.19 0.25 0.17 8 11.6 7 0.21 0.28 0.18 10 11.0 8 0.22 0.27 0.19 10 11.1 9 0.22 0.27 0.20 12 10.6 10 0.23 0.28 0.21 13 10.5 where L B represents the lower bound. The gaps are plotted as functions of k in Figs. 11 and 12. As is evident in the tables, Wong’s algorithm executed much faster than our two-phase algorithm in all cases. Wong’s algorithm solved the problem within 30 s, even at the largest scales, whereas our two-phase algorithm consumed up to 18 h 40 min (see Table 18). However, our algorithm identified GCPs superior to those obtained by Wong’s algorithm. Although the differences appear small, the gaps exceeded 40% in some instances. In the largerscale random simulations, the gaps decrease in cases in which 123 K. Murakami, T. Uno Table 12 Occ_500k (|T | = 500k) k Wong’s Table 15 Random_1m (|T | = 1m, sd = 1.0) Two-phase Gap (%) GCPw Time (s) GCPt Time (min) 3 0.09 1.12 0.08 18 18.9 4 0.11 1.15 0.09 24 17.8 5 0.12 1.13 0.10 31 6 0.13 1.16 0.11 38 7 0.14 1.18 0.12 8 0.15 1.21 0.13 9 0.16 1.26 10 0.17 1.27 Wong’s 3 0.08 1.74 0.05 28 53.1 4 0.11 1.82 0.07 40 49.7 17.2 5 0.13 1.99 0.09 50 42.9 17.5 6 0.14 1.99 0.10 61 41.2 43 17.4 7 0.16 2.21 0.11 72 37.9 49 17.3 8 0.17 2.27 0.12 86 36.1 0.14 54 16.7 9 0.18 2.43 0.13 95 34.1 0.14 61 17.0 10 0.19 2.49 0.14 110 32.1 Two-phase Gap (%) k Two-phase Gap (%) Time (s) GCPt 3 0.28 1.90 0.20 41 37.3 4 0.31 1.90 0.24 55 29.9 32.1 5 0.33 2.05 0.27 70 23.0 0.35 2.33 0.29 82 18.1 14.6 Time (min) 0.31 0.20 0.21 5 47.5 4 0.35 0.18 0.26 5 36.4 8 Wong’s GCPw 3 0.29 Time (min) Table 16 Random_1m (|T | = 1m, sd = 1.5) GCPt 0.26 Gap (%) GCPt Time (s) 0.38 Two-phase Time (s) GCPw 5 Wong’s GCPw Table 13 Random_100k (|T | = 100k, sd = 1.0) k k Time (min) 6 0.40 0.24 0.31 7 26.6 6 7 0.41 0.23 0.34 11 22.2 7 0.36 2.38 0.31 95 8 0.42 0.28 0.36 12 18.7 8 0.37 2.55 0.33 108 11.7 9 0.43 0.28 0.38 14 14.7 9 0.38 2.74 0.34 122 9.5 10 0.44 0.34 0.39 15 13.2 10 0.38 2.83 0.36 133 7.5 Table 17 Random_10m (|T | = 10m, sd = 1.0) Table 14 Random_100k (|T | = 100k, sd = 1.5) k Wong’s Two-phase k Gap (%) GCPw Time (s) GCPt 3 0.40 0.20 0.27 3 45.9 4 0.44 0.20 0.33 5 35.7 5 0.47 0.20 0.36 8 31.6 6 0.49 0.26 0.39 8 25.0 7 0.50 0.24 0.41 9 22.8 8 0.51 0.28 0.42 12 19.6 9 0.52 0.29 0.44 14 16.8 10 0.52 0.35 0.46 15 14.5 Wong’s Two-phase Gap (%) GCPw Time (s) GCPt Time (min) 3 0.04 20.43 0.03 273 47.3 4 0.06 20.85 0.04 373 44.2 5 0.07 21.47 0.05 477 37.7 6 0.08 22.14 0.06 581 35.6 7 0.09 22.97 0.07 693 32.8 8 0.10 23.86 0.07 799 30.9 9 0.11 24.81 0.08 910 28.8 10 0.11 25.91 0.09 1024 27.0 Time (min) Table 18 Random_10m (|T | = 10m, sd = 1.5) the higher sd was used. When the data varied more widely, more hidden sites were required. Consequently, the feasibility of even a randomly chosen solution is relatively high, and Wong’s algorithm (random approach) yields reasonable results. Our algorithm found better solutions than Wong’s algorithm within a reasonable time, even when executed on large-scale datasets. Therefore, the aims of the study were satisfied. As shown in Figs. 11 and 12, the gap between GCPt and the lower bound increases with k. At large k, the lower bound departs from the optimal solution because some vertices in 123 k Wong’s Two-phase Gap (%) GCPw Time (s) GCPt Time (min) 3 0.15 21.35 0.11 353 38.2 4 0.18 22.21 0.14 458 30.3 5 0.20 23.13 0.16 563 23.9 6 0.21 24.35 0.18 673 20.1 7 0.22 25.70 0.19 790 17.0 8 0.23 27.21 0.20 901 14.6 9 0.24 28.74 0.22 1009 12.5 10 0.25 30.06 0.23 1119 10.6 Optimization algorithm for k-anonymization of datasets with low information loss 0.32 "Occ_100k" "Occ_200k" "Occ_300k" "Occ_400k" "Occ_500k" 0.3 0.28 LB_Gap 0.26 0.24 0.22 the (initial) solution by canceling negative cycles. In all of the computational experiments, our algorithm obtained smaller objective functions than Wong et al.’s algorithm, at the expense of longer computational time. Shortening the computation time and further decreasing the information loss will be targeted in our future work. Acknowledgements Part of this research is supported by the Funding Program for World-Leading Innovative R&D on Science and Technology, Japan. We thank Professor Wong for providing us with the programs used in our experiments. 0.2 0.18 0.16 0.14 0.12 3 4 5 6 k 7 8 9 10 Fig. 11 Gap between GCP(T ) (computed by our algorithm) and lower bound in dataset CENSUS 0.8 "Random_100k_sd=1.0" "Random_100k_sd=1.5" "Random_1m_sd=1.0" "Random_1m_sd=1.5" "Random_10m_sd=1.0" "Random_10m_sd=1.5" 0.7 0.6 The gaps between GCPt and GCPunsort are computed by Gap_sort = GCPunsort − GCPt , GCPt (12) where GCPunsort represents the information loss of the anonymized dataset obtained using our algorithm without preliminarily sorting. Note that we do not compute Gap_sort for the instances of |T | = 100k because the datasets are not partitioned; thus, GCPunsort equals to GCPt (Tables 19, 20). 0.5 LB_Gap Appendix 0.4 0.3 0.2 Table 19 Gap_sort (Occ) 0.1 0 3 4 5 6 7 8 9 k 100k 200k 300k 400k 500k 3 – 19.3 37.7 48.1 91.9 4 – 18.7 36.3 46.5 86.4 5 – 19.2 34.8 43.7 82.7 6 – 17.3 33.3 41.8 79.4 7 – 17.0 31.9 40.8 76.5 8 – 16.6 31.0 40.3 74.4 9 – 16.7 30.0 39.8 71.7 10 – 16.5 28.9 38.0 70.7 10 k Fig. 12 Gap between GCP(T ) (computed by our algorithm) and lower bound in random instances VT (not VT ) are probably adjacent to more than k vertices, while others are adjacent to far fewer than k vertices. In other words, the (relaxed) degree constraints (5) are considerably violated. In Fig. 11, the gaps remain below 0.32 even for k = 10, suggesting that our algorithm analyzed the CENSUS dataset with high performance. In Fig. 12, the gaps are larger in Random_1m_sd = 1.0 and Random_10m_sd = 1.0 than in the other sample datasets, reflecting that optimization problems with less-biased values (i.e., datasets with less variation) are notoriously difficult to solve. 5 Conclusions We formulated the k-anonymity problem as an optimization problem and proposed a two-phase algorithm. The first phase seeks an initial solution by iteratively finding minimum weight perfect matchings; the second phase improves Table 20 Gap_sort (Random) 100k 1m 10m k\sd 1.0 1.5 1.0 1.5 1.0 1.5 3 – – 14.9 38.8 101.9 136.5 4 – – 14.4 38.4 97.1 124.9 5 – – 12.3 37.0 92.1 117.0 6 – – 13.3 35.8 89.4 112.5 7 – – 12.1 35.2 87.0 109.1 8 – – 12.2 34.5 85.3 106.7 9 – – 12.2 34.1 83.9 104.4 10 – – 12.2 33.7 82.6 102.5 123 K. Murakami, T. Uno References 1. Sacharidis, D., Mouratidis, K., Papadias, D.: k-Anonymity in the presence of external databases. IEEE Trans. Knowl. Data Eng. 22(3), 392–403 (2010) 2. Dalenius, T.: Finding a needle in a haystack or identifying anonymous census record. J. Off. Stat. 2(3), 329–336 (1986) 3. Wang, K., Yu, P.S., Chakraborty, S.: Bottom-up generalization: a data mining solution to privacy protection. In: Fourth IEEE International Conference on Data Mining, 2004. ICDM’04, pp. 249–256 (2004) 4. Samarati, P., Sweeney, L.: Generalizing data to provide anonymity when disclosing information. In: Proceedings of the 17th ACM SIGMOD-SIGACT-SIGART Symposium on the Principles of Database Systems, p. 188 (1998) 5. Fung, B., Wang, K., YU, P.: Top-down specialization for information and privacy preservation. In: Proceedings. 21st International Conference on Data Engineering, 2005. ICDE 2005. IEEE, pp. 205–216 (2005) 6. Samarati, P.: Protecting respondants identities in microdata release. IEEE Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001) 7. Sun, X., Li, M., Wang, H., Plank, A.: An efficient hash-based algorithm for minimal k-anonymity. In: Proceedings of the Thirty-First Australasian Conference on Computer Science, vol. 74, pp. 101– 107 (2008) 8. Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical Report SRI-CSL-98–04, SRI Computer Science Laboratory (1998) 9. LeFevre, K., DeWitt, D.J., Ramakrishnan, R., Incognito: Efficient full-domain k-anonymity. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, vol. 21, pp. 49–60 (2005) 10. Machanavajjhala, A., Gehrke, J., Kifer, D.: l-Diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1(3), 1–52 (2007) 11. Domingo-Ferrer, J.: Microaggregation for database and location privacy. In: Etzion, O., Kuflik, T., Motro, A. (eds.) Next Generation Information Technologies and Systems. Lecture Notes in Computer Science, vol. 4032. Springer, Berlin, Heidelberg, pp. 106–116 (2006) 123 12. Campan, A., Truta, T.M., Miller, J., Sinca, R.A.: A clustering approach for achieving data privacy. In: Proceedings of the International Data Mining Conference, pp. 321–330 (2007) 13. Goldberg, A.V., Tarjan, R.E.: Efficient maximum flow algorithms. Commun. ACM 57(8), 82–89 (2014) 14. Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., Zhu, A.: Anonymizing tables. In: Proceedings of the 10th International Conference on Database Theory, LNCS, 3363, pp. 246–258 (2005) 15. Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 10(5), 571–588 (2002) 16. Wong, W.K., Mamoulis, N., Cheung, D.W.-L.: Non-homogeneous generalization in privacy preserving data publishing. In: The ACM SIGMOD International Conference on Data Management (SIGMOD), pp. 747–758 (2010) 17. Murakami, K., Uno, T.: A matching model and an algorithm for k-anonymity of large-scale data. In: Proceedings of the 15th KoreaJapan Joint Workshop on Algorithms and Computation, pp. 154– 160 (2012) 18. Shmueli, E., Tassa, T., Wasserstein, R., Shapira, B., Rokach, L.: Limiting disclosure of sensitive data in sequential releases of databases. Inf. Sci. 191, 98–127 (2012) 19. Shmueli, E., Tassa, T.: Privacy by diversity in sequential releases of databases. Inf. Sci. 298, 344–372 (2015) 20. Goldberg, A.V.: An Efficient implementation of a scaling minimum-cost flow algorithm. J. Algorithms 22(1), 1–29 (1997) 21. Hall, P.: On representatives of subsets. J. Lond. Math. Soc. 10(1), 26–30 (1935) 22. Goldberg, A.V., Tarjan, R.E.: Finding minimum-cost circulations by canceling negative cycles. J. ACM 36(4), 873–886 (1989) 23. Sokkalingam, P.T.: New polynomial-time cycle-canceling algorithms for minimum cost flows. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA 02139, USA (1997) 24. Uno, T.: Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data. Knowl. Inf. Syst. 25(2), 229–251 (2010)

1/--страниц