DDMWikiMain Page | About | Help | FAQ | Special pages | Log in
The Free Encyclopedia
Printable version | Disclaimers

Privacy Preserving Data Mining Bibliography

From DDMWiki

 Please follow this code demo to edit bibliography in this page. Check out bibtex template for a template of bibtex entries. Make sure you include as much information as possible in the bibtex you provided.
Enlarge
Please follow this code demo to edit bibliography in this page. Check out bibtex template for a template of bibtex entries. Make sure you include as much information as possible in the bibtex you provided.

Contents

Privacy Preserving Data Mining Philosophical Issues

• Mary J. Cronin, "e-Privacy?," HOOVER DIGEST, No. 3, 2000. Adapted from the essay "Privacy and Electronic Commerce," by Mary J. Cronin, in the new Hoover Press book Public Policy and the Internet: Privacy, Taxes, and Contract, edited by Nicholas Imparato. [link]

[abstract]

Imagine going to a shopping mall in which researchers follow you from store to store, taking notes on every product you examine or buy. Would you shop in such a place? Chances are, you already do. Welcome to the Internet.

[bib]

''this is the bibtex entry ...''

• Alan J. Broder, "Data Mining, the Internet, and Privacy," International WEBKDD’99 Workshop San Diego, CA, USA, August 15, 1999. [link]

[abstract]

This paper addresses the inherent technological conflict between the desire for privacy by World Wide Web users, and the need of Web content providers and advertisers to more fully collect and utilize data about users. As the other papers in this volume illustrate, the Data Mining community has now turned its collective attention towards the Web as a fertile venue for research and development. In doing so, Data Miners have found themselves at the nexus of this conflict.

We present the technical issues regarding privacy from two perspectives. First, from the perspective of the Web user who may be unaware of the degree to which identifying information can be inadvertently disclosed. And second, from the perspective of a Data Miner we consider the extent to which privacy enhancing technologies could substantially invalidate data mining results.

[bib]

@INCOLLECTION{Broder_99,
  AUTHOR =       "Alan J. Broder",
  TITLE =        "Data Mining, the Internet, and Privacy",
  BOOKTITLE =    "Web Usage Analysis and User Profiling: International WEBKDD'99 Workshop San Diego, CA, USA, August 15, 1999 ",
  PUBLISHER =    "Springer Berlin/Heidelberg",
  YEAR =         "1999",
  editor =       "",
  volume =       "1836/2000",
  number =       "",
  series =       "",
  type =         "",
  chapter =      "p.56",
  pages =        "",
  address =      "",
  edition =      "",
  month =        "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• EPIC: Electronic Privacy Information Center, a public interest research center in Washington, D.C. [link]

[abstract]

[bib]

	 


Privacy Preserving Data Mining Survey/General Issues

• V.S. Verykios, E. Bertino, I.N. Fovino, L.P. Provenza, Y. Saygin, and Y. Theodoridis, “State-of-the-Art in Privacy Preserving Data Mining,” ACM SIGMOD Record, vol. 3, no. 1, pp. 50-57, Mar. 2004. [link]

[abstract]

We provide here an overview of the new and rapidly emerging research area of privacy preserving data mining. We also propose a classification hierarchy that sets the basis for analyzing the work which has been performed in this context. A detailed review of the work accomplished in this area is also given, along with the coordinates of each work to the classification hierarchy. A brief evaluation is performed, and some initial conclusions are made.

[bib]

@ARTICLE{Verykios_04v,
  AUTHOR =       "V. S. Verykios and E. Bertino and I. N. Fovino and L. P. Provenza and Y. Saygin and Y. Theodoridis",
  TITLE =        "State-of-the-art in Privacy Preserving Data Mining",
  JOURNAL =      "ACM SIGMOD Record",
  YEAR =         "2004",
  volume =       "3",
  number =       "1",
  pages =        "50--57",
  month =        "March",
  note =         "",
  abstract =     "",
  keywords =     "",
}

• C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Zhu, “Tools for Privacy Preserving Distributed Data Mining,” ACM SIGKDD Explorations, vol. 4, no. 2, 2003. [link]

[abstract]

Privacy preserving mining of distributed data has numerous applications. Each application poses di�erent constraints: What is meant by privacy, what are the desired results, how is the data distributed, what are the constraints on collaboration and cooperative computing, etc. We suggest that the solution to this is a toolkit of components that can be combined for speci�c privacy-preserving data mining applications. This paper presents some components of such a toolkit, and shows how they can be used to solve several privacy-preserving data mining problems.

[bib]

@article{Clifton:03,
    author      = {C. Clifton and M. Kantarcioglu and J. Vaidya and X. Lin and M. Zhu},
    title       = "Tools for Privacy Preserving Distributed Data Mining",
    journal     = {ACM SIGKDD Explorations},
    volume      = {4},
    number      = {2},
    year        = {2003},
}

• B. Pinkas, “Cryptographic Techniques for Privacy Preserving Data Mining,” SIGKDD Explorations, vol. 4, no. 2, pp. 12-19, 2002. [link]

[abstract]

Research in secure distributed computation, which was done as part of a larger body of research in the theory of cryptography, has achieved remarkable results. It was shown that non-trusting parties can jointly compute functions of their different inputs while ensuring that no party learns anything but the defined output of the function. These results were shown using generic constructions that can be applied to any function that has an efficient representation as a circuit. We describe these results, discuss their efficiency, and demonstrate their relevance to privacy preserving computation of data mining algorithms. We also show examples of secure computation of data mining algorithms that use these generic constructions.

[bib]

@ARTICLE{Pinkas_02,
  AUTHOR =       "B. Pinkas",
  TITLE =        "Cryptographic Techniques for Privacy Preserving Data Mining",
  JOURNAL =      "SIGKDD Explorations",
  YEAR =         "2002",
  volume =       "4",
  number =       "2",
  pages =        "12--19 ",
  month =        "",
  note =         "",
  abstract =     "",
  keywords =     "",
  source =       "",
}

• Murat Kantarcio\&\#487;lu and Jiashun Jin and Chris Clifton, "When do data mining results violate privacy?," in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004. [link]

[abstract]

Privacy-preserving data mining has concentrated on obtaining valid results when the input data is private. An extreme example is Secure Multiparty Computation-based methods, where only the results are revealed. However, this still leaves a potential privacy breach: Do the results themselves violate privacy? This paper explores this issue, developing a framework under which this question can be addressed. Metrics are proposed, along with analysis that those metrics are consistent in the face of apparent problems.

[bib]

@inproceedings{1014126,
 author = {Murat Kantarcio\&\#487;lu and Jiashun Jin and Chris Clifton},
 title = {When do data mining results violate privacy?},
 booktitle = {KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining},
 year = {2004},
 isbn = {1-58113-888-1},
 pages = {599--604},
 location = {Seattle, WA, USA},
 doi = {http://doi.acm.org/10.1145/1014052.1014126},
 publisher = {ACM Press},
 address = {New York, NY, USA},
 }

• Shipra Agrawal, Vijay Krishnan, Jayant Haritsa, "On Addressing Efficiency Concerns in Privacy Preserving Data Mining," In Proceedings of the 9th International Conference on Database Systems for Advanced Applications (DASFAA-2004). [link]

[abstract]

Data mining services require accurate input data for their results to be meaningful, but privacy concerns may influence users to provide spurious information. To encourage users to provide correct inputs, we recently proposed a data distortion scheme for association rule mining that simultaneously provides both privacy to the user and accuracy in the mining results. However, mining the distorted database can be orders of magnitude more time-consuming as compared to mining the original database. In this paper, we address this issue and demonstrate that by (a) generalizing the distortion process to perform symbol-specific distortion, (b) appropriately choosing the distortion parameters, and (c) applying a variety of optimizations in the reconstruction process, runtime efficiencies that are well within an order of magnitude of undistorted mining can be achieved.

[bib]

''this is the bibtex entry ...''

• Chris Clifton and Murat Kantarcioglu and Jaideep Vaidya, "Defining Privacy for Data Mining," in Proceedings of the National Science Foundation Workshop on Next Generation Data Mining, November 1-3, 2002, Baltimore, MD. [link]

[abstract]

Privacy preserving data mining – getting valid data mining results without learning the underlying data values – has been receiving attention in the research community and beyond. It is unclear what privacy preserving means. This paper provides a framework and metrics for discussing the meaning of privacy preserving data mining, as a foundation for further research in this field.

[bib]

''this is the bibtex entry ...''




Additive Data Perturbation

• R. Agrawal and R. Srikant, “Privacy Preserving Data Mining,” Proc. ACM SIGMOD Conf. Management of Data, pp. 439-450, May 2000. [link]

[abstract]

A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? We consider the concrete case of building a decision-tree classifier from training data in which the values of individual records have been perturbed. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. While it is not possible to accurately estimate original values in individual data records, we propose a novel reconstruction procedure to accurately estimate the distribution of original data values. By using these reconstructed distributions, we are able to build classifiers built with the original data.

[bib]

@INPROCEEDINGS{Agrawal:00,
  AUTHOR =       "R. Agrawal and R. Srikant",
  TITLE =        "Privacy Preserving Data Mining",
  BOOKTITLE =    "Proceedings of the ACM SIGMOD Conference on Management of Data",
  YEAR =         "2000",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "439--450",
  address =      "Dallas, TX",
  month =        "May",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• D. Agrawal and C. C. Aggarwal, "On the Design and Quantification of Privacy Preserving Data Mining Algorithms," Proc. ACM SIGMOD, pp. 247-255, 2001. [link]

[abstract]

The increasing ability to track and collect large amounts of data with the use of current hardware technology has lead to an interest in the development of data mining algorithms which preserve user privacy. A recently proposed technique addresses the issue of privacy preservation by perturbing the data and reconstructing distributions at an aggregate level in order to perform the mining. This method is able to re- tain privacy while accessing the information implicit in the original attributes. The distribution reconstruction process naturally leads to some loss of information which is accept- able in many practical situations. This paper discusses an Expectation Maximization (EM) algorithm for distribution reconstruction which is more effective than the currently available method in terms of the level of information loss. Specifically, we prove that the EM algorithm converges to the maximum likelihood estimate of the original distribu- tion based on the perturbed data. We show that when a large amount of data is available, the EM algorithm pro- vides robust estimates of the original distribution. We pro- pose metrics for quantification and measurement of privacy- preserving data mining algorithms. Thus, this paper pro- vides the foundations for measurement of the effectiveness of privacy preserving data mining algorithms. Our privacy metrics illustrate some interesting results on the relative ef- fectiveness of different perturbing distributions.

[bib]

@INPROCEEDINGS{Agrawal:01,
  AUTHOR =       "D. Agrawal and C. C. Aggarwal" 
  TITLE =        "On the Design and Quantification of Privacy
                   Preserving Data Mining Algorithms",
  BOOKTITLE =    "Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of Database Systems",
  YEAR =         "2001",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "247--255",
  address =      "Santa Barbara, CA",
  month =        "",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
}

• H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar, “On the Privacy Preserving Properties of Random Data Perturbation Techniques,” Proc. IEEE Int’l Conf. Data Mining, Nov. 2003. [link]

[abstract]

Privacy is becoming an increasingly important issue in many data mining applications. This has triggered the development of many privacy-preserving data mining techniques. A large fraction of them use randomized data distortion techniques to mask the data for preserving the privacy of sensitive data. This methodology attempts to hide the sensitive data by randomly modifying the data values often using additive noise. This paper questions the utility of the random value distortion technique in privacy preservation. The paper notes that random objects (particularly random matrices) have "predictable" structures in the spectral domain and it develops a random matrix-based spectral filtering technique to retrieve original data from the dataset distorted by adding random values. The paper presents the theoretical foundation of this filtering method and extensive experimental results to demonstrate that in many cases random data distortion preserve very little data privacy. The paper also points out possible avenues for the development of new privacy-preserving data mining techniques like exploiting multiplicative and colored noise for preserving privacy in data mining applications.

[bib]

@INPROCEEDINGS{Kargupta:03a,
    AUTHOR      = {H. Kargupta and S. Datta and Q. Wang and K. Sivakumar},
    TITLE       = "On the Privacy Preserving Properties of Random Data Perturbation Techniques",
    BOOKTITLE   = {{Proceedings of the IEEE International Conference on Data Mining}},
    YEAR        = {2003},
    EDITOR      = {},
    PAGES       = {99},
    VOLUME      = {},
    NUMBER      = {},
    SERIES      = {},
    ADDRESS     = {Melbourne, FL},
    MONTH       = {November},
    NOTE        = {},
    KEY         = {},
    KEYWORDS    = {},
    ISBN        = {},
    URL         = {},
    ABSTRACT    = {},
}

• K. Muralidhar and R. Sarathy, "Security of Random Data Perturbation Methods," ACM Transactions on Database Systems, Vol. 24, No. 4, December 1999, Pages 487–493. [link]

[abstract]

Statistical databases often use random data perturbation (RDP) methods to protect against disclosure of confidential numerical attributes. One of the key requirements of RDP methods is that they provide the appropriate level of security against snoopers who attempt to obtain information on confidential attributes through statistical inference. In this study, we evaluate the security provided by three methods of perturbation. The results of this study allow the database administrator to select the most effective RDP method that assures adequate protection against disclosure of confidential information.

[bib]

''this is the bibtex entry ...''

• K. Muralidhar and R. Sarathy, "A theoretical basis for perturbation methods," Statistics and Computing 13: 329–335, 2003. [link]

[abstract]

In this paper we discuss a new theoretical basis for perturbation methods. In developing this new theoretical basis, we define the ideal measures of data utility and disclosure risk. Maximum data utility is achieved when the statistical characteristics of the perturbed data are the same as that of the original data. Disclosure risk is minimized if providing users with microdata access does not result in any additional information. We show that when the perturbed values of the confidential variables are generated as independent realizations from the distribution of the confidential variables conditioned on the non-confidential variables, they satisfy the data utility and disclosure risk requirements. We also discuss the relationship between the theoretical basis and some commonly used methods for generating perturbed values of confidential numerical variables.

[bib]

''this is the bibtex entry ...''

• Alexandre Evfimievski, "Randomization in Privacy Preserving Data Mining," ACM SIGKDD Explorations Newsletter, Volume 4, Issue 2, Pages: 43-48, December 2002. [link]

[abstract]

Suppose there are many clients, each having some personal information, and one server, which is interested only in aggregate, statistically significant, properties of this information. The clients can protect privacy of their data by perturbing it with a randomization algorithm and then submitting the randomized version. The randomization algorithm is chosen so that aggregate properties of the data can be recovered with sufficient precision, while individual entries are significantly distorted. How much distortion is needed to protect privacy can be determined using a privacy measure. Several possible privacy measures are known; finding the best measure is an open question. This paper presents some methods and results in randomization for numerical and categorical data, and discusses the issue of measuring privacy.

[bib]

''this is the bibtex entry ...''

• A. Evfimevski, J. Gehrke, and R. Srikant, “Limiting Privacy Breaches in Privacy Preserving Data Mining,” Proc. ACM SIGMOD/PODS Conf., June 2003. [link]

[abstract]

There has been increasing interest in the problem of building accurate data mining models over aggregate data, while protecting privacy at the level of individual records. One approach for this problem is to randomize the values in individual records, and only disclose the randomized values. The model is then built over the randomized data, after first compensating for the randomization (at the aggregate level). This approach is potentially vulnerable to privacy breaches: based on the distribution of the data, one may be able to learn with high confidence that some of the randomized records satisfy a specified property, even though privacy is preserved on average.In this paper, we present a new formulation of privacy breaches, together with a methodology, "amplification", for limiting them. Unlike earlier approaches, amplification makes it is possible to guarantee limits on privacy breaches without any knowledge of the distribution of the original data. We instantiate this methodology for the problem of mining association rules, and modify the algorithm from [9] to limit privacy breaches without knowledge of the data distribution. Next, we address the problem that the amount of randomization required to avoid privacy breaches (when mining association rules) results in very long transactions. By using pseudorandom generators and carefully choosing seeds such that the desired items from the original transaction are present in the randomized transaction, we can send just the seed instead of the transaction, resulting in a dramatic drop in communication and storage cost. Finally, we define new information measures that take privacy breaches into account when quantifying the amount of privacy preserved by randomization.

[bib]

@INPROCEEDINGS{Evfimievski:03,
  AUTHOR =       "A. Evfimevski and J. Gehrke and R. Srikant",
  TITLE =        "Limiting Privacy Breaches in Privacy Preserving Data Mining",
  BOOKTITLE =    "Proceedings of the ACM SIGMOD/PODS Conference",
  YEAR =         "2003",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "211--222",
  address =      "San Diego, CA",
  month =        "June",
  organization = "",
  note =         "",
  abstract =     "",
  keywords =     "",
}

• C. W. Wu, "Privacy Preserving Data Mining: a signal processing perspective and a simple data perturbation protocol," 2nd Workshop on Privacy Preserving Data Mining, IEEE International Conference on Data Mining, Melbourne, FL, 2003. [link]

[abstract]

Privacy concerns over the proliferation of gathering of personal information by various institutions over the internet led to the development of data mining algorithms that preserve the privacy of those whose personal data are collected and analyzed. A novel approach to such privacy preserving data mining algorithms was proposed where the individual datum in a data set is perturbed by adding a random value from a known distribution. In these applications, the distribution of the original data set is important and estimating it is one of the goals of the data mining algorithm. This distribution is estimated via an iterative algorithmsuch as the ExpectationMaximization (EM) algorithmwhich was shown to have desirable properties such as low privacy loss and high fidelity estimates of the distribution. Each iteration of EM requires computation that is proportional to the size of the data set and can require large computation time to estimate the distribution. In this paper we propose two ways to reduce the amount of computation. First, we show that the problem can be recast as a deconvolution problem and signal processing algorithms can be applied to solve this problem. In particular we consider both a direct method and iterative methods which are more robust against noise and ill-conditioning. We show that the Richardson-Lucy deblurring algorithm is equivalent to EM after quantization. The signal processing approach also shows how the choice of perturbation affects information loss and privacy loss and allows us to clarify some points made in the literature.

In the second part of this paper, we propose a scheme for perturbing data which also has the nice properties of arbitrarily small privacy loss and arbitrarily high fidelity in the estimate. The main advantage of the proposed scheme is the simplicity of the estimation algorithm. In contrast to iterative algorithms such as EM, the proposed scheme estimates the unknown distribution in one step. This is significant in applications where the data set is very large or when the data mining algorithm is run in an online environment.

[bib]

''this is the bibtex entry ...''

• Z. Huang and W. Du and B. Chen, "Deriving Private Information from Randomized Data," in Proceedings of the 2005 ACM SIGMOD Conference, pp. 37-48, Baltimroe, MD, 2005. [link]

[abstract]

Randomization has emerged as a useful technique for data disguising in privacy-preserving data mining. Its privacy properties have been studied in a number of papers. Kar- gupta et al. challenged the randomization schemes, and they pointed out that randomization might not be able to preserve privacy. However, it is still unclear what factors cause such a security breach, how they affect the privacy preserving property of the randomization, and what kinds of data have higher risk of disclosing their private contents even though they are randomized.

We believe that the key factor is the correlations among

attributes. We propose two data reconstruction methods that are based on data correlations. One method uses the Principal Component Analysis (PCA) technique, and the other method uses the Bayes Estimate (BE) technique. We have conducted theoretical and experimental analysis on the relationship between data correlations and the amount of private information that can be disclosed based our proposed data reconstructions schemes. Our studies have shown that when the correlations are high, the original data can be re- constructed more accurately, i.e., more private information can be disclosed.

To improve privacy, we propose a modified randomization

scheme, in which we let the correlation of random noises "similar" to the original data. Our results have shown that the reconstruction accuracy of both PCA-based and BE- based schemes become worse as the similarity increases.

[bib]

@INPROCEEDINGS{Huang_05,
  AUTHOR =       "Z. Huang and W. Du and B. Chen",
  TITLE =        "Deriving Private Information from Randomized Data",
  BOOKTITLE =    "Proceedings of the 2005 ACM SIGMOD Conference",
  YEAR =         "2005",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "37--48",
  address =      "Baltimroe, MD",
  month =        "June",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• S. Guo and X. Wu, "On the Use of Spectral Filtering for Privacy Preserving Data Mining," Proceedings of the 21st ACM Symposium on Applied Computing (SAC06), Dijon, France, April 23-27, pp. 622-626, 2006. [link]

[abstract]

Randomization has been a primary tool to hide sensitive pri- vate information during privacy preserving data mining.The previous work based on spectral ¯ltering, show the noise may be separated from the perturbed data under some conditions and as a result privacy can be seriously compromised. In this paper, we explicitly assess the e®ects of perturbation on the accuracy of the estimated value and give the explicit rela- tion on how the estimation error varies with perturbation. In particular, we derive one upper bound for the Frobenius norm of reconstruction error. This upper bound may be ex- ploited by attackers to determine how close their estimates are from the original data using spectral ¯ltering technique, which imposes a serious threat of privacy breaches.

[bib]

@INPROCEEDINGS{Guo_06,
  AUTHOR =       "S. Guo and X. Wu",
  TITLE =        "On the Use of Spectral Filtering for Privacy Preserving Data Mining",
  BOOKTITLE =    "Proceedings of the 21st ACM Symposium on Applied Computing (SAC'06)",
  YEAR =         "2006",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "622--626",
  address =      "Dijon, France",
  month =        "April",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• S.Guo, X.Wu and Y. Li, "On the Lower Bound of Reconstruction Error for Spectral Filting," Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD06), Berlin, Germany, Sept 18-22, 2006. [link]

[abstract]

Additive Randomization has been a primary tool to hide sensitive private information during privacy preserving data mining. The previous work based on Spectral Filtering empirically showed that individual data can be separated from the perturbed one and as a result privacy can be seriously compromised. Our previous work initiated the theoretical study on how the estimation error varies with the noise and gave an upper bound for the Frobenius norm of reconstruction error using matrix perturbation theory. In this paper, we propose one Singular Value Decomposition (SVD) based reconstruction method and derive a lower bound for the reconstruction error. We then prove the equivalence between the Spectral Filtering based approach and the proposed SVD approach and as a result the achieved lower bound can also be considered as the lower bound of the Spectral Filtering based approach.

[bib]

@INPROCEEDINGS{,
  AUTHOR =       "S.Guo and X.Wu and Y. Li",
  TITLE =        "On the Lower Bound of Reconstruction Error for Spectral Filting",
  BOOKTITLE =    "Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'06)",
  YEAR =         "2006",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "",
  address =      "Berlin, Germany",
  month =        "September",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• Islam, M. Z. and Brankovic, L. (2004): A Framework for Privacy Preserving Classification in Data Mining, In Proc. of Australasian Workshop on Data Mining and Web Intelligence (DMWI 2004), Dunedin, New Zealand, CRPIT, 32, J. Hogan, P. Montague, M. Purvis and C. Steketee, Eds., Australian Computer Science Communications, 163-168. [link]

[abstract]

Nowadays organizations all over the world are dependent on mining gigantic datasets. These datasets typically contain delicate individual information, which inevitably gets exposed to different parties. Consequently privacy issues are constantly under the limelight and the public dissatisfaction may well threaten the exercise of data mining and all its benefits. It is thus of great importance to develop adequate security techniques for protecting confidentiality of individual values used for data mining.

In the last 30 years several techniques have been proposed in the context of statistical databases. It was noticed early on that non-careful noise addition introduces biases to statistical parameters, including means, variances and covariances, and sophisticated techniques that avoid biases were developed. However, when these techniques are applied in the context of data mining, they do not appear to be bias-free. Wilson and Rosen (2002) suggest the existence of Type Data Mining (DM) bias that relates to the loss of underlying patters in the database and cannot be eliminated by preserving simple statistical parameters. In this paper we propose a noise addition framework specifically tailored towards the classification task in data mining. It builds upon some previous techniques that introduce noise to the class and the so-called innocent attributes. Our framework extends these techniques to the influential attributes; additionally, it caters for the preservation of the variances and covariances, along with patterns, thus making the perturbed dataset useful for both statistical and data mining purposes. Our preliminary experimental results indicate that data patterns are highly preserved suggesting the non-existence of DM bias.

[bib]

@inproceedings{Isla04,
  author    = {Md. Zahidul Islam and
               Ljiljana Brankovic},
  title     = {A Framework for Privacy Preserving Classification in Data
               Mining.},
  booktitle = {Proceedings of Workshop on Data Mining and Web Intelligence (DMWI2004)},
  year      = {2004},
  pages     = {163-168},
  ee        = {http://crpit.com/confpapers/CRPITV32Islam.pdf},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

• Islam, M. Z. and Brankovic, L. (2003): Noise Addition for Protecting Privacy in Data Mining, In Proceedings of the 6th Engineering Mathematics and Applications Conference (EMAC 2003), Sydney, Australia, 85-90. [link]

[abstract]

In recent years advances in technology facilitated collection and storage of vast amount of data. Many organizations, including large and small businesses, and hospitals and government bodies rely on data for day-to-day operation as well as marketing, planning and research purposes. Examples include criminal records used by law enforcement and national security agencies, medical records used for treatment and research purposes and shopping records used for marketing and enhancing business strategies. The benefits of the information extracted from such data can hardly be overestimated. For example, we are all witnessing huge progress made in the human genetic project bringing new promises of previously unimaginable treatments such as gene therapy.

However, simultaneously with the data explosion there is an uprise of anxiety about the confidentiality of delicate personal information open to potential misuses. This is not necessarily limited to data as sensitive as medical and genetic records mentioned above. Other personal information, although not as vulnerable as health records, is also considered to be confidential and as such is open to malicious exploitation. For example, detailed credit card records can be used to monitor personal habits.

The IBM Multinational Consumer Privacy Survey performed in 1999 in Germany, USA and UK illustrates public concern about privacy [6]. Most consumers (80%) feel that “consumers have lost all control over how personal information is collected and used by companies.” The majority of consumers (94%) are concerned about the possible misuse of their personal information. This survey also shows that, when it comes to the confidence that their personal information is properly handled, consumers have most trust in health care providers and banks and the least trust in credit card agencies and Internet companies.

Personal data are typically collected with the consent (presumed or otherwise) of the subject. It seems that the main public concern comes from so-called secondary use of personal information without the consent of the subject, that is, any use other than the one for which the data were originally collected. In other words, consumers feel strongly that their personal information should not be sold to other organizations without their prior consent. Indeed, the above-mentioned survey shows that over 50% of respondents have asked a company not to sell their information.

The main concern of collectors and owners of personal records is that public apprehension about privacy may result in difficulties in obtaining truthful information from individuals. Additionally, privacy concerns may lead to future laws and regulations that will restrict and constrain such data collection. In this paper we argue that it is possible to provide confidentiality of individual records and still preserve the usefulness of data for research and planning purposes.

We first note that removing names and other unique identifiers is not enough to ensure the confidentiality of personal records. Privacy invasion is possible whenever a record can be uniquely identified by using a combination of other attributes. For example, an individual, who is the only one with certain characteristics, say age of 25 and salary of 45000, may be uniquely identified and all other characteristics may be learned from other attributes in that record. Thus better techniques are needed to ensure privacy and a lot of work has been already done in the area of statistical databases (see, for example, [10,12,13]). In this paper we focus on records used for building decision trees and we develop techniques for adding noise to class and other attributes, using various probability distributions. We show that decision trees built from the perturbed records are the same or very similar to the trees built from the original records.

[bib]

@inproceedings{Isla03b,
  author    = {Md. Zahidul Islam and
               Ljiljana Brankovic},
  title     = {Noise Addition for Protecting Privacy in Data Mining},
  booktitle = {Proceedings of of the 6th Engineering Mathematics and Applications Conference (EMAC 2003)},
  year      = {2003},
  address   = {Sydney, Australia},
  pages     = {85-90},
  volume    = {2},
}

• Islam, M. Z., Barnaghi, P. M. and Brankovic, L.(2003): Measuring Data Quality: Predictive Accuracy vs. Similarity of Decision Trees, In Proceedings of the 6th International Conference on Computer & Information Technology (ICCIT 2003), Dhaka, Bangladesh, Vol.2, 457-462. [link]

[abstract]

Nowadays huge amount of data is regularly being collected for various purposes by organizations and companies, including business, government departments, and medical service providers. Data mining techniques are often used on these gigantic datasets to discover previously hidden information. Upon release of these datasets for data mining, individual sensitive and delicate information is at high risk of being exposed to unauthorised disclosure. Due to the growing public concern about privacy, many control techniques have been proposed to protect confidentiality of individual information. Some of these techniques involve perturbing datasets by adding a noise to data in some controlled fashion. The effectiveness of such techniques is typically evaluated by measuring the security and data quality of perturbed dataset. In this paper we experimentally evaluate the data quality by comparing the prediction capability of decision trees and neural networks, built from original and perturbed datasets. We then compare this evaluation technique to the one that uses logic rules associated with the decision tree classifiers.

[bib]

@inproceedings{Isla03a,
  author    = {Md. Zahidul Islam and
               Payam Mamaani Barnaghi and
               Ljiljana Brankovic},
  title     = {Measuring Data Quality: Predictive Accuracy vs. Similarity of Decision Trees},
  booktitle = {Proceedings of the 6th International Conference on Computer \& Information Technology (ICCIT 2003)},
  year      = {2003},
  address   = {Dhaka, Bangladesh},
  pages     = {457-462},
  volume    = {2},
}


Multiplicative Data Perturbation

• Kun Liu, Hillol Kargupta, and Jessica Ryan, Random Projection-based Multiplicative Perturbation for Privacy Preserving Distributed Data Mining. IEEE Transactions on Knowledge and Data Engineering (TKDE), VOL. 18, NO. 1, pages 92--106, Piscataway, NJ, January 2006. [link]

[abstract]

This paper explores the possibility of using multiplicative random projection matrices for privacy preserving distributed data mining. It specifically considers the problem of computing statistical aggregates like the inner product matrix, correlation coefficient matrix, and Euclidean distance matrix from distributed privacy sensitive data possibly owned by multiple parties. This class of problems is directly related to many other data-mining problems such as clustering, principal component analysis, and classification. This paper makes primary contributions on two different grounds. First, it explores Independent Component Analysis as a possible tool for breaching privacy in deterministic multiplicative perturbation-based models such as random orthogonal transformation and random rotation. Then, it proposes an approximate random projection-based technique to improve the level of privacy protection while still preserving certain statistical characteristics of the data. The paper presents extensive theoretical analysis and experimental results. Experiments demonstrate that the proposed technique is effective and can be successfully used for different types of privacypreserving data mining applications.

[bib]

@ARTICLE{Liu_06a,
    AUTHOR      = {Kun Liu and Hillol Kargupta and Jessica Ryan},
    TITLE       = {{Random Projection-Based Multiplicative Data Perturbation for Privacy Preserving Distributed Data Mining}},
    JOURNAL     = {{IEEE Transactions on Knowledge and Data Engineering (TKDE)}},
    YEAR        = {2006},
    VOLUME      = {18},
    NUMBER      = {1},
    PAGES       = {92--106},
    MONTH       = {January},
    NOTE        = {},
    KEY         = {},
    KEYWORDS    = {},
    ISBN        = {1041-4347},
    URL         = {http://doi.ieeecomputersociety.org/10.1109/TKDE.2006.14},
    ABSTRACT    = {},
}

• Kun Liu, Chris Giannella and Hillol Kargupta, "An Attacker's View of Distance Preserving Maps for Privacy Preserving Data Mining," in Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'06), Berlin, Germany, September, 2006. [link]

[abstract]

We examine the effectiveness of distance preserving transformations in privacy preserving data mining. These techniques are potentially very useful in that some important data mining algorithms can be applied to the transformed data and produce exactly the same results as if applied to the original data e.g. distance-based clustering, k-nearest neighbor classification. However, the issue of how well the original data is hidden has, to our knowledge, not been carefully studied. We take a step in this direction by assuming the role of an attacker armed with two types of prior information regarding the original data. We examine how well the attacker can recover the original data from the transformed data and prior information. Our results offer insight into the vulnerabilities of distance preserving transformations.

[bib]

 
@INPROCEEDINGS{Liu_06b,
  AUTHOR =       "Kun Liu and Chris Giannella and Hillol Kargupta",
  TITLE =        "An Attacker's View of Distance Preserving Maps for Privacy Preserving Data Mining",
  BOOKTITLE =    "Proceedings of the 10th European Conference on Principles and
Practice of Knowledge Discovery in Databases (PKDD'06)",
  YEAR =         "2006",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "",
  address =      "Berlin, Germany",
  month =        "September",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• Hillol Kargupta, Kun Liu, and Jessica Ryan, "Privacy Sensitive Distributed Data Mining from Multi-party Data," Proceedings of the first NSF/NIJ Symposium on Intelligence and Security Informatics, (Tucson, AZ), June 2003, Lecture Notes in Computer Science, Volume 2665/2003, pp. 336-342. [link]

[abstract]

Privacy is becoming an increasingly important issue in data mining, particularly in security and counter-terrorism-related applications where the data is often sensitive. This paper considers the problem of mining privacy sensitive distributed multi-party data. It specifically considers the problem of computing statistical aggregates like the correlation matrix from privacy sensitive data where the program for computing the aggregates is not trusted by the owner(s) of the data. It presents a brief overview of a random projection-based technique to compute the correlation matrix from a single third-party data site and also multiple homogeneous sites.

[bib]

@INPROCEEDINGS{Kargupta_03,
  AUTHOR =       "Hillol Kargupta and Kun Liu and Jessica Ryan",
  TITLE =        "Privacy Sensitive Distributed Data Mining from Multi-party Data",
  BOOKTITLE =    "Proceedings of the first NSF/NIJ Symposium on Intelligence and Security Informatics",
  YEAR =         "2003",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "Lecture Notes in Computer Science",
  pages =        "336--342",
  address =      "Tucson, AZ",
  month =        "June",
  organization = "",
  publisher =    "Springer Berlin/Heidelberg",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• M. Trottini, S.E. Fienberg, U.E. Makov and M.M. Meyer, "Additive noise and multiplicative bias as disclosure limitation techniques for continuous microdata: A simulation study," Journal of Computational Methods in Sciences and Eigineering 4 (2004), pages 5-16. [link]

[abstract]

This paper focuses on a combination of two disclosure limitation techniques, additive noise and multiplicative bias, and studies their efficacy in protecting confidentiality of continuous microdata. A Bayesian intruder model is extensively simulated in order to assess the performance of these disclosure limitation techniques as a function of key parameters like the variability amongst profiles in the original data, the amount of users prior information, the amount of bias and noise introduced in the data. The results of the simulation offer insight into the degree of vulnerability of data on continuous random variables and suggests some guidelines for effective protection measures.

[bib]

@ARTICLE{Trottini_04,
    AUTHOR      = {M. Trottini and S. E. Fienberg and U. E. Makov and M. M. Meyer},
    TITLE       = {Additive noise and multiplicative bias as disclosure limitation techniques for continuous microdata: A simulation study},
    JOURNAL     = {{Journal of Computational Methods in Sciences and Engineering}},
    YEAR        = {2004},
    VOLUME      = {4},
    NUMBER      = {},
    PAGES       = {5--16},
    MONTH       = {},
    NOTE        = {},
    KEY         = {},
    KEYWORDS    = {},
    ISBN        = {},
    URL         = {},
    ABSTRACT    = {},
}

• J.J. Kim and W.E. Winkler, “Multiplicative Noise for Masking Continuous Data,” Technical Report Statistics #2003-01, Statistical Research Division, US Bureau of the Census, Washington D.C., Apr. 2003. [link]

[abstract]

To protect the identity of the persons or firms on a microdata file, noise is sometimes added to the data before releasing it to the public. There has been conjecture that, rather than adding noise, multiplying noise might better protect the confidentiality. Two forms of multiplicative noise are considered. The first approach is generating random numbers which have mean one and small variance, and multiplying the original data by the noise. The second approach is to take a logarithmic transformation, compute a covariance matrix of the transformed data, generate random number which follows mean zero and variance/covariance c times the variance/covariance computed in the previous step, add the noise to the transformed data and take an antilog of the noise added data. This paper investigates the statistical properties of both methods and shows how well they protect the identity of those on the file via reidentification trials.

[bib]

@TECHREPORT{Kim_03,
  AUTHOR =       "J. J. Kim and W. E. Winkler",
  TITLE =        "Multiplicative Noise for Masking Continuous Data",
  INSTITUTION =  "Statistical Research Division, U.S. Bureau of the Census",
  YEAR =         "2003",
  type =         "",
  number =       "Statistics \#2003-01",
  address =      "Washington D.C.",
  month =        "April",
  note =         "",
  abstract =     "",
  keywords =     "",
  source =       "",
  file = F
}

• K. Chen and L. Liu. Privacy preserving data classification with rotation perturbation. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05), pages 589–592, Houston, TX, November 2005. [link]

[abstract]

This paper presents a random rotation perturbation approach for privacy preserving data classification. Concretely, we identify the importance of classificationspecific information with respect to the loss of information factor, and present a random rotation perturbation framework for privacy preserving data classification. Our approach has two unique characteristics. First, we identify that many classification models utilize the geometric properties of datasets, which can be preserved by geometric rotation. We prove that the three types of classifiers will deliver the same performance over the rotation perturbed dataset as over the original dataset. Second, we propose a multi-column privacy model to address the problems of evaluating privacy quality for multidimensional perturbation. With this metric, we develop a local optimal algorithm to find the good rotation perturbation in terms of privacy guarantee. We also analyze both naive estimation and ICA-based reconstruction attacks with the privacy model. Our initial experiments show that the random rotation approach can provide high privacy guarantee while maintaining zero-loss of accuracy for the discussed classifiers.

[bib]

@INPROCEEDINGS{Chen_05,
  AUTHOR =       "K. Chen and L. Liu",
  TITLE =        "Privacy Preserving Data Classification with Rotation Perturbation",
  BOOKTITLE =    "Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM'05)",
  YEAR =         "2005",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "589--592",
  address =      "Houston, TX",
  month =        "November",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• S. R. M. Oliveira and O. R. Zaiane. Privacy preservation when sharing data for clustering. In Proceedings of the International Workshop on Secure Data Management in a Connected World, pages 67–82, Toronto, Canada, August 2004. [link]

[abstract]

In this paper, we address the problem of protecting the underlying attribute values when sharing data for clustering. The challenge is how to meet privacy requirements and guarantee valid clustering results as well. To achieve this dual goal, we propose a novel spatial data transformation method called Rotation-Based Transformation (RBT). The major features of our data transformation are: a) it is independent of any clustering algorithm, b) it has a sound mathematical foundation; c) it is efficient and accurate; and d) it does not rely on intractability hypotheses from algebra and does not require CPU-intensive operations. We show analytically that although the data are transformed to achieve privacy, we can also get accurate clustering results by the safeguard of the global distances between data points.

[bib]

@INPROCEEDINGS{Oliveira_04p,
  AUTHOR =       "S. R. M. Oliveira and O. R. {Za\"{i}ane}",
  TITLE =        "Privacy Preservation When Sharing Data For Clustering",
  BOOKTITLE =    "Proceedings of the International Workshop on Secure Data Management in a Connected World",
  YEAR =         "2004",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "67--82",
  address =      "Toronto, Canada",
  month =        "August",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}




Categorical or General Data Perturbation

• Alexandre Evfimievski, "Randomization in Privacy Preserving Data Mining," ACM SIGKDD Explorations Newsletter, Volume 4, Issue 2, Pages: 43-48, December 2002. [link]

[abstract]

Suppose there are many clients, each having some personal information, and one server, which is interested only in aggregate, statistically significant, properties of this information. The clients can protect privacy of their data by perturbing it with a randomization algorithm and then submitting the randomized version. The randomization algorithm is chosen so that aggregate properties of the data can be recovered with sufficient precision, while individual entries are significantly distorted. How much distortion is needed to protect privacy can be determined using a privacy measure. Several possible privacy measures are known; finding the best measure is an open question. This paper presents some methods and results in randomization for numerical and categorical data, and discusses the issue of measuring privacy.

[bib]

''this is the bibtex entry ...''

• A. Evfimevski, J. Gehrke, and R. Srikant, “Limiting Privacy Breaches in Privacy Preserving Data Mining,” Proc. ACM SIGMOD/PODS Conf., June 2003. [link]

[abstract]

There has been increasing interest in the problem of building accurate data mining models over aggregate data, while protecting privacy at the level of individual records. One approach for this problem is to randomize the values in individual records, and only disclose the randomized values. The model is then built over the randomized data, after first compensating for the randomization (at the aggregate level). This approach is potentially vulnerable to privacy breaches: based on the distribution of the data, one may be able to learn with high confidence that some of the randomized records satisfy a specified property, even though privacy is preserved on average.In this paper, we present a new formulation of privacy breaches, together with a methodology, "amplification", for limiting them. Unlike earlier approaches, amplification makes it is possible to guarantee limits on privacy breaches without any knowledge of the distribution of the original data. We instantiate this methodology for the problem of mining association rules, and modify the algorithm from [9] to limit privacy breaches without knowledge of the data distribution. Next, we address the problem that the amount of randomization required to avoid privacy breaches (when mining association rules) results in very long transactions. By using pseudorandom generators and carefully choosing seeds such that the desired items from the original transaction are present in the randomized transaction, we can send just the seed instead of the transaction, resulting in a dramatic drop in communication and storage cost. Finally, we define new information measures that take privacy breaches into account when quantifying the amount of privacy preserved by randomization.

[bib]

@INPROCEEDINGS{Evfimievski:03,
  AUTHOR =       "A. Evfimevski and J. Gehrke and R. Srikant",
  TITLE =        "Limiting Privacy Breaches in Privacy Preserving Data Mining",
  BOOKTITLE =    "Proceedings of the ACM SIGMOD/PODS Conference",
  YEAR =         "2003",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "211--222",
  address =      "San Diego, CA",
  month =        "June",
  organization = "",
  note =         "",
  abstract =     "",
  keywords =     "",
}

• Islam, M. Z. and Brankovic, L (2005): DETECTIVE: A Decision Tree Based Categorical Value Clustering and Perturbation Technique in Privacy Preserving Data Mining, In Proc. of the 3rd International IEEE Conference on Industrial Informatics (INDIN 2005), 10-12 August, Perth, Australia. [link]

[abstract]

Data Mining is a powerful tool for information discovery from huge datasets. Various sectors, including com-mercial, government, financial, medical, and scientific, are ap-plying Data Mining techniques on their datasets that typically contain sensitive individual information. During this process the datasets get exposed to several parties, which can poten-tially lead to disclosure of sensitive information and thus to breaches of privacy.

Several Data Mining privacy preserving techniques have been recently proposed. In this paper we focus on data pertur-bation techniques, i.e., those that add noise to the data in order to prevent exact disclosure of confidential values. Some of these techniques were designed for datasets having only nu-merical non-class attributes and a categorical class attribute. However, natural datasets are more likely to have both nu-merical and categorical non-class attributes, and occasionally they contain only categorical attributes. Noise addition tech-niques developed for numerical attributes are not suitable for such datasets, due to the absence of natural ordering among categorical values. In this paper we propose a new method for adding noise to categorical values, which makes use of the clusters that exist among these values. We first discuss several existing categorical clustering methods and point out the limi-tations they exhibit in our context. Then we present a novel approach towards clustering of categorical values and use it to perturb data while maintaining the patterns in the dataset. Our clustering approach partitions the values of a given cate-gorical attribute rather than the records of the datasets; addi-tionally, our approach operates on the horizontally partitioned dataset and it is possible for two values to belong to the same cluster in one horizontal partition of the dataset, and to two distinct clusters in another partition. Finally, we provide some experimental results in order to evaluate our perturbation technique and to compare our clustering approach with an existing method, the so-called CACTUS.

[bib]

@inproceedings{Isla05,
  author    = {Md. Zahidul Islam and
               Ljiljana Brankovic},
  title     = {DETECTIVE: A Decision Tree Based Categorical Value Clustering and Perturbation Technique in Privacy Preserving Data Mining},
  booktitle = {Proceedings of the 3rd International IEEE Conference on Industrial Informatics (INDIN 2005)},
  year      = {2005},
  address   = {Perth, Australia}
}


Data Anonymization

• Latanya Sweeney, "k-anonymity: a model for protecting privacy," International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, Volume 10, Issue 5, pp. 557-570, 2002. [link]

[abstract]

Consider a data holder, such as a hospital or a bank, that has a privately held collection of person-specific, field structured data. Suppose the data holder wants to share a version of the data with researchers. How can a data holder release a version of its private data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful? The solution provided in this paper includes a formal protection model named k-anonymity and a set of accompanying policies for deployment. A release provides k-anonymity protection if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release. This paper also examines re-identification attacks that can be realized on releases that adhere to k- anonymity unless accompanying policies are respected. The k-anonymity protection model is important because it forms the basis on which the real-world systems known as Datafly, µ-Argus and k-Similar provide guarantees of privacy protection.

[bib]

@article{774552,
 author = {Latanya Sweeney},
 title = {k-anonymity: a model for protecting privacy},
 journal = {Int. J. Uncertain. Fuzziness Knowl.-Based Syst.},
 volume = {10},
 number = {5},
 year = {2002},
 issn = {0218-4885},
 pages = {557--570},
 doi = {http://dx.doi.org/10.1142/S0218488502001648},
 publisher = {World Scientific Publishing Co., Inc.},
 address = {River Edge, NJ, USA},
 }

• L. Sweeney, "Achieving k-anonymity privacy protection using generalization and suppression," International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), pp. 571-588, 2002. [link]

[abstract]

Often a data holder, such as a hospital or bank, needs to share person-specific records in such a way that the identities of the individuals who are the subjects of the data cannot be determined. One way to achieve this is to have the released records adhere to k- anonymity, which means each released record has at least (k-1) other records in the release whose values are indistinct over those fields that appear in external data. So, k- anonymity provides privacy protection by guaranteeing that each released record will relate to at least k individuals even if the records are directly linked to external information. This paper provides a formal presentation of combining generalization and suppression to achieve k-anonymity. Generalization involves replacing (or recoding) a value with a less specific but semantically consistent value. Suppression involves not releasing a value at all. The Preferred Minimal Generalization Algorithm (MinGen), which is a theoretical algorithm presented herein, combines these techniques to provide k-anonymity protection with minimal distortion. The real-world algorithms Datafly and µ-Argus are compared to MinGen. Both Datafly and µ-Argus use heuristics to make approximations, and so, they do not always yield optimal results. It is shown that Datafly can over distort data and µ-Argus can additionally fail to provide adequate protection.

[bib]

@article{774553,
 author = {Latanya Sweeney},
 title = {Achieving k-anonymity privacy protection using generalization and suppression},
 journal = {Int. J. Uncertain. Fuzziness Knowl.-Based Syst.},
 volume = {10},
 number = {5},
 year = {2002},
 issn = {0218-4885},
 pages = {571--588},
 doi = {http://dx.doi.org/10.1142/S021848850200165X},
 publisher = {World Scientific Publishing Co., Inc.},
 address = {River Edge, NJ, USA},
 }

• A. Meyerson and R. Williams, "General k-anonymization is Hard," Carnegie Mellon School of Computer Science Tech Report, 03-113, 2003. [link]

[abstract]

This is the abstract ...

[bib]

''this is the bibtex entry ...''

• K. Wang, P. S. Yu, and S. Chakraborty, "Bottom-up generalization: a data mining solution to privacy protection," in Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), November 2004. [link]

[abstract]

The well-known privacy-preserved data mining modifies existing data mining techniques to randomized data. In this paper, we investigate data mining as a technique for masking data, therefore, termed data mining based privacy protection. This approach incorporates partially the requirement of a targeted data mining task into the process of masking data so that essential structure is preserved in the masked data. The idea is simple but novel: we explore the data generalization concept from data mining as a way to hide detailed information, rather than discover trends and patterns. Once the data is masked, standard data mining techniques can be applied without modification. Our work demonstrated another positive use of data mining technology: not only can it discover useful patterns, but also mask private information.

We consider the following privacy problem: a data holder wants to release a version of data for building classification models, but wants to protect against linking the released data to an external source for inferring sensitive information. We adapt an iterative bottom-up generalization from data mining to generalize the data. The generalized data remains useful to classification but becomes difficult to link to other sources. The generalization space is specified by a hierarchical structure of generalizations. A key is identifying the best generalization to climb up the hierarchy at each iteration. Enumerating all candidate generalizations is impractical. We present a scalable solution that examines at most one generalization in each iteration for each attribute involved in the linking.

[bib]

@inproceedings{WYC04icdm,
  author = "K. Wang and P. S. Yu and S. Chakraborty",
  title = "Bottom-up generalization: a data mining solution to privacy protection",
  booktitle = "Proc. of the 4th IEEE International Conference on Data Mining (ICDM 2004)",
  month = "November",
  year = "2004",
}

• Roberto J. Bayardo and Rakesh Agrawal, "Data Privacy Through Optimal k-Anonymization," in Proceedings of the 21st International Conference on Data Engineering (ICDE'05) - Volume 00, pp. 217-228, 2005. [link]

[abstract]

Data de-identification reconciles the demand for release of data for research purposes and the demand for privacy from individuals. This paper proposes and evaluates an optimization algorithm for the powerful de-identification procedure known as k-anonymization. A k-anonymized dataset has the property that each record is indistinguishable from at least k - 1 others. Even simple restrictions of optimized k-anonymity are NP-hard, leading to significant computational challenges. We present a new approach to exploring the space of possible anonymizations that tames the combinatorics of the problem, and develop data-management strategies to reduce reliance on expensive operations such as sorting. Through experiments on real census data, we show the resulting algorithm can find optimalk-anonymizations under two representative cost measures and a wide range of k. We also show that the algorithm can produce good anonymizations in circumstances where the input data or input parameters preclude finding an optimal solution in reasonable time. Finally, we use the algorithm to explore the effects of different coding approaches and problem variations on anonymization quality and performance. To our knowledge, this is the first result demonstrating optimal k-anonymization of a nontrivial dataset under a general model of the problem.

[bib]

@inproceedings{1054048,
 author = {Roberto J. Bayardo and Rakesh Agrawal},
 title = {Data Privacy through Optimal k-Anonymization},
 booktitle = {ICDE '05: Proceedings of the 21st International Conference on Data Engineering (ICDE'05)},
 year = {2005},
 isbn = {0-7695-2285-8},
 pages = {217--228},
 doi = {http://dx.doi.org/10.1109/ICDE.2005.42},
 publisher = {IEEE Computer Society},
 address = {Washington, DC, USA},
 }

• Sheng Zhong and Zhiqiang Yang and Rebecca N. Wright, "Privacy-enhancing k-anonymization of customer data," in Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 139--147, 2005. [link]

[abstract]

In order to protect individuals' privacy, the technique of k-anonymization has been proposed to de-associate sensitive attributes from the corresponding identifiers. In this paper, we provide privacy-enhancing methods for creating k-anonymous tables in a distributed scenario. Specifically, we consider a setting in which there is a set of customers, each of whom has a row of a table, and a miner, who wants to mine the entire table. Our objective is to design protocols that allow the miner to obtain a k-anonymous table representing the customer data, in such a way that does not reveal any extra information that can be used to link sensitive attributes to corresponding identifiers, and without requiring a central authority who has access to all the original data. We give two different formulations of this problem, with provably private solutions. Our solutions enhance the privacy of k-anonymization in the distributed scenario by maintaining end-to-end privacy from the original customer data to the final k-anonymous results.

[bib]

@inproceedings{1065185,
 author = {Sheng Zhong and Zhiqiang Yang and Rebecca N. Wright},
 title = {Privacy-enhancing k-anonymization of customer data},
 booktitle = {PODS '05: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems},
 year = {2005},
 isbn = {1-59593-062-0},
 pages = {139--147},
 location = {Baltimore, Maryland},
 doi = {http://doi.acm.org/10.1145/1065167.1065185},
 publisher = {ACM Press},
 address = {New York, NY, USA},
 }

• B. C. M. Fung, K. Wang, and P. S. Yu, "Top-down specialization for information and privacy preservation," in Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE 2005), Tokyo, Japan, April 2005, pp. 205-216. [link]

[abstract]

Releasing person-specific data in its most specific state poses a threat to individual privacy. This paper presents a practical and efficient algorithm for determining a generalized version of data that masks sensitive information and remains useful for modelling classification. The generalization of data is implemented by specializing or detailing the level of information in a top-down manner until a minimum privacy requirement is violated. This top-down specialization is natural and efficient for handling both categorical and continuous attributes. Our approach exploits the fact that data usually contains redundant structures for classification. While generalization may eliminate some structures, other structures emerge to help. Our results show that quality of classification can be preserved even for highly restrictive privacy requirements. This work has great applicability to both public and private sectors that share information for mutual benefits and productivity.

[bib]

@inproceedings{FWY05icde,
  author = "B. C. M. Fung and K. Wang and P. S. Yu",
  title = "Top-Down Specialization for Information and Privacy Preservation",
  booktitle = "Proc. of the 21st IEEE International Conference on Data Engineering (ICDE 2005)",
  pages = "205-216",
  address = "Tokyo, Japan",
  month = "April",
  year = "2005",
}

• K. Wang, B. C. M. Fung, and G. Dong, "Integrating private databases for data analysis," in Proceedings of the 2005 IEEE International Conference on Intelligence and Security Informatics (ISI 2005), Atlanta, GA, May 2005, pp. 171-182. [link]

[abstract]

In today’s globally networked society, there is a dual demand on both information sharing and information protection. A typical scenario is that two parties wish to integrate their private databases to achieve a common goal beneficial to both, provided that their privacy requirements are satisfied. In this paper, we consider the goal of building a classifier over the integrated data while satisfying the k-anonymity privacy requirement. The k-anonymity requirement states that domain values are generalized so that each value of some specified attributes identifies at least k records. The generalization process must not leak more specific information other than the final integrated data. We present a practical and efficient solution to this problem.

[bib]

@inproceedings{WFD05isi,
  author = "K. Wang and B. C. M. Fung and G. Dong",
  title = "Integrating Private Databases for Data Analysis",
  booktitle = "Proc. of the 2005 IEEE International Conference on Intelligence and Security Informatics (ISI 2005)",
  series = "Lecture Notes in Computer Science",
  volume = "3495",
  pages = "171-182",
  address = "Atlanta, GA",
  month = "May",
  year = "2005",
}

• K. Wang, B. C. M. Fung, and P. S. Yu, "Template-based privacy preservation in classification problems," in Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005), Houston, TX, November 2005, pp. 466-473. [link]

[abstract]

In this paper, we present a template-based privacy preservation to protect against the threats caused by data mining abilities. The problem has dual goals: preserve the information for a wanted classification analysis and limit the usefulness of unwanted sensitive inferences that may be derived from the data. Sensitive inferences are specified by a set of "privacy templates". Each template specifies the sensitive information to be protected, a set of identifying attributes, and the maximum association between the two. We show that suppressing the domain values is an effective way to eliminate sensitive inferences. For a large data set, finding an optimal suppression is hard, since it requires optimization over all suppressions. We present an approximate but scalable solution. We demonstrate the effectiveness of this approach on real life data sets.

[bib]

@inproceedings{WFY05icdm,
  author = "K. Wang and B. C. M. Fung and P. S. Yu",
  title = "Template-Based Privacy Preservation in Classification Problems",
  booktitle = "Proc. of the 5th IEEE International Conference on Data Mining (ICDM 2005)",
  address = "Houston, TX",
  pages = "466-473",
  month = "November",
  year = "2005",
}

• M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi, "k-Anonymous Patterns," In Proceedings of the Ninth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'05) Lecture Notes in Computer Science, Volume 3721, Springer. October 3-7, 2005, Porto, Portugal. [link]

[abstract]

It is generally believed that data mining results do not violate the anonymity of the individuals recorded in the source database. In fact, data mining models and patterns, in order to ensure a required statistical significance, represent a large number of individuals and thus conceal individual identities: this is the case of the minimum support threshold in association rule mining. In this paper we show that this belief is ill-founded. By shifting the concept of k-anonymity from data to patterns, we formally characterize the notion of a threat to anonymity in the context of pattern discovery, and provide a methodology to ef�ciently and effectively identify all possible such threats that might arise from the disclosure of a set of extracted patterns.

[bib]

@INPROCEEDINGS{Atzori_05a,
  AUTHOR =       "M. Atzori, F. Bonchi, F. Giannotti and D. Pedreschi",
  TITLE =        "k-Anonymous Patterns",
  BOOKTITLE =    "Proceedings of the Ninth European Conference on Principles
                  and Practice of Knowledge Discovery in Databases (PKDD'05)",
  YEAR =         "2005",
  editor =       "",
  volume =       "3721",
  number =       "",
  series =       "Lecture Notes in Computer Science, Springer",
  pages =        "",
  address =      "Porto, Portugal",
  month =        "October",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi, "Blocking Anonymity Threats Raised by Frequent Itemset Mining," In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM'05), IEEE. November 27-30, 2005, New Orleans, Louisiana. [link]

[abstract]

In this paper we study when the disclosure of data min- ing results represents, per se, a threat to the anonymity of the individuals recorded in the analyzed database. The novelty of our approach is that we focus on an objective definition of privacy compliance of patterns without any reference to a preconceived knowledge of what is sensi- tive and what is not, on the basis of the rather intuitive and realistic constraint that the anonymity of individu- als should be guaranteed. In particular, the problem ad- dressed here arises from the possibility of inferring from the output of frequent itemset mining (i.e., a set of item- sets with support larger than a threshold \sigma), the exis- tence of patterns with very low support (smaller than an anonymity threshold k). In the following we develop a simple methodology to block such inference opportuni- ties by introducing distortion on the dangerous patterns.

[bib]

@INPROCEEDINGS{Atzori_05b,
  AUTHOR =       "M. Atzori, F. Bonchi, F. Giannotti and D. Pedreschi",
  TITLE =        "Blocking Anonymity Threats Raised by Frequent Itemset Mining",
  BOOKTITLE =    "Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM'05)",
  YEAR =         "2005",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "",
  address =      "",
  month =        "November",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, Muthuramakrishnan Venkitasubramaniam, "l -Diversity: Privacy Beyond k-Anonymity," p. 24, 22nd International Conference on Data Engineering (ICDE'06), 2006. [link]

[abstract]

Publishing data about individuals without revealing sensitive information about them is an important problem. In recent years, a new definition of privacy called k-anonymity has gained popularity. In a k-anonymized dataset, each record is indistinguishable from at least k−1 other records with respect to certain “identifying” attributes. In this paper we show with two simple attacks that a k-anonymized dataset has some subtle, but severe privacy problems. First, we show that an attacker can discover the values of sensitive attributes when there is little diversity in those sensitive attributes. Second, attackers often have background knowledge, and we show that k-anonymity does not guarantee privacy against attackers using background knowledge. We give a detailed analysis of these two attacks and we propose a novel and powerful privacy defi- nition called l-diversity. In addition to building a formal foundation for l-diversity, we show in an experimental evaluation that ℓ-diversity is practical and can be implemented efficiently.

[bib]

@inproceedings{Machanavajjhala_06,
  author    = {Ashwin Machanavajjhala and
               Johannes Gehrke and
               Daniel Kifer and
               Muthuramakrishnan Venkitasubramaniam},
  title     = {l-Diversity: Privacy Beyond k-Anonymity.},
  booktitle = {Proceedings of the 22nd International Conference on Data
               Engineering (ICDE'06)},
  year      = {2006},
  pages     = {24},
  ee        = {http://doi.ieeecomputersociety.org/10.1109/ICDE.2006.1},
  crossref  = {DBLP:conf/icde/2006},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

• K. Wang and B. C. M. Fung, "Anonymizing sequential releases," in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006), Philadelphia, PA, August 2006, pp.414-423. [link]

[abstract]

An organization makes a new release as new information become available, releases a tailored view for each data request, releases sensitive information and identifying information separately. The availability of related releases sharpens the identification of individuals by a global quasi-identifier consisting of attributes from related releases. Since it is not an option to anonymize previously released data, the current release must be anonymized to ensure that a global quasi-identifier is not effective for identification. In this paper, we study the sequential anonymization problem under this assumption. A key question is how to anonymize the current release so that it cannot be linked to previous releases yet remains useful for its own release purpose. We introduce the lossy join, a negative property in relational database design, as a way to hide the join relationship among releases, and propose a scalable and practical solution.

[bib]

@inproceedings{WF06kdd,
  author = "K. Wang and B. C. M. Fung",
  title = "Anonymizing Sequential Releases",
  booktitle = "Proc. of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006)",
  pages = "414-423",
  address = "Philadelphia, PA",
  month = "August",
  year = "2006",
}

• T. M. Truta and B. Vinay, “Privacy Protection: p-Sensitive k-Anonymity Property,” International Workshop of Privacy Data Management (PDM 2006), In Conjunction with 22th International Conference of Data Engineering (ICDE 2006), Atlanta, April 2006. [link]

[abstract]

In this paper, we introduce a new privacy protection property called p-sensitive k-anonymity. The existing k-anonymity property protects against identity disclosure, but it fails to protect against attribute disclosure. The new introduced privacy model avoids this shortcoming. Two necessary conditions to achieve p-sensitive k-anonymity property are presented, and used in developing algorithms to create masked microdata with p-sensitive k-anonymity property using generalization and suppression.

[bib]

@inproceedings{PDM06ICDE,
  author = " T. M. Truta and B. Vinay",
  title = " Privacy Protection: p-Sensitive k-Anonymity Property",
  booktitle = "Proc. of the ICDE 2006 Workshops - 22nd International Conference on Data Engineering Workshops (ICDE 2006)”,
  pages = "94-94",
  address = "Atlanta, GA",
  month = "April",
  year = "2006",
}

• A. Friedman, A. Schuster and R. Wolff, “k-Anonymous Decision Tree Induction”, In Proceedings of the Tenth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'06) Lecture Notes in Computer Science, Volume 4213, Springer. September 2006, Berlin, Germany. [link]

[abstract]

In this paper we explore an approach to privacy preserving data mining that relies on the k-anonymity model. The k-anonymity model guarantees that no private information in a table can be linked to a group of less than k individuals. We suggest extended definitions of k-anonymity that allow the k-anonymity of a data mining model to be determined. Using these definitions, we present decision tree induction algorithms that are guaranteed to maintain k-anonymity of the learning examples. Experiments show that embedding anonymization within the decision tree induction process provides better accuracy than anonymizing the data first and inducing the tree later.

[bib]

@INPROCEEDINGS{FriedmanPKDD06,
  AUTHOR =       "A. Friedman, A. Schuster and R. Wolff",
  TITLE =        "k-Anonymous Decision Tree Induction",
  BOOKTITLE =    "Proceedings of the Tenth European Conference on Principles
                  and Practice of Knowledge Discovery in Databases (PKDD'06)",
  YEAR =         "2006",
  volume =       "4213",
  series =       "Lecture Notes in Computer Science, Springer",
  pages =        "151-162",
  address =      "Berlin, Germany",
  month =        "September",
}

• K. Wang, B. C. M. Fung, and P. S. Yu, "Handicapping attacker's confidence: an alternative to k-anonymization," in Knowledge and Information Systems: An International Journal (KAIS), 11(3):345-368, April 2007. [link]

[abstract]

We present an approach of limiting the confidence of inferring sensitive properties to protect against the threats caused by data mining abilities. The problem has dual goals: preserve the information for a wanted data analysis request and limit the usefulness of unwanted sensitive inferences that may be derived from the release of data. Sensitive inferences are specified by a set of "privacy templates". Each template specifies the sensitive property to be protected, the attributes identifying a group of individuals, and a maximum threshold for the confidence of inferring the sensitive property given the identifying attributes. We show that suppressing the domain values monotonically decreases the maximum confidence of such sensitive inferences. Hence, we propose a data transformation that minimally suppresses the domain values in the data to satisfy the set of privacy templates. The transformed data is free of sensitive inferences even in the presence of data mining algorithms. The prior k-anonymization focuses on personal identities. This work focuses on the association between personal identities and sensitive properties.

[bib]

@article{WFY07kais,
  author = "K. Wang and B. C. M. Fung and P. S. Yu",
  title = "Handicapping Attacker's Confidence: An Alternative to k-Anonymization",
  journal = "Knowledge and Information Systems: An International Journal (KAIS)",
  publisher = "Springer-Verlag",
  volume = "11",
  number = "3",
  pages = "345--368",
  month = "April",
  year = "2007",
}

• B. C. M. Fung, K. Wang, and P. S. Yu, "Anonymizing Classification Data for Privacy Preservation," IEEE Transactions on Knowledge and Data Engineering (TKDE), 19(5):711-725, May 2007. [link]

[abstract]

Classification is a fundamental problem in data analysis. Training a classifier requires accessing a large collection of data. Releasing person-specific data, such as customer data or patient records, may pose a threat to individual's privacy. Even after removing explicit identifying information such as Name and SSN, it is still possible to link released records back to their identities by matching some combination of non-identifying attributes such as {Sex, Zip, Birthdate}. A useful approach to combat such linking attacks, called k-anonymization, is anonymizing the linking attributes so that at least k released records match each value combination of the linking attributes. Previous work attempted to find an optimal k-anonymization that minimizes some data distortion metric. We argue that minimizing the distortion to the training data is not relevant to the classification goal that requires extracting the structure of predication on the "future" data.

In this paper, we propose a k-anonymization solution for classification. Our goal is to find a k-anonymization, not necessarily optimal in the sense of minimizing data distortion, that preserves the classification structure. We conducted intensive experiments to evaluate the impact of anonymization on the classification on future data. Experiments on real life data show that the quality of classification can be preserved even for highly restrictive anonymity requirements.

[bib]

@article{FWY07tkde,
    author = "B. C. M. Fung and Ke Wang and P. S. Yu",
    title = "Anonymizing Classification Data for Privacy Preservation",
    journal = "IEEE Transactions on Knowledge and Data Engineering (TKDE)",
    volume = "19",
    number = "5",
    pages = "711--725",
    month = "May",
    year = "2007",
}

• Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian, "t-closeness: Privacy Beyond k-Anonymity and l-Diversity", 2007 IEEE International Conference on Data Engineering", Apr 2007. [link]

[abstract]

The $k$-anonymity privacy requirement for publishing microdata requires that each equivalence class (i.e., a set of records that are indistinguishable from each other with respect to certain identifying attributes) contains at least $k$ records. Recently, several authors have recognized that $k$-anonymity cannot prevent attribute disclosure. The notion of $\ell$-diversity has been proposed to address this; $\ell$-diversity requires that each equivalence class has at least $\ell$ well-represented values for each sensitive attribute.

In this paper we show that $\ell$-diversity has a number of limitations. In particular, it is neither necessary nor sufficient to prevent attribute disclosure. We propose a novel privacy notion called $t$-closeness, which requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table (i.e., the distance between the two distributions should be no more than a threshold $t$). We choose to use the Earth Mover Distance measure for our $t$-closeness requirement. We discuss the rationale for $t$-closeness and illustrate its advantages through examples and experiments.

[bib]

@InProceedings{closeness,
   author = 	 {Ninghui Li  and Tiancheng Li and Suresh Venkatasubramanian},
   title = 	 {$t$-Closeness: Privacy Beyond $k$-Anonymity and $\ell$-Diversity},
   booktitle = 	 {IEEE International Conference on Data Engineering (this 
proceedings)},
   year =	 2007
}

• B. C. M. Fung, K. Wang, A. W. C. Fu, and J. Pei, "Anonymity for continuous data publishing," In Proceedings of the 11th International Conference on Extending Database Technology (EDBT 2008), Nantes, France: ACM Press, March 2008. [link]

[abstract]