DDMWikiMain Page | About | Help | FAQ | Special pages | Log in
The Free Encyclopedia
Printable version | Disclaimers

Privacy Preserving Data Mining

From DDMWiki

Contents

Background

Recent interest in the collection and monitoring of data using data mining technology for the purpose of security and business-related applications has raised serious concerns about privacy issues. For example, mining health-care data for detection of bio-terrorism may require analyzing clinical records and pharmacy transaction data of certain off-the-shelf drugs. However, combining such diverse data sets belonging to different parties may violate privacy laws. Although health organizations are allowed to release data as long as the identifiers (e.g. name, SSN, address, etc.) are removed, it is not considered safe enough since re-identification attacks [1] may be constructed for linking different public databases to identify the original subjects. Privacy Preserving Data Mining (PPDM) strives to provide a solution to this dilemma. It aims to allow useful data patterns to be extracted without compromising privacy.

What is Privacy?

Generally, when we are talking about privacy, we mean the "right to be left alone" or "keep information about me from being available to others or from being misused". According to the National Information Infrastructure Advisory Council, the privacy of an individual or an organization depends on three distinct perspectives [2]:

Unfortunately, even with these guidelines, it is still very difficult defining privacy exactly because this term refers to different aspects to different people on different occasions. Asking a 10-year-old child his age over the internet has different privacy implications from asking the same question of a middle-aged adult. A question that may not be seen as violating our privacy in one situation (e.g. a doctor's office) could have that appearance in another. Even being asked to define your privacy boundaries can be an invasion of your privacy. By naming areas of your life that you wish to have kept private you are in essence calling attention to some areas over others and revealing something about yourself that you might prefer not to reveal. From this point of view, the definition of privacy actually correlates to a variety of social, legal, governmental, ethical and philosophical issues.

Despite these facts, technically speaking, given a data set, the main consideration about privacy falls into two categories:

Algorithms

Data Hiding

Additive Perturbation

Data perturbation represents one common approach in privacy preserving data mining. Here, the original dataset is perturbed and the result is released for data analysis. Perturbation approaches typically face a "privacy/accuracy" trade-off. On the one hand, perturbation must not allow the original data records to be adequately recovered. On the other, it must allow "patterns" in the original data to be recovered. In many cases, increased privacy comes at the cost of reduced accuracy and vice versa. For example, Agrawal and Srikant [3] proposed adding randomly generated i.i.d. noise to the dataset. They show how the distribution from which the original data arose can be estimated using only the perturbed data. However, Kargupta et al. [4] and Huang et al. [5] point out how, in many cases, the noise can be filtered off leaving a reasonably good estimation of the original data. These results point to the fact that unless the variance of the additive noise is sufficiently large, original data records can be recovered unacceptably well. However, this increase in variance reduces the accuracy with which the original data distribution can be estimated.

Multiplicative Perturbation

Two basic forms of multiplicative noise have been studied in the statistics community [6]. One multiplies each data element by a random number that has a truncated Gaussian distribution with mean one and small variance. The other takes a logarithmic transformation of the data first, adds multivariate Gaussian noise, then takes the anti-log of the noise-added data. Neither of these perturbations preserve distance.

Distance Preserving Perturbation

Liu et al. [7] proposed an approach where the data is multiplied by a randomly generated matrix -- in effect, the data is projected into a lower dimensional space. This technique preserves distance on expectation. Oliveira and Zaiane [8], Chen and Liu [9] discussed the use of random rotation for privacy-preserving data mining. These authors observeed that the distance preserving nature of random rotation makes it useful in this setting, but do not analyze its privacy limitations.

Categorical Data Perturbation

Evfimievski et al. [10], Rizvi and Haritza [11] consider the use of data categorical perturbation. They develop algorithms from which association rules present in the original data can be estimated from the perturbed data.

Data Anonymization

Sweeney [1] developed the k-anonymity framework wherein the original data is transformed so that the information for any individual cannot be distinguished from k-1 others. Values from the original data are generalized (replaced by a less specific value) to produce the anonymized data. This technique makes no guarantees regarding subsequent analysis of the transformed data.

Statistical Disclosure Control

Statisticians have long considered the problem of allowing data analysis while protecting sensitive data. Such techniques fall under the general heading of statistical disclosure control. For an introductory survey, the reader is referred to [12].

Additive and Multiplicative Perturbation

Data perturbation is one such technique.

Sampling Method

Liew et al. [13] described a probability distribution-based method for protecting a single confidential attribute in an secure database. This method is applicable to both categorical and numerical attributes. However, the generalization to multiple dependent confidential attributes is not straightforward. This method consists of three steps: (1) Identify the underlying probability density function of the attribute values and estimate the parameters of this function, (2) Generate a sample series of data from the estimated probability density function of the confidential attribute, (3) Substitute the generated data for the original confidential data in the same rank order, i.e., the smallest value of the new sample should replace the smallest value in the original data, and so on. The noise introduced by this method is larger when the secure database is small; thus better security is achieved but biased-query responses are provided with users. When the size of the database increases, the bias becomes smaller but less security of confidential attribute is achieved.

Analytical Method

Lefons et al. [14] proposed a method for protecting multi-numerical sensitive attributes by replacing the original secure database with its probability distribution. This method consists of estimating the joint probability density function of several numerical attributes. The key contribution of this work lies in the approximation of the data distribution by orthogonal polynomials. The coefficients used in the computation of this approximation are called canonical coefficients. These coefficients are well suited for usage in an online environment because they can be adopted easily in case of insertions and deletions of the database records. However, if the new probability distribution function is a very precise description of the original data, then there is hardly any protection against partial disclosures. On the other hand, if deviations between the distribution function and the original sensitive data are possible, then issues such as how to avoid bias and how could the database administrator exercise control on the trade-off between precision and security need to be addressed.

Data Swapping

The basic idea of data swapping, which was first proposed by Dalenius and Reiss [15], is to transform the database by switching a subset of attributes between selected pairs of records so that the lower order frequency counts or marginal are preserved and data confidentiality is not compromised. This technique could equally as well be classified under the data perturbation category. A variety of refinements and applications of data swapping have been addressed since its initial appearance. We refer readers to [16] for a thorough treatment.

Randomized Response

Randomized response was first developed by Warner [17] for the purpose of protecting data owner's privacy in a survey. This method deals with a single confidential attribute that can take only the value 0 or 1. For example, to estimate the percentage of people population that are addicted to drug, queries are sent to a group of people. Since this question is related to some privacy sensitive aspects of an individual, respondents may decide not reply at all or to reply with incorrect answers. Two models (Related-Question Model and Unrelated-Question Model) have been proposed to solve this survey problem. We describe related-Question Model first. In this model, instead of asking each respondent whether s/he is addicted to drug, the interviewer asks each respondent two related questions, the answers to which are opposite to each other. For example, the questions could be like the following: (1) I am addicted to drug. (2) I am not addicted to drug. Respondents use a randomizing device to decide which question to answer, without letting the interviewer know which question is answered. The randomizing device is designed in such a way that the probability of choosing the first question is $p$, and the probability of choosing the second question is $1-p$. Although the interviewer learns the responses (e.g. ``yes" or ``no"), s/he does not know which question was answered by the respondents. Thus the respondents' privacy is preserved (assume the respondents do not lie when answering the question). To estimate the percentage of people who are addicted to drug, we can use the following equations:

\begin{eqnarray*}

P^{*} (Yes) &=& P(Addicted to Drug = Yes) \cdot p + P(Addicted to Drug = No) \cdot (1-p) \\

P^{*} (No) &=& P(Addicted to Drug = No) \cdot p + P(Addicted to Drug = Yes) \cdot (1-p)

\end{eqnarray*}

where $P^{*}(Yes)$ (resp. $P^{*}(No)$) is the proportion of the ``Yes" (resp. ``No") responses obtained from the survey, and $P(Addicted to Drug = Yes)$ (resp. $P(Addicted to Drug = No)$) is the real proportion of the people who are addicted to drug (resp. not addicted to drug). Getting $P(Addicted to Drug = Yes)$ and $P(Addicted to Drug = No)$ is the goal of the survey. By solving the above equations, we can get $P(Addicted to Drug = Yes)$ and $P(Addicted to Drug = No)$ if $p \neq 1/2$. For the case where $p = 1/2$, we can apply Unrelated-Question Model where two unrelated questions are asked with one probability for one of the questions is known. The paper in [18] proposed a randomized response-based technique for constructing decision tree classifier from multiple-attribute database from a survey. Their results show that by choosing the appropriate randomization parameters, the accuracy is very close to the accuracy achieved using the original ID3 on the original data.

Secure Multiparty Computation

The Secure Multi-party Computation (SMC) [19] technique considers the problem of evaluating a function of two or more parties' secret inputs, such that no party learns anything but the designated output of the function. A large body of cryptographic protocols including circuit evaluation protocol, oblivious transfer, homomorphic encryption, commutative encryption serve as the building blocks of SMC. The work in [20] offered a broad view of SMC framework and its applications to data mining. The work in [21] detailed a rigorous introduction to SMC. It was shown that any function that can be expressed by an arithmetic circuit is privately computable using a generic circuit evaluation protocol. However, the communication and computational complexity of doing so makes this general approach infeasible for large data sets. A collection of SMC tools useful for large scale privacy preserving data mining (e.g., secure sum, set union, inner product) are discussed in [22]. An overview of the state-of-the-art privacy preserving data mining techniques is presented in [23].

Distributed Data Mining

The distributed data mining (DDM) [24] approach supports computation of data mining models and extraction of "patterns" at a given node by exchanging only the minimal necessary information among the participating nodes. The work in [25] proposed a paradigm for clustering distributed privacy sensitive data in an unsupervised or a semi-supervised scenario. In this algorithm, each local data site builds a model and transmit only the parameters of the model to the central site where a global clustering model is constructed. A distributed privacy-preserving algorithm for Bayesian network parameter learning is reported elsewhere [26].

Rule Hiding

The main objective of rule hiding is to transform the database such that the sensitive rules are masked, and all the other underlying patterns can still be discovered. The work in [27] gave a formal proof that the optimal sanitization is an NP-hard problem for the hiding of sensitive large item sets in the context of association rule mining. For this reason, some heuristic approaches have been applied to address the complexity issues. For example, the perturbation-based association rule hiding technique [28] is implemented by changing a selected set of 1-values to 0-values or vice versa so that the frequent item sets that generate the rule are hidden or the support of sensitive rules is lowered to a user-specified threshold. The blocking-based association rule hiding approach [29] replaces certain attributes of the data with a question mark. In this regard, the minimum support and confidence will be altered into a minimum interval. As long as the support and/or the confidence of a sensitive rule lies below the middle in these two ranges, the confidentiality of data is expected to be protected.

The work in [30] presented a new framework for dealing with inference problem which arises from parsimonious downgrading. Here ``parsimonious downgrading" refers to the phenomenon of trimming out sensitive information from a data set when it is transferred from a secure environment (referred to as High) to a public domain (referred to as Low). This framework has been applied for decision tree classification analysis such that the receiver of the data will unable to build informative models for the data that is not downgraded. The following section considers a random orthogonal transformation-based perturbation technique and explores the privacy-preserving (or lack of it) properties of such transformation.


Kleinburg et al. [31] consider the auditing problem. An outside party issues a series of statistical queries (give the average salary of all people aged 30 to 40) on a sensitive database. The database server takes into account information revealed by past queries and decides whether answering the current query will result in the revelation of sensitive data.

Privacy Metrics

...

Industrial Applications & Products

On December 1, 2005, AGNIK, LLC, a data analytics company at Baltimore USA for distributed, mobile, and embedded environments, wins SBIR Phase II contract from the Department of Homeland Security for developing a "Cross-Domain Intrusion Detection system using Privacy-Preserving Distributed Data Mining Technologies." This SBIR project develops PURSUIT, a cross-domain intrusion detection and prevention system that relies upon state-of-the-art privacy-preserving distributed data mining (PPDM) technology. PURSUIT has a distributed multi-agent architecture that supports formation of ad-hoc peer-to-peer, hierarchical, and othercollaborative coalitions with due attention to the security and privacy issues. It will be equipped with PPDM algorithms so that the patterns can be computed and shared across the sites without sharing the privacy-sensitive data. The algorithmic foundation of the approach is based on combination of secured multi-party computation and randomized transformation techniques that allow sharing of attack patterns, not the raw data.



See also

References

  1. L. Sweeney, “k-Anonymity: A Model for Protecting Privacy,” Int’l J. Uncertainty, Fuzziness, and Knowledge-Based Systems, vol. 10, no. 5, pp. 557-570, 2002. [Sweeney_02]
  2. The National Information Infrastructure Advisory Council. Common ground: Fundamental principles for the national information infrastructure, 1995. [NIIAC_95]
  3. R. Agrawal and R. Srikant. Privacy preserving data mining. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 439–450, Dallas, TX, May 2000. [Agrawal_00]
  4. H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar. Random data perturbation techniques and privacy preserving data mining. Knowledge and Information Systems, 7(5):387–414, 2005. [Kargupta_05]
  5. Z. Huang, W. Du, and B. Chen. Deriving private information from randomized data. In Proceedings of the 2005 ACM SIGMOD Conference, pages 37–48, Baltimroe, MD, June 2005. [Huang_05]
  6. J. J. Kim and W. E. Winkler. Multiplicative noise for masking continuous data. Technical Report Statistics #2003-01, Statistical Research Division, U.S. Bureau of the Census, Washington D.C., April 2003. [Kim_03]
  7. K. Liu, H. Kargupta, and J. Ryan, Random Projection-based Multiplicative Perturbation for Privacy Preserving Distributed Data Mining. IEEE Transactions on Knowledge and Data Engineering (TKDE), VOL. 18, NO. 1, pages 92--106, Piscataway, NJ, January 2006. [Liu_06]
  8. S. R. M. Oliveira and O. R. Zaïane. Privacy preservation when sharing data for clustering. In Proceedings of the International Workshop on Secure Data Management in a Connected World, pages 67–82, Toronto, Canada, August 2004. [Oliveira_04]
  9. K. Chen and L. Liu. Privacy preserving data classification with rotation perturbation. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05), pages 589–592, Houston, TX, November 2005. [Chen_05]
  10. A. Evfimevski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy preserving data mining. In Proceedings of the ACM SIGMOD/PODS Conference, San Diego, CA, June 2003. [Evfimevski_03]
  11. S. J. Rizvi and J. R. Haritsa. Maintaining data privacy in association rule mining. In Proceedings of the 28th VLDB Conference, Hong Kong, China, August 2002. [Rizvi_02]
  12. L. Willenborg and T. de Waal. Statistical Disclosure Control in Practice. Springer-Verlag, Lecture Notes in Statistics 111, New York, 1996. [Willenborg_96]
  13. C. K. Liew, U. J. Choi, and C. J. Liew. A data distortion by probability distribution. ACM Transactions on Database Systems (TODS), 10(3):395–411, 1985. [Liew_85]
  14. E. Lefons, A. Silvestri, and F. Tangorra. An analytic approach to statistical databases. In Proceedings of the 9th International Conference on Very Large Data Bases, pages 260–274, Florence, Italy, November 1983. Morgan Kaufmann Publishers Inc. [Lefons_83]
  15. T. Dalenius and S.P. Reiss, “Data-Swapping: A Technique for Disclosure Control,” J. Statistical Planning and Inference, vol. 6, pp. 73-85, 1982. [Dalenius_82]
  16. S.E. Fienberg and J. McIntyre, “Data Swapping: Variations on a Theme by Dalenius and Reiss,” technical report, Nat’l Inst. of Statistical Sciences, Research Triangle Park, NC, 2003. [Fienberg_03]
  17. S. Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60:63–69, 1965. [Warner_65]
  18. W. Du and Z. Zhan. Using randomized response techniques for privacypreserving data mining. In Proceedings of The 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, August 2003. [Du_03]
  19. A.C. Yao, “How to Generate and Exchange Secrets,” Proc. 27th IEEE Symp. Foundations of Computer Science, pp. 162-167, 1986. [Yao_86]
  20. B. Pinkas, “Cryptographic Techniques for Privacy Preserving Data Mining,” SIGKDD Explorations, vol. 4, no. 2, pp. 12-19, 2002. [Pinkas_02]
  21. O. Goldreich, The Foundations of Cryptography, vol. 2, chapter 7. Cambridge Univ. Press, 2004. [Goldreich_04]
  22. C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Zhu, “Tools for Privacy Preserving Distributed Data Mining,” ACM SIGKDD Explorations, vol. 4, no. 2, 2003. [Clifton_03]
  23. V.S. Verykios, E. Bertino, I.N. Fovino, L.P. Provenza, Y. Saygin, and Y. Theodoridis, “State-of-the-Art in Privacy Preserving Data Mining,” ACM SIGMOD Record, vol. 3, no. 1, pp. 50-57, Mar. 2004. [Verykios_04]
  24. B.-H. Park and H. Kargupta, “Distributed Data Mining,” The Handbook of Data Mining, ser. Human Factors and Ergonomics, pp. 341-358, N. Ye, ed., Lawrence Erlbaum Associates, Inc., 2003. [Park_03]
  25. S. Merugu and J. Ghosh, “Privacy-Preserving Distributed Clustering Using Generative Models,” Proc. Third IEEE Int’l Conf. Data Mining (ICDM’03), Nov. 2003. [Merugu_03]
  26. D. Meng, K. Sivakumar, and H. Kargupta, “Privacy Sensitive Bayesian Network Parameter Learning,” Proc. Fourth IEEE Int’l Conf. Data Mining (ICDM’04), Nov. 2004. [Meng_04]
  27. M.J. Atallah, E. Bertino, A.K. Elmagarmid, M. Ibrahim, and V.S. Verykios, “Disclosure Limitation of Sensitive Rules,” Proc. IEEE Knowledge and Data Eng. Workshop, pp. 45-52, 1999. [Atallah_99]
  28. V.S. Verykios, A.K. Elmagarmid, B. Elisa, Y. Saygin, and D. Elena, “Association Rule Hiding,” IEEE Trans. Knowledge and Data Eng., 2003. [Verykios_03]
  29. Y. Saygin, V.S. Verykios, and C. Clifton, “Using Unknowns to Prevent Discovery of Association Rules,” SIGMOD Record, vol. 30, no. 4, pp. 45-54, Dec. 2001. [Saygin_01]
  30. L. Chang and I. S. Moskowitz. Parsimonious downgrading and decision tree applied to the inference problem. In Proceedings of the 1998 New Security Paradigms Workshop, pages 82–89, Charlottesville, VA, September 1998. [Chang_98]
  31. J. Kleinberg, C.Papadimitriou, and P. Raghavan. Auditing boolean attributes. Journal of Computer and System Sciences, 66(1):244 – 253, 2003. [Kleinberg_03]

Bibliography

Retrieved from "http://www.umbc.edu/ddm/wiki/Privacy_Preserving_Data_Mining"

This page has been accessed 2,062 times. This page was last modified 16:06, 26 June 2006. Content is available under a Creative Commons Attribution-ShareAlike 2.5 License.


Find
Browse
Main Page
Bulletin Board
Recent changes
Random page
Wiki Help
Sandbox
Feedback
Edit
Edit this page
Editing help
This page
Discuss this page
Post a comment
Printable version
Context
Page history
What links here
Related changes
My pages
Create an account or log in
Special pages
New pages
File list
Statistics
Bug reports
More...