Assessing the utility of a public dataset for analytical research

doi:10.26565/2304-6201-2024-61-07

Oksana Podoliaka V.N. Karazin Kharkiv National University, Svobody Square 4, Kharkiv, Ukraine, 61022 https://orcid.org/0000-0002-3401-2996
Oleksii Podoliaka V.N. Karazin Kharkiv National University, Svobody Square 4, Kharkiv, Ukraine, 61022 https://orcid.org/0000-0002-5755-3728

DOI: https://doi.org/10.26565/2304-6201-2024-61-07

Keywords: data privacy, de-identification, data publishing, data utility, GDPR (General Data Protection Regulation)

Abstract

Organizations and agencies release various data intended for analysis, training of artificial intelligence systems, and other research purposes. According to the adopted regulations in the field of personal data protection, public data must be anonymized and protected from various threats of personal data disclosure. Elimination of these threats is realized by reducing the accuracy of data during their preparation for the release. Loss of accuracy obviously leads to a decrease in the usefulness of data for analysis. The paper considers entropy metrics of utility and problems of their computability, as well as metrics of loss of utility of certain subsets of public data.

Objective. To develop effective metrics for assessing the usefulness of a public dataset for analysis, taking into account the requirements of personal data protection.

Research methods. Information security, Shannon's theory of information, Data Governance.

Results. Metrics for assessing information loss and data usefulness for analysis based on the entropy metrics of Shannon's information theory are proposed. Procedures aimed at increasing the speed of calculations of the considered metrics are suggested.

Conclusions. The procedures for building a secure public dataset are described. The application of entropy metrics of Shannon's information theory to assess information loss and data usefulness for analysis is considered. It has been shown that the calculation of these metrics is a complex computational task that is practically impossible for large databases. Procedures aimed at increasing the speed of calculating the considered metrics are proposed. In particular, the creation of a less accurate copy of the original data and the formation of a random sample from a large database to calculate the necessary statistics. The metrics for assessing the usefulness of certain subsets (clusters) of public data are considered in the article.

Downloads

Download data is not yet available.

Author Biographies

Oksana Podoliaka, V.N. Karazin Kharkiv National University, Svobody Square 4, Kharkiv, Ukraine, 61022

PhD of Тechnical Sciences, docent

Oleksii Podoliaka, V.N. Karazin Kharkiv National University, Svobody Square 4, Kharkiv, Ukraine, 61022

Senior lecturer

References

/

References

Dankar, F., Emam, K.E., Neisa, A., Roffey, T.: Estimating the re-identification risk of clinical data sets. BMC Med. Inform. Decis. Mak, 2012. №12 (66). P. 1-15.

B.A. Malin, D. Karp, R.H. Scheuermann. Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. J. Investig. Med., 2010. 58 (1). P. 11-18.

Li, Tiancheng & Li, Ninghui. On the tradeoff between privacy and utility in data publishing. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009. P. 517-526.

Fung, Benjamin & Wang, ke & Chen, Rui & Yu, Philip. Privacy-Preserving Data Publishing: A Survey of Recent Developments. ACM Comput. Surv, 2010. №4 (14). P. 1-53.

Yaseen, Saba & Abbas, Syed & Anjum, Adeel & Saba, Tanzila & Khan, Abid & Malik, Saif & Ahmad, Naveed & Shahzad, Basit & Bashir, Ali. Improved Generalization for Secure Data Publishing. IEEE Access, 2018. P. 27156-27165.

Fung, Benjamin & Wang, Ke & Fu, Ada & Yu, Philip. Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques, 2010. 376 s. ISBN: 9780429138737.

Li, Ninghui & Li, Tiancheng & Venkatasubramanian, Suresh. t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. IEEE 23rd International Conference on Data Engineering (ICDE), 2007. 2. P. 106 - 115.

Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. J. Uncertain. Fuzz. Knowl. Sys., 2002. 10 (5). P. 571-588.

US Department of Health and Human Services. Guidance regarding methods for de-identification of protected health information in accordance with the health insurance portability and accountability act (HIPAA) privacy rule, 2014. available at:http://www.hhs.gov/.

Simson L. Garfinkel. NISTIR 8053. De-Identification of Personal Information, 2015. available at: http://dx.doi.org/10.6028/NIST.IR.8053.

Fung B., Wang ke, Wang L., Debbabi M. A framework for privacy-preserving cluster analysis. Conference: Intelligence and Security Informatics, 2008. P. 46 - 51.

Emam K., Dankar F. (2008). Protecting Privacy Using k-Anonymity. Journal of the American Medical Informatics Association : JAMIA. 2008. 15(5).

Marques, Joana & Bernardino, Jorge. Analysis of Data Anonymization Techniques. In Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, 2020. P. 235-241. ISBN: 978-989-758-474-9.

Podoliaka O., Mushkatblat V., Kaplan A. Privacy Attacks Based on Correlation of Dataset Identifiers: Assessing the Risk, 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC), 2022. P. 0808-0815. ISBN: 9781665483032.

Shennon К. Raboty po teorii informacii i kibernetike. Izdatel'stvo inostrannoj literatury, 1963. 830 s.

Shnajer, B. Sekrety i lozh'. Bezopasnost' dannyh v cifrovom mire. – Per. s angl. – SPb.: Piter, 2004. s. 432. ISBN: 5-318-00193-9.

Bystryj vybor sluchajnyh znachenij iz bol'shih tablic MySQL po usloviyu. available at: https://habr.com/ru/post/207096/. Available: May. 1, 2022.

Greg Robidoux. Retrieving random data from SQL Server with TABLESAMPLE. available at: https://www.mssqltips.com/sqlservertip/1308/retrieving-random-data-from-sql-server-with-tablesample/. Available: May. 1, 2022.

NOTES ON SQL. available at: https://sqlrambling.net/2018/01/24/tablesample-basic-examples. Available: May. 1, 2022.

Dankar, F., Emam, K.E., Neisa, A., Roffey, T.: Estimating the re-identification risk of clinical data sets. BMC Med. Inform. Decis. Mak, 2012. №12 (66). С. 1-15.

B.A. Malin, D. Karp, R.H. Scheuermann. Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. J. Investig. Med., 2010. 58 (1). С. 11-18.

Li, Tiancheng & Li, Ninghui. On the tradeoff between privacy and utility in data publishing. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009. С. 517-526.

Fung, Benjamin & Wang, ke & Chen, Rui & Yu, Philip. Privacy-Preserving Data Publishing: A Survey of Recent Developments. ACM Comput. Surv, 2010. №4 (14). С. 1-53.

Yaseen, Saba & Abbas, Syed & Anjum, Adeel & Saba, Tanzila & Khan, Abid & Malik, Saif & Ahmad, Naveed & Shahzad, Basit & Bashir, Ali. Improved Generalization for Secure Data Publishing. IEEE Access, 2018. С. 27156-27165.

Fung, Benjamin & Wang, Ke & Fu, Ada & Yu, Philip. Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques, 2010. 376 с. ISBN: 9780429138737.

Li, Ninghui & Li, Tiancheng & Venkatasubramanian, Suresh. t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. IEEE 23rd International Conference on Data Engineering (ICDE), 2007. 2. С. 106 - 115.

Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. J. Uncertain. Fuzz. Knowl. Sys., 2002. 10 (5), С. 571-588.

US Department of Health and Human Services. Guidance regarding methods for de-identification of protected health information in accordance with the health insurance portability and accountability act (HIPAA) privacy rule, 2014. [Електронний ресурс] / Режим доступу: http://www.hhs.gov/.

Simson L. Garfinkel. NISTIR 8053. De-Identification of Personal Information, 2015. [Електронний ресурс] / Режим доступу: http://dx.doi.org/10.6028/NIST.IR.8053

Fung B., Wang ke, Wang L., Debbabi M. A framework for privacy-preserving cluster analysis. Conference: Intelligence and Security Informatics, 2008. С. 46 - 51.

Emam K., Dankar F. (2008). Protecting Privacy Using k-Anonymity. Journal of the American Medical Informatics Association : JAMIA. 2008. 15(5).

Marques, Joana & Bernardino, Jorge. Analysis of Data Anonymization Techniques. In Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, 2020. С. 235-241. ISBN: 978-989-758-474-9.

Podoliaka O., Mushkatblat V., Kaplan A. Privacy Attacks Based on Correlation of Dataset Identifiers: Assessing the Risk, 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC), 2022. С. 0808-0815. ISBN: 9781665483032.

Шеннон К. Работы по теории информации и кибернетике. Издательство иностранной литературы. 1963. 830 с.

Шнайер, Б. Секреты и ложь. Безопасность данных в цифровом мире. – Пер. с англ. – СПб.: Питер, 2004. с. 432. ISBN: 5-318-00193-9.

Быстрый выбор случайных значений из больших таблиц MySQL по условию. [Електронний ресурс] / Доступно: https://habr.com/ru/post/207096/ Дата звернення: Трав. 1, 2022.

Greg Robidoux. Retrieving random data from SQL Server with TABLESAMPLE. [Електронний ресурс] / Доступно: https://www.mssqltips.com/sqlservertip/1308/retrieving-random-data-from-sql-server-with-tablesample/. Дата звернення: Трав. 1, 2022.

NOTES ON SQL. [Електронний ресурс] / Доступно: https://sqlrambling/.net/2018/01/24/tablesample-basic-examples. Дата звернення: Трав. 1, 2022.