Assessing the utility of a public dataset for analytical research

Keywords: data privacy, de-identification, data publishing, data utility, GDPR (General Data Protection Regulation)

Abstract

Organizations and agencies release various data intended for analysis, training of artificial intelligence systems, and other research purposes. According to the adopted regulations in the field of personal data protection, public data must be anonymized and protected from various threats of personal data disclosure. Elimination of these threats is realized by reducing the accuracy of data during their preparation for the release. Loss of accuracy obviously leads to a decrease in the usefulness of data for analysis. The paper considers entropy metrics of utility and problems of their computability, as well as metrics of loss of utility of certain subsets of public data.

Objective. To develop effective metrics for assessing the usefulness of a public dataset for analysis, taking into account the requirements of personal data protection. 

Research methods. Information security, Shannon's theory of information, Data Governance.

Results. Metrics for assessing information loss and data usefulness for analysis based on the entropy metrics of Shannon's information theory are proposed. Procedures aimed at increasing the speed of calculations of the considered metrics are suggested.

Conclusions. The procedures for building a secure public dataset are described. The application of entropy metrics of Shannon's information theory to assess information loss and data usefulness for analysis is considered. It has been shown that the calculation of these metrics is a complex computational task that is practically impossible for large databases. Procedures aimed at increasing the speed of calculating the considered metrics are proposed. In particular, the creation of a less accurate copy of the original data and the formation of a random sample from a large database to calculate the necessary statistics. The metrics for assessing the usefulness of certain subsets (clusters) of public data are considered in the article.

Downloads

Download data is not yet available.

Author Biographies

Oksana Podoliaka, V.N. Karazin Kharkiv National University, Svobody Square 4, Kharkiv, Ukraine, 61022

PhD of Тechnical Sciences, docent

Oleksii Podoliaka, V.N. Karazin Kharkiv National University, Svobody Square 4, Kharkiv, Ukraine, 61022

Senior lecturer

References

/

References

Published
2024-05-27
How to Cite
Podoliaka, O., & Podoliaka, O. (2024). Assessing the utility of a public dataset for analytical research. Bulletin of V.N. Karazin Kharkiv National University, Series «Mathematical Modeling. Information Technology. Automated Control Systems», 61, 61-67. https://doi.org/10.26565/2304-6201-2024-61-07
Section
Статті