A few months ago, we had an article about the launch of the Data Hub and the addition of synthetic datasets. The Data Hub is a part of the EU Digital Finance Platform, which facilitates data exchange between financial companies and supervisory authorities. Its aim is to provide companies with access to synthetic supervisory data for testing new applications and training AI/ML models.
This article focuses on the technology used for data sharing on the Data Hub, known as synthetic data. Synthetic data is artificially generated to resemble real-world data in terms of statistical properties, without including real individuals or identifiable information. It can be created using algorithms, simulations, or machine learning models.
The process of creating synthetic data involves three steps. First, a machine learning model analyses real customer data to understand relationships between factors, such as how income level affects loan default risk. Then, this analysis is used to train an algorithm to replicate these trends. Finally, the model generates new data that mirrors the patterns of the original data but does not represent any actual person. For instance, it might create a synthetic "customer" with certain financial characteristics, but this customer doesn’t exist.
There are several reasons for using synthetic data. One major benefit is privacy protection, as the synthetic nature of the data removes the risk of revealing real customer information, ensuring compliance with privacy regulations. It also enables data sharing, as central banks and national authorities can share synthetic data with external partners or researchers without privacy concerns. Additionally, the synthetic data retains enough quality to train models effectively, meaning there is no significant loss in accuracy.
This allows financial companies to test algorithms without the need for real customer data, making the process safer and compliant. A recent report by the Commission's Joint Research Centre (JRC) reviews the use of this synthetic data technology in the Data Hub, comparing the statistical characteristics of the original and synthetic datasets. The report evaluates the efficiency of the data creation process and addresses privacy concerns. Tests show that the synthetic data replicates key patterns from the original data while ensuring the confidentiality of sensitive information.
The JRC report concludes in its report, that the synthetic datasets are comparable in value to real data while meeting privacy requirements, contributing to the EU's digital finance goals by supporting innovation without exposing real data to third parties.
Related links
Digital Finance Platform - Launch of phase II & data hub
European Data Spaces - Scientific Insights into Data Sharing and Utilisation at Scale
Technological Enablers for Privacy Preserving Data Sharing and Analysis
Details
- Publication date
- 8 October 2024
- Author
- Directorate-General for Financial Stability, Financial Services and Capital Markets Union