Luke Scanlon and Holly Lambert of Pinsent Masons look at the opportunities for synthetic data and AI in financial services

AI is transforming financial services and synthetic data is central to this transformation. As large language models (LLMs) require, and generative artificial intelligence (AI) depends on, huge amounts of data, it is not difficult to see why many financial institutions and fintech businesses are investing in synthetic data.

However, despite the increasing popularity of Generative AI, the huge amounts of data used to train and develop foundation models has become the subject of increasing scrutiny. 

For example, in March the Italian Data Protection Authority banned ChatGPT for a short period due to concerns around its ‘massive collection and processing’ of data, while other regulators have taken steps to ensure closer supervision of use of the technology. 

Such concerns may lead to more calls within financial services providers to focus on using synthetic data to train AI models. These calls may also be fuelled by the growing number of opportunities to share data across organisations as anti-money laundering, risk prevention and other AI technologies develop at a rapid pace. 

There are different types of synthetic data which are largely created by replicating the statistical patterns and properties of real-world data using algorithms to produce an artificial dataset.

As synthetic data does not contain datapoints from real-world datasets, its use is not subject to the same constraints as real-world data. 

Synthetic data is often referred to as a privacy preserving technique. However, it can also be used to scale a real-world dataset to increase the amount of data that is available. 

Its potential to scale datasets is particularly useful in the context of LLMs. In a financial services context where the training of an LLM is dependent on access to large amounts of highly sensitive data, generating and using synthetic data effectively may be an attractive way to develop products and gain market share quickly. 

Obtaining synthetic data from third parties carries risk. Assurances should be sought to clarify the provenance and lineage of and rights to the data, as well as any restrictions on its use. 

Ownership, database rights, restrictions on the use of trade secrets and confidential information should be considered in conjunction with any restrictive licensing or contractual terms that prevent its further use. 

Where personal data has been used to generate synthetic data, a legal ground must be identified to collect and use the data for that purpose. The person whom the data identifies must have a reasonable expectation that the data may be used to generate synthetic data. 

There is a legal risk if the person to whom the data relates has been used to generate a specific type of dataset without establishing a legal basis for that data to be used as part of the generation process. 

Additional protections can also be put in place if sensitive information like bank account and trading information are processed. Regulatory requirements may prevent further use of this type of data within an organisation even if the sole purpose of further use is to generate synthetic data.

The quality of the synthetic data used should be examined to ensure it is fit for purpose. To be useful, the datasets need to be sufficiently realistic, granular and linked so that changes made in one dataset are reflected in the other. 

For example, characteristics of one synthetic person should be consistent across multiple datasets if they are to be used collectively to decide real-world outcomes. A failure to maintain consistency may lead to unwanted consequences such as unfair bias and discrimination. 

Sometimes the synthetic data may also too closely resemble the real-world personal data from which it was generated, introducing a ‘reidentification risk’. Synthetic data may be subject to ‘membership attacks’ which establish whether an individual is present in a real-world dataset through observing the statistical properties of the synthetic data. 

Implementing responsible AI and robust data governance frameworks are essential to ensure synthetic data risks are effectively managed. 

Financial services businesses that put the right governance, risk management and contractual protections in place now will be best placed to leverage the opportunities which synthetic data-powered AI technology will continue to create.