The world is the biggest data problem — Andrew McAfee
Every year the world generates more data than the previous year. According to International Data Corporation, in 2020, an estimated 59 zettabytes of data will be “created, captured, copied, and consumed”.
Although the data is burgeoning, it doesn’t mean everyone can access it. Companies and organizations are concerned about their user privacy. And now the Covid-19 impact had lead to the shut down of research labs, organizations etc. Without access to the observed data, it is tough to train the machine learning models or other industry needs. Enter Synthetic Data: “any production data applicable to a given situation that is not obtained by direct measurement” — McGraw-Hill.
Synthetic Data and its real-time use cases
What is Synthetic Data?
As the name suggests, synthetic data is artificially created rather than being generated by actual events. It is often made with the help of algorithms and is used for a wide range of activities, including test data for new products and tools, model validation, and AI model training.
Synthetic information is affordable to supply and might support AI / deep learning model development, software package testing. Data privacy (i.e. information privacy enabled by synthetic data) is one of the foremost vital advantages. User information ofttimes includes recognizable in-person data (PII) and Personal Health Data (PHI) and permits corporations to create software without exposing user information to developers or software package tools.
Real-time use cases
- Amazon using synthetic data to train Alexa’s language system
- Google’s Waymo using synthetic data to train its autonomous vehicles
- Amazon using synthetic images to train Amazon Go vision recognition systems
- American Express using synthetic financial data to improve fraud detection
- Roche using synthetic medical data for clinical research
Generating Synthetic Data in R
The synthpop package is an add-on package to the statistical software R. It is freely available from the Comprehensive R Archive Network (CRAN). It can be downloaded and installed, for example, from inside an R session via
Once the synthpop package is installed, it needs to be attached to the current R session by the command
To generate and test the efficiency of synthetic data, a real-time data set is used.
Load the data into R space
df_observed <- read.csv(file = "/Users/reputation/HeartRate.csv")
Generate the Synthetic data using syn(), where m specifies the number of synthetic data sets. The observed dataset contains body temperature, sex, heart rate as labels.
df_synthetic <- syn(df_original, m = 10, method= "cart", cart.minbucket = 10)
compare() can be used to compare df_observed and df_synthetic. This clearly shows the difference between observed data and synthetic data.
compare(df_synthetic, df_observed, vars = "HeartRate")
compare(df_synthetic, df_observed, vars = "BodyTemperature")
By varying the mode we can generate multiple patterns of synthetic data.
In this article, I presented the fundamental importance of Synthetic Data and the functionality of the R package named “synthpop” for generating synthetic versions of microdata containing confidential information.
- 1. M. S. Santos, R. C. Pereira, A. F. Costa, J. P. Soares, J. Santos and P. H.Abreu, “Generating Synthetic Missing Data: A Review by Missing Mechanism,” in IEEE Access, vol. 7, pp. 11651–11667, 2019.
Thanks for reading and good luck — Surya Nuchu