The world is the biggest data problem — Andrew McAfee

Introduction

Every year the world generates more data than the previous year. According to International Data Corporation, in 2020, an estimated 59 zettabytes of data will be “created, captured, copied, and consumed”.

Synthetic Data and its real-time use cases

What is Synthetic Data?

As the name suggests, synthetic data is artificially created rather than being generated by actual events. It is often made with the help of algorithms and is used for a wide range of activities, including test data for new products and tools, model validation, and AI model training.

Real-time use cases

  • Amazon using synthetic data to train Alexa’s language system
  • Google’s Waymo using synthetic data to train its autonomous vehicles
  • Amazon using synthetic images to train Amazon Go vision recognition systems
  • American Express using synthetic financial data to improve fraud detection
  • Roche using synthetic medical data for clinical research

Generating Synthetic Data in R

The synthpop package is an add-on package to the statistical software R. It is freely available from the Comprehensive R Archive Network (CRAN). It can be downloaded and installed, for example, from inside an R session via

install.packages("synthpop")
library(synthpop)
df_observed <- read.csv(file = "/Users/reputation/HeartRate.csv")
df_synthetic <- syn(df_original, m = 10, method= "cart", cart.minbucket = 10)
compare(df_synthetic, df_observed, vars = "HeartRate")
Figure(a) — Comparing observed data and synthetic data for HeartRate in cart mode
compare(df_synthetic, df_observed, vars = "BodyTemperature")
Figure(b) — Comparing observed data and synthetic data for Body Temperature in cart mode
Figure(c) — Z-value comparison between observed and synthetic

Conclusion

In this article, I presented the fundamental importance of Synthetic Data and the functionality of the R package named “synthpop” for generating synthetic versions of microdata containing confidential information.

References

  1. https://tuvalabs.com/datasets/body_temperature_sex__heart_rate/activities
  2. 1. M. S. Santos, R. C. Pereira, A. F. Costa, J. P. Soares, J. Santos and P. H.Abreu, “Generating Synthetic Missing Data: A Review by Missing Mechanism,” in IEEE Access, vol. 7, pp. 11651–11667, 2019.

Software Development Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store