less than 1 minute read

dbt_synth_data

dbt_synth_data is a dbt package for creating synthetic data which I’ve built as part of my work at Education Analytics. The package’s features include:

  • support for Snowflake, Postgres, SQLite, and DuckDB backends
  • ability to generate various distributions including normal, exponential, binomial, and more
  • ability to combine basic distributions by union or average to create more complex ones
  • ability to generate many basic data types including boolean, numeric, string, and date
  • ability to generate more complex data types including references to other tables, words, names, and addresses
  • impressive performance, with ability (on Snowflake) to create billions of rows and hundreds of GB of synthetic data

At EA, we use dbt_synth_data to create synthetic data in the Ed-Fi data standard, which can then be used for

  • testing user interfaces
  • demoing applications to users without permission to access real data
  • performance-tuning operational systems
  • preparing training and other materials with realistic data

You can learn more about dbt_synth_data in this presentation.