Strata+Hadoop World 2017
The dangers of statistical significance when studying weak effects in big data: From natural experiments to p-hacking
When there is a strong signal in a large dataset, many machine-learning algorithms will find it. On the other hand, when the effect is weak and the data is large, there are many ways to discover an effect that is in fact nothing more than noise. Robert (Bob) Grossman shares best practices by exploring three case studies to make it a bit less likely that you will be accused of p-hacking.
The first case study looks at the geospatial variation in the US of common diseases. In the second case study, Bob dives into different methods for understanding whether there is an effect on your health when you are exposed to particulate matter (solid and liquid particles suspended in air). The third case study tries to make a case that beautiful parents have more daughters. Bob extracts several techniques from these three case studies that have proved useful in practice and highlights some of the common mistakes that data scientists make when small effects may or may not be present.
- Effect size, variance, and statistical power
- Correcting for multiple experiments
- Randomized and natural experiments
- Saying no to p-values: Hierarchical Bayesian models
- Why small sample sizes still occur with big data