Hypothesis Testing and Statistically-sound Pattern Mining
Time and Location Slides 2-pager Introduction Outline Meet the Tutors Bibliography
Time and Location
KDD'19, Sunday August 4, 8am — 12pm, Summit 4- Ground Level, Egan
Slides
2-page summary
A 2-page summary of the tutorial, including the outline.
Introduction
The availability of massive datasets has highlighted the need of computationally efficient and statistically-sound methods to extracts patterns while providing rigorous guarantees on the quality of the results, in particular with respect to false discoveries.
In this tutorial we survey recent methods that properly combine computational and statistical considerations to efficiently mine statistically reliable patterns from large datasets. We start by introducing the fundamental concepts in statistical hypothesis testing, which may not be familiar to everyone in the data mining community. We then explain how the computational and statistical challenges in pattern mining have been tackled in different ways. Finally, we describe the application of these methods in areas such as market basket analysis, subgraph mining, social networks analysis, and cancer genomics.
The purpose of this tutorial is to introduce the audience to statistical hypothesis testing, to emphasize the importance of properly balancing the computational and statistical aspects of pattern mining, highlighting the usefulness of techniques that do so for the data mining researcher, and also to encourage further research in this area.
Outline
- Introduction and Theoretical Foundations (1 hour)
- Testing a single hypothesis: setting, basic concepts, and applications to social networks and computational biology
- Fundamental tests: Fisher's test, χ2 test, Barnard's test, A/B testing
- The challenge of testing multiple hypotheses: Family-Wise Error Rate and False Discovery Rate
- Taming the challenge: the Bonferroni-Holm procedure and the Benjamini-Yekutieli correction
- Mining Statistically-Sound Patterns (1 hour and 30 minutes)
- Computational and statistical challenges in pattern mining
- Computational aspects: LAMP, permutation testing, TopKWY
- Statistical aspects: hold-out approach and layered critical values, a threshold for significant pattern mining, true frequent itemsets
- Other statistical measures: emerging patterns, discriminative patterns, significant association rules, significant subgroups
- Applications: subgraph mining, cancer genomics, computational biology, and survival analysis
- Recent developments and advanced topics (30 minutes)
- Removing assumptions on the data generating process
- Data-dependent hypothesis weighting
- Conclusions, future directions, and discussion
Meet the Tutors
Leonardo Pellegrina is a Ph.D. student in Information Engineering from the Department of Information Engineering of the University of Padova, under the supervision of Prof. Fabio Vandin, and a Visiting Research Fellow at Brown University. His research activities focus on efficient and statistically sound algorithms for pattern discovery from large data, with applications to computational biology.
Matteo Riondato is an assistant professor of computer science at Amherst College, and a visiting faculty at Brown University. His research focuses on randomized approximation algorithms for knowledge discovery, data mining, and machine learning. His works received best-of-conference awards at SIAM SDM'14, ACM KDD'16, and IEEE ICDM'18. He has presented tutorials at ACM KDD'15, ACM CIKM'15, ECML PKDD'15, and WWW'16.
Fabio Vandin is an associate professor at the Department of Information Engineering at the University of Padova (Italy). Prior to joining the University of Padova he was an assistant professor of research at Brown University and an assistant professor at the University of Southern Denmark, where he holds an adjunct associate professor of computer science position. His research interests are in algorithms for statistically significant pattern mining and machine learning, and in applications to large datasets from molecular biology. He received the best paper award at RECOMB'13. His work has been funded in part by the National Science Foundation (NSF), the National Institute of Health (NIH), and the Italian Ministry for Education, University and Research (MIUR). He has presented tutorials at the Computational Genomics Summer Institute (CGSI) of UCLA in 2017, and will give another tutorial at CGSI in 2019.