Ehsan Karim, PhD, M.Sc.
Scientist, Advancing Health
Assistant Professor, School of Population and Public Health, UBC
Finding the Optimal Number of Splits in Double Cross-Fitting Targeted Maximum Likelihood Estimators
Flexible machine learning (ML) algorithms have become vital in the realm of epidemiological research, offering refined insights through real-data analyses. However, integrating highly flexible algorithms within double robust methods, such as the Targeted Maximum Likelihood Estimator (TMLE), introduces complexities in variance estimation, resulting in notable undercoverage; a critical concern. The Double Cross-Fitting (DCF) method enables the use of diverse machine learning estimators while facilitating asymptotically valid inference. Nonetheless, the literature on DCF lacks clarity regarding the optimal number of data splits. This research explores the impact of different DCF splits on the performance of TMLE estimators, utilizing statistical simulations and real-world data analysis. We generalize DCF beyond traditional setups, experimenting with various splits to optimize TMLE, and employing a super learner. Our study examines configurations across different sample sizes and DCF generalizations, with real-world implications demonstrated through data from the National Health and Nutrition Examination Survey (NHANES), focusing on the risk of obesity and diabetes. This study emphasizes the importance of careful split selection in DCF TMLE methods for computational efficiency and accurate statistical inference, finding that three to five splits are optimal. It offers guidance to epidemiologists using complex machine learning in causal studies, advocating for prudent split management in DCF to effectively navigate the complexities of epidemiological analysis. The presentation topic is a collaborative effort with Momenul Haque Mondol, a trainee at the UBC School of Population and Public Health.
This is a virtual event, please register to receive Zoom link.