Levin Lecture Series: Fall 2019 Colloquium Seminars

September 5, 2019

 

Topic: "Statistical Inference for Online Decision Making via Stochastic Gradient Descent"

 

Speaker: Dr. Rui Song

Email: rsong@ncsu.edu

Associate Professor, Statistics Department, NCSU 

11:30am-12:30pm​

AR Building, 8th Floor Auditorium

Hosted by: Bin Cheng

 

Abstract: 

Online decision making aims to learn the optimal decision rule by making personalized decisions and updating the decision rule recursively. It has become easier than before with the help of big data, but new challenges also come along. Since the decision rule should be updated once per step, an offline update which uses all the historical data is inefficient in computation and storage.  To this end, we propose a completely online algorithm that can make decisions and update the decision rule online via stochastic gradient descent. It is not only efficient but also supports all kinds of parametric reward models. Focusing on the statistical inference of online decision making, we establish the asymptotic normality of the parameter estimator produced by our algorithm and the online inverse probability weighted value estimator we used to estimate the optimal value. Online plugin estimators for the variance of the parameter and value estimators are also provided and shown to be consistent, so that interval estimation and hypothesis test are possible using our method. The proposed algorithm and theoretical results are tested by simulations and a real data application to news article recommendation.


SEPTEMBER 12, 2019

 

Topic: "Dissecting the Genetic Architecture of Complex Diseases Through Genome Wide Association Studies"

 

Speaker: Dr. Hongyu Zhao

Email: hongyu.zhao@yale.edu

Chair and Ira V. Hiscock Professor, Department of Biostatistics, Yale School of Public Health

11:30am-12:30pm​

AR Building, 8th Floor Auditorium

Hosted by: Zhezhen Jin

 

Abstract: Genome-wide association study (GWAS) has been a great success in the past decade with thousands of chromosomal regions in the human genome implicated for hundreds of complex diseases. A typical GWAS involves thousands to hundreds of thousands individuals, each queried at over millions of sites in the human genome. Despite successes, significant challenges remain in both identifying new risk loci and interpreting results from these large data sets. In this presentation, I will describe our recent efforts to develop statistical methods, theories, and resources to infer the genetic architecture of complex disease, such as the proportion of phenotypic variations explained by genetic variations, the tissue and cell-type origin of diseases, and genetic correlations among phenotypes. The effectiveness of our methods will be demonstrated through their applications to GWAS results for a large number of traits/diseases. This is joint work with Can Yang, Jiming Jiang, Qiongshi Lu, Debashis Paul, Wei Jiang, Cecilia Dao, and others.


SEPTEMBER 18, 2019  ***(Wednesday)***

 

Topic: "Sample size considerations for precision medicine "

 

Speaker: Dr. Eric Laber

Email: eblaber@ncsu.edu​

Professor, Statistics Department, NCSU

4pm-5pm​

AR Building, 8th Floor Auditorium

Hosted by: Min Qian

 

Abstract: Sequential Multiple Assignment Randomized Trials (SMARTs) are considered the gold standard for estimation and evaluation of treatment regimes. SMARTs are typically sized to ensure sufficient power for a simple comparison, e.g., the comparison of two fixed treatment sequences. Estimation of an optimal treatment regime is conducted as part of a secondary and hypothesis-generating analysis with formal evaluation of the estimated optimal regime deferred to a follow-up trial. However, running a follow-up trial to evaluate an estimated optimal treatment regime is costly and time-consuming; furthermore, the estimated optimal regime that is to be evaluated in such a follow-up trial may be far from optimal if the original trial was underpowered for estimation of an optimal regime. We derive sample size procedures for a SMART that ensure: (i) sufficient power for comparing the optimal treatment regime with standard of care; and (ii) the estimated optimal regime is within a given tolerance of the true optimal regime with high-probability. We establish asymptotic validity of the proposed procedures and demonstrate their finite sample performance in a series of simulation experiments.  


SEPTEMBER 26, 2019  

 

Topic: "A propensity scoring framework for multiple and continuous treatments and their applicability to EHR and genomic data"

 

Speaker: Dr. Stacia DeSantis

Email: stacia.m.desantis@uth.tmc.edu

Professor, Department of Biostatistics and Data Science, University of Texas

11:30am-12:30pm​

AR Building, 8th Floor Auditorium

Hosted by: Cody Chiuzan

 

Abstract: Large observational studies such as those derived from electronic health records (EHRs) and whole genome sequencing are becoming readily available to the public for data analysis. However, utilizing these complex data types to make unbiased inference about the effects of treatments or genetic loci, on clinical outcomes or phenotypes, remains challenging. We present a unified causal inference propensity score framework for multiple treatments (i.e., ordinal, categorical, and continuous) that can be applied regardless of the chosen propensitymodel, present its application to EHR data, and discuss its potential scalability to genetic causal inference with the goal of identifying treatments, behaviors, and genetic factors that influence health outcomes.


october 3, 2019 

 

Topic: "Fast Algorithms for Detection of Structural Breaks in High Dimensional Data"

 

Speaker: Dr. George Michailidis

Email: gmichail@ufl.edu

Professor and Director,  The Informatics Institute at University of Florida

11:30am-12:30pm​

AR Building, 8th Floor Auditorium

Hosted by: Gen Li

 

Abstract: Many real time series data sets exhibit structural changes over time. It is then of interest to both estimate the (unknown) number of structural break points, together with the parameters of the statistical model employed to capture the relationships amongst the variables/features of interest. An additional challenge emerges in the presence of very large data sets, namely on how to accomplish these two objectives in a computational efficient manner. In this talk, we outline a novel procedure which leverages a block segmentation scheme (BSS) that reduces the number of model parameters to be estimated through a regularized least squares criterion. Specifically, BSS examines appropriately defined blocks of the available data, which when combined with a fused lasso based estimation criterion, leads to significant computational gains without compromising on the statistical accuracy in identifying the number and location of the structural breaks. This procedure is further coupled with new local and global screening steps to consistently estimate the number and location of break points. The procedure is scalable to large size high-dimensional time series data sets and can provably achieve significant computational gains. It is further applicable to various statistical models, including regression, graphical models and vector-autoregressive models. Extensive numerical work on synthetic data supports the theoretical findings and illustrates the attractive properties of the procedure. Applications to neuroimaging data will also be discussed.


OCTOBER 10, 2019 

 

Topic: "Constructing tumor-specific gene regulatory networks based on samples with tumor purity heterogeneity"

 

Speaker: Dr. Pei Wang

Email: pei.wang@mssm.edu​

Professor, Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai

11:30am-12:30pm​

AR Building, Hess Commons

Hosted by: Zhezhen Jin

 

Abstract: Tumor tissue samples often contain an unknown fraction of normal cells. This problem well known as tumor purity heterogeneity (TPH) was recently recognized as a severe issue in omics studies. Specifically, if TPH is ignored when inferring co-expression networks, edges are likely to be estimated among genes with mean shift between normal and tumor cells rather than among gene pairs interacting with each other in tumor cells. To address this issue, we propose TSNet a new method which constructs tumor-cell specific gene/protein co-expression networks based on gene/protein expression profiles of tumor tissues. TSNet treats the observed expression profile as a mixture of expressions from different cell types and explicitly models tumor purity percentage in each tumor sample. The advantage of TSNet over existing methods ignoring TPH is illustrated through extensive simulation examples. We then apply TSNet to estimate tumor specific co-expression networks based on ovarian cancer expression profiles. We identify novel co-expression modules and hub structure specific to tumor cells.


 

OCTOBER 11, 2019   ***(FRIDAY)***

 

Topic: "Efficient adjustment sets for population average causal effect estimation in graphical models"

 

Speaker: Dr. Andrea Rotnitzky

Email: arotnitzky@utdt.edu

Professor, Universidad Torcuato Di Tella; also Adjunct Professor of Biostatistics, Harvard University

11:30am-12:30pm​

AR Building, 8th Floor Auditorium

Hosted by: Ian McKeague

 

Abstract: Covariate adjustment is often used for estimation of population average causal effects (ATE). In recent years graphical rules have been derived for determining, from a causal diagram, all covariate adjustment sets. Restricting attention to causal linear models, a very recent article introduced two graphical criterions: one to compare the asymptotic variance of linear regression estimators that control for certain distinct adjustment sets and a second to identify the optimal adjustment set that provides the smallest asymptotic variance. In this talk, I will show that the same graphical criterions can be used in arbitrary causal diagrams when the goal is to minimize the asymptotic variance of non-parametric estimators of ATE that ignore the causal diagram assumptions. Furthermore, I will provide a graphical criterion to determine the optimal adjustment set among the minimal adjustment sets. In addition, I will provide another graphical criterion for determining when a non-parametric estimator of ATE is as efficient as an efficient estimator that exploits the causal diagram assumptions. Finally, I will show that for estimating the effect of time dependent treatments in the presence of time dependent confounders, there exist diagrams with no optimal adjustment sets.


 

OCTOBER 17, 2019 

 

Topic: "Targeted Machine Learning for Causal Inference based on Real World Data"

 

Speaker: Dr. Mark van der Laan

Email: laan@berkeley.edu 

Jiann-Ping Hsu/Karl E. Peace Endowed Chair and Professor of Biostatistics,University of California--Berkeley

11:30am-12:30pm​

AR Building, 8th Floor Auditorium

Hosted by: Caleb Miles

 

Abstract: We discuss a general roadmap for generating causal inference based on observational studies used to general real world evidence. This defines the statistical estimation problem in terms of knowledge about the data generating experiment and a target estimand, where the target estimand is aimed to identify or best approximate the causal quantity of interest. We review targeted minimum loss estimation (TMLE), which provides a general template for the construction of asymptotically efficient plug-in estimators of a target estimand for realistic (i.e, infinite dimensional) statistical models. TMLE is a two stage procedure that first involves using ensemble machine learning termed super-learning  to estimate the relevant stochastic relations between the treatment, censoring, covariates and outcome of interest. The super-learner allows one to fully utilize all the advances in machine learning (in addition to more conventional parametric model based estimators) to build a single most powerful machine learning algorithm.  In the second step, the TMLE involves maximizing a parametric likelihood along a so-called least favorable parametric model through the super-learner fit of the relevant stochastic relations  in the observed data, where this least favorable parametric model will also involve an estimator of the treatment and censoring mechanism. This second step bridges the state of the art in machine learning to estimators of target estimands for which statistical inference is available (i.e, confidence intervals, p-values etc). We present an approach collaborative TMLE to regularize the targeting step, involving targeted estimation of the treatment and censoring mechanism, thereby further optimizing and robustifying the TMLE. 

 

The asymptotic normality and efficiency of the TMLE relies on the asymptotic negligibility of a second-order remainder term. This typically requires the initial (super-learner) estimator to converge at a rate faster than n-1/4 in sample size n. We show that a new Highly Adaptive LASSO (HAL) of the data distribution and its functionals  converges indeed at a sufficient rate regardless of the dimensionality of the data/model, under almost no additional regularity. This allows us to propose a general TMLE,  using  a super-learner whose library includes HAL, that is asymptotically normal and efficient in great generality. 

 

We demonstrate the practical performance of the corresponding HAL-TMLE (and its confidence intervals) for the average causal effect for dimensions up till 10 based on simulations that randomly generate data distributions. We also discuss a nonparametric bootstrap method for inference taking into account the higher order contributions of the HAL-TMLE, providing excellent robust coverage. 


OCTOBER 21, 2019   ***(MONDAY)***

 

Topic: "Designing Combined Data Products: Examples from Economic Research Studies"

 

Speaker: Dr. Frauke Kreuter

Email: fkreuter@umd.edu

Professor and Director in Joint Program in Survey Methodology, University of Maryland

11:30am-12:30pm​

Hammer Building, Room LL106

Hosted by: Qixuan Chen

 

Abstract: Combining data from different sources will be key for social scientists to take full advantage of the data deluge resulting from the increasing digitalization of society. Currently we see many attempts at using single (big data sources) with mixed results, the most exciting projects rely on a combination of different data, some still collected with traditional modes. This talk will highlight a few approaches and provide a framework with which researchers can think about creating new data products. An important element in this endeavor is however the respect of people’s privacy. While different cultures have different norms about the collection on specific types of data for specific purposes, the notion of contextual integrity still holds. Learning how to design data collections for new insights in a more holistic way, will be the overarching theme of this talk. 

 

In the talk I will be using several Economic Research examples, in particular the IAB-SMART research project to discuss privacy issues and the approaches to create high quality combined data sources. See the attached paper for details on the privacy part. In brief:  The IAB-SMART study combines data from administrative records, surveys, and digital traces from smart phones. The digital trace data are collected via an app. The purpose of the IAB- SMART study is to measure the effects of long-term unemployment on social integration and social activity, as well as the inhibiting effects of reduced social networks and activities in finding reentry into the labor market. To create measures of social integration access to the phone's address book and usage is required, as well as sensory data from accelerometer and geoposition. For valid population estimates statisticians need to account for potential coverage bias and bias due to nonresponse and measurement error. Using the case study I will demonstrate how we approached these problems. 


OCTOBER 24, 2019 

 

Topic: "Distributed learning from multiple EHRs databases for predicting medical events"

 

Speaker: Dr. Qi Long

Email: qlong@pennmedicine.upenn.edu​

Professor, Biostatistics, Perelman School of Medicine, University of Pennsylvania

11:30am-12:30pm​

AR Building, 8th Floor Auditorium

Hosted by: Yuanjia Wang

 

Abstract: Electronic health records (EHRs) data offer great promises in personalized medicine. However, EHRs data also present significant analytical challenges due to their irregularity and complexity. For example, EHRs include data from multiple domains collected over time and include both structured and unstructured data. In addition, analyzing EHR data involves privacy issues and sharing such data across multiple institutions/sites may be infeasible. Building on a contextual embedding model, we propose a distributed learning approach to learn from multiple EHRs databases and predict multiple medical events simultaneously, which can handle both structured and unstructured data. We further augment the proposed approach with Differential Privacy to enhance privacy protection. Our numerical studies demonstrate that the proposed method can build predictive models in a distributed fashion with privacy protection and the resulting models achieve reasonable prediction accuracy compared with methods that use pooled data across all sites. Our algorithm, if integrated into EHR system as a decision support tool, has the potential to improve early detection and diagnosis of diseases which is known to be associated with better patient outcomes. This is joint work with Ziyi Li, Kirk Roberts, and Xiaoqian Jiang.


OCTOBER 31, 2019 

 

Topic: "Restricted function-on-function linear regression model"

 

Speaker: Dr. Ruiyan Luo

Email: rluo@gsu.edu

Associate Professor, Division of Epidemiology & Biostatistics, School of Public Health, Georgia State University

11:30am-12:30pm​

AR Building, 8th Floor Auditorium

Hosted by: Todd Ogden

 

Abstract: In the usual function-on-function linear regression model, the coefficient surface is defined in a rectangular region and  the value of the response curve at any point is influenced by the entire trajectory of the predictor curve. In a couple of variants of function-on-function models, such as the historical model, the response value is influenced only by a subset of the predictor curve. In this paper, we consider the restricted function-on-function linear model where the coefficient surface is defined in a sub-region of the rectangle, and the value of the response curve at any point is influenced by a sub-curve of the predictor. The restricted function-on-function model includes the usual function-on-function model and its variants as special cases. We have two major purposes. First, given the sub-region, we propose an efficient estimation procedure for the corresponding restricted function-on-function model based on the optimal expansion of the coefficient surface. Second, as the sub-region is seldom specified in practice, we propose a sub-region selection procedure which can lead to models with better interpretation and better performance than the model without any restriction. Algorithms are developed for both model estimation and sub-region selection.


November 4, 2019   ***(monday)***

 

Topic: "TBA"

 

Speaker: Dr. Peng Zhang

Email: pczhang@med.umich.edu​

Research Assistant Professor, Department of Surgery, University of Michigan

11:30am-12:30pm​

AR Building, 6th Floor, Room 657

Hosted by: Min Qian

 

Abstract: "TBA"


NOVEMBER 7, 2019   

 

Topic: "TBA"

 

Speaker: Dr. Dylan Small

Email: dsmall@wharton.upenn.edu​

Class of 1965 Wharton Professor of Statistics, The Wharton School, University of Pennsylvania

11:30am-12:30pm​

AR Building, 8th Floor Auditorium

Hosted by: Min Qian

 

Abstract: "TBA"


NOVEMBER 14, 2019   

 

Topic: "TBA"

 

Speaker: Dr. Yusuke Narita

Email: yusuke.narita@yale.edu​

Assistant Professor, Economics, Yale University

11:30am-12:30pm​

AR Building, 8th Floor Auditorium

Hosted by: Ying Wei

 

Abstract: "TBA"


NOVEMBER 21, 2019   

 

Topic: "TBA"

 

Speaker: Dr. Lei Liu

Email: lei.liu@wustl.edu​

Professor, Division of Biostatistics, Washington University School of Medicine in St. Louis

11:30am-12:30pm​

AR Building, 8th Floor Auditorium

Hosted by: Shaung Wang

 

Abstract: "TBA"


NOVEMBER 25, 2019  ***(MONDAY)***  

 

Topic: "TBA"

 

Speaker: Dr. Li-Shan Huang

Email: lhuang@stat.nthu.edu.tw

Professor, Institute of Statistics,National Tsing Hua University

11:30am-12:30pm​

AR Building, 6th Floor, Room 657

Hosted by: Zhezhen Jin

 

Abstract: "TBA"


december 5, 2019  

 

Topic: "TBA"

 

Speaker: Dr. Jenny Bryan

Email: jenny@rstudio.com​

Data Scientist, Rstudio

11:30am-12:30pm​

AR Building, 8th Floor Auditorium

Hosted by: Jeff Goldsmith

 

Abstract: "TBA"


DECEMBER 12, 2019  

 

Topic: "TBA"

 

Speaker: Dr. Zhigang Li

Email: zhigang.li@ufl.edu​

11:30am-12:30pm​

AR Building, 8th Floor Auditorium

Hosted by: Ian McKeague

 

Abstract: "TBA"