Seminars

2025-04-10- Dr. Claire Bowen - Responsibly Represent People in Data: Navigating Data Governance and Privacy Challenges in Public Policy

Abstract:
At what point does the sacrifice of our personal information outweigh the public good?

If public policymakers had access to our personal and confidential data, they could make more evidence-based, data-informed decisions that could accelerate economic recovery and improve healthcare resource distribution. However, access to personal data comes at a steep privacy cost for individuals, particularly underrepresented groups, such as those living in rural communities.

The Data Governance and Privacy Practice Area at the Urban Institute is dedicated to ensuring that everyone is represented in data. This involves empowering data curators, practitioners, and public policymakers to safely expand access to data sources that responsibly represent data subjects. This talk will explore the intricate data governance and privacy challenges faced by both the U.S. government and the private sector in data collection and dissemination. The talk will also highlight how the Data Governance and Privacy team leverages their expertise and develops tools that offer technological and policy solutions to data stewardship challenges—such as privacy-enhancing technologies, secure computing infrastructure, statistical data privacy methodology, open-source software development, user engagement studies, and data governance frameworks and implementations.

Bio:
Claire McKay Bowen is a senior fellow and leads the Data Governance and Privacy Practice Area at the Urban Institute. Her research focuses on developing technical and policy solutions to safely expand access to confidential data for advancing evidence-based policy-making. She also has an interest in improving science communication and ensuring everyone is responsibly represented in data. In 2024, she became an American Statistical Association Fellow “for her significant contributions in the field of statistical data privacy, leadership activities in support of the profession, and commitment to mentoring the next generation of statisticians and data scientists.” Further, she is a member of the ICPSR Governing Council and several other data governance and data privacy committees as well as an adjunct professor at Stonehill College.

2025-04-03- Dr. Sebastian Kurtek - Variograms for Kriging and Clustering of Spatial Functional Data with Phase Variation

Abstract: Spatial functional data arise in various applied domains including environmental science and medical imaging. In general, functional data often exhibit two distinct sources of variation, amplitude (y-axis) variation, and phase (x-axis) variation. Statistical analysis of functional data thus involves so-called amplitude-phase separation wherein these two sources of variation are decoupled. However, in the particular setting of spatial functional data, amplitude and/or phase are dependent in space resulting in new challenges. We describe a framework that extends amplitude-phase separation methods in functional data to the spatial setting, with a view towards performing clustering and spatial prediction. We propose a decomposition of the trace-variogram, which quantifies spatial variation, into amplitude and phase components, and quantify how spatial correlations between functional observations manifest in their respective amplitude and phase. This enables us to generate separate amplitude and phase clustering methods for spatial functional data, and develop a novel spatial functional interpolant at unobserved locations based on combining separate amplitude and phase predictions. Through simulations and real data analyses, we demonstrate advantages of our approach when compared to standard ones that ignore phase variation, through more accurate predictions and more interpretable clustering results. This is joint work with my former Ph.D. student Xiaohan Guo, and Karthik Bharath at the University of Nottingham.

Bio: Sebastian Kurtek is a Professor of Statistics at The Ohio State University. He received his Ph.D. in Biostatistics from Florida State University in 2012. His research interests include statistical shape analysis, functional data analysis, and statistics on manifolds, among others.

2025-03-27- Dr. Peter Mueller - Common atoms mixture models in some biostatistical inference problems

Abstract:
We consider several examples of statistical inference for two or more related populations. In one example we characterize two patient populations that are relevant in the construction of a clinical study design, and propose a method to adjust for detected differences. Another example is about comparative immune profiling under two biologic conditions of interest when we identify shared versus condition-specific homogeneous cell subpopulations. In a third example we model spatially aligned cell subpopulations for spatial transcriptomics data.

Bayesian inference in all three applications requires prior probability models for two or more related distributions. We build on extensive literature on such models based on Dirichlet process priors. Related models are commonly known as dependent Dirichlet processes (DDP), with many variations and extensions beyond the Dirichlet process model.

One special feature in all three motivating applications is the focus on understanding the nature of the dependence across the related populations. In one application we aim to adjust for differences in population heterogeneity, in others we aim to identify and understand homogeneous subpopulations that are characteristic for one or the other condition.

Bio:
Peter Mueller is a Professor of Statistics and Mathematics at UT Austin. He obtained a doctorate in statistics from Purdue University in 1991. He served on the faculty in ISDS, at Duke University (1991-2001), in the Department of Biostatistics at M.D. Anderson Cancer Center (2001-2011) and at the Department of Mathematics and the Department of Statistics and Data Science at the University of Texas at Austin (2011-current), and as Adjunct Professor at the Department of Biostatistics at M.D. Anderson Cancer Center (2011-current). Mueller has published over 150 refereed articles in statistics, biostatistics, and other journals. He served as president of ISBA, and as chair of ASA/SBSS.

He works on Bayesian inference, with a focus on nonparametric Bayesian methods, simulation-based methods, optimal design, and multiple comparison procedures. He is interested in applications in biostatistics and bioinformatics, including in particular Bayesian clinical trial design, hierarchical models, population PK/PD models, inference for histone modifications, and tumor heterogeneity.

2025-03-13- Dr. Alejandro Jara - Bayesian Nonparametric Modeling of Mixed-Type Bounded Data

Abstract:
We propose a Bayesian nonparametric model for mixed-type bounded data, where some variables are compositional and others are interval-bounded. Compositional variables are non-negative and sum to a fixed constant, such as the proportion of time an individual spends on different activities during the day or the fraction of different types of nutrients in a person's diet. Interval-bounded variables, on the other hand, are real numbers constrained by both a lower and an upper bound. Our approach is based on a novel class of random multivariate Bernstein polynomials, which induce a Dirichlet process mixture model of products of Dirichlet and beta densities. We investigate the theoretical properties of the model, including its topological support and posterior consistency. The model can be used for both density and conditional density estimation, accommodating cases where the response and predictors take values in the simplex space and/or the hypercube. We illustrate the model's behavior through an analysis of simulated data and data from the 2005–2006 cycle of the U.S. National Health and Nutrition Examination Survey. This is joint work with Rufeng Liu, Claudia Wehrhahn, and Andrés F. Barrientos.

Bio:
Alejandro Jara is an Associate Professor of Statistics in the Department of Statistics at the Catholic University of Chile. He received a Ph.D. in Science (Mathematics) from the Catholic University of Leuven (Belgium). His Ph.D. dissertation was awarded the 2008 Savage Award from the International Society for Bayesian Analysis in the category of applied methodology. The widely used public domain set of programs for inference under Bayesian nonparametric models, DPpackage, was developed as part of his Ph.D. dissertation and received the 2008 John M. Chambers Statistical Software Award from the Statistical Computing Sections of the American Statistical Association. His main research interests are related to the development of novel Bayesian nonparametric models that thrive on collaborative and interdisciplinary efforts.

2025-02-27- Dr. Tianjian Zhou - Random Coefficient Regression: Nonidentifiability and Nonparametric Estimation

Abstract: The random coefficient regression (RCR) model is an extension of the classical linear regression model that accounts for individual variation by modeling the regression coefficients as random variables. A major interest lies in the estimation of the joint probability distribution of the random coefficients based on observed samples of the outcome variable evaluated for different values of the explanatory variables. In this paper, we consider fixed-design sampling of RCR models, under which the coefficient distribution may not be uniquely identifiable. To tackle the challenges of nonuniqueness, we consider the equivalence class in which each element is a plausible coefficient distribution that, for each value of the explanatory variables, yields the same distribution for the outcome variable. We formulate the approximation of the coefficient distribution as a collection of stochastic inverse problems, allowing for a flexible nonparametric approach that requires minimal assumptions. An iterative approach is proposed to approximate a particular element of the equivalence class by incorporating an initial guess of a solution called the global ansatz. We study its convergence and demonstrate its performance through simulation studies. The proposed approach is applied to a real dataset from an acupuncture clinical trial.

Bio: Tianjian Zhou is an Assistant Professor in the Department of Statistics at Colorado State University. His research interests include Bayesian methods with applications to clinical trials, statistical genomics, missing data, infectious diseases, and veterinary sciences. He earned his Ph.D. in statistics from the University of Texas at Austin in 2017. Before joining the faculty at CSU, he held postdoctoral appointments at the University of Chicago and NorthShore University HealthSystem.

2025-02-20- Dr. Maryclare Griffin - Doing More with Less: Statistical Methods for High-Dimensional and Correlated Regression Problems

Abstract:
Modern statistical methods are spectacularly rich and complex. They have the potential to help practitioners answer difficult but important questions about fundamental scientific problems. However, there are often gaps between the how modern statistical methods are introduced in the statistical literature and their implementation, which can make them impractical or inefficient from the perspective of practitioners. This talk presents several specific examples of how existing modern statistical methods can be made more accessible. They include faster, less computationally demanding methods for fitting popular models, new generalizations of popular models with desirable properties, and ways to avoid unnecessary and limiting assumptions. They are motivated by challenges practitioners encounter when analyzing high dimensional data and correlated data using penalized or Bayesian regression models, long memory time series models, Log-Gaussian Cox process models, and nonlinear potentially black-box machine learning models.

Bio:
Maryclare Griffin is an Assistant Professor in the Department of Mathematics and Statistics at the University of Massachusetts Amherst. Her research focuses on computational problems that arise in Bayesian statistics, penalized regression, spatial statistics, and time series.

2025-02-06- Dr. Alexandra Schmidt - Coupled Markov Switching Models for Spatio-temporal Infectious Disease Counts

Abstract:
Spatio-temporal infectious disease counts are often subject to abrupt and dramatic changes in behavior associated with different epidemiological events. For example, a disease might go temporarily extinct in an area causing cases to drop to zero for several weeks, or an outbreak might emerge causing cases to rise rapidly. In this talk I will describe recent novel Bayesian coupled Markov switching models to account for these types of shifts in behavior. Our approach in general is to assume the disease moves between different epidemiological periods or states in each area, such as disease absence or outbreak. When the disease is in a certain state in an area, the corresponding time series follows an appropriate statistical submodel, e.g., a degenerate 0 distribution if the disease is absent or an autoregressive model with high autocorrelation if there’s an outbreak. We switch between the states through a first-order Markov chain in each area where the transition probabilities can depend on covariates and the states in neighboring areas, to account for disease spread between the areas. Inference is performed under the Bayesian paradigm, and we develop efficient Markov chain Monte Carlo methods based on jointly sampling the hidden state indicators.

Examples include the number of cases of dengue fever across the neighborhoods of Rio de Janeiro, the number of hospitalizations due to COVID-19 across Quebec, and the joint modeling of cases of dengue, Zika, and chikungunya during the first triple epidemic Rio de Janeiro experienced. This is joint work with Dirk Douwes-Schultz (University of Calgary).

Bio:
Alexandra M. Schmidt is Professor of Biostatistics and the endowed University Chair in the Department of Epidemiology, Biostatistics and Occupational Health (EBOH) at McGill University in Montreal, Canada. She is Fellow of the American Statistical Association and the International Society for Bayesian Analysis. Her main research interests involve the modelling of spatial and spatio-temporal processes under the Bayesian framework.

2025-01-30- Dr. Jonathan Bradley - The Hierarchical Generalized Transformation Model for Spatial and Spatio-Temporal Data with Application to Wildfires

Abstract: In recent years, wildfires have devastated communities and represent an immediate danger to humans in terms of property damage and death. There is evidence that wildfires are becoming more frequent and are increasing in size. Moreover, wildfires are predicted to continue to increase in a warming climate. Prediction of wildfires can be difficult when one acknowledges that there are known feedback mechanisms producing signal-to-noise dependence. However, spatial statistical models predominantly assume that the error term is mutually independent of the signal to avoid issues with identifiability, leaving one incapable of leveraging this known source of dependence to improve predictions. To solve this problem, we introduce a discrepancy term into an additive model, which allows for cross-signal-to-noise covariances that avoid identifiability issues. This Bayesian hierarchical model is sometimes referred to as the hierarchical generalized transformation (HGT) model. Several different parametric structures will be discussed including “deep” specifications, and a discussion will be given on the interpretations of an HGT. Demonstrations are provided using simulations and an analysis of the motivating dataset of the 2017 Haypress wildfire downloaded from the Geospatial Multi-Agency Coordination (GeoMAC) database.

Bio: Jonathan Bradley received his PhD from The Ohio State University in 2013 under the mentorship of Drs. Noel Cressie and Tao Shi. He then worked as a postdoctoral researcher at the University of Missouri under Drs. Christopher Wikle and Scott Holan until 2016. He then moved to Florida State University, where he is now an Associate professor. His research interests are in spatial and spatio-temporal statistics, Bayesian analysis, official statistics, and environmental statistics.

2025-01-23- Ron Wasserstein - Moving to a World Beyond P<0.05

Abstract: For nearly a hundred years, the concept of “statistical significance” has been fundamental to statistics and to science. In this accessible and humor-filled talk, statistician Ron Wasserstein reviews why "statistical significance" arose, how it has become scientific gospel, and why its retirement is long overdue. He'll argue that this artificial boundary has warped research incentives, publication practices, and ultimately scientific progress itself. Most importantly, he'll present practical alternatives that embrace uncertainty and nuance, leading to better science. Please note: Dr. Wasserstein will be speaking in his capacity as an individual researcher and not in his role as Executive Director of the American Statistical Association.

2025-01-16- Dr. Kuan Liu- Bayesian Latent Variable Methods for Complex Observational Studies

Bio:
Kuan Liu is an Assistant Professor at the Institute of Health Policy, Management & Evaluation (IHPME) at the University of Toronto. She holds a PhD in Biostatistics from the University of Toronto. Kuan’s area of research is centred on causal inference, applied Bayesian statistics, and longitudinal data analysis. Her work focuses on developing statistical methodologies motivated by applications in clinical and public health research. Kuan worked professionally as a Biostatistician at ICES (formerly known as the Institute of Clinical Evaluative Sciences) and at the Hospital for Sick Children prior to joining the University of Toronto. Kuan is the current Chair of the Student and Recent Graduate Committee and a member of the New Investigator Committee of the Statistical Society of Canada.

Abstract:
Observational studies using administrative data offer a viable, efficient, and low-cost design to readily gather evidence on exposure effects. Although more practical, the exposure mechanism is nonrandomized, and causal inference methods are required to draw causal conclusions. The Bayesian approach to causal inference offers principled management of complex uncertainty structures. However, research in Bayesian causal inference lags behind its frequentist counterpart and remains an active area of methodological research of high interest. In this talk, we will introduce and discuss two Bayesian causal inference approaches that utilize latent variables to handle complex causal structures that are subjected to time-varying measured and unmeasured confounding. Both methods will be demonstrated using simulated data and real-world data.

2024-11-21- Dr. Joe Guiness- Gaussian Process Computing with Vecchia's Approximation and the GpGp R Package

Bio:
Associate Professor and Co-Director of Undergraduate Studies of Statistics and Data Science at Washington University St. Louis. He studies modeling and computational issues that arise in the analysis of spatial-temporal data, particularly data from the Earth Sciences.
Abstract:
Introduced more than 30 years ago, Vecchia's Gaussian process approximation has gained popularity within spatial statistics and is now emerging as a state-of-the-art approximation. I will review a bit of the history of the approximation and some more recent work on it. The recent work includes methods for improving the approximation by reordering the input locations, a grouped version for increasing accuracy and computational speed, theoretical developments, and procedures for fast optimization of the likelihood. We will also showcase the implementation of these ideas in the GpGp R package and highlight some applied projects that make use of the GpGp software for analyzing remote sensing data.

2024-11-14 - Spencer Wadsworth - Bayesian Stacking via Proper Scoring Rule Optimization Using a Gibbs Posterior

Abstract:
In a probabilistic forecast hub, many researchers collaborate by each submitting their own forecasts of the same events. A forecast hub thus provides a locale where forecasts from the various participants may be assessed for skill, directly compared with other forecasts, and findings may be disseminated. Often, the forecasts in a forecast hub are combined into an ensemble forecast. A common method for combining forecasts is to construct an optimal linear pool where forecast distributions are weighted and summed into a mixture distribution. The selection of the model weights in a linear pool has received much attention and is often done by minimizing some score function or via Bayesian model averaging methods. Most methods either fail to provide uncertainty in their estimation of the weights or they do not cater well to problem specific needs. In this work we present the stacked Gibbs posterior (SGP), a novel method for combining forecasts by constructing an optimal linear pool via a Gibbs posterior. The weights of the linear pool are intended to optimize a proper scoring rule, which rule may be specified according to problem needs, and the Gibbs posterior allows for uncertainty quantification of the weights and regularization from a prior distribution. In two simulation studies, we compare the predictive performance of linear pool forecasts constructed using the SGP to linear pools constructed via Bayesian model averaging methods and an equally weighted pool. We also apply the SGP to forecasts from the 2023-24 CDC FluSight collaborative hub and compare the results to those of the model averaging and equally weighted methods. In each study, the forecasts from the SGP outperform ensemble forecasts from the other methods.

Bio:
I'm in my final semester as a PhD student at Iowa State University and am currently a research fellow in the ISU Dependable Data Driven Discovery (D4) grant. I previously earned a masters degree in statistics at ISU and a bachelors degree from Brigham Young University. My research interests are in Bayesian methodology for disease outbreak forecasting and probabilistic forecast hubs.

2024-11-12 - Thomas Kerby - Beyond Black Boxes: Learning Local Higher-Order Interactions for Interpretable Data and Models

Abstract:
In this era of big data and complex machine learning models, achieving model explainability remains a significant challenge. This work explores two interconnected black boxes: the generative processes behind high-dimensional datasets and the opaque mechanisms through which neural networks learn from these data. We introduce a method, called Local CorEx, that interprets neural network models by treating the representations passed through the layers as a complex dataset, allowing us to uncover both localized pairwise dependencies and higher-order interactions (HOIs) within the underlying data. Local CorEx leverages techniques from information theory, manifold learning, and representation learning to identify and extract these HOIs, revealing how groups of features interact in intricate and sometimes surprising ways. We validate Local CorEx through experiments on both synthetic and real-world datasets to assess its robustness and applicability. We find that Local CorEx uncovers meaningful local HOIs that capture dependencies beyond pairwise relationships in an unsupervised manner, paving the way for enhanced interpretability in neural networks and insights into the generative processes of complex data.

Bio:
Tom Kerby is a PhD candidate in Mathematical Sciences with a specialization in Statistics at Utah State University, studying under Dr. Kevin Moon and supported by the Presidential Doctoral Research Fellowship. Tom is a proud BYU alumnus, where he earned his undergraduate in Statistics with minors in Computer Science, Mathematics, and Molecular Biology. While at BYU, he conducted interdisciplinary research with plant geneticists and bioinformaticians in the orphan crops lab. His current research focuses on interpretable machine learning, representation learning, and diffusion models. Tom’s industry experience includes internships with Idaho National Laboratory, where he worked on detecting living-off-the-land cyber attacks, and with Enveda Biosciences, a biotech drug discovery company, where he applied large language models to predict molecular structures and properties. He married his sweetheart before coming to BYU and is now the father of two children, with a third expected in March. In his spare time, he enjoys outdoor activities with his family, astrophotography, playing the piano, and cheering on BYU sports.

2024-11-07- Steven Barnett - Generating higher resolution sky maps using a deep Gaussian process Poisson model

Bio:
Steven Barnett is a fifth-year graduate student at Virginia Tech pursuing a Ph.D. in Statistics. He received his M.S. in Statistics from Virginia Tech in 2022 after receiving a B.S. in Computer Science from Brigham Young University in 2016 and working as a software engineer at Qualtrics for four years. His research focuses on Gaussian process regression, surrogate modeling, and computer model calibration, with applications in the physical and engineering sciences. Steven has worked as a Graduate Research Assistant at Los Alamos National Laboratory and has served as a lead statistical consultant in the Statistical Applications and Innovations Group at Virginia Tech, collaborating with researchers in other fields and co-authoring several academic publications. He currently serves as the Communications Chair for the Chemical and Process Industries Division of the American Society and Quality and is the president of the Virginia Tech chapter of Mu Sigma Rho.
Abstract:
The Interstellar Boundary Explorer (IBEX) satellite was launched in 2008 in an effort to learn more about the heliosphere, which sits at the boundary between our solar system and interstellar space. IBEX detects energetic neutral atoms (ENAs) originating at the edge of the heliosphere to create sky maps describing their rate of emission. These sky maps are used by physicists to inform their theoretical models about the heliosphere, its many properties, and the processes that govern its creation. However, the data collected by IBEX are both noisy and irregular. We propose a deep Gaussian process Poisson model to estimate the rate at which ENAS are generated throughout the heliosphere. We believe this constitutes a more cohesive model than those developed previously. Our deep Gaussian process Poisson model has shown a greater ability to learn the complex, non-stationary mean response surface, and provide appropriate uncertainty quantification in different regions of the input space, while maintaining a simpler covariance function. We have developed a Markov chain Monte Carlo algorithm utilizing elliptical slice sampling and the Vecchia approximation to help better predict the underlying latent deep Gaussian process for use in sky map generation, sequential design, and heliospheric computer model calibration.

2024-11-05 - Zoe Gibbs - Dynamic State-Space Modeling for Determining the Long-Term Impact of Mobile Health Interventions

Abstract:
While many smartphone applications measure and/or seek to modify human behavior via mobile health (mHealth) interventions, it can be challenging to determine whether the interventions result in lasting behavioral changes. In this paper, we introduce a novel framework for assessing mHealth interventions. Specifically, we define a latent variable that measures a person's tendency toward a behavior of interest. We assume this tendency can deviate over time due to random changes and interventions. We also assume there is an overall long-term effect of interventions, but allow the effect to vary over time using an AR(1) correlation structure. The resulting model can be thought of as a local linear trend state-space model. Additionally, because the type and frequency of mHealth applications often rely on previous response data, we show how endogenous variables can be incorporated into the model. We also show how a hurdle likelihood can be used in the case of zero-inflated data. Because of the complexity of the model, we use Bayesian methods to fit the model. We show via simulations that the model can be fit using a No-U-Turn sampler (NUTS) in Stan. We apply our model to step count data, offering interpretations and areas for future work.

Bio:
Zoe McBride is a Ph.D. student at the University of Connecticut. Her research interests include Bayesian hierarchical modeling and functional data analysis, especially in applications surrounding behavioral health and actuarial science. She received B.S. and M.S. degrees in Statistics from Brigham Young University in 2020. Zoe is also an associate of the Society of Actuaries. In her free time, she enjoys running, hiking, paddleboarding, and spending time with her husband and daughter.

2024-10-31 - Dr. Tim Swartz - Two Problems in Soccer Analytics

Abstract:
This talk concerns two problems in soccer analytics that both rely on tracking data. The first problem begins with a review of average aging curves in sport. Then, a new approach is introduced which addresses personal aging curves in soccer, an essential problem of interest which has not been previously addressed.

The second problem concerns the development of a metric that identifies soccer players who have a similar style to a player of interest. Whereas performance variables have been well studied, the same is not true of stylistic variables. Unlike assessments from scouting, the metric is automatic and objective. The metric is developed using a Bayesian framework.

Bio:
Tim Swartz is Professor and former Chair in the Department of Statistics and Actuarial Science at Simon Fraser University. He obtained a PhD and MSc in Statistics from the University of Toronto and a BMath from the University of Waterloo. He has over 120 research publications and has written several books including an Oxford text (2000) with Michael Evans on Approximating Integrals via Monte Carlo and Deterministic Methods. He is Fellow of the American Statistical Association and is AE for five journals. Most of his current research involves sports analytics.

2024-10-24 - Dr. Jean-François Bégin - Modeling Subnational Mortality in the Presence of Aggregated Data

Abstract:
This study proposes a new approach to modeling subnational mortality that relies on individual features (e.g., sex, geographical region, socioeconomic status) instead of dealing directly with subpopulations. Our strategy leads to more parsimonious models because fewer parameters are needed to explain mortality. Also, data providers might aggregate data over privacy concerns, and our framework allows for the use of such data, unlike the common subnational mortality modeling approach. A general one-step Bayesian estimation methodology that works well with most age–period–cohort mortality models proposed thus far in the literature is presented; it uses Markov chain Monte Carlo techniques by combining deterministic filtering with adaptive Metropolis steps and is well-suited for high-dimensional cases like the one investigated in this article. In a case study using real data, the framework is applied to Canadian mortality data from three datasets that encompass three features: sex, geographic region, and socioeconomic status. We show that the proposed approach combined with a reasonable mortality model provides realistic, coherent, and plausible mortality projections and that it fits the data reasonably.

This is joint work with Barbara Sanders and Xueyi Xu.

Bio:
Dr. Jean-François Bégin is an Associate Professor in the Department of Statistics and Actuarial Science at Simon Fraser University in British Columbia. He is a specialist in financial modelling as well as statistical and mathematical applications to finance and insurance. Before joining Simon Fraser University, he received his PhD from HEC Montréal in Financial Engineering. He is also a Fellow of both the Society of Actuaries and the Canadian Institute of Actuaries.

Over the past few years, his research program focused on the construction of complex models for long-term economic predictions, the understanding and management of credit risk, the modelling of option prices, and the development of sustainable retirement solutions and designs. He was the recipient of the first Young Researcher Award of the Canadian Institute of Actuaries in 2017. He also won the 2019 Bob Alting von Geusau Prize of the AFIR-ERM section of the International Actuarial Association and the Early Career Faculty Research Award of Simon Fraser University’s Faculty of Science.

2024-10-17 - Dr. Sameer Deshpande - Scalable Targeted Smoothing In High Dimensions With BART

Abstract:
Bayesian Additive Regression Trees (BART) is an easy-to-use and highly effective nonparametric regression model that approximates unknown functions with a sum of binary regression trees (i.e., piecewise-constant step functions). Consequently, BART is fundamentally limited in its ability to estimate smooth functions. Initial attempts to overcome this limitation replaced the constant output in each leaf of a tree with a realization of a Gaussian Process (GP). While these elaborations are conceptually elegant, most implementations thereof are computationally prohibitive, displaying cubic per-iteration complexity.

We propose a version of BART built with trees that output linear combinations of ridge functions; that is, our trees return linear combinations of compositions between affine transforms of the inputs and a (potentially non-linear) activation function. We develop a new MCMC sampler that updates trees in linear time. Our proposed model includes a random Fourier feature-inspired approximation to treed GPs as a special case. More generally, our proposed model can be viewed as an ensemble of local neural networks, which combines the representational flexibility of neural networks with the uncertainty quantification and computational tractability of BART.

2024-10-10 - Dr. Alex Petersen - FunCLIME: Sparse Estimation of a Functional Precision Operator by Constrained L1 Minimization

Abstract:
In multivariate data analysis, the precision (or inverse covariance) matrix is an important estimation target as it encodes information about linear dependencies between variables conditional on others, going beyond the pairwise correlations represented in the covariance matrix itself. Using the sample covariance as an input, modern algorithms provide sparse estimates of the precision matrix, with zero estimates indicating conditional linear independence. The first part of this talk will summarize two of these algorithms, commonly known as the graphical Lasso and CLIME, the latter being a simple and very computationally convenient relaxation of the former. In the second part, the extension of the precision matrix to
multivariate functional data will be presented, along with a recently proposed (and by no means obvious) extension of the graphical Lasso methodology. Finally, I will present some ongoing work related to the Functional CLIME relaxation (FunCLIME), including the associated optimization problem, the computational algorithm, and an outline of some finite-sample
theoretical properties.

2024-10-03 - Dr. Jeffery Dawson - Thinking Outside the Box Plot: Nontraditional Roles for Statisticians

Bio:
Dr. Dawson (BS in Statistics, BYU, and ScD in Biostatistics, Harvard) is a Professor of Biostatistics at the University of Iowa’s College of Public Health (UI CPH), where he serves as Associate Dean for Faculty Affairs, Chair of the Global Public Health Initiative, and a Fulbright Scholar mentor. He has collaborated with over 30 departments across campus and is particularly known for his research in elderly and neurologically impaired drivers, cardiovascular health, and cancer. His methodological interests include longitudinal data and clinical trials. He is a biostatistical consultant for Statistics Without Borders and has participated in several public health and educational outreach activities in Haiti, the Dominican Republic, and Mexico. He is a Fellow of the American Statistical Association.

Abstract:
Statistical training often focuses on data analysis methods and the theoretical underpinnings thereof. However, the actual work responsibilities of a statistician often expand into many other areas, and some statisticians may become involved in activities well beyond their anticipated roles. In this presentation, we discuss several examples of nontraditional roles, including addressing the media, providing training in ethics, conflict management, community outreach, and supervising study abroad programs. Other opportunities for nontraditional activities exist, depending on our work environment and our talents and interests. The quantitative skills that statisticians have often provide an important perspective when participating in nontraditional activities.

2024-09-19- Dr. Nate Sandholtz- Investigating the Spatial Component of Serving Strategy in Tennis

Abstract:
A key component of a player's strategy in tennis is the decision of where to aim their serve. At a high level, there are two primary strategies: aiming ``out wide" (Wide), which entails serving across the court toward the outside boundary of the service region; or ``up the T" (T), in which the server aims toward the center line of the service region. Once the player has decided which high-level region they will serve to, they must choose (perhaps subconsciously) a specific spatial target within that region. This component of the serving decision is the focus of our research. Specifically, conditional on a chosen region (Wide or T), where should a player aim their serve, knowing that they will not be able to execute their intention perfectly? To answer this question, we model the serve as a two-period Markov decision process (MDP) where both the transition dynamics and expected rewards must be learned from data. We examine two sources of data: first, data from an experiment we conducted with players from the BYU Men's tennis team; and second, observational data from the 2020 and 2021 Australian Opens. In our presentation, we will show results from our analysis on each data source and discuss their respective differences, advantages and limitations.

2024-09-12- Dr. Jared Whitehead- Aliasing Beyond Bias and Variance, a Nearly Label-Independent Decomposition for Quantitative Risk

Abstract:
We develop a novel decomposition of error in regression problems with intuition taken from the effects of signal aliasing from the signals processing community. This new decomposition agrees with the classical interpretation of the bias-variance tradeoff in the under parameterized regime, but also completely explains the presence of the so-called 'double descent' phenomenon as well as the potential for multiple descents and slew of other non-convex behavior in the risk. Depending on the distribution of the training vs testing data (labels) we can use this decomposition to determine when the classical 'sweet spot' or under parameterized model is preferred, or when extreme over-parameterization (such as deep neural networks) is warranted. This generalized aliasing decomposition is demonstrated on a few simple regression problems, and on a large dataset arising from materials science where we show that the non-convex behavior of the risk is completely explained via this decomposition.

2024-04-04- Xinyi "Lucy" Lu- Latent trajectory models for spatio-temporal ecosystem dynamics

Abstract:
State-space models are commonly used to characterize variation in ecosystems in space and time. However, Markov models (e.g., Discrete Time Markov Chain) are susceptible to imputing unrealistically volatile states when sampling occurred at irregular intervals. We proposed an alternative class of latent trajectory models that represent ecological processes using dynamic state probabilities instead of state transition probabilities. We demonstrated our method using two case studies. The first case study quantified the structural transformation of Alaskan land covers using remotely sensed imagery. We developed a Bayesian hierarchical model to infer the rates of land cover transitions and a Polya-Gamma sampling strategy to improve computation. The second case study differentiated population vulnerability to seasonal weather using native brook trout (Salvelinus fontinalis) count data. We inferred the latent states by incorporating ecological mechanisms and developed a Bayesian regularization strategy to promote temporal coherence.

2024-03-28- Amy Herring- Bayesian Learning of Clinically Meaningful Disease Phenotypes

Abstract:
Sepsis is a life-threatening condition that arises when the immune system's response to infection causes damage to internal organs and tissues. Recently, researchers have hypothesized that sepsis consists of a heterogeneous spectrum of distinct subtypes, motivating several studies to identify clusters of sepsis patients that correspond to subtypes, with the long-term goal of using these clusters to design more personalized treatments. Therefore, clinicians rely on clusters having a concrete medical interpretation, usually corresponding to clinically meaningful regions of the sample space that have a concrete implication to practitioners. In this article, we propose Clustering Around Meaningful Regions (CLAMR), a Bayesian clustering method that explicitly models the medical interpretation of each cluster center. CLAMR favors clusterings that can be summarized via meaningful feature values, leading to medically significant sepsis patient clusters. We also provide details on measuring the effect of each feature on the clustering using Bayesian hypothesis tests, so one can assess what features are relevant for cluster interpretation. The method is applied to clustering sepsis patients from Moshi, Tanzania, where patients are younger and the prevalence of HIV infection is higher than in previous sepsis subtyping cohorts from Europe and North America.

Bio:
Amy H. Herring is Sara & Charles Ayres Distinguished Professor of Statistical Science, Global Health, and Biostatistics and Bioinformatics at Duke University. Dr. Herring received her doctorate in biostatistics at Harvard University and came to Duke from UNC-Chapel Hill, where she was distinguished professor of biostatistics. Her research interests include development of statistical methodology for longitudinal or clustered data, Bayesian methods, latent class and latent variable models, missing data, complex environmental mixtures, and applications of statistics in population health and medicine. She has received numerous awards for her work, including the Mortimer Spiegelman Award from the American Public Health Association as the best applied public health statistician under age 40. Her research program is funded by NIH, and she holds leadership positions at the national and international level, including as Chair of the American Statistical Association's Section on Bayesian Statistical Science, as President of the International Society for Bayesian Analysis, and as a member of the Board of the International Biometric Society.

2024-03-21- Jyotishka Datta- Global-Local Shrinkage Priors: An Overview and New Directions

Abstract: Building scalable Bayesian methods for handling high-dimensional data with complex structure remains an important methodological challenge with diverse applications. While there is a vast literature proposing elaborate shrinkage and sparsity priors for high-dimensional continuous data and real-valued parameters, there has been limited consideration of compositional or count data and admixtures. In the first part of my talk, I will provide a broad overview of the state-of-the-art in global-local shrinkage priors, covering theoretical optimality as well as computational aspects. In the second part, I will discuss a few recent developments, namely designing a shrinkage prior to handle bi-level sparsity and handling sparse compositional data, routinely observed in microbiomics. I will address the methodological challenges associated with each of these problems and propose to fill this gap by using new prior distributions specially designed to enable handling structured data. I will provide theoretical support for the proposed methods and demonstrate improved performance in simulation settings and applications to environmentrics and microbiome data.

Author Bio: Jyotishka Datta is an assistant professor in the Department of Statistics at Virginia Tech and a core faculty member at the Center of Biostatistics and Health Data Science. He was an Assistant Professor in the Department of Mathematical Sciences at the University of Arkansas at Fayetteville from 2016-2020. Jyotishka received his PhD in Statistics from Purdue University and worked as a postdoctoral fellow at Duke University and SAMSI (Statistics and Applied Mathematical Sciences Institute). His research interest spans developing new methodology and theory for high-dimensional data with low-dimensional structures. He has contributed to the area of shrinkage estimation, sparse signal recovery, changepoint detection, nonparametric Bayes, and graphical models . Recent applications include next-gen sequencing studies, auditory neuroscience, ecology and crime forecasting.

2024-03-14- Lane Muranaka- A Salary Negotiation Discussion - Resources, tactics, and Strategies

Lane is the Career Director for Computer Science, Math, and Statistics

Why I love working here:
I love working where we can blend secular and spiritual learning to impact the lives of the students that we serve.

Quote I live by:
Trust in the LORD with all thine heart; and lean not unto thine own understanding. In all thy ways acknowledge him, and he shall direct thy paths – Proverbs 3:5-6

Advice for Students:
Follow your curiosity and ask lots of questions.

Favorite BYU Ice Cream Flavor:
Mint Chocolate Chip (do they have that here?)

What I love doing when I’m not at work:
Right now, I love to chase my kids around and support their activities. I love sports, playing golf, and it is a goal of mine to see a game at every major league baseball stadium.

2024-03-07- Trevor Harris- Climate Models, Variables & Projections

Bio: Trevor Harris is an assistant professor in the Department of Statistics at Texas A&M University. His research interests include the development and application of deep learning tools to problems in Climate Science and Epidemiology. Some recent works include developing new Wasserstein metrics for validating climate models, large scale granger causality testing and policy evaluation through deep neural networks, and forecasting West Nile virus with graph neural networks. In the fall, he will join the Department of Statistics at the University of Connecticut through their cluster hire on Climate Science and AI.

Abstract: Large ensembles of climate models are the primary method for quantifying uncertainty in future climate projections. However, these ensembles lack traditional statistical guarantees regarding their coverage and precision, potentially leading to either over or under representation of certainty in climate projections. To address this, we propose a conformal inference based approach, called conformal ensembles, that is guaranteed to have joint coverage over the prediction field without being excessively wide. This approach allows us to use nearly any prediction algorithm, including black box algorithms, to combine climate model output and generate statistically valid prediction regions whose sharpness directly improves with the accuracy of the algorithm. We demonstrate our approach on three climate variables: average temperature, maximum temperature, and precipitation and show that it significantly outperforms inter-model variability on a wide range of uncertainty quantification metrics. We also show that the conformal ensembles also have good marginal performance across spatial locations, do not generally deteriorate over time, and typically improve with increasing climate model ensemble size. Finally, we investigate issues related to distribution shift that can impact the statistical guarantees of our approach.

2024-02-15- Matthew Heaton- Adjusting for Spatial Correlation in Machine and Deep Learning

Abstract: Spatial data display correlation between observations collected at neighboring locations. Generally, machine and deep learning methods either do not account for this correlation or do so indirectly through correlated predictors and thereby forfeit predictive accuracy. To remedy this shortcoming, we propose preprocessing the data using a simple spatial decorrelation transform derived from principles of Gaussian processes. The decorrelating transform is computationally scalable and allows the transformed data to be ported immediately into any machine or deep learning tool. After model fitting on the transformed data, the output from machine or deep learning methods are then spatially re-correlated via the corresponding back transformation. We show that including such a spatial adjustment results in higher predictive accuracy on simulated and real spatial datasets.

2024-02-08- Brian Hartman- Personal Finance and Financial Independence

We will discuss some basics of personal finance. We will examine how to make choices with your money to be able to achieve your goals and live the life you want to live. We will also talk about early retirement and financial independence, how unusual saving can give you ultimate career and life flexibility. You are welcome to bring friends, partners, or roommates (especially those who never seem to be able to pay their portion of the utilities).

2024-02-01- Cameron Bale- Can We Protect Time Series Data While Maintaining Accurate Forecasts?

Abstract: In the context of increasing data breaches and privacy concerns, we focus on the often-neglected goal of maintaining the usefulness of protected data by exploring the intersection of data privacy and time series forecasting. Using both simulated and real-world time series data sets, we test various privacy methods, including a proposed swapping-based method (k-nTS+) designed to maintain time series features, a differentially private method, and an approach based on sharing model weights trained on unprotected data. Based on forecasts from seven different forecasting models, none of the privacy methods based on swapping or random noise addition can consistently maintain forecast accuracy at an acceptable level of privacy. On the other hand, sharing model weights trained on unprotected data enables accurate forecasting, but accurate forecasts themselves can uncover the identities of the unprotected time series. We show that transforming time series into rates increases the similarity of time series features, values, and forecasts, and enables k-nTS+ to produce protected rate time series with a reduction in forecast accuracy of just 3.6% on average. Overall, this paper shows that except under certain conditions, creating protected time series with acceptable privacy is incompatible with obtaining accurate forecasts.

Author Bio: Cameron is a Ph.D. candidate in Business Analytics at Drexel University Lebow College of Business. He specializes in statistical and machine learning-based privacy methods with a focus on location, textual, and time series data. His research emphasizes a utility-first approach to privacy protection that balances data usefulness with privacy risk. Cameron holds a B.S. in Economics from Brigham Young University and has published in Expert Systems with Applications and Transactions on Data Privacy. He will join the Brigham Young University Marketing Department as an assistant professor in Fall 2024.

2024-01-25- Matt Koslovsky- Accommodating Measurement Error in Multivariate Compositional Count Data with Applications to Microbiome Research

Abstract: The human microbiome is the collection of microorganisms that live on and inside of our bodies. Microbiome data are inherently challenging to analyze due to their high-dimensionality, overdispersion, and zero-inflation. Analysis is further complicated by the steps taken to collect and process microbiome samples. For example, sequencing instruments have a fixed capacity for the total number of reads delivered. It is therefore essential to treat microbial samples as compositional. Another complicating factor of modeling microbiome data is that taxa counts are subject to measurement error introduced at various stages of the measurement protocol. Recently, the Dirichlet-multinomial (DM) distribution and its variants have been used extensively to model microbiome data due to its ability to accommodate the compositional structure of the data as well as overdispersion. A major limitation of the DM distribution is that it is unable to handle excess zeros typically found in practice which may bias inference. In this talk, I will introduce a novel Bayesian zero-inflated DM model for multivariate compositional count data with excess zeros, designed to boost scalability without sacrificing interpretability or imposing limiting assumptions. I will then present extensions to handle high-dimensional regression settings and potential taxonomic misclassification. The performance of the proposed methods is examined through simulation and is further illustrated using human microbiome data.

Bio: I graduated with a PhD in Biostatistics from The University of Texas Health Science Center School of Public Health (UTHealth) in December 2016 and held an appointment as a Post-Doctoral Research Associate in Data Science at Rice University in Dr. Marina Vannucci’s research lab from March 2018 to August 2020. Prior to this role, I served as a statistical consultant in the Biostatistics Lab at Johnson Space Center on the Human Health and Performance Contract. I joined the Statistics Department at Colorado State University as an Assistant Professor Fall 2020. My research interests include Bayesian methodology, nonparametric Bayes, variable selection, joint modeling, and compositional data analysis with applications in cancer prevention, mobile health, microbiome research, and ecology. Here's a like to my personal webpage: https://mkoslovsky.github.io

2024-01-18- Bruce Schaalje- Randomness and Providence in an Inside-Outside Career

Bio:
Born and raised in Alberta, Canada. Served a mission in Finland. Attended Rick's College, BYU, U of Washington, and NorthCarolina State U. Degrees in Mathematics, Zoology, and Statistics. Worked for Agriculture Canada (12 years), BYU (24 years), andBayer Crop Science (8 years). Married for 48 years with 6 children and 17 amazing grandchildren.

Abstract:
I have studied or worked in statistics for half a century -- in 3 universities, 2 countries, and 3 employment sectors (government, academia, and the private sector). The field has gone through a lot of change, but surprisingly many ideas are timeless.

I haven’t conquered statistics, but I am excited about it and still trying to make contributions. I have had the opportunity to interact with many wonderful people. I hope to excite you about what is in store for you in your studies and future career path.

2023-11-28 - Michael Christensen - Utilizing network structure to flexibly model areal data

Abstract:

Regionally indexed or areal data is common within many ecological, epidemiological, and economic applications, yet commonly used spatial models for this data type, such as the conditional autoregressive model, are often insufficiently flexible to represent the types of between-region dependence that may be present within such data sets. The adjacency structure of a partitioned spatial domain allows for a graphical representation of the space on which data is observed. Within this presentation we provide a brief discussion of graph metrics and relevant theory before showing how these tools may be used to more thoughtfully characterize dependence structures within a network. We then review research and methods from current projects focused on modeling the complex spatial and spatio-temporal dependence patterns present within species abundance data taken from the eBird database.

2023-11-16- Godwin Osabutey- From Ising Block Model to Bayesian Inference: Limit Theorems and Parametric Analysis

Godwin Osabutey- a doctoral student at the University of Bologna, Italy, specializing in Mathematics.

Abstract: This presentation delves into the Ising model applied to a network, where vertices are partitioned into r-blocks, each representing a fixed proportion of the total vertices. Interactions among vertex variables are governed by their respective block memberships. The discussion unfolds in two segments. The first part focuses on examining the asymptotic properties of the random vector of block sample means, establishing sufficient conditions for the validity of Central Limit Theorem (CLT). In the second part, Bayesian statistics are employed to conduct a parametric analysis of the model. This involves the use of an advanced MCMC sampling method, introducing a novel approach to address an inverse Ising problem, leveraging prior knowledge of the data. The validity of CLT provides a theoretical underpinning of the parametric analysis as the sample mean of the data tends to approximate a normal distribution for a sufficiently large sample size.

2023-11-7- David Arthur- On Bayesian Methods for Small Sample Psychometrics

Abstract: Psychometrics, a crucial branch of psychology, plays a pivotal role in quantifying mental traits, abilities, and processes. One powerful tool in this domain is the Cognitive Diagnostic Model (CDM), gaining popularity for its ability to offer precise, personalized feedback on skill mastery. However, despite their potential, CDMs face underutilization in everyday classrooms, partly due to their demanding sample size prerequisites. This talk focuses on two critical aspects of cognitive diagnosis that are adversely affected by small sample sizes: parameter estimation and item-level model selection. To tackle these challenges, innovative Bayesian methodologies are introduced. For the first problem, data-driven shrinkage priors, also known as catalytic priors, are used to enhance the accuracy of parameter estimation. For the second problem, Reversible-Jump Markov Chain Monte Carlo (RJ-MCMC) is used to increase the chances of selecting the most appropriate CDM for each assessment item. Finally, the presentation concludes by outlining promising avenues for future research within the realm of CDMs, contributing to the ongoing progress in educational assessment practices.

2023-10-26- Dr. Huiyan Sang -Graph Random Partition Models for Clustering, Classification, and Regression of Spatial and Network Data

Abstract: This talk introduces random partition models constrained by graph contiguity. We focus on the random spanning tree and graph product partition model constructions. These models can be flexibly embedded into various Bayesian hierarchical frameworks as Bayesian priors for modeling clustered latent variables on graphs. We illustrate the utility of such graph random partition prior models in spatial clustering problems and classification/regression problems using non-axis parallel graph-split-based ensemble Bayesian additive decision trees. If time allows, we will also discuss strategies to extend the random partition model from a finite graph to a spatially soft partitioned process.
Bio:
Huiyan Sang is a professor and the director of the undergraduate program in statistics at Texas A&M University. She joined Texas A&M in 2008 as an assistant professor after earning her Ph.D. in Statistics from Duke University. Her research interests include the development of theory, methodology, and computation for spatial statistics, graph and network data analysis, Bayesian nonparametrics, machine learning methods, computational statistics, high-dimensional data analysis, and extreme values. Her interdisciplinary research work spans applications of statistics in environmental sciences, geosciences, urban and traffic planning, economics and business, biomedical research, chemometrics, and electrical/manufacturing/petroleum engineering.

2023-10-12 - Jeffrey S Morris - Quantile Functional Regression for Distributional regression of biomedical imaging data

Co-authors: Quy Cao, Hojin Yang, Veera Baladandayuthapani, Benny Renn, Elizabeth Sweeney

Abstract: In many areas of science, technological advances have led to devices that produce an enormous number of measurements per subject, including biomedical imaging data. Frequently, researchers deal with these data by extracting summary statistics from these data (e.g. mean or variance) and then modeling those, but this approach can miss key insights when the summaries do not capture all of the relevant information in the raw data. One of the key challenges in modern statistics is to devise methods that can extract information from these big data while avoiding reductionist assumptions. In this talk, we will discuss methods for modeling the entire distribution of the measurements observed for each subject and relating properties of the distribution to covariates, with possible smooth nonlinear covariate and longitudinally varying affects. We apply this method to two biomedical imaging applications: one computing how the distribution of pixel intensities within a glioblastoma region relate to various biological and clinical factors, and the second using quantitative susceptibility mapping measuring inflammatory processes in brain imaging from multiple sclerosis patients. This general approach has many important applications, including many biomedical imaging applications, as well as wearable device data from accelerometers, blood pressure, and blood sugar monitors, as well as other types of high frequency data streams.

2023-10-05 - Alex Petersen - Graphical Modeling for Multivariate Functional Data

Abstract: This talk will have 3 parts. In Part I, I will describe the topic of Gaussian Graphical Models for classical multivariate data. From a statistical perspective, a particularly nice feature of GGMs is that they are characterized by the (inverse) covariance matrix, so that interpretable estimation can be done using penalized likelihood estimators of the precision matrix. In Part II, I will outline how to generalize GGMs to multivariate functional data, where each node in the graph corresponds to a random function, with neuroimaging data being a key motivating example. Some existing estimators will be discussed. In Part III, I will describe two projects I will be investigating related to i) leveraging separability for functional GGM estimation and ii) extending the so-called CLIME GGM estimator to the functional case.

2023-09-28 - Dr. Alex Franks - On Sensitivity to Unobserved Confounding in Studies with Factor-structured Outcomes

Bio: Alex Franks is an Associate Professor in the Department of Statistics and Applied Probability at the University of California, Santa Barbara. His research interests include covariance estimation, sensitivity analysis and causal inference, missing data and measurement error, high throughput applications in biology (“omics”), Bayesian statistics and sports.

Abstract: In this work, we propose an approach for assessing sensitivity to unobserved confounding in studies with multiple outcomes. We demonstrate how prior knowledge unique to the multi-outcome setting can be leveraged to strengthen causal conclusions beyond what can be achieved from analyzing individual outcomes in isolation. We argue that it is often reasonable to make a shared confounding assumption, under which residual dependence amongst outcomes can be used to simplify and sharpen sensitivity analyses. We focus on a class of factor models for which we can bind the causal effects for all outcomes conditional on a single sensitivity parameter that represents the fraction of treatment variance explained by unobserved confounders. We characterize how causal ignorance regions shrink under additional prior assumptions about the presence of null control outcomes, and provide new approaches for quantifying the robustness of causal effect estimates. Finally, we illustrate our sensitivity analysis workflow in practice, in an analysis of both simulated data and a case study with data from the National Health and Nutrition Examination Survey (NHANES).

2023-09-21 - Kimberly F. Sellers, Ph.D., on Dispersed Methods for Handling Dispersed Count Data

Bio:
Kimberly F. Sellers, Ph.D. is the Head of the Statistics Department at North Carolina State University in Raleigh, NC. She further serves as a Principal Researcher with the Center for Statistical Research and Methodology Division of the U.S. Census Bureau. Her primary research interests and expertise center on statistical methods for count data that contain data dispersion with methodological interests in distribution theory, regression analysis, multivariate analysis, stochastic processes, and time series analysis. She recently authored the book, The Conway-Maxwell-Poisson Distribution (Cambridge University Press, 2023), which is the first comprehensive reference on the distribution and the flexible statistical methods derived using it to analyze dispersed count data. Sellers is an Elected Member of the International Statistical Institute, and a Fellow in both the American Statistical Association (ASA) and the Association for Women in Mathematics in recognition of her research, and active contributions to diversifying the fields of mathematical and statistical sciences with respect to gender and race/ethnicity.

A Washington, DC-area native, Sellers completed her BS and MA degrees in Mathematics at the University of Maryland College Park, and obtained her PhD in Mathematical Statistics at The George Washington University. She held previous faculty positions at Carnegie Mellon University, the University of Pennsylvania School of Medicine, and Georgetown University.

Abstract:
While the Poisson distribution is a classical statistical model for count data, it hinges on the constraining equi-dispersion property (i.e. that the mean and variance equal). This assumption, however, does not usually hold for real count data; over-dispersion (i.e. when the variance is greater than the mean) is a more common phenomenon for count data, however data under-dispersion has also been prevalent in various settings. It would be more convenient to work with a distribution that can effectively model data (over- or under-) dispersion because it can offer more flexibility (and, thus, more appropriate inference) in the statistical methodology. This talk introduces the Conway- Maxwell-Poisson distribution along with several associated statistical methods motivated by this model to better analyze count data under various scenarios (e.g. distributional theory, generalized linear modeling, control chart theory, and count processes). As time permits, this talk will likewise acquaint the audience with available associated tools for statistical computing.

2023-09-14 - Dr. Abhirup Datta - On Combining Machine Learning With Gaussian Processes For Geospatial Data

Bio: Dr. Datta is an Associate Professor in the Department of Biostatistics at Johns Hopkins University. He completed his PhD in Biostatistics from University of Minnesota. Dr. Datta’s research focuses on developing spatial models for geographically indexed data. His work on Nearest Neighbor Gaussian Processes (NNGP) has become one of the most widely used methods for scalable analysis of massive geospatial data. His recent work focuses on developing theory and methodology for combining machine learning algorithms with traditional spatial modeling, and application of the methodology to air pollution and infectious disease modeling. He also works on developing Bayesian hierarchical models for multi-source data with applications in global health. His research as Principal Investigator is funded by grants from the National Science Foundation (NSF), National Institute of Environmental Health Sciences (NIEHS) and the Bill and Melinda Gates Foundation. He has received the Early Career Investigator award from the American Statistical Association Section of Environmental Health, the Young Statistical Scientist Award (YSSA) by the International Indian Statistical Association (IISA), and the Abdel El-Shaarawi Early Investigator's Award from the The International Environmetrics Society (TIES).

Abstract:
Spatial generalized linear mixed-models, consisting of a linear covariate effect and a Gaussian Process (GP) distributed spatial random effect, are widely used for analyses of geospatial data. We consider the setting where the covariate effect is non-linear and propose modeling it using a flexible machine learning algorithm like random forests or deep neural networks. We propose well-principled extensions of these methods, for estimating non-linear covariate effects in spatial mixed models where the spatial correlation is still modeled using GP. The basic principle is guided by how ordinary least squares extends to generalized least squares for linear models to account for dependence. We demonstrate how the same extension can be done for these machine learning approaches like random forests and neural networks. We provide extensive theoretical and empirical support for the methods and show how they fare better than naïve or brute-force approaches to use machine learning algorithms for spatially correlated data. We demonstrate the RandomForestsGLS R-package that implements this extension for random forests.

2023-04-06 - Martina Narcisi - On the Effect of Confounding in Linear Regression Models: An Approach Based on the Theory of Quadratic Forms

Abstract:
In the last two decades, prominent research has been devoted to the issue of spatial confounding in linear regression models. In general, confounding arises when a relationship between an observed process, the covariate, and an outcome process, the response, is influenced by the confounder, an unmeasured process, associated with both. Consequently, the estimators for the regression coefficients of the measured covariates might be severely biased, less efficient and characterized by misleading interpretations. Under a simple setup with one covariate, this work aims to extend the spatial confounding framework to general structured settings and to understand the behaviour of confounding as a function of the data generating process structure parameters in several scenarios focusing on the joint covariate-confounder structure. The focus is on the sampling properties of the target parameter estimators. They are derived as ratios of dependent quadratic forms in Gaussian random variables in order to provide an analytic expression of the marginal sampling properties of the estimator by means of Carlson’s R function. We propose a representative scalar quantity for the extent of confounding as a proxy of the estimator bias, its Laplace approximation up to the first order.

For the application study, we work under several frameworks that consider spatial geostatistical and areal data and temporal data with specific assumptions regarding the covariance and cross-covariance functions used to generate the processes involved. The results allow us to claim that the variability of the confounder-covariate interaction and of the covariate plays the most relevant role in determining the principal marker of the magnitude of confounding.

2023-03-30 - Tamara Broderick - An Automatic Finite-Sample Robustness Check: Can Dropping a Little Data Change Conclusions?

Abstract:
Commonly researchers will run a statistical analysis on a data sample, with the goal of applying any conclusions to a new population. For instance, if economists conclude microcredit is effective at alleviating poverty based on observed data, policymakers might decide to distribute microcredit in other locations or future years. Typically, the original data is not a perfect random sample from the population where policy is applied -- but researchers might feel comfortable generalizing anyway so long as deviations from random sampling are small, and the corresponding impact on conclusions is small as well. Conversely, researchers might worry if a very small proportion of the data sample was instrumental to the original conclusion. So we propose a method to assess the sensitivity of statistical conclusions to the removal of a very small fraction of the data set. Manually checking all small data subsets is computationally infeasible, so we propose an approximation based on the classical influence function. Our method is automatically computable for common estimators. We provide finite-sample error bounds on approximation performance and a low-cost exact lower bound on sensitivity. We find that sensitivity is driven by a signal-to-noise ratio in the inference problem, does not disappear asymptotically, and is not decided by misspecification. While some empirical applications are robust, conclusions of several influential economics papers can be changed by removing (much) less than 1% of the data.

2023-03-23 - Joshua Christensen/Elizabeth Petersen/Logan Clarke - Graduate Students Presentation

Joshua Christensen: Posterior Sampling for Non-Gaussian Dynamic Linear Models
Abstract:
Dynamic linear models (DLMs) are a flexible class of time series models based on the concept of time-varying parameters (TVPs). The forward-filtering backward-sampling (FFBS) algorithm provides an efficient framework for sampling the joint posterior of TVPs in a conditionally Gaussian DLM. We propose a novel sampling method that leverages the efficiency of the FFBS algorithm while relaxing the requirement of conditional Gaussianity. We rely on recent advances in slice sampling which eliminate the need for tuning parameters with transformations to approximate (pseudo) targets. The FFBS conversion of the joint distribution into a chain of univariate conditional distributions facilitates this transformation. We use this new slice sampling technique with an approximating conditionally Gaussian DLM as a pseudo target to sample from the joint posterior distribution of TVPs in non-Gaussian DLMs. We demonstrate the effectiveness of our technique by modeling pollution in French rivers with a dynamic harmonic regression (DHR) model.

Elizabeth Petersen: Grade of Membership Scores in Estimating Heterogeneous Treatment Effects for Instrumental Variable Scenarios
Abstract:
In order to make causal conclusions in observational studies, researchers must look beyond standard modeling methodology and instead use tools such as instrumental variables. Valid instruments provide the ability to infer causality without the requirement of balancing on all relevant covariates. The use of covariates can provide information on heterogeneous treatment effects, which we approach by clustering the covariates. Traditional clustering requires points to exist within one group. This ignores the potential that observations may share characteristics with other groups. Grade of Membership (GOM) models allow observations to share membership with multiple groups or pure types. We introduce the use of GOM scores in instrumental variable analysis to consider heterogeneous treatment effects. Our analysis applies GOM scores in understanding the effects of Medicaid enrollment when a valid instrument, lottery assignment, is present.

Logan Clarke: Multinomial Classification of Costa Rican Household Poverty Level
Abstract:
In this project we look at using a variety of different statistical tools to predict poverty levels in Costa Rican households. Welfare fund distribution in Costa Rica is determined by poverty level, but limited documentation of income and expenses make it difficult for welfare workers to determine how to allocate aid. In order to accurately classify households into their corresponding level of poverty, we will use multinomial logistic regression, multiple linear regression, random forest, neural networks, gradient boosted models, and k-nearest neighbor models to predict poverty level. By comparing model performance, we seek to determine which type of model is most appropriate for multinomial classification situations. Finally, using the model performance rankings from this data set, we explore how similar results may apply to other classification data sets

2023-03-16 - Felipe Barrientos - Differentially Private Methods for Managing Model Uncertainty in Linear Regression Models

Abstract:
Many data producers are concerned about protecting individuals' private information while still allowing modelers to draw inferences from confidential data sets. The framework of differential privacy enables statistical analyses while controlling and quantifying the potential leakage of private information. In this talk, we present differentially private methods for hypothesis testing, model averaging, and model selection for normal linear models. We consider both Bayesian and non-Bayesian methods for the tasks. The procedures are asymptotically consistent and straightforward to implement with existing software. We focus on practical issues such as quantifying the uncertainty introduced by the privacy-ensuring mechanisms. We evaluate the empirical performance of the approaches using simulated and real data. This is joint work with Víctor Peña at Universitat Politècnica de Catalunya, Spain.

2023-03-09 - Erin Schliep - Correcting for Informative Sampling in Spatial Covariance Estimation and Kriging

Abstract:
Informative sampling designs are broadly used across many application areas of statistical modeling and can have a large impact on model inference and prediction. In spatial modeling, informative sampling can result in biased spatial covariance parameter estimation, which in turn can bias spatial prediction. To mitigate these biases, we develop a weighted composite likelihood approach to improve spatial covariance parameter estimation under informative sampling designs. Then, given these parameter estimates, we propose two approaches to quantify the effects of the sampling design on the variance estimates in spatial prediction in order to make informed decisions for population-based inference.

2023-03-02 - Christophe Giraud-Carrier - Meta-learning and Dataset Embedding

No abstract listed. For more information on the speaker click here.

2023-02-23 - Evan Johnson - Tools and Methods for Profiling Host-Microbe Interactions in Pulmonary Diseases

Abstract:
Big data technologies are playing an increasingly influential role in multiple sectors within industry and research. This is providing significant opportunities for individuals with expertise in the acquisition, management, and analysis of data—so called data scientists. In the biomedical arena in particular, data science and informatics have revolutionized discoveries in molecular research, public heath, and clinical care. One unique application of big data technology is the use of DNA- and RNA-sequencing to evaluate the interplay between the microbiome and host immune and inflammatory pathways, particularly in relation to human pulmonary diseases. Here I will discuss several relevant computational and statistical innovations in the context of of host-microbe profiling in asthma, lung cancer, and tuberculosis research.

2023-02-16 - Dave Higdon - Computer Model Calibration and Agent Based Models

Abstract:
Agent-based models (ABMs) use rules at the individual (agent) level to simulate a social, ecologic, or social-technical system, producing structured behavior when viewed at an aggregated level. ABMs allow consideration of interactions, behaviors, outcomes, and interventions at a very fine resolution, giving them a connection to reality that is often missing from more aggregated modeling approaches. Of course this comes with a price. For example, ABMs typically come with high computational cost, random realizations, and difficulty in capturing more aggregated properties of the real system. This talk will use a toy example to highlight the key concepts of Bayesian model calibration for constraining ABMs with observations; it'll also touch on Approximate Bayesian Computation (ABC). I'll then go on to show how such methodology can be used in more involved applications, using an ABM developed during the 2014 ebola epidemic.

2023-01-26 - Alejandro Jara - A Class of Random Bernstein Copula Models

Abstract:
Copula models provide great flexibility in modeling relationships between random variables. For inference to take full advantage of this flexibility, one needs appropriately rich families of copula functions, capable of approximating any copula. One such family is the family of Bernstein copulas, which are a variety of multivariate Bernstein polynomials, and which has been shown to be dense in the space of continuous copula functions. Bernstein copulas have been used for inference before, but only using likelihood-free approximation methods. We observe a fact about the geometry of the parameter space of Bernstein copulas, and note that it is closely related to a different class of copula known as grid uniform copulas. Based on this relationship, we propose a Bayesian model based on Bernstein copulas and an automatic MCMC algorithm capable of performing full posterior inference on the copula and marginal distributions. Joint work with Nicolas Kuschinski. N. Kuschinski’s research is supported by Fondecyt 3210553 grant. A. Jara’s research is supported by Fondecyt 1220907 grant.

2023-01-19 - Richard Hahn - Feature Selection for Casual Interference

Abstract:
Feature selection for the estimation of causal effects is a challenging and subtle problem about which there is a wealth of --- sometimes contradictory! --- folk wisdom. In this talk I will analyze the problem in the simplified setting of discrete control variables, where it is possible to provide rigorous statements of various folk-theorems, allowing us to better understand their practical implications. My presentation will use elements of three distinct methodological frameworks for causal inference: potential outcomes, causal diagrams, and structural equations. Each framework emphasizes unique aspects of the feature selection problem.

2022-12-01 - Daniel Bauer - Calculation of the Enterprise Capital via Least-squares Monte Carlo – Regress Now or Later?

Abstract:
There has been substantial interest in applied research and practice on the estimation of risk capital within enterprise risk models via Monte Carlo procedures. The difficulty arises from characterizing the distribution of the company’s available capital, which takes the form of a conditional expected value of the firm’s future cash flows given the market and company state at the risk horizon.

One prevalent approach, referred to as regression-now, projects the company’s realized cash flows and regresses resulting discounted values against a set of basis functions of the Markov states at the risk horizon. Another approach seeks to approximate the cash-flows based on functions of realizations of a class of tractable processes, and in a second step calculates the conditional expected value of the approximating functional combination at the risk horizon. This approach is referred to as regression-later. Different authors documented advantages and disadvantages to both approaches in specific situations.
This paper makes two contributions. The first is theoretical. We show that the left and right singular functions of the valuation operator mapping cash flows to capital present robust approximating functions for the regress-now and regress-later approach, respectively. We show superior performance of the approaches when relying on these functions. The second contribution is applied. We characterize situations in which either of the approaches results in a superior performance. In essence, the regress-later approach performs better when underlying theoretical conditions are met, whereas the regress-now approach is the more robust approach.

2022-11-17 - Jake Rhodes - Random Forest Geometry-Preserving Proximities with Applications in Manifold Learning

Abstract:
Many machine learning algorithms use distances or similarities between observations to make predictions, cluster similar data, visualize patterns, or generally explore the data. Most distances or similarity measures do not incorporate known data labels or conditionally generate metrics that exaggerate separation between data points of different classes. Instead, we leverage a popular algorithm used for making data-driven predictions, known as random forests, to naturally incorporate data labels (either continuous or categorical) into similarity measures known as random forest proximities. In this talk, I will discuss the development of a new proximity definition that can be used to reconstruct the random forest’s predictions. We call these random forest-geometry-and accuracy-preserving proximities or RF-GAP. I will discuss improvements in common proximity-based including imputation, outlier detection, and data visualization. I will also introduce a random forest proximity-based manifold learning approach that can be used to generate 2- or 3-dimensional data representations which can be used as a tool to visually explore the data. I show quantitatively and qualitatively that this method surpasses other existing methods for this task.

2022-11-10 - John Kidd - Expectation-Maximization Algorithms and Mediation Analysis

Abstract:
Determining the relationship between two variables is the aim of many statistical procedures. When there are one or more additional variables between the independent variable and the response, mediation analysis is often employed. However, these middle variables, called mediators, may be missing from study data. This scenario is particularly prevalent in genetic studies where the relationship between a specific genetic variant and a phenotype, such as diseases or ailment, is of interest. Omic variables such as RNA and protein expressions mediate these relationships but can be difficult to measure. In this talk, I will discuss a new method that uses likelihood methods to find unbiased effect estimates with improved efficiency in the presence of multiple mediator data where mediator information may be missing.

2022-11-03 - Christopher Challis - Consumer Modeling at Wells Fargo

Abstract:
Wells Fargo employs hundreds of statisticians and data scientists. Chris Challis will give an overview of focus areas across the bank, then dive into details of consumer-facing statistical models, including marketing, personalization, pricing, forecasting, and optimization. There will also be time for questions and discussion on the daily work of members of his team, and the types of skills that he has seen be successful at Wells Fargo and other companies.

2022-10-27 - Jonathan Chipman - Experimental Design with Sequential Rematched Randomization

Abstract:
This two-part talk overviews (i) the University of Utah PhD program in Population Health Sciences in the Division of Biostatistics and (ii) novel developments to balancing covariates between treatment arms in randomization trials. Randomization plays a foundational role in causal hypotheses testing. It eliminates systematic confounding and provides a ‘reasoned basis for inference’ through randomization-based inference (RBI). However, chance imbalances in key baseline covariates can still occur and complicate interpretation of the estimated causal effect. (Such was the case this past week in a trial reviewed by the Food and Drug Administration this October). Sequential Matched Randomization (SMR) reduces this risk by randomizing treatment within participants who match closely on key baseline covariates. We refine SMR to use a dynamic matching criteria, allow batch enrollment, and allow matches to break and rematch throughout enrollment. Through extensive simulation using real trial data, we compare SMR with and without our extensions in terms of average covariate balance and estimator efficiency (via RBI power and variance of estimators). Our extensions further decreased the risk of covariate imbalances and increased efficiency relative to randomization without covariate adjustment by 1.3 fold. We’ll show a case where adjusting for covariates in the design has as much power as adjusting for covariates in a regression model though with fewer assumptions.

2022-10-20 - Abel Rodriguez - Analyzing Cognitive Social Structures

Abstract:
Cognitive social structures (CSSs) are a type of social network data that appear in psychology, sociology, and industrial organization applications. CSSs involve a collection of networks, each one of them reflecting the perceptions of an individual about the interactions among all members of the community. This makes cognitive social structures richer than other forms of social networks that only reflect the perspective of a single, “objective” observer.

Traditionally, cognitive social structures have been analyzed by either collapsing the multiple networks into some sort of "consensus" network, or by analyzing different "slices" independently. In this talk we discuss two classes of models used to generate insights from CSS data. In the first part of the talk, we use a hierarchical embedding of the networks into a continuous latent space along with carefully constructed zero-inflated priors to explore the perceptual agreement between individuals and the group consensus. In the second part of the talk we discuss a novel class of hierarchical blockmodels that uses Chinese Restaurant Process priors and fragmentation-coagulation processes to identify conserved motifs that are preserved across all observers. This is joint work with Juan Sosa, currently at Universidad Nacional de Colombia.

2022-10-13 - Rebecca Nichols - Go Forth to Serve: Statistics Service Opportunities to Enhance your Career and Make a Difference in the World

Abstract:
Interested in making a difference in the world through statistics? Finding joy in your profession and fulfillment through service? This presentation will discuss service opportunities and ideas for how you can be involved while enhancing your career and making a difference in the profession, your community, and even the world. What are the resources available to you? How can service enhance your career path? How can this fit into a balanced life? These thoughts and more for students and faculty.

2022-09-29 - Volodymyr Melnykov - Finite Mixture Modeling in Stylometry

Abstract:
Stylometry is the area of linguistics that deals with the quantification of a language style. Applications of stylometry can be found in plagiarism detection, authorship identification, speech style recognition, association analysis of text samples, as well as the prediction of the chronological order for several written fragments. Among the most typical statistics techniques used in the stylometry analysis are the goodness-of-fit test, analysis of contingency tables, paired t-test, and other traditional but sometimes overly naive and simplistic procedures. The main focus of the talk will be made on the detection of the chronological order of several text fragments written by Plato

2022-09-22 - Brad Barney - Clinical Research

Abstract:
In our search to determine best practices for clinical care, numerous resources are devoted to assessing the safety and efficacy of medical interventions. In this presentation, I will discuss pragmatic aspects of operating and analyzing clinical trials and observational studies to identify best practices. I will give an overview of various study designs and their strengths and limitations. I will discuss the range of statistical complexity commonly encountered in clinical research. I will also review some of the studies for which I have provided assistance, as well as findings if they are already publicly available. Although some of the presented analyses were conducted with financial support from the National Institutes of Health, the views do not necessarily represent those of the NIH or any other government entity.

2022-09-15 - Hans Mueller - Statistics for Random Objects

Abstract:
Random objects, i.e., random variables that take values in a separable metric space, pose many challenges for statistical analysis, since vector operations are not available in general metric spaces. Examples include random variables that take values in the space of distributions, covariance matrices and covariance surfaces, networks and other spaces. The increasing prevalence of samples of random objects has stimulated the development of metric statistics, a collection of statistical tools to characterize, infer and relate samples of such random objects. For the geometric exploration of random objects a useful tool are depth profiles; the depth profile for any given object is the distribution of distances to all other objects. These distributions can be subjected to further statistical analysis, leading to notions of transport ranks and centrality. Theory concerns the convergence of the empirical estimates to the population targets. For random objects in geodesic spaces, a recent development is transport regression, based on a transport algebra, where predictors and responses are transports. Parts of this presentation are based on joint work with Yaqing Chen, Paromita Dubey, Alexander Petersen and Changbo Zhu.

2022-09-08 - Joey Stanley - Modeling Change in American English Accents

Abstract:
Gone are the days when a linguist could analyze spoken language using their ears alone. Contemporary research on vowel pronunciation relies heavily on statistical modeling of acoustic measurements extracted from recordings of speech. In this presentation, I showcase some aspects of a typical quantitative analysis in sociophonetic research, incorporating the latest developments in linguistic methodology. Specifically, I will show how linear mixed-effects models and generalized additive mixed-effects models can be used to analyze and model vowel pronunciation. In addition to using data from my own voice, I will analyze data representing Southern American English and Utah English and will show how they have changed across 100 years.

2022-03-17 - Paromita Dubey - Functional Models for Time Varying Random Objects

Abstract

In recent years, samples of time-varying object data such as time-varying networks that are not in a vector space have been increasingly collected. These data can be viewed as elements of a general metric space that lacks local or global linear structure and therefore common approaches that have been used with great success for the analysis of functional data, such as functional principal component analysis, cannot be applied directly.

In this talk, I will propose some recent advances along this direction. First, I will discuss ways to obtain dominant modes of variations in time varying object data. I will describe metric covariance, a new association measure for paired object data lying in a metric space (Ω, d) that we use to define a metric auto-covariance function for a sample of random Ω-valued curves, where Ω will not have a vector space or manifold structure. The proposed metric auto-covariance function is non-negative definite when the squared metric d^2 is of negative type. The eigenfunctions of the linear operator with the metric auto-covariance function as the kernel can be used as building blocks for an object functional principal component analysis for Ω-valued functional data, including time-varying probability distributions, covariance matrices and time-dynamic networks. Then I will describe how to obtain analogues of functional principal components for time-varying objects by applying weighted Fréchet means which serve as projections of the random object trajectories in the directions of the eigenfunctions, leading to Ω-valued Fréchet integrals. This talk is based on joint work with Hans-Georg Müller.

2022-03-03 - Matt Morgan/Nate Hawkins/Garrett Duncan - Graduate Students Presentation

Speaker 1: Matt Morgan

Title: Implementing Minibatch Processing In Spatial Analyses

Abstract: Working with spatial statistics can often be computationally intensive. Furthermore, when performing a spatial analysis on large amounts of data, the computation problem grows exponentially. Fitting a model and cross validating can take days or weeks even when done utilizing more RAM and an increased numbers of CPUs. Given the increase in amounts of large data in recent years and the computational complexity that comes with spatial statistics, much research has been done in an attempt to simplify the computational complexity of spatial analyses. One method of reducing computational complexity that has been utilized in other areas is minibatching. Minibatching is a process where a subset of the data is used to get an approximation to the whole dataset without having to use the whole dataset. This study aims to incorporate minibatching with spatial analyses so as to reduce the amount of data needed to be used in a model. The hope is that minibatching will allow for spatial analyses to be performed using large datasets with an ease in computational strain.

Speaker 2: Nate Hawkins

Title: Real-Time Prediction Using Covariate-Informed Product Partition Models

Abstract: Covariate informed product partition models excel at modeling data that are non-linear in nature. These models exploit available covariates by increasing the probability of co-clustering for two individuals with similar covariate values. In practice, however, prediction using these models can be computationally expensive and constrained to specific model parameters. We develop a covariate-based prediction algorithm that runs at compiled speed and allows for changes in model parameters to improve fit. We show the utility of this algorithm using data from the 2018 men’s world volleyball championship. First, we train a product partition model using team performance statistics from the round robin stage of the tournament to predict wins. We then use our algorithm to predict the win probability after every point of every match in the knockout stages.

Speaker 3: Garrett Duncan

Title: COVID-19 Spread Feature Engineering

Abstract: This research investigated potential methods used in modeling the spread of COVID-19 in higher education. Basic demographic and student schedule information was used to engineer features such as number of credits taken with students tested positive, the median distance of the nearest positive for students living near the university, the level of risk brought by in-person classes for each student, and other additional features. These features were calculated on a week by week or semester by semester period. Different features were also aggregated at the individual class section and overall course level. A dashboard of these features was shown to university administration to help monitor virus spread. We consider these features and make recommendations for feature engineering for future models of virus spread in higher education communities.

2022-02-10 - Nathan Bean - Bayesian Multi-Regional Clinical Trials Using Model Averaging

Abstract:

Sponsors often rely on multi-regional clinical trials (MRCTs) to introduce new treatments more rapidly into the global market. Many commonly used statistical methods do not account for regional differences, and small regional sample sizes frequently result in lower estimation quality of region-specific treatment effects. Recent ICH E17 guidelines suggest the consideration of methods that allow for information borrowing across regions to improve estimation. In response to these guidelines, we develop novel methodology to estimate global and region-specific treatment effects from MRCTs with either a single continuous or time-to-event endpoint using Bayesian model averaging (BMA), and we discuss how posterior model probabilities can be used to assess the consistency of treatment effects across regions. We show through simulation studies that the proposed modeling approach results in lower MSE than traditional fixed effects models while preserving the rejection rate of the global treatment effect. We then apply the BMA approach to data from the LEADER trial, an MRCT designed to evaluate the cardiovascular safety of an anti-diabetic treatment.

2022-01-27 - Carlos Carvalho - Searching for Dusty Corners: Understanding the Prediction of the Cross Section of Returns

Abstract:
Bayesian nonparametric regression models will be presented in order to predict equity returns from various characteristics. We will focus on model modifications that will incorporate economic information, time variability and explore ways to develop interpretable summaries of otherwise black-box strategies.

2022-01-20 - William Christensen - A Statistician's Perspective on Climate Change Assessment

Abstract:

In this talk, we consider statistical perspectives on various aspects of climate change assessment. We briefly discuss: (1) a review of major findings in the Sixth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC), (2) how uncertainty is discussed in IPCC assessment reports, (3) how the notion of uncertainty can be misunderstood within society and even the scientific community, and (4) how uncertainty can be more rigorously integrated in scientific (and economic) decision making. Several examples will illustrate how uncertainties can be integrated into an analysis of climate change data, including the modeling of ice accumulation in Antarctica, climate model ensemble data in North America and in High Mountain Asia, sea level rise off the US Atlantic Coast, and precipitation trends in Antarctica.

2022-01-13 - Jacob Andros/Camilla Handley/Travis Andersen - Masters Students Presentation

Jacob Andros

Title: Search Algorithms and Loss Functions for Bayesian Feature Allocation Models

Abstract: Feature allocation models are a Bayesian means to flexibly borrow information among items where each item possesses zero or more features. These models are typically fit with Markov chain Monte Carlo, yet there has been relatively little work done on summarizing a feature allocation posterior distribution. We propose a new loss function between two feature allocations and suggest a search algorithm to minimize the expected loss and summarize a feature allocation distribution. This loss is easy to compute and allows for comparison of matrices with differing numbers of features. With this loss function and a set of feature allocation samples from MCMC, our search algorithm uses principles of the assignment problem to minimize the expected loss without being restricted to only the samples present in the Markov chain.

Camilla Handley

Title: A Simulation-Based Approach to Controlling False Discovery Rate and COVID-19 Hotspot Detection

Abstract: In this project, we aim to answer the question of whether or not there is in-class transmission happening on campus in order to regulate restrictions and keep students and faculty safe. Using demographic information, we built a logistic regression model that produces an estimated probability of testing positive for each student. Using a simulation-based approach, we developed a method that identifies possible spots of in-class transmission on campus with these probabilities. We simulated semesters under the null hypothesis (that there is no in-class transmission) and compared them to the observed positivity rates to find a p-value for each group (section, course or major). Given that our p-values are dependent and non-uniform, we developed a simulation-based method to control the false discovery rate for dependent, non-uniform p-values.

Travis Andersen

Title: A Permutation-Based Confidence Distribution for Rare-Event Meta-Analysis

Abstract: Confidence distributions (CDs), which provide evidence across all levels of significance, are receiving increasing attention, especially in meta-analysis. Meta-analyses allow independent study results to be combined to produce one overall conclusion and are particularly useful in public health and medicine. For studies with binary outcomes that are rare, traditional meta-analysis methods often fail. Zabriskie et al. develop a permutation-based method to analyze such data when study treatment effects vary beyond what is expected by chance. We show how this method can be considered a CD and develop new metrics to assess a CD's relative performance. We compare this permutation-based CD with other CDs for rare-event meta-analyses.

2021-12-09 - Andrew Whetten - Localized Mutual Information Monitoring of Pairwise Associations in Animal Movement

Abstract
Advances in Satellite Imaging and GPS tracking devices have given rise to a new era of remote sensing and geospatial analysis. In environmental science and conservation ecology, biotelemetric data recorded is often high-dimensional, spatially and/or temporally, and functional in nature, meaning that there is an underlying continuity to the biological process of interest. GPS-tracking of animal movement is commonly characterized by irregular time-recording of animal position, and the movement relationships between animals are prone to sudden change. In this paper, we propose a measure of localized mutual information (LMI) to derive a correlation function for monitoring changes in the pairwise association between animal movement trajectories. The properties of the LMI measure are assessed analytically and by simulation under a variety of circumstances. Advantages and disadvantages of the LMI measure are assessed and alternate measures of LMI are proposed to handle potential disadvantages. The proposed measure of LMI is shown to be an effective tool for detecting shifts in the correlation of animal movements, and seasonal/phasal correlatory structure.

2021-12-02 - Zach Baucom - State Space Models for Neuropsychological Data

Abstract:
In the research area of cognitive decline, longitudinal analysis plays a critical role in understanding how and why a disease may progress. However, the predominant analysis techniques often fail to have both of the following: 1.) a simple effect interpretation and 2.) allow for proper subject-level heterogeneity. This has led to the development and use of State Space Models (SSM) in scenarios typical of traditional Linear Mixed Effect Models (LMEM) with joint model applications. First, I develop an SSM model along with estimation procedures for modeling cognitive decline. Estimation procedures include a traditional full-likelihood approach, a partitioned SSM, and a Bayesian SSM. These models are compared to LMEMs for longitudinal cognitive data using simulation and real data analyses. The SSM framework is then applied to understand the effect of the APOE e4 allele on cognitive trajectory. As many of the popular cognition studies offer a battery of tests to gain broader insight into different aspects of cognition (i.e. memory),the SSM is being extended to a joint model and factor analysis.

2021-11-18 - Kimberly Kaufeld - Integrating public health surveillance and environmental data to model presence of "Histoplasma" in the United States

Abstract

In the United States, the true geographic distribution of the environmental fungus Histoplasma capsulatum remains poorly understood but appears to have changed since it was first characterized. Histoplasmosis is caused by inhalation of the fungus and can range in severity from asymptomatic to life-threatening. Due to limited public health surveillance and under detection of infections, it is challenging to directly use reported case data to characterize spatial risk. Using monthly and yearly county-level public health surveillance data and various environmental and socioeconomic characteristics, we use a spatio-temporal occupancy model to estimate latent, or unobserved, presence of H. capsulatum, accounting for imperfect detection of histoplasmosis cases. We estimate areas with higher probabilities of the presence of H. capsulatum in the East North Central states around the Great Lakes, reflecting a shift of the endemic region to the north from previous estimates. By doing so, we were able to mitigate challenges related to reporting and illustrate a shift in the endemic region from historical estimates. This work aims to help inform future surveillance needs, clinical awareness, and testing decisions for histoplasmosis.

2021-11-04 - Brad Ferguson - How Data Science Unravels Social Media

Abstract:

Predicting how many views, clicks, and conversions an Instagram Story gets is an important tool in the influencer marketing space. It allows businesses to prioritize which influencers will perform well enough for a specific brand and makes budgets and planning more efficient. In this talk, I discuss the successes and challenges in building predictive models for Instagram and YouTube and what lessons can be learned when applying machine learning models in the real world. I also discuss what I have found makes a strong data scientist and what things to look for in your next employer to set you up for a successful career in data science.

2021-10-28 - David Dahl - Shrinking a Partition Distribution Towards a Baseline Partition, With Applications to Dependent Partitions

Co-Authors:
Richard L. Warr
Thomas P. Jensen

Abstract:
Random partition models are closely related to Bayesian nonparametric models and provide flexible means to borrow strength in Bayesian data analysis. Parsimony is obtained by postulating that observations share model parameters with other observations belonging to the same cluster in a latent partition. In many contexts, prior knowledge regarding the partitioning of observations may be available and we may desire to use this baseline partition information to influence the prior partition distribution. To this end, we propose the shrinkage partition distribution (SPD) which shrinks any baseline partition distribution towards a baseline partition. Recognizing that prior knowledge may be stronger for some items than others, our formulation allows for item-specific shrinkage towards the baseline partition. Further, our approach has a tractable normalizing constant, permitting posterior inference on the shrinkage and parameters associated with the baseline distribution. We explore the properties of our proposed distribution and other comparable distributions. We also show how the SPD can hierarchically model a collection of random partition distributions and can also model time-dependent random partitions.

2021-10-21 - Nathan Sandholtz - An Inverse Optimization Analysis of the Fourth Down Decision in Football

Abstract:
The fourth down decision in football has been primarily studied as an optimization problem; using win probability as the criterion to optimize, analysts estimate optimal decisions for every fourth down situation. Prescriptions from these models have been publicly available for decades, informed by increasingly sophisticated win probability models in recent years. Despite this availability, NFL coaches’ observed fourth down decisions have remained distant from analysts’ recommendations. Inverse optimization provides a mathematical framework to make sense of the gap between coaches’ decisions and analysts’ prescriptions. Leveraging this paradigm, we assume that the coaches’ observed decisions are optimal but that the risk preferences governing their decisions are unknown. Our goal is to model and infer these latent risk preferences such that when the resulting optimization model is solved, it yields their observed decisions as optimal (or minimally suboptimal).

To this end, we model a football game as a Markov decision process, the components of which we estimate from regular season NFL play-by-play data from 2014 through 2020. Rather than assuming that win probability governs the coaches' decisions, we treat the optimality criterion as an unknown parameter. Using the quantile function to parameterize risk, we estimate which quantile-optimal policy yields the coaches’ observed decisions as minimally suboptimal. We find that coaches almost always exhibit conservative risk preferences, but that they have different risk preferences depending on various features of the fourth down situation.

2021-10-21 - Jared Fisher - A Bayesian Semiparametric Approach to Treatment Effect Variation with Noncompliance

Abstract:
Estimating varying treatment effects in randomized trials with noncompliance is inherently challenging since variation comes from two separate sources: variation in the impact itself and variation in the compliance rate. In this setting, existing Frequentist and ML-based methods are quite flexible but are highly sensitive to the so-called weak instruments problem, in which the compliance rate is (locally) close to zero, and require pre-specifying subgroups of interest. Parametric Bayesian approaches, which account for noncompliance via imputation, are more robust in this case, but are much more sensitive to model specification. In this paper, we propose a Bayesian semiparametric approach that combines the best features of both approaches. Our main contribution is to embed Bayesian Additive Regression Trees (BART) in a broader Bayesian noncompliance framework in which we repeatedly impute individuals’ compliance types. This allows us to flexibly estimate varying treatment effects among Compliers while mitigating the weak instruments problem. We then apply our method to the Oregon health insurance experiment and show that analyses that only focus on a single source of variation can miss important heterogeneity.

2021-10-14 - Oliver Johnson - Inference, Uncertainty Quantification, and Uncertainty Propagation for Grain Boundary Structure-Property Models

Abstract:

Brandon D. Snow, Sterling G. Baird, David E. Page, Kathryn F. Varela, Christian Kurniawan, David T. Fullwood, Eric R. Homer, Oliver K. Johnson

We present a non-parametric Bayesian approach for developing structure-property models for grain boundaries (GBs) with built-in uncertainty quantification (UQ). Using this method we infer a structure-property model for H diffusivity in [100] tilt GBs in Ni at 700K based on molecular dynamics (MD) data. We then leverage these results to perform uncertainty propagation (UP) for mesoscale simulations of the effective diffusivity of polycrystals to investigate the interaction between structure-property model uncertainties and GB network structure. We observe a fundamental interaction between crystallographic correlations and spatial correlations in GB networks that causes certain types of microstructures (those with large populations of J2- and J3-type triple junctions) to exhibit intrinsically larger uncertainty in their effective properties. We also investigate the influence of different types of input data (bicrystal vs. polycrystal) and observe evidence of a transition between a data-rich regime in which bicrystals yield more accurate results to a data-limited regime in which polycrystals provide improved inferences.

2021-10-07 - Candace Berrett - A Bayesian Change Point Model to Identify Local Temperature Changes Related to Urbanization

Abstract:
Changes to the environment surrounding a temperature measuring station can cause local changes to the recorded temperature that deviate from regional temperature trends. This phenomenon -- often caused by construction or urbanization -- occurs at a local level. If these local changes are assumed to represent regional or global processes it can have significant impacts on historical data analyses. These changes or deviations are generally gradual, but can be abrupt, and arise as construction or other environment changes occur near a recording station. We propose a methodology to examine if changes in temperature trends at a point in time exist at a local level at various locations in a region. Specifically, we propose a Bayesian change point model for spatio-temporally dependent data where we select the number of change points at each location using a "œforwards" selection process using deviance information criterion (DIC). We then fit the selected model and examine the linear slopes across time to quantify the local changes in long-term temperature behavior. We show the utility of this model and method using a synthetic data set and observed temperature measurements from eight stations in Utah consisting of daily temperature data for 60 years.

2021-09-30 - Elizabeth Hammond - The Role of Statistics in Evidence Based Medicine

Abstract:
Throughout my long career in medical research, I have seen first hand the critical role of statistical expertise in refining hypotheses, creating impactful research proposals, and refining or refuting research conclusions. Meta analyses and well conducted research studies have had a major impact on patient care. Major questions that have changed patient care have only been definitively answered by the collaboration of statisticians and medical professionals.

Some of these seminal questions answered in this way are:

How is the prognosis of cancer patients affected by disease spread? Which cancers are lethal? What treatments are effective and which treatments should be abandoned?
In coronary artery disease, do all patients require surgery? How can damage from a heart attack be mitigated without surgery?
In the COVID-19 pandemic, what are the serious risk factors of death? Are vaccines effective in preventing death? How long does immunity last?
Without statistical expertise applied to these questions, definitive answers would not have been possible. During this seminar, I will describe important research in which statistical collaboration has played a major role.

2021-09-16 - David Wingate - Harnessing Algorithmic Fidelity in Large-Scale Language Models for Social Science

Abstract:
Large-scale language models, such as BERT, T5 or GPT3, have revolutionized natural language processing because of their ability to capture nuanced structure in the syntax and semantics of natural language. However, the text they generate can exhibit problematic biases, such as racism, sexism, populism, extremism, and more.

In our recent work, we ask: instead of mitigating such bias, can we harness it for good? Can we treat it as signal, instead of noise? Our work demonstrates that the patterns described by these large-scale language models exhibit strong correlations with actual human opinions, value judgements, cultural experiences, assumptions, and thought processes. We term this "algorithmic fidelity".

To the extent that language models accurately reflect human thought processes, then they can be studied as an "in silico brain." In other words, researchers from psychology, neuroscience, cognitive science, or political science could potentially analyze these models and their output as surrogates for humans, performing new kinds of experiments that would never be possible with actual human subjects.

In this talk, I will discuss some of our recent efforts to use language models in the service of social sciences, including demonstrating how language models can take virtual political science surveys, and the potential to use language models to code free-form text.

Joint work with Chris Rytting, Lisa Argyle, Joshua Gubler, Ethan Busby, Taylor Sorensen, and Nancy Fulda.

2020-03-19 - Marie Tuft - Interpretable Localized Time-Frequency Analysis via Penalized Reduced Rank Regression

Abstract:
Spectral analysis of nonstationary biological processes such as heart rate variability (HRV) and EEG poses a unique challenge: localized, accurate and interpretable descriptions of both frequency and time are required. By reframing this question in a reduced rank regression setting, we propose a novel approach that produces a low-dimensional and empirical basis that is localized in bands of time and frequency. To estimate this frequency-time basis, we apply penalized reduced rank regression with singular value decomposition to the localized discrete Fourier transform. An adaptive sparse fused lasso penalty is applied to the left and right singular vectors, resulting in low-dimensional measures that are interpretable as localized bands in time and frequency. Asymptotic properties of this method are derived, and it is shown to provide a consistent estimator of the time-varying spectrum. Simulation
studies are used to evaluate its performance and its utility in practice in illustrated through the analysis
of HRV during sleep.

2020-03-12 - Julia Silge - Understanding Word Embeddings

Abstract:
Modern natural language processing frameworks (including word2vec, GloVe, fastText, ULMFIT, and more) depend on word embeddings, a way of statistically modeling language where words or phrases are mapped to vectors of real numbers. In this talk, understand word embeddings by investigating how we can generate them using count-based statistics and dimensionality reduction, then learn how to make use of pre-trained embeddings based on enormous datasets. Finally, explore the ethical issues involved in using word embeddings and how they can amplify systemic and historical bias.

2020-03-05 - Tom Greene - Validation and Application of Surrogate Endpoints in Chronic Kidney Disease (CKD) Clinical Trials

Abstract:
Randomized trials of CKD treatments traditionally use clinical events late in CKD progression as end points. This requires costly studies with large sample sizes and long follow-up times. Recently, there has been much interest in the validation of surrogate endpoints to enable the use of smaller trials with shorter follow-up. In this talk, I will overview methodological challenges for the validation of surrogate endpoints, in general and specifically in CKD clinical trials. I will then describe recent work to evaluate potential surrogate endpoints using an evolving library of patient level data sets that currently includes 47 previously conducted CKD clinical trials. Finally, I will overview new ideas for applying validated surrogate endpoints in the design and analysis of newly conducted randomized trials.

2020-02-27 - David Dahl - Two for One: 1. Focal Random Partition Distribution and 2. Optimization of Clustering Criteria

Abstract:
Random partition models, such as the Chinese restaurant process, allow a Bayesian model to flexibly borrow strength. We present two related
working papers on random partition models. First, while many partition
priors are exchangeable, we propose a nonexchangeable prior based on a focal partition, a Bayesian's prior guess for the unknown partition. We
show how our approach modifies the Chinese restaurant process so that
partitions that are similar to the focal partition have higher
probability. There is a weight parameter that varies between -1 and
infinity, where 0 corresponds to the original Chinese restaurant process
and infinity yields a point mass distribution at the focal partition. In
the later part of the talk, we present a novel stochastic search algorithm
to minimize the posterior expected loss of a clustering criterion based on
a pairwise similarity matrix. Several loss functions for clustering have
been proposed, but the minimization of the expected loss is challenging
given the large size and discrete nature of the search space. Our
approach is a stochastic search based on a series of micro-optimizations
performed in a random order. Our approach is embarrassingly parallel.

Co-Authors:
Richard Warr
Thomas Jensen
Devin Johnson
Peter Müller

Website:
https://statistics.byu.edu/directory/dahl-david-b

2020-02-20 - Paul Sabin - Estimating Player Value in Football Using Plus-Minus Models

The use of statistical methods in sports has exploded during the past decade. Football, America’s most popular sport has lagged behind in the adoption of “Analytics.” A framework for calculating the expected points of each play was introduced by former BYU Quarterback and statistics student Virgil Carter and Robert Machol in 1971. 35 years later, this framework was reintroduced by Brian Burke and opened up the possibilities of analysis in the sport. Until more recently, calculating the value of football players on-field performance has been limited to scouting methods and quarterbacks. Adjusted Plus-Minus (APM) models have long been used in other sports, most notably basketball (Rosenbaum (2004), Kubatko et al. (2007), Winston (2009), Sill (2010)) to estimate each player’s value by accounting for those in the game at the same time. More recent methods have found ways to incorporate plus-minus models in other sports such as Hockey (Macdonald (2011)) and Soccer (Schultze and Wellbrock (2018) and Macdonald (2011)). These models are especially useful in coming up with results-oriented estimation of each player’s value. In American Football, it is difficult to estimate every player’s value since many positions such as offensive lineman have no recorded statistics. While player-tracking data in the NFL is allowing new analysis, such data does not exist in other levels of football such as the NCAA. Using expected points, I provide a model framework that solves many of the traditional issues APM models face in football. This methodology allows the models to estimate the value of each position in each level of the sport. These coarse models will be essential to pair with fine-level player tracking models in a multiscale framework in the future.

2020-02-13 - Greg MacFarlane -Using Mobile Device Data to Measure Park Choice, Access, and Health

Parks provide benefits to the people who can access them, but previous research attempts to quantify these benefits have used largely arbitrary means to measure access. In two papers currently under review, I and my coauthors have applied a measurement of access based in econometric choice theory to this problem. We inform this measure using mobile device data, developing models of how far people are willing to travel to reach marginally larger parks. We then correlate these choice-based measures of access to parks with tract-level data on physical activity and obesity rates, controlling for demographic variables and spatial effects. The results indicate that excellent park access improves physical activity participation rates, and suggests a marginal improvement in obesity rates beyond what physical activity and demographics can explain.

2020-02-06 - Summer Rupper - Tapping into Spy Satellite Imagery to Measure Glacier Changes in the Water Towers of Asia

Abstract:
The high mountain regions of Asia are host to more snow and ice than anywhere outside of the Polar Regions. Changes in snow and ice storage in these remote landscapes has the potential to impact the nearly 1 billion people living downstream. While it is clear that glaciers are generally retreating (shrinking in size) globally, there is a significant paucity of data on glacier changes in high mountain Asia. These glacial systems are extremely remote, occur at very high altitudes, and are often located along disputed boarders in geopolitically unstable regions. This has hampered our ability to access these glaciers and directly monitor changes over time. Here we tap into historical spy satellite imagery to measure the change in glacier volume across the Himalayan range over the past ~50 years. We use this new data set to assess the magnitude and rates of change in glacier-related water resources, assess the mechanisms driving these changes, and model the impacts on downstream populations.

2020-01-30 - Abbas Zaidi - Evaluating the Effect of Residential Care on Self-Concept and Ego-Resilience: A Generalized Propensity Score Analysis with Clustered Data

Abstract:
This discussion focuses on the parametric estimation of average causal effects under a continuous treatment in a hierarchical setting. Our approach is applied to assessing the impact of the Udayan Ghar Program. This residential care system began in New Delhi, India with the purpose of providing surrogate housing and care to vulnerable and at risk children in an effort to improve their psychological development. We investigate the impact of staying in the system on the self-concept and ego-resilience of the residents as proxied by the Piers-Harris score. We find that there is a positive effect of staying in the residential care system at Udayan and that there are variations in this effect by gender. Furthermore, we strengthen our causal claims by demonstrating robustness against unmeasured confounding using a novel sensitivity analysis technique to assess how violations of this key identifying assumption impact our findings.

Website:
https://sites.google.com/site/amzaidistatistics/

2020-01-23 - Kevin Moon - Visualizing the True Structure of Big Data for Data Exploration

Abstract:
We live in an era of big data in which researchers in nearly every field are generating thousands or even millions of samples in high dimensions. Most methods in data science focus on prediction or impose restrictive assumptions that require established knowledge and understanding of the data; i.e. these methods require some level of expert supervision. However, in many cases, this knowledge is unavailable and the goal of data analysis is scientific discovery and to develop a better understanding of the data. There is especially a strong need for methods that perform unsupervised data visualization that accurately represents the true structure of the data, which is crucial for developing intuition and understanding of the data. In this talk, I will present PHATE: an unsupervised data visualization tool based on a new information distance that excels at denoising the data while preserving both global and local structure. I will demonstrate PHATE on a variety of datasets including facial images, mass cytometry data, and new single-cell RNA-sequencing data. On the latter, I will show how PHATE can be used to discover novel surface markers for sorting cell populations. In addition, I will present DIG, a visualization tool inspired by PHATE that theoretically eliminates nearly all sources of noise when visualizing dynamical systems. I will then demonstrate DIG on EEG sleep data.

Website:
https://sites.google.com/a/umich.edu/kevin-r-moon/home

2019-12-05 - Derek Tucker - Elastics Functional Data Analysis

Abstract:
Functional data analysis (FDA) is an important research area, due to its broad applications across many disciplines where functional data is prevalent. An essential component in solving these problems is the registration of points across functional objects. Without proper registration, the results are often inferior and difficult to interpret. The current practice in the FDA literature is to treat registration as a pre-processing step, using off-the-shelf alignment procedures, and follow it up with statistical analysis of the resulting data. In contrast, an Elastic framework is a more comprehensive approach, where one solves for the registration and statistical inferences in a simultaneous fashion. Our goal is to use a metric with appropriate invariance properties, to form objective functions for alignment and to develop statistical models involving functional data. While these elastic metrics are complicated in general, we have developed a family of square-root transformations that map these metrics into simpler Euclidean metrics, thus enabling more standard statistical procedures. Specifically, we have developed techniques for elastic functional PCA, elastic tolerance bounds, and elastic regression models involving functional variables. I will demonstrate these ideas using simulated data and real data from various sources.

J. Derek Tucker is a Principal Member of the Technical Staff at Sandia National Laboratories. He received his B.S. in Electrical Engineering Cum Laude and M.S. in Electrical Engineering from Colorado State University in 2007 and 2009, respectively. In 2014 he received a Ph.D. degree in Statistics from Florida State University In Tallahassee, FL under the co-advisement of Dr. Anuj Srivastava and Dr. Wei Wu. He currently is leading research projects in the area of satellite image registration and point processes modeling for monitory applications. His research is focused on pattern theoretic approaches to problems in image analysis, computer vision, signal processing, and functional data analysis. In 2017, he received the Director of National Intelligence Team Award for his contributions to the Signal Location in Complex Environments (SLiCE) team.

Affiliation:
Sandia National Labs

Date:
2019-12-05

2019-11-21 - Antonio Villanueva-Morales - Modified Pseudo-likelihood Estimation for Markov Random Fields on Lattice

Abstract:
The probability function of spatial statistical models involves, in general, an extremely awkward normalizing function of the parameters known as the partition function in statistical mechanics with the consequence that a direct approach to statistical inference through maximum likelihood (ML) is rarely possible. In order to avoid such intractability Besag (1975) introduced an alternative technique known as the method of maximum pseudo-likelihood (MPL) owing to its merit of being easy to implement. The maximum pseudo-likelihood estimator (MPLE) is the value of the parameter that maximizes the pseudo-likelihood defined as the direct product of conditional probabilities or conditional probability densities of the variable at each site. It has been mathematically demonstrated that, under suitable conditions, the MPLEs are strongly consistent and asymptotically normally distributed around the true parameter value for large samples of various spatial processes. On the other hand, the MPL method trades away efficiency for computational ease. It has been shown that in many situations the MPLE is not efficient in comparison with the ML estimator (MLE). According to these studies, the MPLEs are as good as the MLEs in the weak interaction case, but the difference between the two becomes substantial when spatial interactions are strong.

Huang and Ogata (2002) address the problem of improving the efficiency of MPLEs while still keeping the technique computationally feasible and proposed the maximum generalized pseudo-likelihood (MGPL) method for Markov random field (MRF) models on lattice. The MGPL estimator (MGPLE) is the value of the parameter that maximizes the generalized pseudo-likelihood function (GPL). This GPL is the multivariate version of Besag's pseudo-likelihood which is constructed first by defining a group of adjacent sites for each site in the lattice and then taking the product of the multivariate conditional probability distributions (MCPD) of the groups of random variables defined on each group of adjacent sites. Simulation results for an Ising and two auto-normal models on a region of square lattice showed better performance of the MGPLE than the MPLE, and the performance became better as the size of the groups of adjacent sites increased. On the other hand, it was observed that as the size of the groups of adjacent sites increased, the computing complexity for the MGPLE increased exponentially due to the presence of a normalizing integral (a sum in the case of discrete site variables) in the expression for each MCPD which has to be evaluated all over the support of the joint distribution for groups of site variables in each case. Because of this, for continuous MRFs other than auto-normal and discrete MRFs with site variables assuming more than two values, an enormous effort might be required making the implementation of the MGPL method practically unfeasible even for small square lattices. For example, in MRFs where each site variable, conditional on its neighbors, follows the distribution of a Winsorized Poisson random variable (Kaiser and Cressie (1997)) the computation of the normalizing integrals rapidly becomes prohibitive with the size of the groups of adjacent sites even for small square lattices, as the support of this distribution may be in the hundreds (or thousands).

In our research we propose a conditional pairwise pseudo-likelihood (CPPL) for parameter estimation in Markov random fields on lattice. The CPPL is defined as the direct product of conditional pairwise distributions corresponding to the pairs of random variables associated with the cliques of size two from the collection of spatial locations on a region of a lattice. Thus the CPPL is a modified version of Besag's pseudo-likelihood (PL) and Huang and Ogata's generalized pseudo-likelihood (GPL) in that it is not constructed based on defining a group of adjacent sites for each site in the lattice. We carry out calculations of the correspondingly defined maximum conditional pairwise pseudo-likelihood estimator (MCPPLE) for Markov random fields with Winsorized Poisson conditional distributions on the lattice. These simulation studies show that the MCPPLE has significantly better performance than Besag's maximum pseudo-likelihood estimator (MPLE), and its calculation is almost as easy to implement as the MPLE. Therefore, we suggest that for situations where each discrete local random variable conditional on its neighbors assumes more than two possible values, as in the Winsorized Poisson case, estimation based on the CPPL may be a computationally more feasible alternative than estimation based on Huang and Ogata's GPL.

Affiliation:
Statistics Department, Chapingo Autonomous University

Add new comment

2019-11-14 - Jennifer Sinnott - Genetic Association Testing with Imperfect Phenotypes Derived From Electronic Health Records

Abstract:
Electronic health records linked to blood samples form a powerful new data resource that can provide much larger, more diverse samples for testing associations between genetic markers and disease. However, algorithms for estimating certain phenotypes, especially those that are complex and/or difficult to diagnose, produce outcomes subject to measurement error. Much work is needed to determine best practices for implementing and analyzing such data. To this end, we recently proposed a method for analyzing case-control studies when disease status is estimated by a phenotyping algorithm; our method improves power and eliminates bias when compared to the standard approach of dichotomizing the algorithm prediction and analyzing the data as though case-control status were known perfectly. The method relies on knowing certain qualities of the algorithm, such as its sensitivity, specificity, and positive predictive value, but in practice these may not be known if no ``gold-standard'' phenotypes are known in the population. A common setting where that occurs is in phenome-wide association studies (PheWASs), in which a wide range of phenotypes are of interest, and all that is available for each phenotype is a surrogate measure, such as the number of billing codes for that disease. We proposed a method to perform genetic association tests in this setting, which improves power over existing methods that typically identify cases based on thresholding the number of billing codes. In this talk, I will describe these methods, and present applications to studies of rheumatoid arthritis in the Partners Healthcare System.

Affiliation:
Ohio State University/University of Utah

Date:
019-11-14

2019-11-05 - Jacob Mortensen - Statistical Methods for Modeling Movement

Abstract:
In recent years, tracking data in has become widespread, allowing researchers to model movement at a very high level of detail. In this talk I will present two examples of statistical research inspired by this type of data. In the first, I present a method for nonparametric estimation of continuous-state Markov transition densities. Our approach uses a Poisson point process to represent the joint transition space, then divides that process by the marginal intensity to estimate the conditional transition density. Modeling a transition density as a point process creates a general framework that admits a wide variety of implementations, depending on suitability for a given application and at the discretion of the modeler. A key feature of this point process representation is that it allows the presence of spatial structure to inform transition density estimation. We illustrate this by using our method to model ball movement in the National Basketball Association, enabling us to capture the effects of spatial features, such as the three point line, that impact transition density values. In the second, I will show how broadcast-derived tracking data can be used to estimate external load metrics in sports science. Sports scientists use high resolution coordinate data to estimate external load metrics, such as acceleration load and high speed running distance, traditionally used to understand the physical toll a game takes on an athlete. Unfortunately, collecting this data requires installation of expensive hardware and paying costly licensing fees to data providers, restricting its availability. Algorithms have been developed that allow a traditional broadcast feed to be converted to x-y coordinate data, making tracking data easier to acquire, but coordinates are available for an athlete only when that player is within the camera frame. Obviously, this leads to inaccuracies in external load estimates, limiting the usefulness of this data for sports scientists. In this research, we develop models that predict offscreen load metrics and demonstrate the viability of broadcast-derived tracking data for understanding external load in soccer.

Affiliation:
Simon Fraser University

2019-10-31- Nathan Sandholtz - Modeling human decision-making in spatio-temporal systems: An observational and an experimental case study

Abstract:
In this talk I present two contrasting analyses of human decision-making behavior inspatio-temporal systems. In the first case, we examine player shooting decisions inprofessional basketball. We assume that all players operate under the same objectivefunction on offense---namely, maximizing their team's total expected points. Our goal is toidentify areas where, conditional on location, lineups exhibit potential inefficiencies inallocating shots among their players. We do this by comparing a player's probability ofmaking a shot to the rate at which he shoots in context of both his four teammates on thecourt and the spatial distribution of his shots. While on average players are highly efficientwith respect to the shot allocation metrics we introduce, nearly every lineup exhibits somedegree of potential inefficiency. We estimate and visualize the points that are potentiallylost and identify which players are responsible.

In the second case, we analyze an experiment in which subjects were tasked withmaximizing a reward in a simple "hotspot" computer game. As in the basketball example,subjects made decisions to maximize a point total. However, unlike shots in a basketballgame, this task was specifically designed to induce uncertainty about the effect an actionhas on the subsequent point outcome. This forced subjects to balance exploration andexploitation in their strategy. Our analysis shows that subjects exhibit vastly differentpreferences regarding the exploration vs. exploitation tradeoff. For this reason, we cannotassume a global strategy which all subjects follow. On the contrary, inferring each subject'slatent strategy (or acquisition function, as referred to in the Bayesian optimizationliterature) actually becomes the primary goal of our research. We find that the classicalsuite of acquisition functions doesn’t adequately explain every subject's behavior, and wepropose a modification to this suite of acquisition functions which better explains thesubjects' collective set of decisions.

Affiliation:
Simon Fraser University

Date:
2019-10-31

2019-10-24-Alex Petersen-Partial Separability and Graphical Models for Multivariate Functional Data

Abstract:
Graphical models are a ubiquitous tool for identifying dependencies among components of high-dimensional multivariate data. Recently, these tools have been extended to estimate dependencies between components of multivariate functional data by applying multivariate methods to the coefficients of truncated basis expansions. A key difficulty compared to multivariate data is that the covariance operator is compact, and thus not invertible. In this talk, we will discuss a property called partial separability that circumvents the invertibility issue and identifies the functional graphical model with a countable collection of finite-dimensional graphical models. This representation allows for the development of simple and intuitive estimators. Finally, we will demonstrate the empirical findings of our method through simulation and analysis of functional brain connectivity during a motor task.

Affiliation:
University of California, Santa Barbara

Date:
2019-10-24

2019-10-17 Jonathan Blake - A Risk Manager's Guide to a Faith Journey

Abstract:
Jonathan Blake, recipient of the college Alumni Achievement Award, will give a guest lecture on Thursday, October 17. Entitled “A Risk Manager’s Guide to a Faith Journey,” the lecture will take place at 11 a.m. in room 1170 of the Talmage Building. The public is invited to attend the event. For over twenty years, Blake has been employed in a variety of actuarial roles. He is currently the Vice President and Lead Actuary at The Hanover Insurance Group. In this position, he assesses the financial strength of over three billion dollars in domestic reserve position. Blake is also a partner with the Personal, Commercial, and Specialty businesses units, where he helps units engage in profitable growth strategic initiatives. Blake graduated magna cum laude with a B.S. and M.S. from the Department of Statistics in the College of Physical and Mathematical Sciences. He has previously worked in Ohio, Texas, and Massachusetts and served as present, vice president, and board member of the Casualty Actuaries of New England. He is currently a Fellow of the Casualty Actuarial Society and a member of the American Academy of Actuaries. Blake and his wife, Julia, have six children. Each year, every college on Brigham Young University campus honors one alumnus or alumna with this prestigious award. Blake received the 2019 Alumni Achievement Award from the College of Physical and Mathematical Sciences.

Affiliation:
Hanover

2019-10-10 - Wes Johnson - Gold Standards are Out and Bayes is In: Implementing the Cure for Imperfect Reference Tests in Diagnostic Accuracy Studies

Abstract:
Bayesian mixture models, often termed latent class models, allow users to estimate the diagnostic accuracy of tests and true prevalence in one or more populations when the positive and/or negative reference standards are imperfect. Moreover, they allow the data analyst to show the superiority of a novel test over an old test, even if this old test is the (imperfect) reference standard. We use published data on Toxoplasmosis in pigs to explore the effects of numbers of tests, numbers of populations, and dependence structure among tests to ensure model (local) identifiability. We discuss and make recommendations about the use of priors, sensitivity analysis, model identifiability and study design options, and strongly argue for the use of Bayesian mixture models as a logical and coherent approach for estimating the diagnostic accuracy of two or more tests.

Affiliation:
Department of Statistics, University of California Irvine

Date:
October 10th, 2019

Website:
https://www.ics.uci.edu/~wjohnson/

2019-10-03 - John Lawson - Useful Models and Design Strategies for Experimentation - A Career Long Perspective

Affiliation:
Department of Statistics, Brigham Young University

Date:
October 3, 2019

Website:
John Lawson

2019-09-26 - Matt Heiner - Bayesian Nonparametric Density Autoregression with Lag Selection

Abstract:
We propose and illustrate a Bayesian nonparametric autoregressive model applied to flexibly estimate general transition densities exhibiting nonlinear lag dependence. Our approach is related to Bayesian curve fitting via joint density estimation using Dirichlet process mixtures, with the Markovian likelihood defined as the conditional distribution obtained from the mixture. This results in a nonparametric extension of a mixture-of-experts formulation. We address computational challenges to posterior sampling that arise from the conditional likelihood. We illustrate the base model by fitting to synthetic data simulated from a classical model for population dynamics, as well as a time series of successive waiting times between eruptions of Old Faithful Geyser. We explore inferences available through the base model before extending the model to include automatic relevance detection among a pre-specified set of lags. We explore methods and inferences for global and local lag selection with additional simulation studies, and illustrate by fitting to an annual time series of pink salmon abundance in a stream in Alaska. We further explore and compare transition density estimation performance for alternative configurations of the proposed model.

Affiliation:
Department of Statistics, Brigham Young University

Date:
September 26, 2019

2019-09-19 - Adam Smith - Bayesian Analysis of Partitioned and Large-Scale Demand Models

Website:

Dr. Schultz Webpage

2019-02-28 - Ephraim Hanks - Random walk spatial models for spatially correlated genetic data

Presenter:

Dr. Ephraim Hanks

Title:

Random walk spatial models for spatially correlated genetic data

Affiliation:

Penn State

Date:

February 28, 2019

Abstract:

Landscape genetics is the study of how landscape features, like rivers, mountains, and roads, influence genetic connectivity of wildlife populations. We build models for spatial genetic correlation based off of spatio-temporal models for how animals move across the landscape. This approach provides insights into common spatial models, such as simultaneous autoregressive (SAR) models and common Matern covariance models. It also allows for scientific interpretation of spatial covariance parameters. We illustrate this approach in a study of brook trout, where we provide the first parametric description of how stream characteristics influence genetic connectivity.

Website:

Dr. Hanks' Website

2019-02-21 - Michele Guindani - Bayesian Approaches to Dynamic Model Selection

Presenter:

Michele Guindani

Title:

Bayesian Approaches to Dynamic Model Selection

Affiliation:

University of California, Irvine

Date:

February 21, 2019

Abstract:

In many applications, investigators monitor processes that vary in space and time, with the goal of identifying temporally persistent and spatially localized departures from a baseline or ``normal" behavior. In this talk, I will first discuss a principled Bayesian approach for estimating time varying functional connectivity networks from brain fMRI data. Dynamic functional connectivity, i.e., the study of how interactions among brain regions change dynamically over the course of an fMRI experiment, has recently received wide interest in the neuroimaging literature. Our method utilizes a hidden Markov model for classification of latent neurological states, achieving estimation of the connectivity networks in an integrated framework that borrows strength over the entire time course of the experiment. Furthermore, we assume that the graph structures, which define the connectivity states at each time point, are related within a super-graph, to encourage the selection of the same edges among related graphs. Then, I will propose a Bayesian nonparametric model selection approach with an application to the monitoring of pneumonia and influenza (P&I) mortality, to detect influenza outbreaks in the continental United States. More specifically, we introduce a zero-inflated conditionally identically distributed species sampling prior which allows borrowing information across time and to assign data to clusters associated to either a null or an alternate process. Spatial dependences are accounted for by means of a Markov random field prior, which allows to inform the selection based on inferences conducted at nearby locations. We show how the proposed modeling framework performs in an application to the P&I mortality data and in a simulation study, and compare with common threshold methods for detecting outbreaks over time, with more recent Markov switching based models, and with other Bayesian nonparametric priors that do not take into account spatio-temporal dependence.

Website:

Dr. Guidani's Website

Website:

2018-11-01 - Dustin Harding - How Renting Products Increases Consumer Confidence and Commitment

Presenter:

Dustin Harding

Title:

How Renting Products Increases Consumer Confidence and Commitment

Affiliation:

UVU

Date:

October 25, 2018

Abstract:

Consumers can obtain skill-based products through a variety of acquisition modes, such as purchase or rental. Despite the rise of nonpurchase acquisition modes, surprisingly little research has explored the effects of differential acquisition modes on consumer behavior. This research begins to fill this gap in the literature by examining the effect of acquisition mode on the expected time necessary to master newly adopted skill-based products and the downstream consequences for consumers and marketers. Results of four experiments and a field study show that purchasing, versus renting, products requiring skill-based learning increases the amount of time consumers expect to be required to master them. Further, the differences in speed of product mastery, in turn, impact subsequent consumer behavior via differential levels of product use commitment.

Website:

Dr. Harding's Website

2018-10-25 - Alex Petersen - Wasserstein Regression and Covariance for Random Densities

Presenter:

Alex Petersen

Title:

Wasserstein Regression and Covariance for Random Densities

Affiliation:

UC Santa Barbara

Date:

October 25, 2018

Abstract:

Samples of density functions appear in a variety of disciplines, including distributions of mortality across nations, CT density histograms of hematoma in post-stroke patients, and distributions of voxel-to-voxel correlations of fMRI signals across subjects. The nonlinear nature of density space necessitates adaptations and new methodologies for the analysis of random densities. We define our geometry using the Wasserstein metric, an increasingly popular choice in theory and application. First, when densities appear as responses in a regression model, the utility of Fréchet regression, a general purpose methodology for response objects in a metric space, is demonstrated. Due to the manifold structure of the space, inferential methods are developed allowing for tests of global and partial effects, as well as simultaneous confidence bands for fitted densities. Second, a notion of Wasserstein covariance is proposed for multivariate density data (a vector of densities), where multiple densities are observed for each subject. This interpretable dependence measure is shown to reveal interesting differences in functional connectivity between a group of Alzheimer's subjects and a control group.

Website:

Dr. Petersen's Website

2018-10-18 - Abel Rodriguez - Spherical Factor Analysis for Binary Data: A Look at the Conservative Revolt in the US House of Representatives

Presenter:

Abel Rodriguez

Title:

Spherical Factor Analysis for Binary Data: A Look at the Conservative Revolt in the US House of Representatives

Affiliation:

UC Santa Cruz

Date:

October 18, 2018

Abstract:

Factors models for binary data are extremely common in many social science disciplines. For example, in political science binary factor models are often used to explain voting patterns in deliberative bodies such as the US Congress, leading to an “ideological” ranking of legislators. Binary factor models can be motivated through so-call “spatial” voting models, which posit that legislators have a most preferred policy – their ideal point –, which can be represented as a point in some Euclidean “policy space”. Legislators then vote for/against motions in accordance with the distance between their (latent) preferences and the position of the bill in the same policy space. In this talk we introduce a novel class of binary factor models derived from spatial voting models in which the policy space corresponds to a non-Euclidean manifold. In particular, we consider embedding legislator’s preferences in the surface of a n-dimensional sphere. The resulting model contains the standard binary Euclidean factor model as a limiting case, and provides a mechanism to operationalize (and extend) the so-called “horseshoe theory” in political science, which postulates that the far-left and far-right are more similar to each other in essentials than either are to the political center. The performance of the model is illustrated using voting data from recent US Congresses. In particular, we show that voting patterns for the 113th US House of Representatives are better explained by a circular factor model than by either a one- or a two-dimensional Euclidean model, and that the circular model yields a ranking of legislators more in accord with expert’s expectations.

Website:

Dr. Rodriguez's Website

2018-09-20 - Scott Grimshaw - Going Viral, Binge Watching, and Attention Cannibalism

Presenter:

Dr. Scott Grimshaw

Title:

Going Viral, Binge Watching, and Attention Cannibalism

Affiliation:

BYU

Date:

September 20, 2018

Abstract:

Since digital entertainment is often described as viral this paper uses the vocabulary and statistical methods for diseases to analyze viewer data from an experiment at BYUtv where a program's premiere was exclusively digital. Onset time, the days from the program premiere to a viewer watching the first episode, is modeled using a changepoint between epidemic viewing with a non-constant hazard rate and endemic viewing with a constant hazard rate. Finish time, the days from onset to a viewer watching all episodes, uses an expanded negative binomial hurdle model to reflect the characteristics of binge watching. The hurdle component models binge racing where a viewer watches all episodes on the same day as onset. One reason binge watching appeals to viewers is that they can focus attention on a single program's story line and characters before moving on to a second program. This translates to a competing risks model that has an impact on scheduling digital premieres. Attention cannibalism occurs when a viewer takes a long time watching their first choice program and then never watches a second program or delays watching the second program until much later. Scheduling a difference in premieres reduces attention cannibalism.

Website:

Dr. Grimshaw's website

2018-04-12 - Cristian Tomasetti - Cancer etiology, evolution and early detection

Presenter:

Dr. Cristian Tomasetti

Title:

Cancer etiology, evolution, and early detection

Affiliation:

Johns Hopkins University School of Medicine

Date:

Apr 12, 2018

Abstract:

The standard paradigm in cancer etiology is that inherited factors and lifestyle, environmental exposures are the causes of cancer. I will present recent findings indicating that a third cause, never considered before, plays a large role: "bad luck", i.e. the pure chance involved in DNA replication when cells divide. Novel mathematical and statistical methodologies for distinguishing among these causes will also be introduced. I will then conclude with a new approach for the early detection of cancer.

Website:

Dr. Tomasetti's Website

2018-03-29 - H. Dennis Tolley - What's the Likelihood?

Presenter:

H. Dennis Tolley

Title:

What's the Likelihood?

Affiliation:

BYU

Date:

Mar 29, 2018

Abstract:

The likelihood function plays a major role in both frequentist and Bayesian methods of data analysis. Non-parametric Bayesian models also rely heavily on the form of the likelihood. Despite its heuristic foundation, the likelihood has several desirable large sample statistical properties that prompt its use among frequentists. Additionally, there are other important facets of the likelihood that warrant its formulation in many circumstances. As fundamental as the likelihood is, however, beginning students are only given a cursory introduction into how to formulate the likelihood. This seminar illustrates the formulation of the likelihood for a family of statistical problems common in the physical sciences. By examining the basic scientific principles associated with an experimental set-up, we show the step by step construction of the likelihood, starting with the discrete random walk model as a paradigm. The resulting likelihood is the solution to a stochastic differential equation. Elementary applications of the likelihood are illustrated.

Website:

Dr. Tolley's website

2018-03-22 - Matthew Heaton - Methods for Analyzing Large Spatial Data: A Review and Comparison

Presenter:

Dr. Matthew Heaton

Title:

Methods for Analyzing Large Spatial Data: A Review and Comparison

Affiliation:

BYU

Date:

Mar 22, 2018

Abstract:

The Gaussian process is an indispensable tool for spatial data analysts. The onset of the “big data” era, however, has lead to the traditional Gaussian process being computationally infeasible for modern spatial data. As such, various alternatives to the full Gaussian process that are more amenable to handling big spatial data have been proposed. These modern methods often exploit low rank structures and/or multi-core and multi-threaded computing environments to facilitate computation. This study provides, first, an introductory overview of several methods for analyzing large spatial data. Second, this study describes the results of a predictive competition among the described methods as implemented by different groups with strong expertise in the methodology. Specifically, each research group was provided with two training datasets (one simulated and one observed) along with a set of prediction locations. Each group then wrote their own implementation of their method to produce predictions at the given location and each which was subsequently run on a common computing environment. The methods were then compared in terms of various predictive diagnostics.

Website:

Dr. Heaton's website

2018-03-15 - Timothy Hanson - A unified framework for fitting Bayesian semiparametric models to arbitrarily censored spatial survival data

Presenter:

Timothy Hanson

Title:

A unified framework for fitting Bayesian semiparametric models to arbitrarily censored spatial survival data

Affiliation:

Medtronic

Date:

Mar 15, 2018

Abstract:

A comprehensive, unified approach to modeling arbitrarily censored spatial survival data is presented for the three most commonly-used semiparametric models: proportional hazards, proportional odds, and accelerated failure time. Unlike many other approaches, all manner of censored survival times are simultaneously accommodated including uncensored, interval censored, current-status, left and right censored, and mixtures of these. Left truncated data are also accommodated leading to models for time-dependent covariates. Both georeferenced (location observed exactly) and areally observed (location known up to a geographic unit such as a county) spatial locations are handled. Variable selection is also incorporated. Model fit is assessed with conditional Cox-Snell residuals, and model choice carried out via LPML and DIC. Baseline survival is modeled with a novel transformed Bernstein polynomial prior. All models are fit via new functions which call efficient compiled C++ in the R package spBayesSurv. The methodology is broadly illustrated with simulations and real data applications. An important finding is that proportional odds and accelerated failure time models often fit significantly better than the commonly-used proportional hazards model.

Website:

Dr. Hanson's LinkedIn

2018-03-08 - Daniel Nettleton - Random Forest Prediction Intervals

Presenter:

Dr. Daniel Nettleton

Title:

Random Forest Prediction Intervals

Affiliation:

Iowa State University

Date:

Mar 8, 2018

Abstract:

Breiman's seminal paper on random forests has more than 30,000 citations according to Google Scholar. The impact of Breiman's random forests on machine learning, data analysis, data science, and science in general is difficult to measure but unquestionably substantial. The virtues of random forest methodology include no need to specify functional forms relating predictors to a response variable, capable performance for low-sample-size high-dimensional data, general prediction accuracy, easy parallelization, few tuning parameters, and applicability to a wide range of prediction problems with categorical or continuous responses. Like many algorithmic approaches to prediction, random forests are typically used to produce point predictions that are not accompanied by information about how far those predictions may be from true response values. From the statistical point of view, this is unacceptable; a key characteristic that distinguishes statistically rigorous approaches to prediction from others is the ability to provide quantifiably accurate assessments of prediction error from the same data used to generate point predictions. Thus, we develop a prediction interval -- based on a random forest prediction -- that gives a range of values that will contain an unknown continuous univariate response with any specified level of confidence. We illustrate our proposed approach to interval construction with examples and demonstrate its effectiveness relative to other approaches for interval construction using random forests.

Website:

Dr. Nettleton's website

2018-02-22 - Robert Richardson - Non-Gaussian Translation Processes

Presenter:

Robert Richardson

Title:

Non-Gaussian Translation Processes

Affiliation:

BYU

Date:

Feb 22, 2018

Abstract:

A non-Gaussian translation process is a method used in some engineering applications where a stochastic process is used with non-Gaussian marginal distributions. It could be considered a hierarchical copula model where the correlation structure of the process is defined separately from the marginal distributional characteristics. This approach also yields a simple likelihood function for the finite dimensional distributions of the stochastic process. These processes will be shown in a few applications to either perform tasks that could not be done previously or to do it much more efficiently such as non-Gaussian option pricing, general multivariate stable spatial processes, and non-Gaussian spatio-temporal dynamic modeling.

Website:

Dr. Richardson's Website

2018-02-15 - Jeffery Tessem - How to make more beta cells: exploring molecular pathways that increase functional beta cell mass as a cure for Type 1 and Type 2 diabetes

Presenter:

Dr. Jeffery S Tessem

Title:

How to make more beta cells: exploring molecular pathways that increase functional beta cell mass as a cure for Type 1 and Type 2 diabetes

Affiliation:

Department of Nutrition, Dietetics and Food Science at BYU

Date:

Feb 15, 2018

Abstract:

Both Type 1 (T1D) and Type 2 diabetes (T2D) are caused by a relative insufficiency in functional β-cell mass. Current therapeutic options for diabetes include daily insulin injections to maintain normoglycemia, pharmacological agents to stimulate β-cell function and enhance insulin sensitivity, or islet transplantation. A major obstacle to greater application of islet transplantation therapy is the scarcity of human islets. Thus, new methods for expansion of β-cell mass, applied in vitro to generate the large numbers of human islet cells needed for transplantation, or in situ to induce expansion of the patients remaining β-cells, could have broad therapeutic implications for this disease. To this end, our lab is interested in delineating the molecular pathways that increase β-cell proliferation, enhance glucose stimulated insulin secretion, and protect against β-cell death.

Website:

Dr. Tessem's Website

2018-02-08 - Chris Groendyke - Bayesian Inference for Contact Network Models using Epidemic Data

Presenter:

Chris Groendyke

Title:

Bayesian Inference for Contact Network Models using Epidemic

Data

Affiliation:

Robert Morris University

Date:

Feb 8, 2018

Abstract:

I will discuss how network models can be used to study the spread of epidemics through a population, and in turn what epidemics can tell us about the structure of this population. I apply a Bayesian methodology to data from a disease presumed to have spread across a contact network in a population in order to perform inference on the parameters of the underlying network and disease models. Using a simulation study, I will discuss the strengths, weaknesses, and limitations of this type of these models, and the data required for this type of inference. Finally, I will describe an analysis of an actual measles epidemic that spread through the town of Hagelloch, Germany, in 1861 and share the conclusions it allows us to make regarding the population structure.

Website:

Chris's Website

2018-02-01 - Larry Baxter - Structure in Prior PDFs and Its Effect on Bayesian Analysis

Presenter:

Larry Baxter

Title:

Structure in Prior PDFs and Its Effect on Bayesian Analysis

Affiliation:

BYU

Date:

Feb 1, 2018

Abstract:

Bayesian statistics formalizes a procedure for combining established (prior) statistical knowledge with current knowledge to produce a posterior statistical description that presumably is better than either the prior or new knowledge by itself. Two common applications of this theory involve (a) combining established (literature) estimates of model parameter with new data to produce better parameter estimates, and (b) estimating model prediction confidence bands. Frequently, the prior information includes reasonable parameter estimates, poorly quantified and often subjective parameter uncertainty estimates, and no information regarding how the values of one parameter affect the confidence intervals of other parameters. All three of these parameter characteristics affect Bayesian analysis. The first two receive a great deal of attention. The third characteristic, the dependence of model parameters on one another, creates structure in the prior pdfs. This structure strongly influences Bayesian results, often to an extent that rivals or surpasses the parameter uncertainty best estimates. Nevertheless, Bayesian analyses commonly ignore this structure.
All structure stems primarily from the form of the model and, in linear models, does not depend on the observations themselves. Most models produce correlated parameters when applied to real-world engineering and science data. The most common example of structure is parameter correlation coefficients. Linear models produce linear parameter correlations that depend on the magnitude of the independent variable under analysis but that in most practical applications produce large, often close to unity, correlation coefficients. Nonlinear models also generally have correlated parameters. However the correlations can be nonlinear, even discontinuous, and generally involve more complexity than linear model parameter correlations. Parameter correlations profoundly affect the results of Bayesian parameter estimation and prediction uncertainty. Properly incorporated structure produces Bayesian results that powerfully illustrate the strength and potential contribution of the theory. Bayesian analyses that ignore such structure produce poor or even nonsensical results, often significantly worse than a superficial guess.
This seminar demonstrates the importance of prior structure in both parameter estimation and uncertainty quantification using real data from typical engineering systems. Perhaps most importantly, the discussion illustrates methods of incorporating parameter structure for any given model that does not rely on observations. These methods quantify parameter structure, including the lack of structure, for linear and nonlinear models.

Website:

Larry's Website

2018-01-18 - Brad Barney - Growing Curve Methodology with Application to Neonatal Growth Curves

Presenter:

Brad Barney

Title:

Growing Curve Methodology with Application to Neonatal Growth Curves

Affiliation:

BYU

Date:

Jan 18, 2018

Abstract:

As part of postnatal care, newborns are routinely monitored to assess the stability and adequacy of their growth. Interest lies in learning about the typical postnatal growth of especially preterm infants. We briefly consider some general methodological strategies currently employed to parsimoniously construct growth curves for use in medical practice. We present original results using existing methodology known as generalized additive models for location, scale and shape (GAMLSS). We also expand existing methodology on the Bayesian analogue of GAMLSS, known as structured additive distributional regression. In particular, we hierarchically model weight and length jointly, from which we are able to induce a time-varying distribution for Body Mass Index.