Seminars
Seminars
Seminars are held on Thursdays at 4:00 pm in WVB 1161
-
Toggle Item2024-11-14 - Spencer Wadsworth - Bayesian Stacking via Proper Scoring Rule Optimization Using a Gibbs PosteriorAbstract:
In a probabilistic forecast hub, many researchers collaborate by each submitting their own forecasts of the same events. A forecast hub thus provides a locale where forecasts from the various participants may be assessed for skill, directly compared with other forecasts, and findings may be disseminated. Often, the forecasts in a forecast hub are combined into an ensemble forecast. A common method for combining forecasts is to construct an optimal linear pool where forecast distributions are weighted and summed into a mixture distribution. The selection of the model weights in a linear pool has received much attention and is often done by minimizing some score function or via Bayesian model averaging methods. Most methods either fail to provide uncertainty in their estimation of the weights or they do not cater well to problem specific needs. In this work we present the stacked Gibbs posterior (SGP), a novel method for combining forecasts by constructing an optimal linear pool via a Gibbs posterior. The weights of the linear pool are intended to optimize a proper scoring rule, which rule may be specified according to problem needs, and the Gibbs posterior allows for uncertainty quantification of the weights and regularization from a prior distribution. In two simulation studies, we compare the predictive performance of linear pool forecasts constructed using the SGP to linear pools constructed via Bayesian model averaging methods and an equally weighted pool. We also apply the SGP to forecasts from the 2023-24 CDC FluSight collaborative hub and compare the results to those of the model averaging and equally weighted methods. In each study, the forecasts from the SGP outperform ensemble forecasts from the other methods.
Bio:
I'm in my final semester as a PhD student at Iowa State University and am currently a research fellow in the ISU Dependable Data Driven Discovery (D4) grant. I previously earned a masters degree in statistics at ISU and a bachelors degree from Brigham Young University. My research interests are in Bayesian methodology for disease outbreak forecasting and probabilistic forecast hubs. -
Toggle Item2024-11-12 - Thomas Kerby - Beyond Black Boxes: Learning Local Higher-Order Interactions for Interpretable Data and ModelsAbstract:
In this era of big data and complex machine learning models, achieving model explainability remains a significant challenge. This work explores two interconnected black boxes: the generative processes behind high-dimensional datasets and the opaque mechanisms through which neural networks learn from these data. We introduce a method, called Local CorEx, that interprets neural network models by treating the representations passed through the layers as a complex dataset, allowing us to uncover both localized pairwise dependencies and higher-order interactions (HOIs) within the underlying data. Local CorEx leverages techniques from information theory, manifold learning, and representation learning to identify and extract these HOIs, revealing how groups of features interact in intricate and sometimes surprising ways. We validate Local CorEx through experiments on both synthetic and real-world datasets to assess its robustness and applicability. We find that Local CorEx uncovers meaningful local HOIs that capture dependencies beyond pairwise relationships in an unsupervised manner, paving the way for enhanced interpretability in neural networks and insights into the generative processes of complex data.
Bio:
Tom Kerby is a PhD candidate in Mathematical Sciences with a specialization in Statistics at Utah State University, studying under Dr. Kevin Moon and supported by the Presidential Doctoral Research Fellowship. Tom is a proud BYU alumnus, where he earned his undergraduate in Statistics with minors in Computer Science, Mathematics, and Molecular Biology. While at BYU, he conducted interdisciplinary research with plant geneticists and bioinformaticians in the orphan crops lab. His current research focuses on interpretable machine learning, representation learning, and diffusion models. Tom’s industry experience includes internships with Idaho National Laboratory, where he worked on detecting living-off-the-land cyber attacks, and with Enveda Biosciences, a biotech drug discovery company, where he applied large language models to predict molecular structures and properties. He married his sweetheart before coming to BYU and is now the father of two children, with a third expected in March. In his spare time, he enjoys outdoor activities with his family, astrophotography, playing the piano, and cheering on BYU sports. -
Toggle Item2024-11-07- Steven Barnett - Generating higher resolution sky maps using a deep Gaussian process Poisson modelBio:
Steven Barnett is a fifth-year graduate student at Virginia Tech pursuing a Ph.D. in Statistics. He received his M.S. in Statistics from Virginia Tech in 2022 after receiving a B.S. in Computer Science from Brigham Young University in 2016 and working as a software engineer at Qualtrics for four years. His research focuses on Gaussian process regression, surrogate modeling, and computer model calibration, with applications in the physical and engineering sciences. Steven has worked as a Graduate Research Assistant at Los Alamos National Laboratory and has served as a lead statistical consultant in the Statistical Applications and Innovations Group at Virginia Tech, collaborating with researchers in other fields and co-authoring several academic publications. He currently serves as the Communications Chair for the Chemical and Process Industries Division of the American Society and Quality and is the president of the Virginia Tech chapter of Mu Sigma Rho.
Abstract:
The Interstellar Boundary Explorer (IBEX) satellite was launched in 2008 in an effort to learn more about the heliosphere, which sits at the boundary between our solar system and interstellar space. IBEX detects energetic neutral atoms (ENAs) originating at the edge of the heliosphere to create sky maps describing their rate of emission. These sky maps are used by physicists to inform their theoretical models about the heliosphere, its many properties, and the processes that govern its creation. However, the data collected by IBEX are both noisy and irregular. We propose a deep Gaussian process Poisson model to estimate the rate at which ENAS are generated throughout the heliosphere. We believe this constitutes a more cohesive model than those developed previously. Our deep Gaussian process Poisson model has shown a greater ability to learn the complex, non-stationary mean response surface, and provide appropriate uncertainty quantification in different regions of the input space, while maintaining a simpler covariance function. We have developed a Markov chain Monte Carlo algorithm utilizing elliptical slice sampling and the Vecchia approximation to help better predict the underlying latent deep Gaussian process for use in sky map generation, sequential design, and heliospheric computer model calibration. -
Toggle Item2024-11-05 - Zoe Gibbs - Dynamic State-Space Modeling for Determining the Long-Term Impact of Mobile Health InterventionsAbstract:
While many smartphone applications measure and/or seek to modify human behavior via mobile health (mHealth) interventions, it can be challenging to determine whether the interventions result in lasting behavioral changes. In this paper, we introduce a novel framework for assessing mHealth interventions. Specifically, we define a latent variable that measures a person's tendency toward a behavior of interest. We assume this tendency can deviate over time due to random changes and interventions. We also assume there is an overall long-term effect of interventions, but allow the effect to vary over time using an AR(1) correlation structure. The resulting model can be thought of as a local linear trend state-space model. Additionally, because the type and frequency of mHealth applications often rely on previous response data, we show how endogenous variables can be incorporated into the model. We also show how a hurdle likelihood can be used in the case of zero-inflated data. Because of the complexity of the model, we use Bayesian methods to fit the model. We show via simulations that the model can be fit using a No-U-Turn sampler (NUTS) in Stan. We apply our model to step count data, offering interpretations and areas for future work.
Bio:
Zoe McBride is a Ph.D. student at the University of Connecticut. Her research interests include Bayesian hierarchical modeling and functional data analysis, especially in applications surrounding behavioral health and actuarial science. She received B.S. and M.S. degrees in Statistics from Brigham Young University in 2020. Zoe is also an associate of the Society of Actuaries. In her free time, she enjoys running, hiking, paddleboarding, and spending time with her husband and daughter. -
Toggle Item2024-10-31 - Dr. Tim Swartz - Two Problems in Soccer AnalyticsAbstract:
This talk concerns two problems in soccer analytics that both rely on tracking data. The first problem begins with a review of average aging curves in sport. Then, a new approach is introduced which addresses personal aging curves in soccer, an essential problem of interest which has not been previously addressed.
The second problem concerns the development of a metric that identifies soccer players who have a similar style to a player of interest. Whereas performance variables have been well studied, the same is not true of stylistic variables. Unlike assessments from scouting, the metric is automatic and objective. The metric is developed using a Bayesian framework.
Bio:
Tim Swartz is Professor and former Chair in the Department of Statistics and Actuarial Science at Simon Fraser University. He obtained a PhD and MSc in Statistics from the University of Toronto and a BMath from the University of Waterloo. He has over 120 research publications and has written several books including an Oxford text (2000) with Michael Evans on Approximating Integrals via Monte Carlo and Deterministic Methods. He is Fellow of the American Statistical Association and is AE for five journals. Most of his current research involves sports analytics. -
Toggle Item2024-10-24 - Dr. Jean-François Bégin - Modeling Subnational Mortality in the Presence of Aggregated DataAbstract:
This study proposes a new approach to modeling subnational mortality that relies on individual features (e.g., sex, geographical region, socioeconomic status) instead of dealing directly with subpopulations. Our strategy leads to more parsimonious models because fewer parameters are needed to explain mortality. Also, data providers might aggregate data over privacy concerns, and our framework allows for the use of such data, unlike the common subnational mortality modeling approach. A general one-step Bayesian estimation methodology that works well with most age–period–cohort mortality models proposed thus far in the literature is presented; it uses Markov chain Monte Carlo techniques by combining deterministic filtering with adaptive Metropolis steps and is well-suited for high-dimensional cases like the one investigated in this article. In a case study using real data, the framework is applied to Canadian mortality data from three datasets that encompass three features: sex, geographic region, and socioeconomic status. We show that the proposed approach combined with a reasonable mortality model provides realistic, coherent, and plausible mortality projections and that it fits the data reasonably.
This is joint work with Barbara Sanders and Xueyi Xu.
Bio:
Dr. Jean-François Bégin is an Associate Professor in the Department of Statistics and Actuarial Science at Simon Fraser University in British Columbia. He is a specialist in financial modelling as well as statistical and mathematical applications to finance and insurance. Before joining Simon Fraser University, he received his PhD from HEC Montréal in Financial Engineering. He is also a Fellow of both the Society of Actuaries and the Canadian Institute of Actuaries.
Over the past few years, his research program focused on the construction of complex models for long-term economic predictions, the understanding and management of credit risk, the modelling of option prices, and the development of sustainable retirement solutions and designs. He was the recipient of the first Young Researcher Award of the Canadian Institute of Actuaries in 2017. He also won the 2019 Bob Alting von Geusau Prize of the AFIR-ERM section of the International Actuarial Association and the Early Career Faculty Research Award of Simon Fraser University’s Faculty of Science. -
Toggle Item2024-10-17 - Dr. Sameer Deshpande - Scalable Targeted Smoothing In High Dimensions With BARTAbstract:
Bayesian Additive Regression Trees (BART) is an easy-to-use and highly effective nonparametric regression model that approximates unknown functions with a sum of binary regression trees (i.e., piecewise-constant step functions). Consequently, BART is fundamentally limited in its ability to estimate smooth functions. Initial attempts to overcome this limitation replaced the constant output in each leaf of a tree with a realization of a Gaussian Process (GP). While these elaborations are conceptually elegant, most implementations thereof are computationally prohibitive, displaying cubic per-iteration complexity.
We propose a version of BART built with trees that output linear combinations of ridge functions; that is, our trees return linear combinations of compositions between affine transforms of the inputs and a (potentially non-linear) activation function. We develop a new MCMC sampler that updates trees in linear time. Our proposed model includes a random Fourier feature-inspired approximation to treed GPs as a special case. More generally, our proposed model can be viewed as an ensemble of local neural networks, which combines the representational flexibility of neural networks with the uncertainty quantification and computational tractability of BART. -
Toggle Item2024-10-10 - Dr. Alex Petersen - FunCLIME: Sparse Estimation of a Functional Precision Operator by Constrained L1 MinimizationAbstract:
In multivariate data analysis, the precision (or inverse covariance) matrix is an important estimation target as it encodes information about linear dependencies between variables conditional on others, going beyond the pairwise correlations represented in the covariance matrix itself. Using the sample covariance as an input, modern algorithms provide sparse estimates of the precision matrix, with zero estimates indicating conditional linear independence. The first part of this talk will summarize two of these algorithms, commonly known as the graphical Lasso and CLIME, the latter being a simple and very computationally convenient relaxation of the former. In the second part, the extension of the precision matrix to
multivariate functional data will be presented, along with a recently proposed (and by no means obvious) extension of the graphical Lasso methodology. Finally, I will present some ongoing work related to the Functional CLIME relaxation (FunCLIME), including the associated optimization problem, the computational algorithm, and an outline of some finite-sample
theoretical properties. -
Toggle Item2024-10-03 - Dr. Jeffery Dawson - Thinking Outside the Box Plot: Nontraditional Roles for StatisticiansBio:
Dr. Dawson (BS in Statistics, BYU, and ScD in Biostatistics, Harvard) is a Professor of Biostatistics at the University of Iowa’s College of Public Health (UI CPH), where he serves as Associate Dean for Faculty Affairs, Chair of the Global Public Health Initiative, and a Fulbright Scholar mentor. He has collaborated with over 30 departments across campus and is particularly known for his research in elderly and neurologically impaired drivers, cardiovascular health, and cancer. His methodological interests include longitudinal data and clinical trials. He is a biostatistical consultant for Statistics Without Borders and has participated in several public health and educational outreach activities in Haiti, the Dominican Republic, and Mexico. He is a Fellow of the American Statistical Association.
Abstract:
Statistical training often focuses on data analysis methods and the theoretical underpinnings thereof. However, the actual work responsibilities of a statistician often expand into many other areas, and some statisticians may become involved in activities well beyond their anticipated roles. In this presentation, we discuss several examples of nontraditional roles, including addressing the media, providing training in ethics, conflict management, community outreach, and supervising study abroad programs. Other opportunities for nontraditional activities exist, depending on our work environment and our talents and interests. The quantitative skills that statisticians have often provide an important perspective when participating in nontraditional activities. -
Toggle Item2024-09-19- Dr. Nate Sandholtz- Investigating the Spatial Component of Serving Strategy in TennisAbstract:
A key component of a player's strategy in tennis is the decision of where to aim their serve. At a high level, there are two primary strategies: aiming ``out wide" (Wide), which entails serving across the court toward the outside boundary of the service region; or ``up the T" (T), in which the server aims toward the center line of the service region. Once the player has decided which high-level region they will serve to, they must choose (perhaps subconsciously) a specific spatial target within that region. This component of the serving decision is the focus of our research. Specifically, conditional on a chosen region (Wide or T), where should a player aim their serve, knowing that they will not be able to execute their intention perfectly? To answer this question, we model the serve as a two-period Markov decision process (MDP) where both the transition dynamics and expected rewards must be learned from data. We examine two sources of data: first, data from an experiment we conducted with players from the BYU Men's tennis team; and second, observational data from the 2020 and 2021 Australian Opens. In our presentation, we will show results from our analysis on each data source and discuss their respective differences, advantages and limitations. -
Toggle Item2024-09-12- Dr. Jared Whitehead- Aliasing Beyond Bias and Variance, a Nearly Label-Independent Decomposition for Quantitative RiskAbstract:
We develop a novel decomposition of error in regression problems with intuition taken from the effects of signal aliasing from the signals processing community. This new decomposition agrees with the classical interpretation of the bias-variance tradeoff in the under parameterized regime, but also completely explains the presence of the so-called 'double descent' phenomenon as well as the potential for multiple descents and slew of other non-convex behavior in the risk. Depending on the distribution of the training vs testing data (labels) we can use this decomposition to determine when the classical 'sweet spot' or under parameterized model is preferred, or when extreme over-parameterization (such as deep neural networks) is warranted. This generalized aliasing decomposition is demonstrated on a few simple regression problems, and on a large dataset arising from materials science where we show that the non-convex behavior of the risk is completely explained via this decomposition. -
Toggle Item2024-04-04- Xinyi "Lucy" Lu- Latent trajectory models for spatio-temporal ecosystem dynamicsAbstract:
State-space models are commonly used to characterize variation in ecosystems in space and time. However, Markov models (e.g., Discrete Time Markov Chain) are susceptible to imputing unrealistically volatile states when sampling occurred at irregular intervals. We proposed an alternative class of latent trajectory models that represent ecological processes using dynamic state probabilities instead of state transition probabilities. We demonstrated our method using two case studies. The first case study quantified the structural transformation of Alaskan land covers using remotely sensed imagery. We developed a Bayesian hierarchical model to infer the rates of land cover transitions and a Polya-Gamma sampling strategy to improve computation. The second case study differentiated population vulnerability to seasonal weather using native brook trout (Salvelinus fontinalis) count data. We inferred the latent states by incorporating ecological mechanisms and developed a Bayesian regularization strategy to promote temporal coherence. -
Toggle Item2024-03-28- Amy Herring- Bayesian Learning of Clinically Meaningful Disease PhenotypesAbstract:
Sepsis is a life-threatening condition that arises when the immune system's response to infection causes damage to internal organs and tissues. Recently, researchers have hypothesized that sepsis consists of a heterogeneous spectrum of distinct subtypes, motivating several studies to identify clusters of sepsis patients that correspond to subtypes, with the long-term goal of using these clusters to design more personalized treatments. Therefore, clinicians rely on clusters having a concrete medical interpretation, usually corresponding to clinically meaningful regions of the sample space that have a concrete implication to practitioners. In this article, we propose Clustering Around Meaningful Regions (CLAMR), a Bayesian clustering method that explicitly models the medical interpretation of each cluster center. CLAMR favors clusterings that can be summarized via meaningful feature values, leading to medically significant sepsis patient clusters. We also provide details on measuring the effect of each feature on the clustering using Bayesian hypothesis tests, so one can assess what features are relevant for cluster interpretation. The method is applied to clustering sepsis patients from Moshi, Tanzania, where patients are younger and the prevalence of HIV infection is higher than in previous sepsis subtyping cohorts from Europe and North America.
Bio:
Amy H. Herring is Sara & Charles Ayres Distinguished Professor of Statistical Science, Global Health, and Biostatistics and Bioinformatics at Duke University. Dr. Herring received her doctorate in biostatistics at Harvard University and came to Duke from UNC-Chapel Hill, where she was distinguished professor of biostatistics. Her research interests include development of statistical methodology for longitudinal or clustered data, Bayesian methods, latent class and latent variable models, missing data, complex environmental mixtures, and applications of statistics in population health and medicine. She has received numerous awards for her work, including the Mortimer Spiegelman Award from the American Public Health Association as the best applied public health statistician under age 40. Her research program is funded by NIH, and she holds leadership positions at the national and international level, including as Chair of the American Statistical Association's Section on Bayesian Statistical Science, as President of the International Society for Bayesian Analysis, and as a member of the Board of the International Biometric Society. -
Toggle Item2024-03-21- Jyotishka Datta- Global-Local Shrinkage Priors: An Overview and New DirectionsAbstract: Building scalable Bayesian methods for handling high-dimensional data with complex structure remains an important methodological challenge with diverse applications. While there is a vast literature proposing elaborate shrinkage and sparsity priors for high-dimensional continuous data and real-valued parameters, there has been limited consideration of compositional or count data and admixtures. In the first part of my talk, I will provide a broad overview of the state-of-the-art in global-local shrinkage priors, covering theoretical optimality as well as computational aspects. In the second part, I will discuss a few recent developments, namely designing a shrinkage prior to handle bi-level sparsity and handling sparse compositional data, routinely observed in microbiomics. I will address the methodological challenges associated with each of these problems and propose to fill this gap by using new prior distributions specially designed to enable handling structured data. I will provide theoretical support for the proposed methods and demonstrate improved performance in simulation settings and applications to environmentrics and microbiome data.
Author Bio: Jyotishka Datta is an assistant professor in the Department of Statistics at Virginia Tech and a core faculty member at the Center of Biostatistics and Health Data Science. He was an Assistant Professor in the Department of Mathematical Sciences at the University of Arkansas at Fayetteville from 2016-2020. Jyotishka received his PhD in Statistics from Purdue University and worked as a postdoctoral fellow at Duke University and SAMSI (Statistics and Applied Mathematical Sciences Institute). His research interest spans developing new methodology and theory for high-dimensional data with low-dimensional structures. He has contributed to the area of shrinkage estimation, sparse signal recovery, changepoint detection, nonparametric Bayes, and graphical models . Recent applications include next-gen sequencing studies, auditory neuroscience, ecology and crime forecasting. -
Toggle Item2024-03-14- Lane Muranaka- A Salary Negotiation Discussion - Resources, tactics, and StrategiesLane is the Career Director for Computer Science, Math, and Statistics
Why I love working here:
I love working where we can blend secular and spiritual learning to impact the lives of the students that we serve.
Quote I live by:
Trust in the LORD with all thine heart; and lean not unto thine own understanding. In all thy ways acknowledge him, and he shall direct thy paths – Proverbs 3:5-6
Advice for Students:
Follow your curiosity and ask lots of questions.
Favorite BYU Ice Cream Flavor:
Mint Chocolate Chip (do they have that here?)
What I love doing when I’m not at work:
Right now, I love to chase my kids around and support their activities. I love sports, playing golf, and it is a goal of mine to see a game at every major league baseball stadium. -
Toggle Item2024-03-07- Trevor Harris- Climate Models, Variables & ProjectionsBio: Trevor Harris is an assistant professor in the Department of Statistics at Texas A&M University. His research interests include the development and application of deep learning tools to problems in Climate Science and Epidemiology. Some recent works include developing new Wasserstein metrics for validating climate models, large scale granger causality testing and policy evaluation through deep neural networks, and forecasting West Nile virus with graph neural networks. In the fall, he will join the Department of Statistics at the University of Connecticut through their cluster hire on Climate Science and AI.
Abstract: Large ensembles of climate models are the primary method for quantifying uncertainty in future climate projections. However, these ensembles lack traditional statistical guarantees regarding their coverage and precision, potentially leading to either over or under representation of certainty in climate projections. To address this, we propose a conformal inference based approach, called conformal ensembles, that is guaranteed to have joint coverage over the prediction field without being excessively wide. This approach allows us to use nearly any prediction algorithm, including black box algorithms, to combine climate model output and generate statistically valid prediction regions whose sharpness directly improves with the accuracy of the algorithm. We demonstrate our approach on three climate variables: average temperature, maximum temperature, and precipitation and show that it significantly outperforms inter-model variability on a wide range of uncertainty quantification metrics. We also show that the conformal ensembles also have good marginal performance across spatial locations, do not generally deteriorate over time, and typically improve with increasing climate model ensemble size. Finally, we investigate issues related to distribution shift that can impact the statistical guarantees of our approach. -
Toggle Item2024-02-15- Matthew Heaton- Adjusting for Spatial Correlation in Machine and Deep LearningAbstract: Spatial data display correlation between observations collected at neighboring locations. Generally, machine and deep learning methods either do not account for this correlation or do so indirectly through correlated predictors and thereby forfeit predictive accuracy. To remedy this shortcoming, we propose preprocessing the data using a simple spatial decorrelation transform derived from principles of Gaussian processes. The decorrelating transform is computationally scalable and allows the transformed data to be ported immediately into any machine or deep learning tool. After model fitting on the transformed data, the output from machine or deep learning methods are then spatially re-correlated via the corresponding back transformation. We show that including such a spatial adjustment results in higher predictive accuracy on simulated and real spatial datasets.
-
Toggle Item2024-02-08- Brian Hartman- Personal Finance and Financial IndependenceWe will discuss some basics of personal finance. We will examine how to make choices with your money to be able to achieve your goals and live the life you want to live. We will also talk about early retirement and financial independence, how unusual saving can give you ultimate career and life flexibility. You are welcome to bring friends, partners, or roommates (especially those who never seem to be able to pay their portion of the utilities).
-
Toggle Item2024-02-01- Cameron Bale- Can We Protect Time Series Data While Maintaining Accurate Forecasts?Abstract: In the context of increasing data breaches and privacy concerns, we focus on the often-neglected goal of maintaining the usefulness of protected data by exploring the intersection of data privacy and time series forecasting. Using both simulated and real-world time series data sets, we test various privacy methods, including a proposed swapping-based method (k-nTS+) designed to maintain time series features, a differentially private method, and an approach based on sharing model weights trained on unprotected data. Based on forecasts from seven different forecasting models, none of the privacy methods based on swapping or random noise addition can consistently maintain forecast accuracy at an acceptable level of privacy. On the other hand, sharing model weights trained on unprotected data enables accurate forecasting, but accurate forecasts themselves can uncover the identities of the unprotected time series. We show that transforming time series into rates increases the similarity of time series features, values, and forecasts, and enables k-nTS+ to produce protected rate time series with a reduction in forecast accuracy of just 3.6% on average. Overall, this paper shows that except under certain conditions, creating protected time series with acceptable privacy is incompatible with obtaining accurate forecasts.
Author Bio: Cameron is a Ph.D. candidate in Business Analytics at Drexel University Lebow College of Business. He specializes in statistical and machine learning-based privacy methods with a focus on location, textual, and time series data. His research emphasizes a utility-first approach to privacy protection that balances data usefulness with privacy risk. Cameron holds a B.S. in Economics from Brigham Young University and has published in Expert Systems with Applications and Transactions on Data Privacy. He will join the Brigham Young University Marketing Department as an assistant professor in Fall 2024. -
Toggle Item2024-01-25- Matt Koslovsky- Accommodating Measurement Error in Multivariate Compositional Count Data with Applications to Microbiome ResearchAbstract: The human microbiome is the collection of microorganisms that live on and inside of our bodies. Microbiome data are inherently challenging to analyze due to their high-dimensionality, overdispersion, and zero-inflation. Analysis is further complicated by the steps taken to collect and process microbiome samples. For example, sequencing instruments have a fixed capacity for the total number of reads delivered. It is therefore essential to treat microbial samples as compositional. Another complicating factor of modeling microbiome data is that taxa counts are subject to measurement error introduced at various stages of the measurement protocol. Recently, the Dirichlet-multinomial (DM) distribution and its variants have been used extensively to model microbiome data due to its ability to accommodate the compositional structure of the data as well as overdispersion. A major limitation of the DM distribution is that it is unable to handle excess zeros typically found in practice which may bias inference. In this talk, I will introduce a novel Bayesian zero-inflated DM model for multivariate compositional count data with excess zeros, designed to boost scalability without sacrificing interpretability or imposing limiting assumptions. I will then present extensions to handle high-dimensional regression settings and potential taxonomic misclassification. The performance of the proposed methods is examined through simulation and is further illustrated using human microbiome data.
Bio: I graduated with a PhD in Biostatistics from The University of Texas Health Science Center School of Public Health (UTHealth) in December 2016 and held an appointment as a Post-Doctoral Research Associate in Data Science at Rice University in Dr. Marina Vannucci’s research lab from March 2018 to August 2020. Prior to this role, I served as a statistical consultant in the Biostatistics Lab at Johnson Space Center on the Human Health and Performance Contract. I joined the Statistics Department at Colorado State University as an Assistant Professor Fall 2020. My research interests include Bayesian methodology, nonparametric Bayes, variable selection, joint modeling, and compositional data analysis with applications in cancer prevention, mobile health, microbiome research, and ecology. Here's a like to my personal webpage: https://mkoslovsky.github.io -
Toggle Item2024-01-18- Bruce Schaalje- Randomness and Providence in an Inside-Outside CareerBio:
Born and raised in Alberta, Canada. Served a mission in Finland. Attended Rick's College, BYU, U of Washington, and NorthCarolina State U. Degrees in Mathematics, Zoology, and Statistics. Worked for Agriculture Canada (12 years), BYU (24 years), andBayer Crop Science (8 years). Married for 48 years with 6 children and 17 amazing grandchildren.
Abstract:
I have studied or worked in statistics for half a century -- in 3 universities, 2 countries, and 3 employment sectors (government, academia, and the private sector). The field has gone through a lot of change, but surprisingly many ideas are timeless.
I haven’t conquered statistics, but I am excited about it and still trying to make contributions. I have had the opportunity to interact with many wonderful people. I hope to excite you about what is in store for you in your studies and future career path. -
Toggle Item2023-11-28 - Michael Christensen - Utilizing network structure to flexibly model areal dataAbstract:
Regionally indexed or areal data is common within many ecological, epidemiological, and economic applications, yet commonly used spatial models for this data type, such as the conditional autoregressive model, are often insufficiently flexible to represent the types of between-region dependence that may be present within such data sets. The adjacency structure of a partitioned spatial domain allows for a graphical representation of the space on which data is observed. Within this presentation we provide a brief discussion of graph metrics and relevant theory before showing how these tools may be used to more thoughtfully characterize dependence structures within a network. We then review research and methods from current projects focused on modeling the complex spatial and spatio-temporal dependence patterns present within species abundance data taken from the eBird database. -
Toggle Item2023-11-16- Godwin Osabutey- From Ising Block Model to Bayesian Inference: Limit Theorems and Parametric AnalysisGodwin Osabutey- a doctoral student at the University of Bologna, Italy, specializing in Mathematics.
Abstract: This presentation delves into the Ising model applied to a network, where vertices are partitioned into r-blocks, each representing a fixed proportion of the total vertices. Interactions among vertex variables are governed by their respective block memberships. The discussion unfolds in two segments. The first part focuses on examining the asymptotic properties of the random vector of block sample means, establishing sufficient conditions for the validity of Central Limit Theorem (CLT). In the second part, Bayesian statistics are employed to conduct a parametric analysis of the model. This involves the use of an advanced MCMC sampling method, introducing a novel approach to address an inverse Ising problem, leveraging prior knowledge of the data. The validity of CLT provides a theoretical underpinning of the parametric analysis as the sample mean of the data tends to approximate a normal distribution for a sufficiently large sample size. -
Toggle Item2023-11-7- David Arthur- On Bayesian Methods for Small Sample PsychometricsAbstract: Psychometrics, a crucial branch of psychology, plays a pivotal role in quantifying mental traits, abilities, and processes. One powerful tool in this domain is the Cognitive Diagnostic Model (CDM), gaining popularity for its ability to offer precise, personalized feedback on skill mastery. However, despite their potential, CDMs face underutilization in everyday classrooms, partly due to their demanding sample size prerequisites. This talk focuses on two critical aspects of cognitive diagnosis that are adversely affected by small sample sizes: parameter estimation and item-level model selection. To tackle these challenges, innovative Bayesian methodologies are introduced. For the first problem, data-driven shrinkage priors, also known as catalytic priors, are used to enhance the accuracy of parameter estimation. For the second problem, Reversible-Jump Markov Chain Monte Carlo (RJ-MCMC) is used to increase the chances of selecting the most appropriate CDM for each assessment item. Finally, the presentation concludes by outlining promising avenues for future research within the realm of CDMs, contributing to the ongoing progress in educational assessment practices.
-
Toggle Item2023-10-26- Dr. Huiyan Sang -Graph Random Partition Models for Clustering, Classification, and Regression of Spatial and Network DataAbstract: This talk introduces random partition models constrained by graph contiguity. We focus on the random spanning tree and graph product partition model constructions. These models can be flexibly embedded into various Bayesian hierarchical frameworks as Bayesian priors for modeling clustered latent variables on graphs. We illustrate the utility of such graph random partition prior models in spatial clustering problems and classification/regression problems using non-axis parallel graph-split-based ensemble Bayesian additive decision trees. If time allows, we will also discuss strategies to extend the random partition model from a finite graph to a spatially soft partitioned process.
Bio:
Huiyan Sang is a professor and the director of the undergraduate program in statistics at Texas A&M University. She joined Texas A&M in 2008 as an assistant professor after earning her Ph.D. in Statistics from Duke University. Her research interests include the development of theory, methodology, and computation for spatial statistics, graph and network data analysis, Bayesian nonparametrics, machine learning methods, computational statistics, high-dimensional data analysis, and extreme values. Her interdisciplinary research work spans applications of statistics in environmental sciences, geosciences, urban and traffic planning, economics and business, biomedical research, chemometrics, and electrical/manufacturing/petroleum engineering. -
Toggle Item2023-10-12 - Jeffrey S Morris - Quantile Functional Regression for Distributional regression of biomedical imaging dataCo-authors: Quy Cao, Hojin Yang, Veera Baladandayuthapani, Benny Renn, Elizabeth Sweeney
Abstract: In many areas of science, technological advances have led to devices that produce an enormous number of measurements per subject, including biomedical imaging data. Frequently, researchers deal with these data by extracting summary statistics from these data (e.g. mean or variance) and then modeling those, but this approach can miss key insights when the summaries do not capture all of the relevant information in the raw data. One of the key challenges in modern statistics is to devise methods that can extract information from these big data while avoiding reductionist assumptions. In this talk, we will discuss methods for modeling the entire distribution of the measurements observed for each subject and relating properties of the distribution to covariates, with possible smooth nonlinear covariate and longitudinally varying affects. We apply this method to two biomedical imaging applications: one computing how the distribution of pixel intensities within a glioblastoma region relate to various biological and clinical factors, and the second using quantitative susceptibility mapping measuring inflammatory processes in brain imaging from multiple sclerosis patients. This general approach has many important applications, including many biomedical imaging applications, as well as wearable device data from accelerometers, blood pressure, and blood sugar monitors, as well as other types of high frequency data streams. -
Toggle Item2023-10-05 - Alex Petersen - Graphical Modeling for Multivariate Functional DataAbstract: This talk will have 3 parts. In Part I, I will describe the topic of Gaussian Graphical Models for classical multivariate data. From a statistical perspective, a particularly nice feature of GGMs is that they are characterized by the (inverse) covariance matrix, so that interpretable estimation can be done using penalized likelihood estimators of the precision matrix. In Part II, I will outline how to generalize GGMs to multivariate functional data, where each node in the graph corresponds to a random function, with neuroimaging data being a key motivating example. Some existing estimators will be discussed. In Part III, I will describe two projects I will be investigating related to i) leveraging separability for functional GGM estimation and ii) extending the so-called CLIME GGM estimator to the functional case.
-
Toggle Item2023-09-28 - Dr. Alex Franks - On Sensitivity to Unobserved Confounding in Studies with Factor-structured OutcomesBio: Alex Franks is an Associate Professor in the Department of Statistics and Applied Probability at the University of California, Santa Barbara. His research interests include covariance estimation, sensitivity analysis and causal inference, missing data and measurement error, high throughput applications in biology (“omics”), Bayesian statistics and sports.
Abstract: In this work, we propose an approach for assessing sensitivity to unobserved confounding in studies with multiple outcomes. We demonstrate how prior knowledge unique to the multi-outcome setting can be leveraged to strengthen causal conclusions beyond what can be achieved from analyzing individual outcomes in isolation. We argue that it is often reasonable to make a shared confounding assumption, under which residual dependence amongst outcomes can be used to simplify and sharpen sensitivity analyses. We focus on a class of factor models for which we can bind the causal effects for all outcomes conditional on a single sensitivity parameter that represents the fraction of treatment variance explained by unobserved confounders. We characterize how causal ignorance regions shrink under additional prior assumptions about the presence of null control outcomes, and provide new approaches for quantifying the robustness of causal effect estimates. Finally, we illustrate our sensitivity analysis workflow in practice, in an analysis of both simulated data and a case study with data from the National Health and Nutrition Examination Survey (NHANES). -
Toggle Item2023-09-21 - Kimberly F. Sellers, Ph.D., on Dispersed Methods for Handling Dispersed Count DataBio:
Kimberly F. Sellers, Ph.D. is the Head of the Statistics Department at North Carolina State University in Raleigh, NC. She further serves as a Principal Researcher with the Center for Statistical Research and Methodology Division of the U.S. Census Bureau. Her primary research interests and expertise center on statistical methods for count data that contain data dispersion with methodological interests in distribution theory, regression analysis, multivariate analysis, stochastic processes, and time series analysis. She recently authored the book, The Conway-Maxwell-Poisson Distribution (Cambridge University Press, 2023), which is the first comprehensive reference on the distribution and the flexible statistical methods derived using it to analyze dispersed count data. Sellers is an Elected Member of the International Statistical Institute, and a Fellow in both the American Statistical Association (ASA) and the Association for Women in Mathematics in recognition of her research, and active contributions to diversifying the fields of mathematical and statistical sciences with respect to gender and race/ethnicity.
A Washington, DC-area native, Sellers completed her BS and MA degrees in Mathematics at the University of Maryland College Park, and obtained her PhD in Mathematical Statistics at The George Washington University. She held previous faculty positions at Carnegie Mellon University, the University of Pennsylvania School of Medicine, and Georgetown University.
Abstract:
While the Poisson distribution is a classical statistical model for count data, it hinges on the constraining equi-dispersion property (i.e. that the mean and variance equal). This assumption, however, does not usually hold for real count data; over-dispersion (i.e. when the variance is greater than the mean) is a more common phenomenon for count data, however data under-dispersion has also been prevalent in various settings. It would be more convenient to work with a distribution that can effectively model data (over- or under-) dispersion because it can offer more flexibility (and, thus, more appropriate inference) in the statistical methodology. This talk introduces the Conway- Maxwell-Poisson distribution along with several associated statistical methods motivated by this model to better analyze count data under various scenarios (e.g. distributional theory, generalized linear modeling, control chart theory, and count processes). As time permits, this talk will likewise acquaint the audience with available associated tools for statistical computing. -
Toggle Item2023-09-14 - Dr. Abhirup Datta - On Combining Machine Learning With Gaussian Processes For Geospatial DataBio: Dr. Datta is an Associate Professor in the Department of Biostatistics at Johns Hopkins University. He completed his PhD in Biostatistics from University of Minnesota. Dr. Datta’s research focuses on developing spatial models for geographically indexed data. His work on Nearest Neighbor Gaussian Processes (NNGP) has become one of the most widely used methods for scalable analysis of massive geospatial data. His recent work focuses on developing theory and methodology for combining machine learning algorithms with traditional spatial modeling, and application of the methodology to air pollution and infectious disease modeling. He also works on developing Bayesian hierarchical models for multi-source data with applications in global health. His research as Principal Investigator is funded by grants from the National Science Foundation (NSF), National Institute of Environmental Health Sciences (NIEHS) and the Bill and Melinda Gates Foundation. He has received the Early Career Investigator award from the American Statistical Association Section of Environmental Health, the Young Statistical Scientist Award (YSSA) by the International Indian Statistical Association (IISA), and the Abdel El-Shaarawi Early Investigator's Award from the The International Environmetrics Society (TIES).
Abstract:
Spatial generalized linear mixed-models, consisting of a linear covariate effect and a Gaussian Process (GP) distributed spatial random effect, are widely used for analyses of geospatial data. We consider the setting where the covariate effect is non-linear and propose modeling it using a flexible machine learning algorithm like random forests or deep neural networks. We propose well-principled extensions of these methods, for estimating non-linear covariate effects in spatial mixed models where the spatial correlation is still modeled using GP. The basic principle is guided by how ordinary least squares extends to generalized least squares for linear models to account for dependence. We demonstrate how the same extension can be done for these machine learning approaches like random forests and neural networks. We provide extensive theoretical and empirical support for the methods and show how they fare better than naïve or brute-force approaches to use machine learning algorithms for spatially correlated data. We demonstrate the RandomForestsGLS R-package that implements this extension for random forests. -
Toggle Item2023-04-06 - Martina Narcisi - On the Effect of Confounding in Linear Regression Models: An Approach Based on the Theory of Quadratic FormsAbstract:
In the last two decades, prominent research has been devoted to the issue of spatial confounding in linear regression models. In general, confounding arises when a relationship between an observed process, the covariate, and an outcome process, the response, is influenced by the confounder, an unmeasured process, associated with both. Consequently, the estimators for the regression coefficients of the measured covariates might be severely biased, less efficient and characterized by misleading interpretations. Under a simple setup with one covariate, this work aims to extend the spatial confounding framework to general structured settings and to understand the behaviour of confounding as a function of the data generating process structure parameters in several scenarios focusing on the joint covariate-confounder structure. The focus is on the sampling properties of the target parameter estimators. They are derived as ratios of dependent quadratic forms in Gaussian random variables in order to provide an analytic expression of the marginal sampling properties of the estimator by means of Carlson’s R function. We propose a representative scalar quantity for the extent of confounding as a proxy of the estimator bias, its Laplace approximation up to the first order.
For the application study, we work under several frameworks that consider spatial geostatistical and areal data and temporal data with specific assumptions regarding the covariance and cross-covariance functions used to generate the processes involved. The results allow us to claim that the variability of the confounder-covariate interaction and of the covariate plays the most relevant role in determining the principal marker of the magnitude of confounding. -
Toggle Item2023-03-30 - Tamara Broderick - An Automatic Finite-Sample Robustness Check: Can Dropping a Little Data Change Conclusions?Abstract:
Commonly researchers will run a statistical analysis on a data sample, with the goal of applying any conclusions to a new population. For instance, if economists conclude microcredit is effective at alleviating poverty based on observed data, policymakers might decide to distribute microcredit in other locations or future years. Typically, the original data is not a perfect random sample from the population where policy is applied -- but researchers might feel comfortable generalizing anyway so long as deviations from random sampling are small, and the corresponding impact on conclusions is small as well. Conversely, researchers might worry if a very small proportion of the data sample was instrumental to the original conclusion. So we propose a method to assess the sensitivity of statistical conclusions to the removal of a very small fraction of the data set. Manually checking all small data subsets is computationally infeasible, so we propose an approximation based on the classical influence function. Our method is automatically computable for common estimators. We provide finite-sample error bounds on approximation performance and a low-cost exact lower bound on sensitivity. We find that sensitivity is driven by a signal-to-noise ratio in the inference problem, does not disappear asymptotically, and is not decided by misspecification. While some empirical applications are robust, conclusions of several influential economics papers can be changed by removing (much) less than 1% of the data. -
Toggle Item2023-03-23 - Joshua Christensen/Elizabeth Petersen/Logan Clarke - Graduate Students PresentationJoshua Christensen: Posterior Sampling for Non-Gaussian Dynamic Linear Models
Abstract:
Dynamic linear models (DLMs) are a flexible class of time series models based on the concept of time-varying parameters (TVPs). The forward-filtering backward-sampling (FFBS) algorithm provides an efficient framework for sampling the joint posterior of TVPs in a conditionally Gaussian DLM. We propose a novel sampling method that leverages the efficiency of the FFBS algorithm while relaxing the requirement of conditional Gaussianity. We rely on recent advances in slice sampling which eliminate the need for tuning parameters with transformations to approximate (pseudo) targets. The FFBS conversion of the joint distribution into a chain of univariate conditional distributions facilitates this transformation. We use this new slice sampling technique with an approximating conditionally Gaussian DLM as a pseudo target to sample from the joint posterior distribution of TVPs in non-Gaussian DLMs. We demonstrate the effectiveness of our technique by modeling pollution in French rivers with a dynamic harmonic regression (DHR) model.
Elizabeth Petersen: Grade of Membership Scores in Estimating Heterogeneous Treatment Effects for Instrumental Variable Scenarios
Abstract:
In order to make causal conclusions in observational studies, researchers must look beyond standard modeling methodology and instead use tools such as instrumental variables. Valid instruments provide the ability to infer causality without the requirement of balancing on all relevant covariates. The use of covariates can provide information on heterogeneous treatment effects, which we approach by clustering the covariates. Traditional clustering requires points to exist within one group. This ignores the potential that observations may share characteristics with other groups. Grade of Membership (GOM) models allow observations to share membership with multiple groups or pure types. We introduce the use of GOM scores in instrumental variable analysis to consider heterogeneous treatment effects. Our analysis applies GOM scores in understanding the effects of Medicaid enrollment when a valid instrument, lottery assignment, is present.
Logan Clarke: Multinomial Classification of Costa Rican Household Poverty Level
Abstract:
In this project we look at using a variety of different statistical tools to predict poverty levels in Costa Rican households. Welfare fund distribution in Costa Rica is determined by poverty level, but limited documentation of income and expenses make it difficult for welfare workers to determine how to allocate aid. In order to accurately classify households into their corresponding level of poverty, we will use multinomial logistic regression, multiple linear regression, random forest, neural networks, gradient boosted models, and k-nearest neighbor models to predict poverty level. By comparing model performance, we seek to determine which type of model is most appropriate for multinomial classification situations. Finally, using the model performance rankings from this data set, we explore how similar results may apply to other classification data sets -
Toggle Item2023-03-16 - Felipe Barrientos - Differentially Private Methods for Managing Model Uncertainty in Linear Regression ModelsAbstract:
Many data producers are concerned about protecting individuals' private information while still allowing modelers to draw inferences from confidential data sets. The framework of differential privacy enables statistical analyses while controlling and quantifying the potential leakage of private information. In this talk, we present differentially private methods for hypothesis testing, model averaging, and model selection for normal linear models. We consider both Bayesian and non-Bayesian methods for the tasks. The procedures are asymptotically consistent and straightforward to implement with existing software. We focus on practical issues such as quantifying the uncertainty introduced by the privacy-ensuring mechanisms. We evaluate the empirical performance of the approaches using simulated and real data. This is joint work with Víctor Peña at Universitat Politècnica de Catalunya, Spain. -
Toggle Item2023-03-09 - Erin Schliep - Correcting for Informative Sampling in Spatial Covariance Estimation and KrigingAbstract:
Informative sampling designs are broadly used across many application areas of statistical modeling and can have a large impact on model inference and prediction. In spatial modeling, informative sampling can result in biased spatial covariance parameter estimation, which in turn can bias spatial prediction. To mitigate these biases, we develop a weighted composite likelihood approach to improve spatial covariance parameter estimation under informative sampling designs. Then, given these parameter estimates, we propose two approaches to quantify the effects of the sampling design on the variance estimates in spatial prediction in order to make informed decisions for population-based inference. -
Toggle Item2023-03-02 - Christophe Giraud-Carrier - Meta-learning and Dataset EmbeddingNo abstract listed. For more information on the speaker click here.
-
Toggle Item2023-02-23 - Evan Johnson - Tools and Methods for Profiling Host-Microbe Interactions in Pulmonary DiseasesAbstract:
Big data technologies are playing an increasingly influential role in multiple sectors within industry and research. This is providing significant opportunities for individuals with expertise in the acquisition, management, and analysis of data—so called data scientists. In the biomedical arena in particular, data science and informatics have revolutionized discoveries in molecular research, public heath, and clinical care. One unique application of big data technology is the use of DNA- and RNA-sequencing to evaluate the interplay between the microbiome and host immune and inflammatory pathways, particularly in relation to human pulmonary diseases. Here I will discuss several relevant computational and statistical innovations in the context of of host-microbe profiling in asthma, lung cancer, and tuberculosis research. -
Toggle Item2023-02-16 - Dave Higdon - Computer Model Calibration and Agent Based ModelsAbstract:
Agent-based models (ABMs) use rules at the individual (agent) level to simulate a social, ecologic, or social-technical system, producing structured behavior when viewed at an aggregated level. ABMs allow consideration of interactions, behaviors, outcomes, and interventions at a very fine resolution, giving them a connection to reality that is often missing from more aggregated modeling approaches. Of course this comes with a price. For example, ABMs typically come with high computational cost, random realizations, and difficulty in capturing more aggregated properties of the real system. This talk will use a toy example to highlight the key concepts of Bayesian model calibration for constraining ABMs with observations; it'll also touch on Approximate Bayesian Computation (ABC). I'll then go on to show how such methodology can be used in more involved applications, using an ABM developed during the 2014 ebola epidemic. -
Toggle Item2023-01-26 - Alejandro Jara - A Class of Random Bernstein Copula ModelsAbstract:
Copula models provide great flexibility in modeling relationships between random variables. For inference to take full advantage of this flexibility, one needs appropriately rich families of copula functions, capable of approximating any copula. One such family is the family of Bernstein copulas, which are a variety of multivariate Bernstein polynomials, and which has been shown to be dense in the space of continuous copula functions. Bernstein copulas have been used for inference before, but only using likelihood-free approximation methods. We observe a fact about the geometry of the parameter space of Bernstein copulas, and note that it is closely related to a different class of copula known as grid uniform copulas. Based on this relationship, we propose a Bayesian model based on Bernstein copulas and an automatic MCMC algorithm capable of performing full posterior inference on the copula and marginal distributions. Joint work with Nicolas Kuschinski. N. Kuschinski’s research is supported by Fondecyt 3210553 grant. A. Jara’s research is supported by Fondecyt 1220907 grant. -
Toggle Item2023-01-19 - Richard Hahn - Feature Selection for Casual InterferenceAbstract:
Feature selection for the estimation of causal effects is a challenging and subtle problem about which there is a wealth of --- sometimes contradictory! --- folk wisdom. In this talk I will analyze the problem in the simplified setting of discrete control variables, where it is possible to provide rigorous statements of various folk-theorems, allowing us to better understand their practical implications. My presentation will use elements of three distinct methodological frameworks for causal inference: potential outcomes, causal diagrams, and structural equations. Each framework emphasizes unique aspects of the feature selection problem. -
Toggle Item2022-12-01 - Daniel Bauer - Calculation of the Enterprise Capital via Least-squares Monte Carlo – Regress Now or Later?Abstract:
There has been substantial interest in applied research and practice on the estimation of risk capital within enterprise risk models via Monte Carlo procedures. The difficulty arises from characterizing the distribution of the company’s available capital, which takes the form of a conditional expected value of the firm’s future cash flows given the market and company state at the risk horizon.
One prevalent approach, referred to as regression-now, projects the company’s realized cash flows and regresses resulting discounted values against a set of basis functions of the Markov states at the risk horizon. Another approach seeks to approximate the cash-flows based on functions of realizations of a class of tractable processes, and in a second step calculates the conditional expected value of the approximating functional combination at the risk horizon. This approach is referred to as regression-later. Different authors documented advantages and disadvantages to both approaches in specific situations.
This paper makes two contributions. The first is theoretical. We show that the left and right singular functions of the valuation operator mapping cash flows to capital present robust approximating functions for the regress-now and regress-later approach, respectively. We show superior performance of the approaches when relying on these functions. The second contribution is applied. We characterize situations in which either of the approaches results in a superior performance. In essence, the regress-later approach performs better when underlying theoretical conditions are met, whereas the regress-now approach is the more robust approach. -
Toggle Item2022-11-17 - Jake Rhodes - Random Forest Geometry-Preserving Proximities with Applications in Manifold LearningAbstract:
Many machine learning algorithms use distances or similarities between observations to make predictions, cluster similar data, visualize patterns, or generally explore the data. Most distances or similarity measures do not incorporate known data labels or conditionally generate metrics that exaggerate separation between data points of different classes. Instead, we leverage a popular algorithm used for making data-driven predictions, known as random forests, to naturally incorporate data labels (either continuous or categorical) into similarity measures known as random forest proximities. In this talk, I will discuss the development of a new proximity definition that can be used to reconstruct the random forest’s predictions. We call these random forest-geometry-and accuracy-preserving proximities or RF-GAP. I will discuss improvements in common proximity-based including imputation, outlier detection, and data visualization. I will also introduce a random forest proximity-based manifold learning approach that can be used to generate 2- or 3-dimensional data representations which can be used as a tool to visually explore the data. I show quantitatively and qualitatively that this method surpasses other existing methods for this task. -
Toggle Item2022-11-10 - John Kidd - Expectation-Maximization Algorithms and Mediation AnalysisAbstract:
Determining the relationship between two variables is the aim of many statistical procedures. When there are one or more additional variables between the independent variable and the response, mediation analysis is often employed. However, these middle variables, called mediators, may be missing from study data. This scenario is particularly prevalent in genetic studies where the relationship between a specific genetic variant and a phenotype, such as diseases or ailment, is of interest. Omic variables such as RNA and protein expressions mediate these relationships but can be difficult to measure. In this talk, I will discuss a new method that uses likelihood methods to find unbiased effect estimates with improved efficiency in the presence of multiple mediator data where mediator information may be missing. -
Toggle Item2022-11-03 - Christopher Challis - Consumer Modeling at Wells FargoAbstract:
Wells Fargo employs hundreds of statisticians and data scientists. Chris Challis will give an overview of focus areas across the bank, then dive into details of consumer-facing statistical models, including marketing, personalization, pricing, forecasting, and optimization. There will also be time for questions and discussion on the daily work of members of his team, and the types of skills that he has seen be successful at Wells Fargo and other companies. -
Toggle Item2022-10-27 - Jonathan Chipman - Experimental Design with Sequential Rematched RandomizationAbstract:
This two-part talk overviews (i) the University of Utah PhD program in Population Health Sciences in the Division of Biostatistics and (ii) novel developments to balancing covariates between treatment arms in randomization trials. Randomization plays a foundational role in causal hypotheses testing. It eliminates systematic confounding and provides a ‘reasoned basis for inference’ through randomization-based inference (RBI). However, chance imbalances in key baseline covariates can still occur and complicate interpretation of the estimated causal effect. (Such was the case this past week in a trial reviewed by the Food and Drug Administration this October). Sequential Matched Randomization (SMR) reduces this risk by randomizing treatment within participants who match closely on key baseline covariates. We refine SMR to use a dynamic matching criteria, allow batch enrollment, and allow matches to break and rematch throughout enrollment. Through extensive simulation using real trial data, we compare SMR with and without our extensions in terms of average covariate balance and estimator efficiency (via RBI power and variance of estimators). Our extensions further decreased the risk of covariate imbalances and increased efficiency relative to randomization without covariate adjustment by 1.3 fold. We’ll show a case where adjusting for covariates in the design has as much power as adjusting for covariates in a regression model though with fewer assumptions. -
Toggle Item2022-10-20 - Abel Rodriguez - Analyzing Cognitive Social StructuresAbstract:
Cognitive social structures (CSSs) are a type of social network data that appear in psychology, sociology, and industrial organization applications. CSSs involve a collection of networks, each one of them reflecting the perceptions of an individual about the interactions among all members of the community. This makes cognitive social structures richer than other forms of social networks that only reflect the perspective of a single, “objective” observer.
Traditionally, cognitive social structures have been analyzed by either collapsing the multiple networks into some sort of "consensus" network, or by analyzing different "slices" independently. In this talk we discuss two classes of models used to generate insights from CSS data. In the first part of the talk, we use a hierarchical embedding of the networks into a continuous latent space along with carefully constructed zero-inflated priors to explore the perceptual agreement between individuals and the group consensus. In the second part of the talk we discuss a novel class of hierarchical blockmodels that uses Chinese Restaurant Process priors and fragmentation-coagulation processes to identify conserved motifs that are preserved across all observers. This is joint work with Juan Sosa, currently at Universidad Nacional de Colombia. -
Toggle Item2022-10-13 - Rebecca Nichols - Go Forth to Serve: Statistics Service Opportunities to Enhance your Career and Make a Difference in the WorldAbstract:
Interested in making a difference in the world through statistics? Finding joy in your profession and fulfillment through service? This presentation will discuss service opportunities and ideas for how you can be involved while enhancing your career and making a difference in the profession, your community, and even the world. What are the resources available to you? How can service enhance your career path? How can this fit into a balanced life? These thoughts and more for students and faculty. -
Toggle Item2022-09-29 - Volodymyr Melnykov - Finite Mixture Modeling in StylometryAbstract:
Stylometry is the area of linguistics that deals with the quantification of a language style. Applications of stylometry can be found in plagiarism detection, authorship identification, speech style recognition, association analysis of text samples, as well as the prediction of the chronological order for several written fragments. Among the most typical statistics techniques used in the stylometry analysis are the goodness-of-fit test, analysis of contingency tables, paired t-test, and other traditional but sometimes overly naive and simplistic procedures. The main focus of the talk will be made on the detection of the chronological order of several text fragments written by Plato -
Toggle Item2022-09-22 - Brad Barney - Clinical ResearchAbstract:
In our search to determine best practices for clinical care, numerous resources are devoted to assessing the safety and efficacy of medical interventions. In this presentation, I will discuss pragmatic aspects of operating and analyzing clinical trials and observational studies to identify best practices. I will give an overview of various study designs and their strengths and limitations. I will discuss the range of statistical complexity commonly encountered in clinical research. I will also review some of the studies for which I have provided assistance, as well as findings if they are already publicly available. Although some of the presented analyses were conducted with financial support from the National Institutes of Health, the views do not necessarily represent those of the NIH or any other government entity. -
Toggle Item2022-09-15 - Hans Mueller - Statistics for Random ObjectsAbstract:
Random objects, i.e., random variables that take values in a separable metric space, pose many challenges for statistical analysis, since vector operations are not available in general metric spaces. Examples include random variables that take values in the space of distributions, covariance matrices and covariance surfaces, networks and other spaces. The increasing prevalence of samples of random objects has stimulated the development of metric statistics, a collection of statistical tools to characterize, infer and relate samples of such random objects. For the geometric exploration of random objects a useful tool are depth profiles; the depth profile for any given object is the distribution of distances to all other objects. These distributions can be subjected to further statistical analysis, leading to notions of transport ranks and centrality. Theory concerns the convergence of the empirical estimates to the population targets. For random objects in geodesic spaces, a recent development is transport regression, based on a transport algebra, where predictors and responses are transports. Parts of this presentation are based on joint work with Yaqing Chen, Paromita Dubey, Alexander Petersen and Changbo Zhu. -
Toggle Item2022-09-08 - Joey Stanley - Modeling Change in American English AccentsAbstract:
Gone are the days when a linguist could analyze spoken language using their ears alone. Contemporary research on vowel pronunciation relies heavily on statistical modeling of acoustic measurements extracted from recordings of speech. In this presentation, I showcase some aspects of a typical quantitative analysis in sociophonetic research, incorporating the latest developments in linguistic methodology. Specifically, I will show how linear mixed-effects models and generalized additive mixed-effects models can be used to analyze and model vowel pronunciation. In addition to using data from my own voice, I will analyze data representing Southern American English and Utah English and will show how they have changed across 100 years. -
Toggle Item2022-03-17 - Paromita Dubey - Functional Models for Time Varying Random ObjectsAbstract
In recent years, samples of time-varying object data such as time-varying networks that are not in a vector space have been increasingly collected. These data can be viewed as elements of a general metric space that lacks local or global linear structure and therefore common approaches that have been used with great success for the analysis of functional data, such as functional principal component analysis, cannot be applied directly.
In this talk, I will propose some recent advances along this direction. First, I will discuss ways to obtain dominant modes of variations in time varying object data. I will describe metric covariance, a new association measure for paired object data lying in a metric space (Ω, d) that we use to define a metric auto-covariance function for a sample of random Ω-valued curves, where Ω will not have a vector space or manifold structure. The proposed metric auto-covariance function is non-negative definite when the squared metric d^2 is of negative type. The eigenfunctions of the linear operator with the metric auto-covariance function as the kernel can be used as building blocks for an object functional principal component analysis for Ω-valued functional data, including time-varying probability distributions, covariance matrices and time-dynamic networks. Then I will describe how to obtain analogues of functional principal components for time-varying objects by applying weighted Fréchet means which serve as projections of the random object trajectories in the directions of the eigenfunctions, leading to Ω-valued Fréchet integrals. This talk is based on joint work with Hans-Georg Müller. -
Toggle Item2022-03-03 - Matt Morgan/Nate Hawkins/Garrett Duncan - Graduate Students PresentationSpeaker 1: Matt Morgan
Title: Implementing Minibatch Processing In Spatial Analyses
Abstract: Working with spatial statistics can often be computationally intensive. Furthermore, when performing a spatial analysis on large amounts of data, the computation problem grows exponentially. Fitting a model and cross validating can take days or weeks even when done utilizing more RAM and an increased numbers of CPUs. Given the increase in amounts of large data in recent years and the computational complexity that comes with spatial statistics, much research has been done in an attempt to simplify the computational complexity of spatial analyses. One method of reducing computational complexity that has been utilized in other areas is minibatching. Minibatching is a process where a subset of the data is used to get an approximation to the whole dataset without having to use the whole dataset. This study aims to incorporate minibatching with spatial analyses so as to reduce the amount of data needed to be used in a model. The hope is that minibatching will allow for spatial analyses to be performed using large datasets with an ease in computational strain.
Speaker 2: Nate Hawkins
Title: Real-Time Prediction Using Covariate-Informed Product Partition Models
Abstract: Covariate informed product partition models excel at modeling data that are non-linear in nature. These models exploit available covariates by increasing the probability of co-clustering for two individuals with similar covariate values. In practice, however, prediction using these models can be computationally expensive and constrained to specific model parameters. We develop a covariate-based prediction algorithm that runs at compiled speed and allows for changes in model parameters to improve fit. We show the utility of this algorithm using data from the 2018 men’s world volleyball championship. First, we train a product partition model using team performance statistics from the round robin stage of the tournament to predict wins. We then use our algorithm to predict the win probability after every point of every match in the knockout stages.
Speaker 3: Garrett Duncan
Title: COVID-19 Spread Feature Engineering
Abstract: This research investigated potential methods used in modeling the spread of COVID-19 in higher education. Basic demographic and student schedule information was used to engineer features such as number of credits taken with students tested positive, the median distance of the nearest positive for students living near the university, the level of risk brought by in-person classes for each student, and other additional features. These features were calculated on a week by week or semester by semester period. Different features were also aggregated at the individual class section and overall course level. A dashboard of these features was shown to university administration to help monitor virus spread. We consider these features and make recommendations for feature engineering for future models of virus spread in higher education communities. -
Toggle Item2022-02-10 - Nathan Bean - Bayesian Multi-Regional Clinical Trials Using Model AveragingAbstract:
Sponsors often rely on multi-regional clinical trials (MRCTs) to introduce new treatments more rapidly into the global market. Many commonly used statistical methods do not account for regional differences, and small regional sample sizes frequently result in lower estimation quality of region-specific treatment effects. Recent ICH E17 guidelines suggest the consideration of methods that allow for information borrowing across regions to improve estimation. In response to these guidelines, we develop novel methodology to estimate global and region-specific treatment effects from MRCTs with either a single continuous or time-to-event endpoint using Bayesian model averaging (BMA), and we discuss how posterior model probabilities can be used to assess the consistency of treatment effects across regions. We show through simulation studies that the proposed modeling approach results in lower MSE than traditional fixed effects models while preserving the rejection rate of the global treatment effect. We then apply the BMA approach to data from the LEADER trial, an MRCT designed to evaluate the cardiovascular safety of an anti-diabetic treatment. -
Toggle Item2022-01-27 - Carlos Carvalho - Searching for Dusty Corners: Understanding the Prediction of the Cross Section of ReturnsAbstract:
Bayesian nonparametric regression models will be presented in order to predict equity returns from various characteristics. We will focus on model modifications that will incorporate economic information, time variability and explore ways to develop interpretable summaries of otherwise black-box strategies. -
Toggle Item2022-01-20 - William Christensen - A Statistician's Perspective on Climate Change AssessmentAbstract:
In this talk, we consider statistical perspectives on various aspects of climate change assessment. We briefly discuss: (1) a review of major findings in the Sixth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC), (2) how uncertainty is discussed in IPCC assessment reports, (3) how the notion of uncertainty can be misunderstood within society and even the scientific community, and (4) how uncertainty can be more rigorously integrated in scientific (and economic) decision making. Several examples will illustrate how uncertainties can be integrated into an analysis of climate change data, including the modeling of ice accumulation in Antarctica, climate model ensemble data in North America and in High Mountain Asia, sea level rise off the US Atlantic Coast, and precipitation trends in Antarctica. -
Toggle Item2022-01-13 - Jacob Andros/Camilla Handley/Travis Andersen - Masters Students PresentationJacob Andros
Title: Search Algorithms and Loss Functions for Bayesian Feature Allocation Models
Abstract: Feature allocation models are a Bayesian means to flexibly borrow information among items where each item possesses zero or more features. These models are typically fit with Markov chain Monte Carlo, yet there has been relatively little work done on summarizing a feature allocation posterior distribution. We propose a new loss function between two feature allocations and suggest a search algorithm to minimize the expected loss and summarize a feature allocation distribution. This loss is easy to compute and allows for comparison of matrices with differing numbers of features. With this loss function and a set of feature allocation samples from MCMC, our search algorithm uses principles of the assignment problem to minimize the expected loss without being restricted to only the samples present in the Markov chain.
Camilla Handley
Title: A Simulation-Based Approach to Controlling False Discovery Rate and COVID-19 Hotspot Detection
Abstract: In this project, we aim to answer the question of whether or not there is in-class transmission happening on campus in order to regulate restrictions and keep students and faculty safe. Using demographic information, we built a logistic regression model that produces an estimated probability of testing positive for each student. Using a simulation-based approach, we developed a method that identifies possible spots of in-class transmission on campus with these probabilities. We simulated semesters under the null hypothesis (that there is no in-class transmission) and compared them to the observed positivity rates to find a p-value for each group (section, course or major). Given that our p-values are dependent and non-uniform, we developed a simulation-based method to control the false discovery rate for dependent, non-uniform p-values.
Travis Andersen
Title: A Permutation-Based Confidence Distribution for Rare-Event Meta-Analysis
Abstract: Confidence distributions (CDs), which provide evidence across all levels of significance, are receiving increasing attention, especially in meta-analysis. Meta-analyses allow independent study results to be combined to produce one overall conclusion and are particularly useful in public health and medicine. For studies with binary outcomes that are rare, traditional meta-analysis methods often fail. Zabriskie et al. develop a permutation-based method to analyze such data when study treatment effects vary beyond what is expected by chance. We show how this method can be considered a CD and develop new metrics to assess a CD's relative performance. We compare this permutation-based CD with other CDs for rare-event meta-analyses. -
Toggle Item2021-12-09 - Andrew Whetten - Localized Mutual Information Monitoring of Pairwise Associations in Animal MovementAbstract
Advances in Satellite Imaging and GPS tracking devices have given rise to a new era of remote sensing and geospatial analysis. In environmental science and conservation ecology, biotelemetric data recorded is often high-dimensional, spatially and/or temporally, and functional in nature, meaning that there is an underlying continuity to the biological process of interest. GPS-tracking of animal movement is commonly characterized by irregular time-recording of animal position, and the movement relationships between animals are prone to sudden change. In this paper, we propose a measure of localized mutual information (LMI) to derive a correlation function for monitoring changes in the pairwise association between animal movement trajectories. The properties of the LMI measure are assessed analytically and by simulation under a variety of circumstances. Advantages and disadvantages of the LMI measure are assessed and alternate measures of LMI are proposed to handle potential disadvantages. The proposed measure of LMI is shown to be an effective tool for detecting shifts in the correlation of animal movements, and seasonal/phasal correlatory structure. -
Toggle Item2021-12-02 - Zach Baucom - State Space Models for Neuropsychological DataAbstract:
In the research area of cognitive decline, longitudinal analysis plays a critical role in understanding how and why a disease may progress. However, the predominant analysis techniques often fail to have both of the following: 1.) a simple effect interpretation and 2.) allow for proper subject-level heterogeneity. This has led to the development and use of State Space Models (SSM) in scenarios typical of traditional Linear Mixed Effect Models (LMEM) with joint model applications. First, I develop an SSM model along with estimation procedures for modeling cognitive decline. Estimation procedures include a traditional full-likelihood approach, a partitioned SSM, and a Bayesian SSM. These models are compared to LMEMs for longitudinal cognitive data using simulation and real data analyses. The SSM framework is then applied to understand the effect of the APOE e4 allele on cognitive trajectory. As many of the popular cognition studies offer a battery of tests to gain broader insight into different aspects of cognition (i.e. memory),the SSM is being extended to a joint model and factor analysis. -
Toggle Item2021-11-18 - Kimberly Kaufeld - Integrating public health surveillance and environmental data to model presence of "Histoplasma" in the United StatesAbstract
In the United States, the true geographic distribution of the environmental fungus Histoplasma capsulatum remains poorly understood but appears to have changed since it was first characterized. Histoplasmosis is caused by inhalation of the fungus and can range in severity from asymptomatic to life-threatening. Due to limited public health surveillance and under detection of infections, it is challenging to directly use reported case data to characterize spatial risk. Using monthly and yearly county-level public health surveillance data and various environmental and socioeconomic characteristics, we use a spatio-temporal occupancy model to estimate latent, or unobserved, presence of H. capsulatum, accounting for imperfect detection of histoplasmosis cases. We estimate areas with higher probabilities of the presence of H. capsulatum in the East North Central states around the Great Lakes, reflecting a shift of the endemic region to the north from previous estimates. By doing so, we were able to mitigate challenges related to reporting and illustrate a shift in the endemic region from historical estimates. This work aims to help inform future surveillance needs, clinical awareness, and testing decisions for histoplasmosis. -
Toggle Item2021-11-04 - Brad Ferguson - How Data Science Unravels Social MediaAbstract:
Predicting how many views, clicks, and conversions an Instagram Story gets is an important tool in the influencer marketing space. It allows businesses to prioritize which influencers will perform well enough for a specific brand and makes budgets and planning more efficient. In this talk, I discuss the successes and challenges in building predictive models for Instagram and YouTube and what lessons can be learned when applying machine learning models in the real world. I also discuss what I have found makes a strong data scientist and what things to look for in your next employer to set you up for a successful career in data science. -
Toggle Item2021-10-28 - David Dahl - Shrinking a Partition Distribution Towards a Baseline Partition, With Applications to Dependent PartitionsCo-Authors:
Richard L. Warr
Thomas P. Jensen
Abstract:
Random partition models are closely related to Bayesian nonparametric models and provide flexible means to borrow strength in Bayesian data analysis. Parsimony is obtained by postulating that observations share model parameters with other observations belonging to the same cluster in a latent partition. In many contexts, prior knowledge regarding the partitioning of observations may be available and we may desire to use this baseline partition information to influence the prior partition distribution. To this end, we propose the shrinkage partition distribution (SPD) which shrinks any baseline partition distribution towards a baseline partition. Recognizing that prior knowledge may be stronger for some items than others, our formulation allows for item-specific shrinkage towards the baseline partition. Further, our approach has a tractable normalizing constant, permitting posterior inference on the shrinkage and parameters associated with the baseline distribution. We explore the properties of our proposed distribution and other comparable distributions. We also show how the SPD can hierarchically model a collection of random partition distributions and can also model time-dependent random partitions. -
Toggle Item2021-10-21 - Nathan Sandholtz - An Inverse Optimization Analysis of the Fourth Down Decision in FootballAbstract:
The fourth down decision in football has been primarily studied as an optimization problem; using win probability as the criterion to optimize, analysts estimate optimal decisions for every fourth down situation. Prescriptions from these models have been publicly available for decades, informed by increasingly sophisticated win probability models in recent years. Despite this availability, NFL coaches’ observed fourth down decisions have remained distant from analysts’ recommendations. Inverse optimization provides a mathematical framework to make sense of the gap between coaches’ decisions and analysts’ prescriptions. Leveraging this paradigm, we assume that the coaches’ observed decisions are optimal but that the risk preferences governing their decisions are unknown. Our goal is to model and infer these latent risk preferences such that when the resulting optimization model is solved, it yields their observed decisions as optimal (or minimally suboptimal).
To this end, we model a football game as a Markov decision process, the components of which we estimate from regular season NFL play-by-play data from 2014 through 2020. Rather than assuming that win probability governs the coaches' decisions, we treat the optimality criterion as an unknown parameter. Using the quantile function to parameterize risk, we estimate which quantile-optimal policy yields the coaches’ observed decisions as minimally suboptimal. We find that coaches almost always exhibit conservative risk preferences, but that they have different risk preferences depending on various features of the fourth down situation. -
Toggle Item2021-10-21 - Jared Fisher - A Bayesian Semiparametric Approach to Treatment Effect Variation with NoncomplianceAbstract:
Estimating varying treatment effects in randomized trials with noncompliance is inherently challenging since variation comes from two separate sources: variation in the impact itself and variation in the compliance rate. In this setting, existing Frequentist and ML-based methods are quite flexible but are highly sensitive to the so-called weak instruments problem, in which the compliance rate is (locally) close to zero, and require pre-specifying subgroups of interest. Parametric Bayesian approaches, which account for noncompliance via imputation, are more robust in this case, but are much more sensitive to model specification. In this paper, we propose a Bayesian semiparametric approach that combines the best features of both approaches. Our main contribution is to embed Bayesian Additive Regression Trees (BART) in a broader Bayesian noncompliance framework in which we repeatedly impute individuals’ compliance types. This allows us to flexibly estimate varying treatment effects among Compliers while mitigating the weak instruments problem. We then apply our method to the Oregon health insurance experiment and show that analyses that only focus on a single source of variation can miss important heterogeneity. -
Toggle Item2021-10-14 - Oliver Johnson - Inference, Uncertainty Quantification, and Uncertainty Propagation for Grain Boundary Structure-Property ModelsAbstract:
Brandon D. Snow, Sterling G. Baird, David E. Page, Kathryn F. Varela, Christian Kurniawan, David T. Fullwood, Eric R. Homer, Oliver K. Johnson
We present a non-parametric Bayesian approach for developing structure-property models for grain boundaries (GBs) with built-in uncertainty quantification (UQ). Using this method we infer a structure-property model for H diffusivity in [100] tilt GBs in Ni at 700K based on molecular dynamics (MD) data. We then leverage these results to perform uncertainty propagation (UP) for mesoscale simulations of the effective diffusivity of polycrystals to investigate the interaction between structure-property model uncertainties and GB network structure. We observe a fundamental interaction between crystallographic correlations and spatial correlations in GB networks that causes certain types of microstructures (those with large populations of J2- and J3-type triple junctions) to exhibit intrinsically larger uncertainty in their effective properties. We also investigate the influence of different types of input data (bicrystal vs. polycrystal) and observe evidence of a transition between a data-rich regime in which bicrystals yield more accurate results to a data-limited regime in which polycrystals provide improved inferences. -
Toggle Item2021-10-07 - Candace Berrett - A Bayesian Change Point Model to Identify Local Temperature Changes Related to UrbanizationAbstract:
Changes to the environment surrounding a temperature measuring station can cause local changes to the recorded temperature that deviate from regional temperature trends. This phenomenon -- often caused by construction or urbanization -- occurs at a local level. If these local changes are assumed to represent regional or global processes it can have significant impacts on historical data analyses. These changes or deviations are generally gradual, but can be abrupt, and arise as construction or other environment changes occur near a recording station. We propose a methodology to examine if changes in temperature trends at a point in time exist at a local level at various locations in a region. Specifically, we propose a Bayesian change point model for spatio-temporally dependent data where we select the number of change points at each location using a "œforwards" selection process using deviance information criterion (DIC). We then fit the selected model and examine the linear slopes across time to quantify the local changes in long-term temperature behavior. We show the utility of this model and method using a synthetic data set and observed temperature measurements from eight stations in Utah consisting of daily temperature data for 60 years. -
Toggle Item2021-09-30 - Elizabeth Hammond - The Role of Statistics in Evidence Based MedicineAbstract:
Throughout my long career in medical research, I have seen first hand the critical role of statistical expertise in refining hypotheses, creating impactful research proposals, and refining or refuting research conclusions. Meta analyses and well conducted research studies have had a major impact on patient care. Major questions that have changed patient care have only been definitively answered by the collaboration of statisticians and medical professionals.
Some of these seminal questions answered in this way are:- How is the prognosis of cancer patients affected by disease spread? Which cancers are lethal? What treatments are effective and which treatments should be abandoned?
- In coronary artery disease, do all patients require surgery? How can damage from a heart attack be mitigated without surgery?
- In the COVID-19 pandemic, what are the serious risk factors of death? Are vaccines effective in preventing death? How long does immunity last?
- Without statistical expertise applied to these questions, definitive answers would not have been possible. During this seminar, I will describe important research in which statistical collaboration has played a major role.
-
Toggle Item2021-09-16 - David Wingate - Harnessing Algorithmic Fidelity in Large-Scale Language Models for Social ScienceAbstract:
Large-scale language models, such as BERT, T5 or GPT3, have revolutionized natural language processing because of their ability to capture nuanced structure in the syntax and semantics of natural language. However, the text they generate can exhibit problematic biases, such as racism, sexism, populism, extremism, and more.
In our recent work, we ask: instead of mitigating such bias, can we harness it for good? Can we treat it as signal, instead of noise? Our work demonstrates that the patterns described by these large-scale language models exhibit strong correlations with actual human opinions, value judgements, cultural experiences, assumptions, and thought processes. We term this "algorithmic fidelity".
To the extent that language models accurately reflect human thought processes, then they can be studied as an "in silico brain." In other words, researchers from psychology, neuroscience, cognitive science, or political science could potentially analyze these models and their output as surrogates for humans, performing new kinds of experiments that would never be possible with actual human subjects.
In this talk, I will discuss some of our recent efforts to use language models in the service of social sciences, including demonstrating how language models can take virtual political science surveys, and the potential to use language models to code free-form text.
Joint work with Chris Rytting, Lisa Argyle, Joshua Gubler, Ethan Busby, Taylor Sorensen, and Nancy Fulda. -
Toggle Item2020-03-19 - Marie Tuft - Interpretable Localized Time-Frequency Analysis via Penalized Reduced Rank RegressionAbstract:
Spectral analysis of nonstationary biological processes such as heart rate variability (HRV) and EEG poses a unique challenge: localized, accurate and interpretable descriptions of both frequency and time are required. By reframing this question in a reduced rank regression setting, we propose a novel approach that produces a low-dimensional and empirical basis that is localized in bands of time and frequency. To estimate this frequency-time basis, we apply penalized reduced rank regression with singular value decomposition to the localized discrete Fourier transform. An adaptive sparse fused lasso penalty is applied to the left and right singular vectors, resulting in low-dimensional measures that are interpretable as localized bands in time and frequency. Asymptotic properties of this method are derived, and it is shown to provide a consistent estimator of the time-varying spectrum. Simulation
studies are used to evaluate its performance and its utility in practice in illustrated through the analysis
of HRV during sleep. -
Toggle Item2020-03-12 - Julia Silge - Understanding Word EmbeddingsAbstract:
Modern natural language processing frameworks (including word2vec, GloVe, fastText, ULMFIT, and more) depend on word embeddings, a way of statistically modeling language where words or phrases are mapped to vectors of real numbers. In this talk, understand word embeddings by investigating how we can generate them using count-based statistics and dimensionality reduction, then learn how to make use of pre-trained embeddings based on enormous datasets. Finally, explore the ethical issues involved in using word embeddings and how they can amplify systemic and historical bias. -
Toggle Item2020-03-05 - Tom Greene - Validation and Application of Surrogate Endpoints in Chronic Kidney Disease (CKD) Clinical TrialsAbstract:
Randomized trials of CKD treatments traditionally use clinical events late in CKD progression as end points. This requires costly studies with large sample sizes and long follow-up times. Recently, there has been much interest in the validation of surrogate endpoints to enable the use of smaller trials with shorter follow-up. In this talk, I will overview methodological challenges for the validation of surrogate endpoints, in general and specifically in CKD clinical trials. I will then describe recent work to evaluate potential surrogate endpoints using an evolving library of patient level data sets that currently includes 47 previously conducted CKD clinical trials. Finally, I will overview new ideas for applying validated surrogate endpoints in the design and analysis of newly conducted randomized trials. -
Toggle Item2020-02-27 - David Dahl - Two for One: 1. Focal Random Partition Distribution and 2. Optimization of Clustering CriteriaAbstract:
Random partition models, such as the Chinese restaurant process, allow a Bayesian model to flexibly borrow strength. We present two related
working papers on random partition models. First, while many partition
priors are exchangeable, we propose a nonexchangeable prior based on a focal partition, a Bayesian's prior guess for the unknown partition. We
show how our approach modifies the Chinese restaurant process so that
partitions that are similar to the focal partition have higher
probability. There is a weight parameter that varies between -1 and
infinity, where 0 corresponds to the original Chinese restaurant process
and infinity yields a point mass distribution at the focal partition. In
the later part of the talk, we present a novel stochastic search algorithm
to minimize the posterior expected loss of a clustering criterion based on
a pairwise similarity matrix. Several loss functions for clustering have
been proposed, but the minimization of the expected loss is challenging
given the large size and discrete nature of the search space. Our
approach is a stochastic search based on a series of micro-optimizations
performed in a random order. Our approach is embarrassingly parallel.
Co-Authors:
Richard Warr
Thomas Jensen
Devin Johnson
Peter Müller
Website:
https://statistics.byu.edu/directory/dahl-david-b -
Toggle Item2020-02-20 - Paul Sabin - Estimating Player Value in Football Using Plus-Minus ModelsThe use of statistical methods in sports has exploded during the past decade. Football, America’s most popular sport has lagged behind in the adoption of “Analytics.” A framework for calculating the expected points of each play was introduced by former BYU Quarterback and statistics student Virgil Carter and Robert Machol in 1971. 35 years later, this framework was reintroduced by Brian Burke and opened up the possibilities of analysis in the sport. Until more recently, calculating the value of football players on-field performance has been limited to scouting methods and quarterbacks. Adjusted Plus-Minus (APM) models have long been used in other sports, most notably basketball (Rosenbaum (2004), Kubatko et al. (2007), Winston (2009), Sill (2010)) to estimate each player’s value by accounting for those in the game at the same time. More recent methods have found ways to incorporate plus-minus models in other sports such as Hockey (Macdonald (2011)) and Soccer (Schultze and Wellbrock (2018) and Macdonald (2011)). These models are especially useful in coming up with results-oriented estimation of each player’s value. In American Football, it is difficult to estimate every player’s value since many positions such as offensive lineman have no recorded statistics. While player-tracking data in the NFL is allowing new analysis, such data does not exist in other levels of football such as the NCAA. Using expected points, I provide a model framework that solves many of the traditional issues APM models face in football. This methodology allows the models to estimate the value of each position in each level of the sport. These coarse models will be essential to pair with fine-level player tracking models in a multiscale framework in the future.
-
Toggle Item2020-02-13 - Greg MacFarlane -Using Mobile Device Data to Measure Park Choice, Access, and HealthParks provide benefits to the people who can access them, but previous research attempts to quantify these benefits have used largely arbitrary means to measure access. In two papers currently under review, I and my coauthors have applied a measurement of access based in econometric choice theory to this problem. We inform this measure using mobile device data, developing models of how far people are willing to travel to reach marginally larger parks. We then correlate these choice-based measures of access to parks with tract-level data on physical activity and obesity rates, controlling for demographic variables and spatial effects. The results indicate that excellent park access improves physical activity participation rates, and suggests a marginal improvement in obesity rates beyond what physical activity and demographics can explain.
-
Toggle Item2020-02-06 - Summer Rupper - Tapping into Spy Satellite Imagery to Measure Glacier Changes in the Water Towers of AsiaAbstract:
The high mountain regions of Asia are host to more snow and ice than anywhere outside of the Polar Regions. Changes in snow and ice storage in these remote landscapes has the potential to impact the nearly 1 billion people living downstream. While it is clear that glaciers are generally retreating (shrinking in size) globally, there is a significant paucity of data on glacier changes in high mountain Asia. These glacial systems are extremely remote, occur at very high altitudes, and are often located along disputed boarders in geopolitically unstable regions. This has hampered our ability to access these glaciers and directly monitor changes over time. Here we tap into historical spy satellite imagery to measure the change in glacier volume across the Himalayan range over the past ~50 years. We use this new data set to assess the magnitude and rates of change in glacier-related water resources, assess the mechanisms driving these changes, and model the impacts on downstream populations. -
Toggle Item2020-01-30 - Abbas Zaidi - Evaluating the Effect of Residential Care on Self-Concept and Ego-Resilience: A Generalized Propensity Score Analysis with Clustered DataAbstract:
This discussion focuses on the parametric estimation of average causal effects under a continuous treatment in a hierarchical setting. Our approach is applied to assessing the impact of the Udayan Ghar Program. This residential care system began in New Delhi, India with the purpose of providing surrogate housing and care to vulnerable and at risk children in an effort to improve their psychological development. We investigate the impact of staying in the system on the self-concept and ego-resilience of the residents as proxied by the Piers-Harris score. We find that there is a positive effect of staying in the residential care system at Udayan and that there are variations in this effect by gender. Furthermore, we strengthen our causal claims by demonstrating robustness against unmeasured confounding using a novel sensitivity analysis technique to assess how violations of this key identifying assumption impact our findings.
Website:
https://sites.google.com/site/amzaidistatistics/ -
Toggle Item2020-01-23 - Kevin Moon - Visualizing the True Structure of Big Data for Data ExplorationAbstract:
We live in an era of big data in which researchers in nearly every field are generating thousands or even millions of samples in high dimensions. Most methods in data science focus on prediction or impose restrictive assumptions that require established knowledge and understanding of the data; i.e. these methods require some level of expert supervision. However, in many cases, this knowledge is unavailable and the goal of data analysis is scientific discovery and to develop a better understanding of the data. There is especially a strong need for methods that perform unsupervised data visualization that accurately represents the true structure of the data, which is crucial for developing intuition and understanding of the data. In this talk, I will present PHATE: an unsupervised data visualization tool based on a new information distance that excels at denoising the data while preserving both global and local structure. I will demonstrate PHATE on a variety of datasets including facial images, mass cytometry data, and new single-cell RNA-sequencing data. On the latter, I will show how PHATE can be used to discover novel surface markers for sorting cell populations. In addition, I will present DIG, a visualization tool inspired by PHATE that theoretically eliminates nearly all sources of noise when visualizing dynamical systems. I will then demonstrate DIG on EEG sleep data.
Website:
https://sites.google.com/a/umich.edu/kevin-r-moon/home -
Toggle Item2019-12-05 - Derek Tucker - Elastics Functional Data AnalysisAbstract:
Functional data analysis (FDA) is an important research area, due to its broad applications across many disciplines where functional data is prevalent. An essential component in solving these problems is the registration of points across functional objects. Without proper registration, the results are often inferior and difficult to interpret. The current practice in the FDA literature is to treat registration as a pre-processing step, using off-the-shelf alignment procedures, and follow it up with statistical analysis of the resulting data. In contrast, an Elastic framework is a more comprehensive approach, where one solves for the registration and statistical inferences in a simultaneous fashion. Our goal is to use a metric with appropriate invariance properties, to form objective functions for alignment and to develop statistical models involving functional data. While these elastic metrics are complicated in general, we have developed a family of square-root transformations that map these metrics into simpler Euclidean metrics, thus enabling more standard statistical procedures. Specifically, we have developed techniques for elastic functional PCA, elastic tolerance bounds, and elastic regression models involving functional variables. I will demonstrate these ideas using simulated data and real data from various sources.
J. Derek Tucker is a Principal Member of the Technical Staff at Sandia National Laboratories. He received his B.S. in Electrical Engineering Cum Laude and M.S. in Electrical Engineering from Colorado State University in 2007 and 2009, respectively. In 2014 he received a Ph.D. degree in Statistics from Florida State University In Tallahassee, FL under the co-advisement of Dr. Anuj Srivastava and Dr. Wei Wu. He currently is leading research projects in the area of satellite image registration and point processes modeling for monitory applications. His research is focused on pattern theoretic approaches to problems in image analysis, computer vision, signal processing, and functional data analysis. In 2017, he received the Director of National Intelligence Team Award for his contributions to the Signal Location in Complex Environments (SLiCE) team.
Affiliation:
Sandia National Labs
Date:
2019-12-05 -
Toggle Item2019-11-21 - Antonio Villanueva-Morales - Modified Pseudo-likelihood Estimation for Markov Random Fields on LatticeAbstract:
The probability function of spatial statistical models involves, in general, an extremely awkward normalizing function of the parameters known as the partition function in statistical mechanics with the consequence that a direct approach to statistical inference through maximum likelihood (ML) is rarely possible. In order to avoid such intractability Besag (1975) introduced an alternative technique known as the method of maximum pseudo-likelihood (MPL) owing to its merit of being easy to implement. The maximum pseudo-likelihood estimator (MPLE) is the value of the parameter that maximizes the pseudo-likelihood defined as the direct product of conditional probabilities or conditional probability densities of the variable at each site. It has been mathematically demonstrated that, under suitable conditions, the MPLEs are strongly consistent and asymptotically normally distributed around the true parameter value for large samples of various spatial processes. On the other hand, the MPL method trades away efficiency for computational ease. It has been shown that in many situations the MPLE is not efficient in comparison with the ML estimator (MLE). According to these studies, the MPLEs are as good as the MLEs in the weak interaction case, but the difference between the two becomes substantial when spatial interactions are strong.
Huang and Ogata (2002) address the problem of improving the efficiency of MPLEs while still keeping the technique computationally feasible and proposed the maximum generalized pseudo-likelihood (MGPL) method for Markov random field (MRF) models on lattice. The MGPL estimator (MGPLE) is the value of the parameter that maximizes the generalized pseudo-likelihood function (GPL). This GPL is the multivariate version of Besag's pseudo-likelihood which is constructed first by defining a group of adjacent sites for each site in the lattice and then taking the product of the multivariate conditional probability distributions (MCPD) of the groups of random variables defined on each group of adjacent sites. Simulation results for an Ising and two auto-normal models on a region of square lattice showed better performance of the MGPLE than the MPLE, and the performance became better as the size of the groups of adjacent sites increased. On the other hand, it was observed that as the size of the groups of adjacent sites increased, the computing complexity for the MGPLE increased exponentially due to the presence of a normalizing integral (a sum in the case of discrete site variables) in the expression for each MCPD which has to be evaluated all over the support of the joint distribution for groups of site variables in each case. Because of this, for continuous MRFs other than auto-normal and discrete MRFs with site variables assuming more than two values, an enormous effort might be required making the implementation of the MGPL method practically unfeasible even for small square lattices. For example, in MRFs where each site variable, conditional on its neighbors, follows the distribution of a Winsorized Poisson random variable (Kaiser and Cressie (1997)) the computation of the normalizing integrals rapidly becomes prohibitive with the size of the groups of adjacent sites even for small square lattices, as the support of this distribution may be in the hundreds (or thousands).
In our research we propose a conditional pairwise pseudo-likelihood (CPPL) for parameter estimation in Markov random fields on lattice. The CPPL is defined as the direct product of conditional pairwise distributions corresponding to the pairs of random variables associated with the cliques of size two from the collection of spatial locations on a region of a lattice. Thus the CPPL is a modified version of Besag's pseudo-likelihood (PL) and Huang and Ogata's generalized pseudo-likelihood (GPL) in that it is not constructed based on defining a group of adjacent sites for each site in the lattice. We carry out calculations of the correspondingly defined maximum conditional pairwise pseudo-likelihood estimator (MCPPLE) for Markov random fields with Winsorized Poisson conditional distributions on the lattice. These simulation studies show that the MCPPLE has significantly better performance than Besag's maximum pseudo-likelihood estimator (MPLE), and its calculation is almost as easy to implement as the MPLE. Therefore, we suggest that for situations where each discrete local random variable conditional on its neighbors assumes more than two possible values, as in the Winsorized Poisson case, estimation based on the CPPL may be a computationally more feasible alternative than estimation based on Huang and Ogata's GPL.
Affiliation:
Statistics Department, Chapingo Autonomous University -
Toggle Item2019-11-14 - Jennifer Sinnott - Genetic Association Testing with Imperfect Phenotypes Derived From Electronic Health RecordsAbstract:
Electronic health records linked to blood samples form a powerful new data resource that can provide much larger, more diverse samples for testing associations between genetic markers and disease. However, algorithms for estimating certain phenotypes, especially those that are complex and/or difficult to diagnose, produce outcomes subject to measurement error. Much work is needed to determine best practices for implementing and analyzing such data. To this end, we recently proposed a method for analyzing case-control studies when disease status is estimated by a phenotyping algorithm; our method improves power and eliminates bias when compared to the standard approach of dichotomizing the algorithm prediction and analyzing the data as though case-control status were known perfectly. The method relies on knowing certain qualities of the algorithm, such as its sensitivity, specificity, and positive predictive value, but in practice these may not be known if no ``gold-standard'' phenotypes are known in the population. A common setting where that occurs is in phenome-wide association studies (PheWASs), in which a wide range of phenotypes are of interest, and all that is available for each phenotype is a surrogate measure, such as the number of billing codes for that disease. We proposed a method to perform genetic association tests in this setting, which improves power over existing methods that typically identify cases based on thresholding the number of billing codes. In this talk, I will describe these methods, and present applications to studies of rheumatoid arthritis in the Partners Healthcare System.
Affiliation:
Ohio State University/University of Utah
Date:
019-11-14 -
Toggle Item2019-11-05 - Jacob Mortensen - Statistical Methods for Modeling MovementAbstract:
In recent years, tracking data in has become widespread, allowing researchers to model movement at a very high level of detail. In this talk I will present two examples of statistical research inspired by this type of data. In the first, I present a method for nonparametric estimation of continuous-state Markov transition densities. Our approach uses a Poisson point process to represent the joint transition space, then divides that process by the marginal intensity to estimate the conditional transition density. Modeling a transition density as a point process creates a general framework that admits a wide variety of implementations, depending on suitability for a given application and at the discretion of the modeler. A key feature of this point process representation is that it allows the presence of spatial structure to inform transition density estimation. We illustrate this by using our method to model ball movement in the National Basketball Association, enabling us to capture the effects of spatial features, such as the three point line, that impact transition density values. In the second, I will show how broadcast-derived tracking data can be used to estimate external load metrics in sports science. Sports scientists use high resolution coordinate data to estimate external load metrics, such as acceleration load and high speed running distance, traditionally used to understand the physical toll a game takes on an athlete. Unfortunately, collecting this data requires installation of expensive hardware and paying costly licensing fees to data providers, restricting its availability. Algorithms have been developed that allow a traditional broadcast feed to be converted to x-y coordinate data, making tracking data easier to acquire, but coordinates are available for an athlete only when that player is within the camera frame. Obviously, this leads to inaccuracies in external load estimates, limiting the usefulness of this data for sports scientists. In this research, we develop models that predict offscreen load metrics and demonstrate the viability of broadcast-derived tracking data for understanding external load in soccer.
Affiliation:
Simon Fraser University -
Toggle Item2019-10-31- Nathan Sandholtz - Modeling human decision-making in spatio-temporal systems: An observational and an experimental case studyAbstract:
In this talk I present two contrasting analyses of human decision-making behavior inspatio-temporal systems. In the first case, we examine player shooting decisions inprofessional basketball. We assume that all players operate under the same objectivefunction on offense---namely, maximizing their team's total expected points. Our goal is toidentify areas where, conditional on location, lineups exhibit potential inefficiencies inallocating shots among their players. We do this by comparing a player's probability ofmaking a shot to the rate at which he shoots in context of both his four teammates on thecourt and the spatial distribution of his shots. While on average players are highly efficientwith respect to the shot allocation metrics we introduce, nearly every lineup exhibits somedegree of potential inefficiency. We estimate and visualize the points that are potentiallylost and identify which players are responsible.
In the second case, we analyze an experiment in which subjects were tasked withmaximizing a reward in a simple "hotspot" computer game. As in the basketball example,subjects made decisions to maximize a point total. However, unlike shots in a basketballgame, this task was specifically designed to induce uncertainty about the effect an actionhas on the subsequent point outcome. This forced subjects to balance exploration andexploitation in their strategy. Our analysis shows that subjects exhibit vastly differentpreferences regarding the exploration vs. exploitation tradeoff. For this reason, we cannotassume a global strategy which all subjects follow. On the contrary, inferring each subject'slatent strategy (or acquisition function, as referred to in the Bayesian optimizationliterature) actually becomes the primary goal of our research. We find that the classicalsuite of acquisition functions doesn’t adequately explain every subject's behavior, and wepropose a modification to this suite of acquisition functions which better explains thesubjects' collective set of decisions.
Affiliation:
Simon Fraser University
Date:
2019-10-31 -
Toggle Item2019-10-24-Alex Petersen-Partial Separability and Graphical Models for Multivariate Functional DataAbstract:
Graphical models are a ubiquitous tool for identifying dependencies among components of high-dimensional multivariate data. Recently, these tools have been extended to estimate dependencies between components of multivariate functional data by applying multivariate methods to the coefficients of truncated basis expansions. A key difficulty compared to multivariate data is that the covariance operator is compact, and thus not invertible. In this talk, we will discuss a property called partial separability that circumvents the invertibility issue and identifies the functional graphical model with a countable collection of finite-dimensional graphical models. This representation allows for the development of simple and intuitive estimators. Finally, we will demonstrate the empirical findings of our method through simulation and analysis of functional brain connectivity during a motor task.
Affiliation:
University of California, Santa Barbara
Date:
2019-10-24 -
Toggle Item2019-10-17 Jonathan Blake - A Risk Manager's Guide to a Faith JourneyAbstract:
Jonathan Blake, recipient of the college Alumni Achievement Award, will give a guest lecture on Thursday, October 17. Entitled “A Risk Manager’s Guide to a Faith Journey,” the lecture will take place at 11 a.m. in room 1170 of the Talmage Building. The public is invited to attend the event. For over twenty years, Blake has been employed in a variety of actuarial roles. He is currently the Vice President and Lead Actuary at The Hanover Insurance Group. In this position, he assesses the financial strength of over three billion dollars in domestic reserve position. Blake is also a partner with the Personal, Commercial, and Specialty businesses units, where he helps units engage in profitable growth strategic initiatives. Blake graduated magna cum laude with a B.S. and M.S. from the Department of Statistics in the College of Physical and Mathematical Sciences. He has previously worked in Ohio, Texas, and Massachusetts and served as present, vice president, and board member of the Casualty Actuaries of New England. He is currently a Fellow of the Casualty Actuarial Society and a member of the American Academy of Actuaries. Blake and his wife, Julia, have six children. Each year, every college on Brigham Young University campus honors one alumnus or alumna with this prestigious award. Blake received the 2019 Alumni Achievement Award from the College of Physical and Mathematical Sciences.
Affiliation:
Hanover -
Toggle Item2019-10-10 - Wes Johnson - Gold Standards are Out and Bayes is In: Implementing the Cure for Imperfect Reference Tests in Diagnostic Accuracy StudiesAbstract:
Bayesian mixture models, often termed latent class models, allow users to estimate the diagnostic accuracy of tests and true prevalence in one or more populations when the positive and/or negative reference standards are imperfect. Moreover, they allow the data analyst to show the superiority of a novel test over an old test, even if this old test is the (imperfect) reference standard. We use published data on Toxoplasmosis in pigs to explore the effects of numbers of tests, numbers of populations, and dependence structure among tests to ensure model (local) identifiability. We discuss and make recommendations about the use of priors, sensitivity analysis, model identifiability and study design options, and strongly argue for the use of Bayesian mixture models as a logical and coherent approach for estimating the diagnostic accuracy of two or more tests.
Affiliation:
Department of Statistics, University of California Irvine
Date:
October 10th, 2019
Website:
https://www.ics.uci.edu/~wjohnson/ -
Toggle Item2019-10-03 - John Lawson - Useful Models and Design Strategies for Experimentation - A Career Long PerspectiveAffiliation:
Department of Statistics, Brigham Young University
Date:
October 3, 2019
Website:
John Lawson -
Toggle Item2019-09-26 - Matt Heiner - Bayesian Nonparametric Density Autoregression with Lag SelectionAbstract:
We propose and illustrate a Bayesian nonparametric autoregressive model applied to flexibly estimate general transition densities exhibiting nonlinear lag dependence. Our approach is related to Bayesian curve fitting via joint density estimation using Dirichlet process mixtures, with the Markovian likelihood defined as the conditional distribution obtained from the mixture. This results in a nonparametric extension of a mixture-of-experts formulation. We address computational challenges to posterior sampling that arise from the conditional likelihood. We illustrate the base model by fitting to synthetic data simulated from a classical model for population dynamics, as well as a time series of successive waiting times between eruptions of Old Faithful Geyser. We explore inferences available through the base model before extending the model to include automatic relevance detection among a pre-specified set of lags. We explore methods and inferences for global and local lag selection with additional simulation studies, and illustrate by fitting to an annual time series of pink salmon abundance in a stream in Alaska. We further explore and compare transition density estimation performance for alternative configurations of the proposed model.
Affiliation:
Department of Statistics, Brigham Young University
Date:
September 26, 2019 -
Toggle Item2019-09-19 - Adam Smith - Bayesian Analysis of Partitioned and Large-Scale Demand ModelsAbstract:
The analysis of consumer purchase behavior is a core component of marketing and economic research, but becomes challenging with large product assortments. I discuss two approaches for estimating demand models with a high-dimensional set of products. The first approach is based on partitioning demand: these models assume that products can be categorized into groups and then define consumer substitution patterns at the group-level rather than product-level. While this can significantly reduce the dimension of the parameter space, it can also lead to inaccurate inferences if the product categories do not match the structure of consumer preferences. To overcome this problem, I let the partition be a model parameter and propose a Bayesian method for inference. The second approach is based on regularization: I propose a new class of shrinkage priors for price elasticities in high-dimensional demand models. The prior has a hierarchical structure where the direction and rate of shrinkage depend on the information in a product classification tree. Both approaches are illustrated with store-level scanner data and the effects on demand predictions and product competition are discussed
Affiliation:
UCL School of Management, University College London
Date:
September 19, 2019
Website:
Adam Smith -
Toggle Item2019-04-04 - Daniel Apley - Understanding the Effects of Predictor Variables in Black-Box Supervised Learning Models
Presenter:
Daniel Apley
Title:
Understanding the Effects of Predictor Variables in Black-Box Supervised Learning Models
Affiliation:
Northwestern University
Date:
April 4, 2019
Abstract:
For many supervised learning applications, understanding and visualizing the effects of the predictor variables on the predicted response is of paramount importance. A shortcoming of black-box supervised learning models (e.g., complex trees, neural networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, support vector regression, etc.) in this regard is their lack of interpretability or transparency. Partial dependence (PD) plots, which are the most popular general approach for visualizing the effects of the predictors with black box supervised learning models, can produce erroneous results if the predictors are strongly correlated, because they require extrapolation of the response at predictor values that are far outside the multivariate envelope of the training data. Functional ANOVA for correlated inputs can avoid this extrapolation but involves prohibitive computational expense and subjective choice of additive surrogate model to fit to the supervised learning model. We present a new visualization approach that we term accumulated local effects (ALE) plots, which have a number of advantages over existing methods. First, ALE plots do not require unreliable extrapolation with correlated predictors. Second, they are orders of magnitude less computationally expensive than PD plots, and many orders of magnitude less expensive than functional ANOVA. Third, they yield convenient variable importance/sensitivity measures that possess a number of desirable properties for quantifying the impact of each predictor.
Website:
Dr. Apley's Website
-
Toggle Item2019-03-28 - Jeff Miller - Flexible perturbation models for robustness to misspecification
Presenter:
Dr. Jeff Miller
Title:
Flexible perturbation models for robustness to misspecification
Affiliation:
Harvard
Date:
March 28, 2019
Abstract:
In many applications, there are natural statistical models with interpretable parameters that provide insight into questions of interest. While useful, these models are almost always wrong in the sense that they only approximate the true data generating process. In some cases, it is important to account for this model error when quantifying uncertainty in the parameters. We propose to model the distribution of the observed data as a perturbation of an idealized model of interest by using a nonparametric mixture model in which the base distribution is the idealized model. This provides robustness to small departures from the idealized model and, further, enables uncertainty quantification regarding the model error itself. Inference can easily be performed using existing methods for the idealized model in combination with standard methods for mixture models. Remarkably, inference can be even more computationally efficient than in the idealized model alone, because similar points are grouped into clusters that are treated as individual points from the idealized model. We demonstrate with simulations and an application to flow cytometry.
Website:
Dr. Miller's Website
-
Toggle Item2019-03-21 - Yue Zhang - Multi-state Approach for Studying Cancer Care Continuum using EHR data
Presenter:
Dr. Yue Zhang
Title:
Multi-state Approach for Studying Cancer Care Continuum using EHR data
Affiliation:
University of Utah
Date:
March 21, 2019
Abstract:
Diagnostic evaluation of suspected breast cancer due to abnormal screening mammography results is common, creates anxiety for women and is costly for the healthcare system. Timely evaluation with minimal use of additional diagnostic testing is key to minimizing anxiety and cost. In this paper, we propose a Bayesian semi-Markov model that allows for flexible, semi-parametric specification of the sojourn time distributions and apply our model to an investigation of the process of diagnostic evaluation with mammography, ultrasound and biopsy following an abnormal screening mammogram. We also investigate risk factors associated with the sojourn time between diagnostic tests. By utilizing semi-Markov processes, we expand on prior work that described the timing of the first test received by providing additional information such as the mean time to resolution and proportion of women with unresolved mammograms after 90 days for women requiring different sequences of tests in order to reach a definitive diagnosis. Overall, we found that older women were more likely to have unresolved positive mammograms after 90 days. Differences in the timing of imaging evaluation and biopsy were generally on the order of days and thus did not represent clinically important differences in diagnostic delay.
Website:
Dr. Zhang's Webpage
-
Toggle Item2019-03-14 - Dennis Tolley - DATA: Whence it Came…Where it’s Going
Presenter:
Dr. Dennis Tolley
Title:
DATA: Whence it Came…Where it’s Going
Affiliation:
BYU
Date:
March 14, 2019
Abstract:
A defining activity of statisticians is the handling, processing, analyzing and interpreting of data. With “big data” upon us, it is sometimes easy to forget some basic principles in the use of data. In this seminar I review some basic guidelines regarding data that apply before one actually begins to physically process the data files. I also review some guidelines based on the ultimate use of the results that assist in how a statistician will formulate a methodology and carry out the analysis. Application of these guidelines is illustrated with a simple problem in liquid chromatography that gives rise to a family of random walk models. These models, in turn, lay the foundation for a family of research problems in statistics.
Website:
Dr. Tolley's Website
-
Toggle Item2019-03-07 - Grant Schultz - Utah Crash Prediction Models: A Joint Effort for Success
Presenter:
Dr. Grant Schultz
Title:
Utah Crash Prediction
Models: A Joint Effort for Success
Affiliation:
BYU
Date:
March 7, 2019
Abstract:
The Utah Department of Transportation (UDOT) continues to advance the safety of the state roadway network through their participation and endorsement of the “Zero Fatalities: A Goal We Can All Live With™” campaign to increase awareness of the importance of highway safety. As a continuing effort by UDOT to advance the safety of their roadway network, research has been conducted wherein statistical models have been developed that allow users to evaluate the safety of roadways within the state. Three models have developed by a team of Civil and Environmental Engineering and Statistics faculty and students. These models include the Utah Crash Prediction Model (UCPM), the Utah Crash Severity Model (UCSM), and the Utah Intersection Crash Prediction Model (UICPM). Using the output from these models, UDOT Safety Programs engineers, Region directors, and other interested users have access to data that will allow them to make informed decisions related to prioritizing highway safety projects and programs within the state of Utah.
Website:
Dr. Schultz Webpage
-
Toggle Item2019-02-28 - Ephraim Hanks - Random walk spatial models for spatially correlated genetic data
Presenter:
Dr. Ephraim Hanks
Title:
Random walk spatial models for spatially correlated genetic data
Affiliation:
Penn State
Date:
February 28, 2019
Abstract:
Landscape genetics is the study of how landscape features, like rivers, mountains, and roads, influence genetic connectivity of wildlife populations. We build models for spatial genetic correlation based off of spatio-temporal models for how animals move across the landscape. This approach provides insights into common spatial models, such as simultaneous autoregressive (SAR) models and common Matern covariance models. It also allows for scientific interpretation of spatial covariance parameters. We illustrate this approach in a study of brook trout, where we provide the first parametric description of how stream characteristics influence genetic connectivity.
Website:
Dr. Hanks' Website
-
Toggle Item2019-02-21 - Michele Guindani - Bayesian Approaches to Dynamic Model Selection
Presenter:
Michele Guindani
Title:
Bayesian Approaches to Dynamic Model Selection
Affiliation:
University of California, Irvine
Date:
February 21, 2019
Abstract:
In many applications, investigators monitor processes that vary in space and time, with the goal of identifying temporally persistent and spatially localized departures from a baseline or ``normal" behavior. In this talk, I will first discuss a principled Bayesian approach for estimating time varying functional connectivity networks from brain fMRI data. Dynamic functional connectivity, i.e., the study of how interactions among brain regions change dynamically over the course of an fMRI experiment, has recently received wide interest in the neuroimaging literature. Our method utilizes a hidden Markov model for classification of latent neurological states, achieving estimation of the connectivity networks in an integrated framework that borrows strength over the entire time course of the experiment. Furthermore, we assume that the graph structures, which define the connectivity states at each time point, are related within a super-graph, to encourage the selection of the same edges among related graphs. Then, I will propose a Bayesian nonparametric model selection approach with an application to the monitoring of pneumonia and influenza (P&I) mortality, to detect influenza outbreaks in the continental United States. More specifically, we introduce a zero-inflated conditionally identically distributed species sampling prior which allows borrowing information across time and to assign data to clusters associated to either a null or an alternate process. Spatial dependences are accounted for by means of a Markov random field prior, which allows to inform the selection based on inferences conducted at nearby locations. We show how the proposed modeling framework performs in an application to the P&I mortality data and in a simulation study, and compare with common threshold methods for detecting outbreaks over time, with more recent Markov switching based models, and with other Bayesian nonparametric priors that do not take into account spatio-temporal dependence.
Website:
Dr. Guidani's Website
-
Toggle Item2019-02-14 - Garritt Page - Temporal and Spatio-Temporal Random Partition Models
Presenter:
Dr. Garritt Page
Title:
Temporal and Spatio-Temporal Random Partition Models
Affiliation:
BYU
Date:
February 14, 2019
Abstract:
The number of scientific fields that regularly collect data that are temporally and spatially referenced continues to experience rapid growth. An intuitive feature in data that are spatio-temporal is that measurements taken on experimental units near each other in time and space tend to be similar. Because of this, many methods developed to accommodate spatio-temporal dependent structures perform a type of implicit grouping based on time and space. Rather than implicitly grouping observations through a type of smoothing, we develop a class of dependent random partition models that explicitly models spatio-temporal clustering. This model can be thought of as a joint distribution for a sequence of random partitions indexed by time and space. We first detail how temporal dependence is incorporated so that partitions evolve gently over time. Then a few properties of the joint model are derived and induced dependence at the observation level is explored. Afterwards, we demonstrate how space can be integrated. Computation strategies are detailed and we apply the method to Chilean standardized testing scores.
Website:
Dr. Page's Website
-
Toggle Item2019-02-07 - Gil Fellingham - Predicting Performance of Professional Golfers
Presenter:
Dr. Gil Fellingham
Title:
Predicting Performance of Professional Golfers
Affiliation:
BYU
Date:
February 7, 2019
Abstract:
Many statisticians agree that building models that predict well should be a high priority. (Harville, 2014, Stern, 2014, Berry and Berry, 2014). The purpose of this paper is to test the predictive ability of various Bayesian models using a group of closely matched members of the Professional Golf Association (PGA). Predicting performance of PGA golfers is a notoriously difficult task. We fit six different models to scores produced by 22 PGA golfers playing on 18 different golf courses in 2014. We then use these models to predict scores for the same golfers and golf courses as well as other golfers and other courses in 2015. We varied model complexity across two different dimensions. In one dimension we fit model intercepts using parametric Bayesian, nonparametric Bayesian, and hierarchical Bayesian methods. In the other dimension, we either included covariates for driving distance, greens hit in regulation, and difficulty of course as measured by slope, or we did not include the covariates. Preliminary results indicate that nonparametric Bayesian methods seem marginally better.
Website:
Dr. Fellingham's Webpage
-
Toggle Item2019-01-31 - Matthias Katzfuss - Gaussian-Process Approximations for Big Data
Presenter:
Matthias Katzfuss
Title:
Gaussian-Process Approximations for Big Data
Affiliation:
Texas A&M University
Date:
Abstract:
Gaussian processes (GPs) are popular, flexible, and interpretable probabilistic models for functions. GPs are well suited for big data in areas such as machine learning, regression, and geospatial analysis. However, direct application of GPs is computationally infeasible for large datasets. We consider a framework for fast GP inference based on the so-called Vecchia approximation. Our framework contains many popular existing GP approximations as special cases. Representing the models by directed acyclic graphs, we determine the sparsity of the matrices necessary for inference, which leads to new insights regarding the computational properties. Based on these results, we propose novel Vecchia approaches for noisy, non-Gaussian, and massive data. We provide theoretical results, conduct numerical comparisons, and apply the methods to satellite data.
Website:
Dr. Katzfuss Website
-
Toggle Item2019-01-24 - Brennan Bean - Interval-Valued Kriging with Applications in Design Ground Snow Load Prediction
Presenter:
Brennan Bean
Title:
Interval-Valued Kriging with Applications in Design Ground Snow Load Prediction
Affiliation:
Utah State University
Date:
January 24, 2019
Abstract:
The load induced by snow on the roof of a structure is a serious design consideration in many western and northeastern states: under-estimating loads can lead to structure failure while over-estimating loads unnecessarily increases construction costs. Recent updates to the design ground snow load requirements in Utah use geostatistics models to produce design ground snow load estimates that have shown significantly improved accuracy. However, the model inputs are subject to several sources of uncertainty including measurement limitations, short observation periods, and shortcomings in the distribution fitting process, among others. Ignoring these uncertainties in the modeling process could result in critical information loss that robs the final predictions of proper context. One way to account for these uncertainties is to express the data by intervals, as opposed to single numbers. Interval-valued geostatistics models for uncertainty characterization were originally considered and studied in the late 1980s. However, those models suffer from several fundamental problems that limit their application. This presentation proposes to modify and improve the interval-valued kriging models proposed by Diamond (1989) based on recent developments of random set theory. The resulting new models are shown to have more structured formulation and computational feasibility. A numerical implementation of these models is developed based on a modified Newton-Raphson algorithm and its finite sample performance is demonstrated through a simulation study. These models are applied to the Utah snow load dataset and produce an interval-valued version of the 2018 Utah Snow Load Study. The interesting and promising implications of these new results to design ground snow load and structural risk analysis will be thoroughly discussed.
Website:
Brennan's Webpage
-
Toggle Item2019-01-17 - Ron Reeder - Improving outcomes after pediatric cardiac arrest – a hybrid stepped-wedge trial
Presenter:
Ron Reeder
Title:
Improving outcomes after pediatric cardiac arrest – a hybrid stepped-wedge trial
Affiliation:
University of Utah
Date:
January 17, 2019
Abstract:
Quality of cardiopulmonary resuscitation (CPR) is associated with survival, but recommended guidelines are often not met, and less than half the children with an in-hospital arrest will survive to discharge. A single-center before-and-after study demonstrated that outcomes may be improved with a novel training program in which all pediatric intensive care unit staff are encouraged to participate in frequent CPR refresher training and regular, structured resuscitation debriefings focused on patient-centric physiology.
I’ll present the design of an ongoing trial that will assess whether a program of structured debriefings and point-of-care bedside practice that emphasizes physiologic resuscitation targets improves the rate of survival to hospital discharge with favorable neurologic outcome in children receiving CPR in the intensive care unit. This study is designed as a hybrid stepped-wedge trial in which two of ten participating hospitals are randomly assigned to enroll in the intervention group and two are assigned to enroll in the control group for the duration of the trial. The remaining six hospitals enroll initially in the control group but will transition to enrolling in the intervention group at randomly assigned staggered times during the enrollment period.
This trial is the first implementation of a hybrid stepped-wedge design. It was chosen over a traditional stepped-wedge design because the resulting improvement in statistical power reduces the required enrollment by 9 months (14%). However, this design comes with additional challenges, including logistics of implementing an intervention prior to the start of enrollment. Nevertheless, if results from the single-center pilot are confirmed in this trial, it will have a profound effect on CPR training and quality improvement initiatives.Website:
Dr. Reeder's Website
-
Toggle Item2019-01-10 - Juan Rodriguez - Deep Learning to Save Humanity
Presenter:
Juan Rodriguez
Title:
Deep Learning to Save Humanity
Affiliation:
Recursion Pharmaceuticals
Date:
January 10, 2019
Abstract:
During the last 50 years, the advances in computational processing and storage have overshadowed the progress of most areas of research. At Recursion Pharmaceuticals we are translating these advances into biological results to change the way drug discovery is done. We are hyper-parallelizing the scientific method to discover new treatments for patients. This new approach presents unique statistical and mathematical challenges in the area of artificial intelligence and computer vision which will be presented.
Website:
Company Website
-
Toggle Item2018-12-06 - Dennis Eggett - Making the best of messy data: A return to basics
Presenter:
Dr. Dennis Eggett
Title:
Making the best of messy data: A return to basics.
Affiliation:
BYU
Date:
December 6, 2018
Abstract:
When your data does not meet the basic assumptions of an analysis method, you have to go back to the basics in order to glean the information you need. Three data sets will be used to explore resampling methods based on the definition of a p-value and the central limit theorem. A simple two sample t-test of a data set that is not near normal and does not conform to non-parametric methods is used to demonstrate resampling in its simplest form. A mixed model analysis of highly skewed data will be used to demonstrate how to maintain its structure through the resampling process. And a resampling of a very large data set to demonstrate the finding parameter estimates and confidence intervals.
Website:
Dr. Eggett's Webpage
-
Toggle Item2018-11-29 - Bruno Sanso - Multi-Scale Models for Large Non-Stationary Spatial Datasets
Presenter:
Bruno Sanso
Title:
Multi-Scale Models for Large Non-Stationary Spatial Datasets
Affiliation:
University of California Santa Cruz
Date:
November 29, 2018
Abstract:
Large spatial datasets often exhibit features that vary at different scales as well as at different locations. To model random fields whose variability changes at differing scales we use multiscale kernel convolution models. These models rely on nested grids of knots at different resolutions. Thus, lower order terms capture large scale features, while high order terms capture small scale ones. In this talk we consider two approaches to fitting multi-resolution models with space-varying characteristics. In the first approach, to accommodate the space-varying nature of the variability, we consider priors for the coefficients of the kernel expansion that are structured to provide increasing shrinkage as the resolution grows. Moreover, a tree shrinkage prior auto-tunes the degree of resolution necessary to model a subregion in the domain. In addition, compactly supported kernel functions allow local updating of the model parameters which achieves massive scalability by suitable parallelization. As an alternative, we develop an approach that relies on knot selection, rather than shrinkage, to achieve parsimony, and discuss how this induces a field with spatially varying resolution. We extend shotgun stochastic search to the multi resolution model setting, and demonstrate that this method is computationally competitive and produces excellent fit to both synthetic and real dataset.Website:
Dr. Sanso's Website
-
Toggle Item2018-11-15 - Margie Rosenberg - Unsupervised Clustering Techniques using all Categorical Variables
Presenter:
Margie Rosenberg
Title:
Unsupervised Clustering Techniques using all Categorical Variables
Affiliation:
University of Wisconsin-Madison
Date:
November 15, 2018
Abstract:
We present a case study to illustrate a novel way of clustering individuals to create groups of similar individuals where covariates are all categorical. Our method is especially useful when applied to multi-level categorical data where there is no inherent order in the variable, like race. We use data from the National Health Interview Survey (NHIS) to form the clusters and apply these clusters for prediction purposes to the Medical Expenditures Panel Study (MEPS). Our approach considers the person-weighting of the surveys to produce clusters and estimates of expenditures per cluster that are representative of the US adult civilian non-institutionalized population. For our clustering method, we apply the K-Medoids approach with an adapted version of the Goodall dissimilarity index. We validate our approach on independent NHIS/MEPS data from a different panel. Our results indicate the robustness of the clusters across years and indicate the ability to distinguish clusters for the predictability of expenditures.
Website:
Dr. Rosenberg's Website
-
Toggle Item2018-11-08 - Terrance Savitsky - Bayesian Uncertainty Estimation under Complex Sampling
Presenter:
Terrance Savitsky
Title:
Bayesian Uncertainty Estimation under Complex Sampling
Affiliation:
Bureau of Labor Statistics
Date:
November 8, 2018
Abstract:
Multistage, unequal probability sampling designs utilized by federal statistical agencies are typically constructed to maximize the efficiency of the target domain level estimator (e.g., indexed by geographic area) within cost constraints for survey administration. Such designs may induce dependence between the sampled units; for example, with employment of a sampling step that selects geographically-indexed clusters of units. A sampling-weighted pseudo-posterior distribution may be used to estimate the population model on the observed sample. The dependence induced between co-clustered units inflates the scale of the resulting pseudo-posterior covariance matrix that has been shown to induce under coverage of the credibility sets. We demonstrate that the scale and shape of the asymptotic distributions are different between each of the pseudo-MLE, the pseudo-posterior and the MLE under simple random sampling. We devise a correction applied as a simple and fast post-processing step to MCMC draws of the pseudo-posterior distribution that projects the pseudo-posterior covariance matrix such that the nominal coverage is approximately achieved. We demonstrate the efficacy of our scale and shape projection procedure on synthetic data and make an application to the National Survey on Drug Use and Health.
Website:
-
Toggle Item2018-11-01 - Dustin Harding - How Renting Products Increases Consumer Confidence and Commitment
Presenter:
Dustin Harding
Title:
How Renting Products Increases Consumer Confidence and Commitment
Affiliation:
UVU
Date:
October 25, 2018
Abstract:
Consumers can obtain skill-based products through a variety of acquisition modes, such as purchase or rental. Despite the rise of nonpurchase acquisition modes, surprisingly little research has explored the effects of differential acquisition modes on consumer behavior. This research begins to fill this gap in the literature by examining the effect of acquisition mode on the expected time necessary to master newly adopted skill-based products and the downstream consequences for consumers and marketers. Results of four experiments and a field study show that purchasing, versus renting, products requiring skill-based learning increases the amount of time consumers expect to be required to master them. Further, the differences in speed of product mastery, in turn, impact subsequent consumer behavior via differential levels of product use commitment.
Website:
Dr. Harding's Website
-
Toggle Item2018-10-25 - Alex Petersen - Wasserstein Regression and Covariance for Random Densities
Presenter:
Alex Petersen
Title:
Wasserstein Regression and Covariance for Random Densities
Affiliation:
UC Santa Barbara
Date:
October 25, 2018
Abstract:
Samples of density functions appear in a variety of disciplines, including distributions of mortality across nations, CT density histograms of hematoma in post-stroke patients, and distributions of voxel-to-voxel correlations of fMRI signals across subjects. The nonlinear nature of density space necessitates adaptations and new methodologies for the analysis of random densities. We define our geometry using the Wasserstein metric, an increasingly popular choice in theory and application. First, when densities appear as responses in a regression model, the utility of Fréchet regression, a general purpose methodology for response objects in a metric space, is demonstrated. Due to the manifold structure of the space, inferential methods are developed allowing for tests of global and partial effects, as well as simultaneous confidence bands for fitted densities. Second, a notion of Wasserstein covariance is proposed for multivariate density data (a vector of densities), where multiple densities are observed for each subject. This interpretable dependence measure is shown to reveal interesting differences in functional connectivity between a group of Alzheimer's subjects and a control group.
Website:
Dr. Petersen's Website
-
Toggle Item2018-10-18 - Abel Rodriguez - Spherical Factor Analysis for Binary Data: A Look at the Conservative Revolt in the US House of Representatives
Presenter:
Abel Rodriguez
Title:
Spherical Factor Analysis for Binary Data: A Look at the Conservative Revolt in the US House of Representatives
Affiliation:
UC Santa Cruz
Date:
October 18, 2018
Abstract:
Factors models for binary data are extremely common in many social science disciplines. For example, in political science binary factor models are often used to explain voting patterns in deliberative bodies such as the US Congress, leading to an “ideological” ranking of legislators. Binary factor models can be motivated through so-call “spatial” voting models, which posit that legislators have a most preferred policy – their ideal point –, which can be represented as a point in some Euclidean “policy space”. Legislators then vote for/against motions in accordance with the distance between their (latent) preferences and the position of the bill in the same policy space. In this talk we introduce a novel class of binary factor models derived from spatial voting models in which the policy space corresponds to a non-Euclidean manifold. In particular, we consider embedding legislator’s preferences in the surface of a n-dimensional sphere. The resulting model contains the standard binary Euclidean factor model as a limiting case, and provides a mechanism to operationalize (and extend) the so-called “horseshoe theory” in political science, which postulates that the far-left and far-right are more similar to each other in essentials than either are to the political center. The performance of the model is illustrated using voting data from recent US Congresses. In particular, we show that voting patterns for the 113th US House of Representatives are better explained by a circular factor model than by either a one- or a two-dimensional Euclidean model, and that the circular model yields a ranking of legislators more in accord with expert’s expectations.
Website:
Dr. Rodriguez's Website
-
Toggle Item2018-09-20 - Scott Grimshaw - Going Viral, Binge Watching, and Attention Cannibalism
Presenter:
Dr. Scott Grimshaw
Title:
Going Viral, Binge Watching, and Attention Cannibalism
Affiliation:
BYU
Date:
September 20, 2018
Abstract:
Since digital entertainment is often described as viral this paper uses the vocabulary and statistical methods for diseases to analyze viewer data from an experiment at BYUtv where a program's premiere was exclusively digital. Onset time, the days from the program premiere to a viewer watching the first episode, is modeled using a changepoint between epidemic viewing with a non-constant hazard rate and endemic viewing with a constant hazard rate. Finish time, the days from onset to a viewer watching all episodes, uses an expanded negative binomial hurdle model to reflect the characteristics of binge watching. The hurdle component models binge racing where a viewer watches all episodes on the same day as onset. One reason binge watching appeals to viewers is that they can focus attention on a single program's story line and characters before moving on to a second program. This translates to a competing risks model that has an impact on scheduling digital premieres. Attention cannibalism occurs when a viewer takes a long time watching their first choice program and then never watches a second program or delays watching the second program until much later. Scheduling a difference in premieres reduces attention cannibalism.
Website:
Dr. Grimshaw's website
-
Toggle Item2018-04-12 - Cristian Tomasetti - Cancer etiology, evolution and early detection
Presenter:
Dr. Cristian Tomasetti
Title:
Cancer etiology, evolution, and early detection
Affiliation:
Johns Hopkins University School of Medicine
Date:
Apr 12, 2018
Abstract:
The standard paradigm in cancer etiology is that inherited factors and lifestyle, environmental exposures are the causes of cancer. I will present recent findings indicating that a third cause, never considered before, plays a large role: "bad luck", i.e. the pure chance involved in DNA replication when cells divide. Novel mathematical and statistical methodologies for distinguishing among these causes will also be introduced. I will then conclude with a new approach for the early detection of cancer.
Website:
Dr. Tomasetti's Website
-
Toggle Item2018-03-29 - H. Dennis Tolley - What's the Likelihood?
Presenter:
H. Dennis Tolley
Title:
What's the Likelihood?
Affiliation:
BYU
Date:
Mar 29, 2018
Abstract:
The likelihood function plays a major role in both frequentist and Bayesian methods of data analysis. Non-parametric Bayesian models also rely heavily on the form of the likelihood. Despite its heuristic foundation, the likelihood has several desirable large sample statistical properties that prompt its use among frequentists. Additionally, there are other important facets of the likelihood that warrant its formulation in many circumstances. As fundamental as the likelihood is, however, beginning students are only given a cursory introduction into how to formulate the likelihood. This seminar illustrates the formulation of the likelihood for a family of statistical problems common in the physical sciences. By examining the basic scientific principles associated with an experimental set-up, we show the step by step construction of the likelihood, starting with the discrete random walk model as a paradigm. The resulting likelihood is the solution to a stochastic differential equation. Elementary applications of the likelihood are illustrated.
Website:
Dr. Tolley's website
-
Toggle Item2018-03-22 - Matthew Heaton - Methods for Analyzing Large Spatial Data: A Review and Comparison
Presenter:
Dr. Matthew Heaton
Title:
Methods for Analyzing Large Spatial Data: A Review and Comparison
Affiliation:
BYU
Date:
Mar 22, 2018
Abstract:
The Gaussian process is an indispensable tool for spatial data analysts. The onset of the “big data” era, however, has lead to the traditional Gaussian process being computationally infeasible for modern spatial data. As such, various alternatives to the full Gaussian process that are more amenable to handling big spatial data have been proposed. These modern methods often exploit low rank structures and/or multi-core and multi-threaded computing environments to facilitate computation. This study provides, first, an introductory overview of several methods for analyzing large spatial data. Second, this study describes the results of a predictive competition among the described methods as implemented by different groups with strong expertise in the methodology. Specifically, each research group was provided with two training datasets (one simulated and one observed) along with a set of prediction locations. Each group then wrote their own implementation of their method to produce predictions at the given location and each which was subsequently run on a common computing environment. The methods were then compared in terms of various predictive diagnostics.
Website:
Dr. Heaton's website
-
Toggle Item2018-03-15 - Timothy Hanson - A unified framework for fitting Bayesian semiparametric models to arbitrarily censored spatial survival data
Presenter:
Timothy Hanson
Title:
A unified framework for fitting Bayesian semiparametric models to arbitrarily censored spatial survival data
Affiliation:
Medtronic
Date:
Mar 15, 2018
Abstract:
A comprehensive, unified approach to modeling arbitrarily censored spatial survival data is presented for the three most commonly-used semiparametric models: proportional hazards, proportional odds, and accelerated failure time. Unlike many other approaches, all manner of censored survival times are simultaneously accommodated including uncensored, interval censored, current-status, left and right censored, and mixtures of these. Left truncated data are also accommodated leading to models for time-dependent covariates. Both georeferenced (location observed exactly) and areally observed (location known up to a geographic unit such as a county) spatial locations are handled. Variable selection is also incorporated. Model fit is assessed with conditional Cox-Snell residuals, and model choice carried out via LPML and DIC. Baseline survival is modeled with a novel transformed Bernstein polynomial prior. All models are fit via new functions which call efficient compiled C++ in the R package spBayesSurv. The methodology is broadly illustrated with simulations and real data applications. An important finding is that proportional odds and accelerated failure time models often fit significantly better than the commonly-used proportional hazards model.
Website:
Dr. Hanson's LinkedIn
-
Toggle Item2018-03-08 - Daniel Nettleton - Random Forest Prediction Intervals
Presenter:
Dr. Daniel Nettleton
Title:
Random Forest Prediction Intervals
Affiliation:
Iowa State University
Date:
Mar 8, 2018
Abstract:
Breiman's seminal paper on random forests has more than 30,000 citations according to Google Scholar. The impact of Breiman's random forests on machine learning, data analysis, data science, and science in general is difficult to measure but unquestionably substantial. The virtues of random forest methodology include no need to specify functional forms relating predictors to a response variable, capable performance for low-sample-size high-dimensional data, general prediction accuracy, easy parallelization, few tuning parameters, and applicability to a wide range of prediction problems with categorical or continuous responses. Like many algorithmic approaches to prediction, random forests are typically used to produce point predictions that are not accompanied by information about how far those predictions may be from true response values. From the statistical point of view, this is unacceptable; a key characteristic that distinguishes statistically rigorous approaches to prediction from others is the ability to provide quantifiably accurate assessments of prediction error from the same data used to generate point predictions. Thus, we develop a prediction interval -- based on a random forest prediction -- that gives a range of values that will contain an unknown continuous univariate response with any specified level of confidence. We illustrate our proposed approach to interval construction with examples and demonstrate its effectiveness relative to other approaches for interval construction using random forests.
Website:
Dr. Nettleton's website
-
Toggle Item2018-02-22 - Robert Richardson - Non-Gaussian Translation Processes
Presenter:
Robert Richardson
Title:
Non-Gaussian Translation Processes
Affiliation:
BYU
Date:
Feb 22, 2018
Abstract:
A non-Gaussian translation process is a method used in some engineering applications where a stochastic process is used with non-Gaussian marginal distributions. It could be considered a hierarchical copula model where the correlation structure of the process is defined separately from the marginal distributional characteristics. This approach also yields a simple likelihood function for the finite dimensional distributions of the stochastic process. These processes will be shown in a few applications to either perform tasks that could not be done previously or to do it much more efficiently such as non-Gaussian option pricing, general multivariate stable spatial processes, and non-Gaussian spatio-temporal dynamic modeling.
Website:
Dr. Richardson's Website
-
Toggle Item2018-02-15 - Jeffery Tessem - How to make more beta cells: exploring molecular pathways that increase functional beta cell mass as a cure for Type 1 and Type 2 diabetes
Presenter:
Dr. Jeffery S Tessem
Title:
How to make more beta cells: exploring molecular pathways that increase functional beta cell mass as a cure for Type 1 and Type 2 diabetes
Affiliation:
Department of Nutrition, Dietetics and Food Science at BYU
Date:
Feb 15, 2018
Abstract:
Both Type 1 (T1D) and Type 2 diabetes (T2D) are caused by a relative insufficiency in functional β-cell mass. Current therapeutic options for diabetes include daily insulin injections to maintain normoglycemia, pharmacological agents to stimulate β-cell function and enhance insulin sensitivity, or islet transplantation. A major obstacle to greater application of islet transplantation therapy is the scarcity of human islets. Thus, new methods for expansion of β-cell mass, applied in vitro to generate the large numbers of human islet cells needed for transplantation, or in situ to induce expansion of the patients remaining β-cells, could have broad therapeutic implications for this disease. To this end, our lab is interested in delineating the molecular pathways that increase β-cell proliferation, enhance glucose stimulated insulin secretion, and protect against β-cell death.
Website:
Dr. Tessem's Website
-
Toggle Item2018-02-08 - Chris Groendyke - Bayesian Inference for Contact Network Models using Epidemic Data
Presenter:
Chris Groendyke
Title:
Bayesian Inference for Contact Network Models using Epidemic
Data
Affiliation:
Robert Morris University
Date:
Feb 8, 2018
Abstract:
I will discuss how network models can be used to study the spread of epidemics through a population, and in turn what epidemics can tell us about the structure of this population. I apply a Bayesian methodology to data from a disease presumed to have spread across a contact network in a population in order to perform inference on the parameters of the underlying network and disease models. Using a simulation study, I will discuss the strengths, weaknesses, and limitations of this type of these models, and the data required for this type of inference. Finally, I will describe an analysis of an actual measles epidemic that spread through the town of Hagelloch, Germany, in 1861 and share the conclusions it allows us to make regarding the population structure.
Website:
Chris's Website
-
Toggle Item2018-02-01 - Larry Baxter - Structure in Prior PDFs and Its Effect on Bayesian Analysis
Presenter:
Larry Baxter
Title:
Structure in Prior PDFs and Its Effect on Bayesian Analysis
Affiliation:
BYU
Date:
Feb 1, 2018
Abstract:
Bayesian statistics formalizes a procedure for combining established (prior) statistical knowledge with current knowledge to produce a posterior statistical description that presumably is better than either the prior or new knowledge by itself. Two common applications of this theory involve (a) combining established (literature) estimates of model parameter with new data to produce better parameter estimates, and (b) estimating model prediction confidence bands. Frequently, the prior information includes reasonable parameter estimates, poorly quantified and often subjective parameter uncertainty estimates, and no information regarding how the values of one parameter affect the confidence intervals of other parameters. All three of these parameter characteristics affect Bayesian analysis. The first two receive a great deal of attention. The third characteristic, the dependence of model parameters on one another, creates structure in the prior pdfs. This structure strongly influences Bayesian results, often to an extent that rivals or surpasses the parameter uncertainty best estimates. Nevertheless, Bayesian analyses commonly ignore this structure.
All structure stems primarily from the form of the model and, in linear models, does not depend on the observations themselves. Most models produce correlated parameters when applied to real-world engineering and science data. The most common example of structure is parameter correlation coefficients. Linear models produce linear parameter correlations that depend on the magnitude of the independent variable under analysis but that in most practical applications produce large, often close to unity, correlation coefficients. Nonlinear models also generally have correlated parameters. However the correlations can be nonlinear, even discontinuous, and generally involve more complexity than linear model parameter correlations. Parameter correlations profoundly affect the results of Bayesian parameter estimation and prediction uncertainty. Properly incorporated structure produces Bayesian results that powerfully illustrate the strength and potential contribution of the theory. Bayesian analyses that ignore such structure produce poor or even nonsensical results, often significantly worse than a superficial guess.
This seminar demonstrates the importance of prior structure in both parameter estimation and uncertainty quantification using real data from typical engineering systems. Perhaps most importantly, the discussion illustrates methods of incorporating parameter structure for any given model that does not rely on observations. These methods quantify parameter structure, including the lack of structure, for linear and nonlinear models.Website:
Larry's Website
-
Toggle Item2018-01-18 - Brad Barney - Growing Curve Methodology with Application to Neonatal Growth Curves
Presenter:
Brad Barney
Title:
Growing Curve Methodology with Application to Neonatal Growth Curves
Affiliation:
BYU
Date:
Jan 18, 2018
Abstract:
As part of postnatal care, newborns are routinely monitored to assess the stability and adequacy of their growth. Interest lies in learning about the typical postnatal growth of especially preterm infants. We briefly consider some general methodological strategies currently employed to parsimoniously construct growth curves for use in medical practice. We present original results using existing methodology known as generalized additive models for location, scale and shape (GAMLSS). We also expand existing methodology on the Bayesian analogue of GAMLSS, known as structured additive distributional regression. In particular, we hierarchically model weight and length jointly, from which we are able to induce a time-varying distribution for Body Mass Index.Co-Authors:
Adrienne Williamson, Josip Derado, Gregory Saunders, Irene Olsen, Reese Clark, Louise Lawson, Garritt Page, and Miguel de Carvalho
Website:
Brad's page