U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Cover of Methods for Variable Selection and Treatment Effect Estimation in Nonrandomized Studies with Few Outcome Events and Many Confounders

Methods for Variable Selection and Treatment Effect Estimation in Nonrandomized Studies with Few Outcome Events and Many Confounders

, , , , , , and .

Author Information and Affiliations

Structured Abstract

Background:

Nonrandomized studies of comparative effectiveness and safety evaluate treatments as used in routine care by diverse patient populations and are therefore critical for producing the information necessary to making patient-centered treatment decisions. Most nonrandomized studies based on electronic health care data use a propensity score (PS) to control hundreds of measured covariates and to estimate the causal effect of treatment. Even in large studies, high-dimensional confounder control can lead to problems in causal inference due to unstable estimation of the PS model or inappropriate use of observations with extreme PS values. However, in studies with few outcome events, each observed event is highly influential, and potential problems are exacerbated.

Objective:

To evaluate and improve analytic strategies for nonrandomized studies with many confounders and few outcome events, including methods for variable selection for the PS in health care database analyses and methods for treatment effect estimation based on the PS.

Methods:

In the first simulation study, we compared the high-dimensional PS algorithm for variable selection with approaches that utilize direct adjustment for all potential confounders via regularized regression, including ridge regression and least absolute shrinkage and selection operator (lasso) regression. In the second simulation study, we compared a wide variety of propensity-based estimators of the marginal relative risk. In contrast to prior research that has focused on specific statistical methods in isolation of other analytic choices, we instead considered a method to be defined by the complete multistep process from PS modeling to final treatment effect estimation. PS model estimation methods considered included ordinary logistic regression, Bayesian logistic regression, lasso, and boosted regression trees. Methods for utilizing the PS included pair matching, full matching, decile strata, fine strata, regression adjustment using 1 or 2 nonlinear splines, inverse propensity weighting, and matching weights. In each study, we based simulations on 2 previously published pharmacoepidemiologic cohorts and used the plasmode simulation framework to create realistic simulated data sets with many potential confounders, and we evaluated performance of methods with respect to bias and mean squared error of the estimated effects.

Results:

In the first set of simulations, high-dimensional PS approaches generally performed better than regularized regression approaches. However, simulations that included the variables selected by lasso regression in a regular PS model also performed well. In the second set of simulations, regression adjustment for the PS and matching weights provided lower bias and mean squared error in the context of rare binary outcomes.

Conclusions:

Some automated analysis approaches can provide highly robust treatment effect estimates across a wide variety of scenarios. Therefore, their use in nonrandomized PCOR studies from administrative health care databases would be expected to improve treatment effect estimates and eventually result in better treatment decision making by patients and providers.

Limitations and Subpopulation Considerations:

All simulation results are specific to the data-generating mechanisms studied. Although we attempted to explore a wide variety of realistic scenarios, simulations scenarios built on other base data sets could result in different conclusions.

Background

Nonrandomized studies of health care interventions and treatments are critical for producing the information necessary to making patient-centered treatment decisions. Nonrandomized studies evaluate treatments as used in routine patient care by populations that are often excluded from randomized studies, such as children, the elderly, women of child-bearing age, or patients with many comorbidities or comedications.1,2 In addition, nonrandomized studies can answer questions of comparative effectiveness and safety that are of direct interest to patients and providers faced with a treatment decision; these questions are often not answered in randomized trials, where treatments may be compared against a placebo.3

The evidence produced with nonrandomized studies is particularly important early in the introduction of a new treatment to routine care, when patients, physicians, regulators, and payers are all struggling to make decisions in spite of incomplete data.4 During this phase, patients are exposed to treatments without a full understanding of their benefits and risks, and health plans pay for treatments without knowing how their effectiveness compares with that of alternatives.5 Large health care databases, including claims and/or electronic medical records provide an opportunity to quickly and efficiently evaluate new treatments and compare them with existing alternatives.6 Like all nonrandomized studies, these evaluations require control of likely confounding of the treatment–outcome association, but the number of patients receiving the new treatment limits the study size, and, thus, the number of observed outcome events.

This scenario results in the analytical challenge of drawing inference with few outcome events while adjusting for many potential confounders.

Despite extensive research into methods for confounding control, the performance of analytic approaches in this scenario has received almost no attention. Therefore, a major impediment to progress in the expedited assessment of treatments is the development of methods that can make maximum use of few outcome events, while adjusting for many measured confounders. This methodology is needed not only for studies of new treatments, but also for other study scenarios with few outcomes that are of interest to PCOR researchers, including assessments of treatment effects in patient subgroups and studies of treatments for rare diseases.

To learn about the causal effects of treatments from nonrandomized studies, alternative treatment strategies must be compared in patients who are similar with respect to their baseline risk of outcome. Equitable comparisons are achieved by measuring and adjusting for factors that influence outcome through 1 of several available methods, including matching, stratification, regression, or weighting directly on covariates7-9 or on a confounder score, such as the propensity score10-13 or a disease risk score.14 Currently, the majority of nonrandomized comparative effectiveness studies control confounding via the PS,15 defined as the probability of treatment given observed covariates. Modeling the PS provides distinct advantages over other methods of adjustment, such as multivariate modeling of the outcome, when there are many confounding factors and few outcome events, because the PS is able to collapse many measured variables down to a single summary score to be used for adjustment and because it requires modeling the treatment mechanism rather than the outcome mechanism, implying improved estimation precision when there are many covariates and few outcomes.

PS methodology also easily accommodates trimming—that is, removing from the analysis patients that are not comparable. Trimming patients with PS values outside the region of overlap between treated and control patients results in better covariate balance and improved validity of treatment effect estimates16 because it restricts comparisons to patients who are similar with respect to factors that influence the outcome. Therefore, trimming is generally recommended, and can be implemented regardless of the approach used for treatment effect estimation. However, when there are few outcome events, analytic choices that affect the number and makeup of the trimmed patients can cause treatment effect estimates to vary dramatically, as the few patients who had an event are forced in and out of the analysis. These analytic choices primarily occur at 2 stages: (1) estimation of the PS model and (2) application of the estimated PS to estimate treatment effects. These choices may have relatively low impact on the final effect estimates when the outcome is continuous or a frequent binary event,17 but they are crucial in studies with few events, where each observed event is highly influential. Given how important studies with few outcome events are to patient-centered research, the acute consequences of analytic choices in such studies, and the paucity of literature on methods in this context, further study is urgently needed.

The PS is usually estimated with a logistic regression model for treatment that includes measured covariates—potentially hundreds of variables. Even with large data sets, these complex models can fail to converge to the maximum likelihood solution if some covariate patterns occur in only 1 treatment group,18 which becomes more likely as the number of binary covariates increases. Although the principles behind confounder selection are now well established, in practice, variable selection for confounder adjustment remains a difficult problem.19 Specifically, it is known that adjusting for all confounders will eliminate bias (assuming all confounders have been measured), but additionally adjusting for predictors of outcome that are unrelated to treatment will lead to estimates with lower variance.20,21 Furthermore, instruments and near-instruments—variables strongly associated with treatment but unrelated or only weakly associated with outcome—should be avoided, because adjusting for these variables can increase variance and amplify bias from unmeasured confounders.22,23 However, in retrospective database studies with hundreds or thousands of measured covariates, investigators rarely know a priori which variables are confounders vs instruments, and instrumental variables cannot be verified empirically.24

The high-dimensional PS (hdPS) algorithm was proposed as a solution to this problem in studies of treatment effects in health care claims databases.25,26This algorithm uses empirical assessments of variables' prevalence and associations with exposure and outcome to screen thousands of unique diagnoses, procedures, and medications recorded in claims. Variables are then ranked with respect to their potential for confounding, and investigators can select the highest ranked variables for inclusion in a PS model. Example studies25,26 and a small simulation study27 have shown hdPS to be effective for removing bias in comparative effectiveness studies, but this approach has never been compared with alternative approaches for confounder selection in a high-dimensional covariate space.

One such alternative, originally suggested by Greenland28 and recently applied to pharmacoepidemiology, is regularized regression. In this approach, no variable selection is necessary; all potential confounders are included in a regression model of outcome on treatment. Because covariates are adjusted for directly in the outcome model, instrumental variables should have estimated coefficients near zero, limiting potential bias amplification.14 In order to accommodate many potential confounders when there may be relatively few observed outcome events, model estimation penalizes large, imprecise coefficient estimates and shrinks them toward zero, thereby reducing the overall variability in the model.29 Despite these desirable properties, regularized regression was not originally developed for confounder adjustment, and its performance in this context has not been studied. Evaluations of alternative PS estimation methods have all agreed that they hold great promise for improving the validity of nonrandomized studies but that extensive simulation studies are needed before their use can be recommended. So far, these methods have not been widely used in comparative effectiveness research.

Once the PS is estimated, an investigator must choose among PS matching, stratification, regression, weighting, or some combination of these approaches for estimating the treatment effect. In specific scenarios and for specific target parameters, any of these approaches can yield an unbiased treatment effect estimate,12,13,30 and some literature reviews have found approximately equivalent results across strategies.31,32 However, when there are few outcome events, the different methods of trimming employed across these approaches can cause vast differences in treatment effect estimation.33 For example, trimming must be done explicitly prior to regression or weighting on the PS, while noncomparable patients are trimmed implicitly when matching on the PS using a caliper or when using fine stratification on the PS. In studies with few outcome events, 1:1 matching with a caliper—a method commonly used in published comparative effectiveness research— can yield particularly poor results, as many patients and, thus, many outcome events may be discarded unnecessarily.34

Direct comparisons of state-of-the-art implementations of these approaches are needed to optimize the analysis of few outcomes, but most prior work has focused on naïve implementations in the context of continuous or frequent binary outcomes. Furthermore, prior comparisons have not included evaluations of trimming and other practical choices as part of the overall analytic approach, even though these choices are critical to its success. For example, weighting on the PS was shown in simulation to be least biased for estimation of the marginal odds ratio and relative risk,35 but weighting can be problematic if there is poor overlap in the PS distributions,36 which occurs whenever there is highly discriminating treatment assignment, a common scenario when studying new treatments.37 Simulations have generally assumed perfect overlap and no trimming. New implementations of regression on the PS can estimate the marginal odds ratio using semiparametric regression splines37,38 and multiple imputation.38 These methods avoid problems of earlier implementations due to model misspecification39 or due to the “noncollapsibility” of the odds ratio—that is, the fact that marginal and conditional odds ratios are not identical.40,41 However, these methods have not been evaluated in the context of few outcomes, where semiparametric regression will be more challenging.

We sought to compare the hdPS algorithm with regularized outcome regression models for estimation of the effect of a binary treatment. We also describe and compare a wide variety of propensity-adjusted estimators of the marginal relative risk (RR). We evaluated these methods in a “plasmode” simulation, which creates simulated data sets based on a real empirical cohort study.27 This approach preserves the number and type of covariates observed in the real study, as well as the complex correlation structure among covariates and exposure, allowing for the first large-scale evaluation of these methods in realistic simulated data. This work will yield valuable guidance for PCOR of new therapies, in subgroup analyses, and for rare diseases.

Participation of Patients and Other Stakeholders in the Design and Conduct of Research and Dissemination of Findings

In this grant, we studied innovative methods for optimizing the analytic approach for nonrandomized studies with a high-dimensional covariate space and few events. This research is highly relevant to patients because it will improve the accuracy (both validity and precision) of treatment effect estimates from nonrandomized studies with few events, thereby strengthening the evidence about treatments that can be gained and subsequently communicated to patients for making treatment decisions. Improving accuracy from nonrandomized comparisons is crucial, since many questions of great importance to patients will never be evaluated in a resource-intensive randomized controlled trial. Furthermore, randomized studies often exclude complex patients with many comorbidities or comedications, who may have a systematically different response to treatment. Therefore, nonrandomized studies often provide the only source of information on some treatment questions or for some patients.

Because nonrandomized comparisons can be completed much more quickly and efficiently than randomized trials, they are particularly useful for studying treatments with low uptake in the population, such as new treatments and treatments for rare diseases. They can also provide evidence on treatment effects in patient subgroups, which is often lacking in randomized trials due to limited study size. However, all of these scenarios that make nonrandomized studies most useful ultimately restrict the size of the potential study, and result in studies with few events. In addition, these studies must control for all confounders of the treatment–outcome association in order to provide accurate comparisons of health effects that are relevant to patients. Thus, analytic approaches that allow for the control of many confounders while making optimal use of available outcome information is a high priority for providing the best possible evidence to patients, but these approaches have not previously been a focus of methods research and are in need of improvement.

In interviews with patient stakeholders that we undertook in preparing the proposal for this research, patients noted many of these issues as important to them, the ultimate consumers of PCOR. For example, several patients expressed frustration with past research findings that were later discredited, such as the false claims of cardio-protective effects of female hormone therapy. Patients who we interviewed put study validity above any other aspect of PCOR. However, despite concerns about validity, patients also conveyed the importance of nonrandomized comparisons. For example, Lisa Freeman (patient advocate and caregiver) said, “I look at research to see that it captures a [population] in a meaningful way. Don't study a disease that's prevalent in a minority group on Caucasians! Don't study problems in elderly groups on healthy young people!” Tanya Lord (caregiver and researcher) said, “I know that randomized control trials are the gold standard. It's about finding that perfect population Studies that remove certain populations skew the results and risk not being generalizable.”

Providing patients with answers to questions such as “Given my personal characteristics, conditions, and preferences, what should I expect will happen to me?” or “What are my options and what are the potential benefits and harms of those options?” means giving them accurate estimates of the differing health effects expected from the available treatment plans. For a large and important set of treatment questions and for a growing number of patients, these estimates can only be derived from a nonrandomized study with a high-dimensional covariate space and few events, but analytic approaches in these studies have simply not received the attention merited. Strengthening the analytic approach will ultimately lead to better evidence on treatments, allowing for improved health care decision making by patients and providers and improved patient outcomes.

The Division of Pharmacoepidemiology Patient Advisory Board is an active organization of patients who demonstrate above average knowledge of navigating health care systems, as well as an interest in treatment safety and efficacy. The board was established to guide the Division of Pharmacoepidemiology's research toward outcomes that are patient centered. Meeting each quarter, members assist investigators by identifying key problems they encounter in health care delivery, advising on the most important questions of interest in comparative effectiveness research, acting as a sounding board for consumer-facing study materials, representing the patient's voice in all stages of research conduct, and helping to ensure that key findings are disseminated in ways that are accessible to patients and families.

During the grant period, Dr. Franklin met with the patient advisory group and gave them background material on simulation studies and PS adjustment—2 methods that feature prominently in the analyses—and then provided an overview of the project and the plans for the coming months. She received numerous questions and comments from the group. Although their ability to contribute to methodology studies is somewhat limited, all of the patients enjoyed learning about methods and indicated that this new knowledge would make them more informed research consumers and advisors in the future.

Methods

The objective of this project was to evaluate and improve analytic strategies for nonrandomized studies with few outcome events and many potential confounders, which would subsequently improve the ability to evaluate treatments quickly after they are made available, evaluate treatment effects in patient subgroups, and evaluate treatments for rare diseases. Specifically, we conducted 2 extensive simulation studies evaluating these different methods. In the first simulation study (see the “Simulation Study 1” section in the Methods), we compare the hdPS algorithm for variable selection with approaches that utilize direct adjustment for all potential confounders via regularized regression, including ridge regression and least absolute shrinkage and selection operator (lasso) regression.42 In the second simulation study (see the “Simulation Study 2” section in the Methods), we describe and compare a wide variety of propensity-adjusted estimators of the marginal relative risk.43

In both cases, we use multiple simulated data sets, rather than real study data sets, to assess the performance of methods because the simulated data sets are generated in such a way as to have a known true treatment effect that is the target of estimation. Comparing our estimates to this known truth allows us to assess the accuracy of the estimation approach in a way that is not possible in real data, where the true treatment effect is unknown.

Specifically, we assess bias as the average difference between the estimated and true treatment effects across multiple simulated data sets. In addition, using multiple simulated data sets allows us to assess the variability in treatment effect estimates across random samples (the precision) in a way that is not possible with a single real data set. We seek methods that can produce estimates of treatment effect with low bias and low variability, as these estimates lead to the most accurate treatment decision making. Therefore, bias, precision, and their combination (mean squared error) were the primary performance metrics used to compare methods in both simulation studies.

Simulation Study 1

Empirical Data

Nonsteroidal anti-inflammatory drugs

We based simulations on 2 previously published cohort studies carried out in claims data. The first example comes from a study of 49 653 patients initiating a nonsteroidal anti-inflammatory drug (NSAID) during 1999-2002.44 Study patients included Medicare beneficiaries 65 years of age and older who were enrolled in the Pharmaceutical Assistance Contract for the Elderly program provided by the state of Pennsylvania. Exposure was classified as either a Cox-2 inhibitor (32 042 exposed) or a nonselective NSAID (unexposed). Patients were followed for 180 days after the initiation of therapy for severe gastrointestinal complications, which included 367 and 185 events observed in exposed and unexposed patients, respectively.

Anticonvulsants

The second study included 166 031 patients 40 to 64 years old from the HealthCore Integrated Research Database who had initiated an anticonvulsant medication between 2001 and 2006.34 Anticonvulsant exposure was classified as “highly inducing,” meaning anticonvulsants that highly induce cytochrome P450 enzyme system activity, which may contribute to increased cardiovascular risk (12 580 exposed), vs regular anticonvulsants that do not have this property (unexposed). Patients were followed for cardiovascular hospitalization or death for 90 days following therapy initiation, including 68 exposed and 496 unexposed events. Prior analyses have indicated potential problems with the hdPS approach to variable selection and covariate adjustment in these data, and we therefore chose this cohort to provide a challenging data set for variable selection.

Simulation Setup

In order to create simulated data sets from these example studies, we used the “plasmode” simulation framework.27 We began by estimating a logistic regression model for the observed study outcome as a function of the exposure indicator, demographics, and a subset of the potential confounders measured from claims during the 6 months prior to exposure initiation (variables included are described in Appendix A). This estimated model served as the basis for subsequent simulated outcomes, as described below. The variables (not including exposure) that entered the outcome-generating model are referred to as the true confounders, because all these variables influence outcome and most are also associated with exposure.

To create a simulated data set, we sampled with replacement exposed and unexposed patients from the original study to achieve the desired study size and prevalence of exposure. We used the covariate and exposure data for each patient without modification, so that associations among these variables remained intact in the sampled population. Next, we used the estimated model for outcome as the outcome-generating model, but we replaced the estimated coefficient on exposure with a desired log odds ratio (OR) treatment effect, and we specified the intercept value to set the prevalence of outcome. We also multiplied the value of all other model coefficients by 1.1 to increase the total amount of confounding in the simulated data. We applies this outcome-generating model to the exposure and covariate data of sampled patients to calculate the probability of outcome, which we used to generate a random binary outcome status for each patient. We repeated this process, beginning with patient sampling, 500 times to yield 500 simulated data sets in each simulation scenario.

Figure 1 depicts the resulting causal structure of this simulation process.

Figure 1. Diagram Depicting Causal Relationships in Simulated Data.

Figure 1

Diagram Depicting Causal Relationships in Simulated Data.

Simulation Scenarios

We explored 7 simulation scenarios in each empirical study, all with a study size of 30 000 patients (Table 1).

Table 1. Parameters for Simulation Scenarios Explored in Each Data Set in the First Simulation Study.

Table 1

Parameters for Simulation Scenarios Explored in Each Data Set in the First Simulation Study.

Variable Creation

In each simulated data set, we created a pool of several thousand potential confounders from the available claims information as described below; these variables were then available to each method under study. In order to use the hdPS variable selection algorithm, all variables under consideration must be binary, so we used the variable creation portion of the hdPS algorithm to create binary variables from the thousands of diagnoses, procedures, and medications in the simulated data. Briefly, hdPS variable creation identifies the top 200 most prevalent diagnosis, procedure, or medication codes from each data dimension (inpatient claims, outpatient claims, drug claims, etc) and creates 3 binary variables from each code selected, indicating the frequency at which the code was observed for each patient vs the reference that the code was never observed. For example, if a diagnosis with diabetes based on the ICD-9-CM (diagnosis code 250) is selected as a prevalent code, the hdPS variable creation algorithm creates 3 binary variables based on the frequency of this code: (1) at least 1 instance of code 250, (2) code 250 appearing > median number of times across patients, and (3) code 250 appearing > 75th percentile number of times across patients. Thus, 600 (200 × 3) variables are created per data dimension, resulting in 4800 variables in the NSAID data and 3000 in the anticonvulsant data. In each simulated data set, some of these variables were constant, reducing the number of potential confounders. Although these variables did not enter the outcome-generating model directly, they may be thought of as proxies for the true confounders, which were similarly based on frequencies of claims for specific codes or medications.

High-Dimensional Propensity Score

We used several variations of hdPS variable selection in each simulated data set. In exposure-based variable selection, ranking of potential confounders is based solely on the magnitude of their RR association with exposure.26 In bias-based variable selection, ranking is based on the Bross formula for bias of a binary confounder, which depends on a variable's association with exposure, association with outcome, and prevalence.45 We used both potential rankings and selected the top 30, the top 250, and the top 500 variables for inclusion in the logistic regression PS model for exposure, leading to 6 distinct models. Each PS model also contained available demographic variables. To estimate treatment effect, we fit a logistic regression model for the simulated outcome that included indicators for the exposure and deciles of the PS as predictors in the model.30,35

Regularization Approaches

We considered 2 approaches to regularized regression. In each approach, all potential confounders were included in a logistic model for the simulated outcome, along with demographics and an indicator of exposure. This model was then estimated by penalized maximum likelihood estimation, where a penalty term for the magnitude of the regression coefficients is added to the log-likelihood function in order to shrink very large, noisy coefficients toward zero. Specifically, in ridge regression,46 a common approach to regularization, the penalty is given by:

λp=1Pβ2p
where pth is the penalty parameter, indicating the amount of shrinkage to be applied, ββpp is the pth model coefficient, and P is the total number of coefficients to be shrunk. Similarly, in lasso regression,47 the penalty is given by:
λp=1P|βp|
where bars indicate the absolute value.

In both methods, independent variables, including all covariates used for adjustment, are standardized prior to model estimation to ensure shrinkage is applied evenly across variables, and cross-validation is used to select the value of λλ that minimizes the model deviance.48,49 However, the 2 methods can produce very different estimated coefficients. Unlike ridge regression, lasso regression may shrink some estimated coefficients all the way to zero, effectively eliminating the corresponding variables from the model. Thus, the lasso method provides not only regularization, but also another method for variable selection. When implementing these methods in the simulated data, shrinkage was applied to all model coefficients except the coefficient on the exposure indicator.

Combination Approaches

In addition to the hdPS and regularization approaches discussed above, we considered several approaches that combined these methods to determine if estimate bias and precision performance could be improved over the basic implementations. First, we estimated outcome models using both ridge and lasso regression that included demographics, an indicator for exposure, and only the top 500 bias-based hdPS variables. Thus, the hdPS algorithm is used as a tool for prescreening variables prior to inclusion in a regularized outcome regression model.

Second, we identified variables selected by the original lasso model without prescreening. We considered a variable to be selected by lasso if its estimated coefficient in the final model was nonzero. We then estimated a logistic regression PS model using these variables only and adjusted for deciles of the PS in a regular logistic regression model for outcome, as in the hdPS approaches above.

Treatment Effect Estimation

All methods described above result in a logistic regression model for outcomes that includes an indicator for patient exposure status and is adjusted either for deciles of an estimated PS or for individual confounders. The usual practice in the literature in such analyses is to use the estimated coefficient on exposure from these models as the estimated log OR treatment effect. However, due to the noncollapsibility of the OR, these estimates are all biased for the conditional log OR value chosen in the simulation setup, which is conditional on the specific set of variables used in the outcome-generating model. Therefore, to provide a fairer comparison of method performance in removing confounding bias and avoid the issue of noncollapsibility, we calculate a risk difference treatment effect from each outcome model by calculating the counterfactual risk of outcome under treatment and control conditions for each patient and taking the average difference in these quantities across patients. This estimated risk difference can then be compared to the true risk difference for the simulation to accurately represent the bias associated with each variable selection approach.

Simulation Study 2

Empirical Data

We performed a second simulation study to compare methods for estimating PS-adjusted treatment effects with respect to bias, variance, and mean squared error (MSE) across a range of data-generating scenarios. We based simulations on 2 previously published cohort studies carried out in health care claims data. Each example study was used independently to generate separate simulation scenarios.

The first example comes from a study of 18 475 patients initiating either dabigatran (the new treatment) or warfarin (control) from October 2010 (when dabigatran was first approved for sale in the United States) through June 2012.50 Patients were followed for stroke for 180 days after the initiation of anticoagulation therapy. There were 125 and 279 events observed on new and control therapies, respectively. The second study used the same anticonvulsant data set from simulation study 1.

In addition to the investigator-specified covariates that accompanied each study, we used the variable creation method in the hdPS algorithm to again create a large pool of potential covariates from the claims observed during the 6 months prior to treatment initiation.

Simulation Setup

To create simulated data sets from these example studies, we used a variation of the plasmode simulation framework.27 We began by selecting which variables, among the thousands of potential covariates created by the hdPS algorithm, would define the covariate space for the simulations. In each example study, we ranked the hdPS variables according to their likely contribution to confounding bias using the strength of associations with the observed exposure and outcome. In the anticoagulant study, we selected the 150 top-ranking hdPS variables along with 53 investigator-specified variables that were previously defined. The predefined variables included the demographic variables age, sex, and calendar year; 44 binary indicators of prior diagnoses, procedures, or medication use relevant to risk of stroke; and 6 continuous variables that summarize comorbidity or health system use, such as number of hospitalizations and a stroke risk score. In the anticonvulsant study, prior experience indicated that adjusting for a large number of hdPS covariates performs poorly with respect to bias and variance due to very large associations between some variables and exposure. Therefore, in this study, we selected the 50 top-ranking hdPS variables along with 36 predefined variables, including demographics; 30 binary indicators of prior diagnoses, procedures, or medications; and 3 continuous summary variables.

We then partitioned the covariate space into risk factors for the outcome that are not associated with exposure (XY, binary variables with a univariate association with treatment in the bottom 10%); instrumental variables that are associated with exposure, but not associated with outcome except through exposure (XZ, binary variables with a univariate association with outcome in the bottom 10%); and confounders associated with both exposure and outcome (XZY, all other variables). We made the partition based on observed associations of covariates with exposure and outcome. In the anticoagulant (OAC) study, this process resulted in 178 confounders, 12 instruments, and 13 risk factors for outcome. In the anticonvulsant study, there were 70 confounders, 8 instruments, and 8 risk factors for outcome. In both studies, all demographics and all continuous summary variables were in XZY, so we can further partition XZY into the binary confounders (XB) and the continuous confounders (XC). In addition, we selected 30 influential pairwise interactions of the confounders in the OAC study and 10 in the anticonvulsant (ACV) study, denoted XI.

In order to create data-generating models for exposure and outcome based on the covariate partition, we estimated a logistic model for the observed indicator of exposure as a function of confounders and instruments:

log{Pr(Z=1)Pr(Z=0)}=γ0+γZXZ+γBXB+f(XC;γX)+γIXI

Another logistic model was estimated for the observed outcome as a function of the exposure indicator, confounders, and risk factors:

log{Pr(Y=1)Pr(Y=0)}=Φ0+ΦZZ+ΦYXY+ΦBXB+f(XC;ΦX)+ΦIXI

In the OAC cohort, we estimated these models using the gam function in the mgcv package in R. Continuous variables were modeled nonlinearly with thin-plate splines using 2 degrees of freedom, and cross-validation was used to select the smoothing parameter. In the ACV study, we estimated these using the bayesglm function with a Cauchy distribution in order to avoid convergence problems.

These models served as the basis for subsequent simulation of exposure and outcomes. To create a simulated data set, we sampled patients with replacement from the original study to achieve the desired study size. We used the covariate data for each patient without modification, so that associations among these variables remained intact in the sampled population. Next, we simulated exposure for the sampled patients from a Bernoulli distribution with probability defined by the estimated model for exposure. We also simulated 2 binary potential outcomes for each sampled patient: 1 assuming that the patient received the treatment of interest (setting Z = 1) and 1 assuming that the patient received the control treatment (Z = 0). In each case, the probability was defined by the estimated model for outcome. Depending on simulated exposure status, 1 of these 2 outcomes was used as the simulated outcome for analyses. The other was the counterfactual outcome. This process, beginning with patient sampling, was repeated 2500 times to yield 2500 simulated data sets in each simulation scenario. Figure 1 depicts the resulting causal structure of this simulation process.

In both exposure and outcome-generating models, we altered the intercept value to set the prevalence of exposure or outcome, and we multiplied the value of all estimated covariate coefficients in order to increase the total amount of confounding in the simulated data in some scenarios, as Table 2 depicts. In the outcome model, we also modified the coefficient on treatment to define the true log OR treatment effect. To calculate the true marginal RR (referred to as the Sample Average Treatment Effect, or SATE), we averaged the treated and control potential outcomes separately and took the ratio. To calculate the true Sample Average Treatment effect among the Treated (SATT), we averaged the 2 sets of potential outcomes among exposed patients only and took the ratio. In scenarios 1 through 6, there was a null treatment effect and no treatment effect heterogeneity, so the marginal RR should be 1, regardless of the population of interest. In scenario 7, we specified a heterogeneous treatment effect so that the true RR depends on the population of interest. In all cases, we calculated bias as the mean of the difference between the estimated and true marginal RR.

Table 2. Parameter Values for Simulation Scenarios in the Second Simulation Study.

Table 2

Parameter Values for Simulation Scenarios in the Second Simulation Study.

Simulation Scenarios

We explored 7 simulation scenarios in each empirical study, all with a study size of 5000 patients. Table 2 gives the parameter values used in scenarios based on the anticoagulant study and the anticonvulsant study.

Treatment Effect Estimation

In each simulated data set, we estimated a total of 12 unique PS models. We considered 3 covariate specifications: (1) all variables (XZ, XZY, XY), (2) predictors of outcome (XZY, XY), and (3) predictors of exposure (XZ, XZY). We also used each of 4 modeling approaches, including (1) ordinary logistic regression (logistic), (2) Bayesian logistic regression with a Cauchy prior distribution (bayesglm), (3) lasso logistic regression with cross-validation to select the smoothing parameter (lasso), and (4) boosted logistic regression (boosting) with an interaction depth of 2, a shrinkage parameter of 0.03, a training fraction of 50% for selection of the optimal number of trees, and up to 1000 trees. Although these choices are not likely to be optimal, they were made to provide a reasonable balance of computation time vs method performance. For approaches 1 through 3, we used main effect linear terms only, as is common in nonrandomized studies with many potential confounders. For approach 4, the model considers interactions and nonlinearities automatically.

Using each of the 12 estimated PSs, we applied several methods of treatment effect estimation. There were 6 unique methods for estimating the sample marginal RR, including (1) full matching; (2) stratification on deciles; (3) fine stratification using 40 quantile-based strata; (4) regression on the PS using 1 thin-plate regression spline with 4 degrees of freedom; (5) regression on the PS using 2 splines, each with 4 degrees of freedom; and (6) stabilized inverse probability of treatment weights (IPTW). We considered 8 methods for estimating the effect in the treated patients, including (1) 1-to-1 greedy matching; (2) full matching; (3) stratification on deciles; (4) fine stratification, as above; (5) regression on the PS using 1 spline; (6) regression on the PS using 2 splines; (7) stabilized standardized mortality ratio weights (which we will refer to as IPTW for convenience); and (8) matching weights. We implemented all matching methods with a caliper of 0.2 standard deviations of the logit of the PS, as generally recommended.51

We implemented all of these methods in the full simulated data set, as well as in the subset remaining after trimming at the 2.5% threshold, a commonly used trimming threshold. For the weighting approaches, we additionally considered weight truncation at the 2.5% level. These combinations resulted in a total of 156 unique estimates of the sample marginal RR and 216 unique estimates of the treated marginal RR.

Results

Simulation Study 1

As Figure 2 shows, lasso selected an average of 102 to 383 variables for adjustment in the NSAID data and 174 to 452 variables in the anticonvulsant data. In both cohorts, lasso selected fewer variables when the model used for outcome generation was smaller, leading to fewer true confounders (scenario 1), or when the number of cases was decreased (scenario 5). In those scenarios, the number of variables selected by the lasso model that included all variables was less than or approximately equal to the number of variables selected by the lasso model that included only prescreened variables. More variables were selected when the number of cases was increased (scenario 6).

Figure 2. Variable Selection Across All Simulation Scenarios.

Figure 2

Variable Selection Across All Simulation Scenarios.

The specific variables selected by the prescreened lasso method were most similar to the 500 selected by bias-based hdPS with an average overlap of 30% to 54%. Because the variables that could be selected by the prescreened lasso method are limited to those selected by bias-based hdPS, the overlap in this case is determined solely by the number of variables selected. Other methods that did not have this restriction generally had lower overlap with the bias-based hdPS method. Ordinary lasso that included all potential covariates selected between 13% and 41% of the bias-based hdPS variables, on average. In contrast, exposure-based hdPS (using 500 variables) selected 19% to 26% of the bias-selected variables in the NSAID cohort and 46% to 49% in the anticonvulsant cohort, on average.

Appendix B, Figure 1 presents bias and root mean squared error results for the first simulation scenario in the NSAID data. The crude estimates, displayed at the top of the figure, were biased (mean odds ratio = 1.2, mean risk difference = 1%) for the true null treatment effect. Ridge and lasso regression using all potential covariates reduced bias compared with the unadjusted estimates but still had the poorest performance of all adjusted methods. When prescreening potential variables by using bias-based hdPS and including only the top 500 in a ridge or lasso model for the outcome, performance improved slightly. PS approaches generally performed better. Specifically, bias was lowest when using the lasso-selected variables in the PS or including the top 500 variables from either hdPS approach. When fewer hdPS-selected variables were used, the bias-based approach was preferred over the exposure-based approach. Appendix B, Figures 1 through 7 contain complete estimation results for all scenarios.

The patterns observed in scenario 1 were generally repeated across all other simulation scenarios evaluated in the NSAID cohort (Figure 3). One exception was in scenario 4, where the number of exposed patients was decreased. In this scenario, exposure-based hdPS selection with 500 variables overadjusted the treatment effect estimate, leading to negative bias; bias-based hdPS selection or lasso selection combined with a PS approach were preferred.

Figure 3. Bias for the Risk Difference Estimates on the Percent Scale From All Simulations Based on the Cohort of Nonsteroidal Anti-inflammatory Drug Initiators.

Figure 3

Bias for the Risk Difference Estimates on the Percent Scale From All Simulations Based on the Cohort of Nonsteroidal Anti-inflammatory Drug Initiators.

In the anticonvulsant cohort, both ridge and lasso regression of the outcome on exposure and confounders again failed to sufficiently remove bias, regardless of whether confounders were prescreened by hdPS (Figure 4). Including lasso-selected variables in a PS approach usually resulted in lower bias, except in scenario 4, where the number of exposed patients was decreased. Similarly, other PS approaches that performed well in most scenarios, including bias-based hdPS with 30 variables, performed worse in that scenario, indicating a potential problem with the PS in this scenario. In general, exposure-based hdPS did not perform well, underadjusting the treatment effect estimate in the case of 30 variables and overadjusting when using more variables. Including a smaller number of hdPS variables was also generally preferred over including a larger number, which contrasts with the trends seen in the NSAID cohort and may also indicate PS convergence problems in these data.

Figure 4. Bias for the Risk Difference Estimates on the Percent Scale From All Simulations Based on the Cohort of Anticonvulsant Initiators.

Figure 4

Bias for the Risk Difference Estimates on the Percent Scale From All Simulations Based on the Cohort of Anticonvulsant Initiators.

Simulation Study 2

In Figure 5, we present the bias of treatment effect estimation methods in the OAC cohort. We focus on results based on PSs that included only predictors of outcome (XY, XZY). Results for other PS variable specifications were very similar and are presented in Appendix C. We plot the absolute bias, calculated as the difference between the mean estimated log RR and the true log RR. Although not shown, the Monte Carlo error for calculation of the bias was generally low (standard error of calculated bias <0.007). Across the first 6 scenarios in the OAC cohort, where the true treatment effect was a homogeneous null effect, regression on the PS using 1 spline had low bias, regardless of the method used for PS estimation or whether the sample was trimmed. However, many other methods that were biased when applied to the full sample—such as full matching, deciles, regression using 2 splines, and IPTW—were less biased after trimming the tails of the PS distributions. In contrast, in scenario 7, where there was strong treatment effect heterogeneity along the PS, trimmed methods were typically more biased than the associated untrimmed approach, likely because these methods target a different estimand than the true SATE used to calculate bias. Regression on the PS using 1 spline also performs poorly with respect to bias in scenario 7, as this method assumes a uniform effect of treatment that is violated in this scenario. Among estimators of the SATT, most approaches had little bias, because the PS models included all true confounders (although not necessarily all interactions), and there was good overlap on the PS in the treated population. Use of boosting or lasso for PS model estimation had inconsistent effects on bias across scenarios and estimands, typically resulting in decreased bias for estimators that were highly biased with a logistic regression PS model, but sometimes resulting in increased bias for approaches that had relatively low bias with logistic regression.

Figure 5. Absolute Bias of the Log RR Estimates for Each Method That Adjusted for Predictors of Outcome (XY, XZY) in All 7 Scenarios in the OAC Cohort.

Figure 5

Absolute Bias of the Log RR Estimates for Each Method That Adjusted for Predictors of Outcome (XY, XZY) in All 7 Scenarios in the OAC Cohort.

In Figure 6, we present the corresponding standard errors, calculated as the standard deviation of log RR estimates across simulation iterations. The values in the plot are divided by the standard deviation of the crude estimator to account for varying numbers of outcomes across simulation scenarios, so values can be interpreted as the relative change in the standard error from the crude estimator. Trimmed estimators, which had low bias, also tended to have lower variance than the associated untrimmed estimators of the SATE, despite the decrease in sample size. Results were generally consistent across scenarios and across PS estimation methods, although boosted or lasso regression nearly always improved the variance over logistic regression or Bayesglm.

Figure 6. Relative Standard Error of the Log RR Estimates (Compared With the Crude Standard Error) for Each Method That Adjusted for Predictors of Outcome (XY, XZY) in All 7 Scenarios in the OAC Cohort.

Figure 6

Relative Standard Error of the Log RR Estimates (Compared With the Crude Standard Error) for Each Method That Adjusted for Predictors of Outcome (XY, XZY) in All 7 Scenarios in the OAC Cohort.

In Figures 7 and 8, we present the bias and standard error for simulations in the ACV cohort. Again, regression on the PS using 1 nonlinear spline had very low bias and low standard error in the first 6 scenarios, resulting in the lowest MSE of all methods considered, regardless of the PS estimation method used. When estimating the SATT, matching weights also had low bias and standard error, whether implemented with or without truncation. Many of the methods that performed poorly in the OAC simulations for estimation of the SATE also performed poorly in these simulations for estimation of either the SATE or the SATT. In general, approaches that focus on the treatment effect in the full sample or the treated sample (rather than the feasible sample, that is, the sample of patients in the region of overlap on the PS) were often biased because there was substantial nonoverlap in this study. In contrast, trimmed estimators, pair matching, or matching weights all focus on feasible populations and therefore had lower bias in scenarios with a homogeneous treatment effect. However, trimming nearly always substantially increased the standard error of the estimator. Truncation of the standardized IPTW weights reduced standard error, but generally did not improve bias.

Figure 7. Absolute Bias of the Log RR Estimates for Each Method That Adjusted for Predictors of Outcome (XY, XZY) in All 7 Scenarios in the ACV Cohort.

Figure 7

Absolute Bias of the Log RR Estimates for Each Method That Adjusted for Predictors of Outcome (XY, XZY) in All 7 Scenarios in the ACV Cohort.

Figure 8. Relative Standard Error of the Log RR Estimates (Compared With the Crude Standard Error) for Each Method That Adjusted for Predictors of Outcome (XY, XZY) in All 7 Scenarios in the ACV Cohort.

Figure 8

Relative Standard Error of the Log RR Estimates (Compared With the Crude Standard Error) for Each Method That Adjusted for Predictors of Outcome (XY, XZY) in All 7 Scenarios in the ACV Cohort.

In scenario 7, where there was strong treatment effect heterogeneity, regression on the PS using 1 spline was not the least biased method. Instead, 1-to-1 matching, full matching, fine stratification, regression on the PS using 2 nonlinear splines, and matching weights were less biased for estimating the SATT. Among these methods, only matching weights also had relatively low standard error, resulting in the lowest MSE for this scenario. As in the OAC cohort, results in the ACV cohort were generally consistent across PS model estimation methods, although using lasso and boosting to estimate the PS sometimes improved bias over logistic regression and Bayes GLM. This was particularly striking in the estimation of the SATT using regression on 2 splines without trimming, where a logistic PS model often resulted in high bias, but a boosted PS model always resulted in unbiased estimation in all 7 scenarios. The impact of PS model choice was generally largest in scenario 2, where the exposure prevalence was lowered from 20% to 8%, yielding approximately 4 exposed patients per variable. In this case, lasso and boosted regression may have provided a more stable PS estimate, leading to better performance of all methods that subsequently used the estimated PS.

Discussion

Study Results and Implications

In this research, we analyzed real health care claims data sets and used a plasmode simulation framework to produce realistic simulated data sets with a known outcome-generating model. In the first set of simulated data sets, we compared the hdPS variable selection algorithm with regularization methods that could accommodate all potential covariates without prior selection. Across 7 simulation scenarios in 2 cohorts, we found that performing variable selection and implementing a PS adjustment approach was preferable to including all covariates in a regularized outcome model. While the method of treatment effect estimation was important for determining the resulting bias and MSE, the specific method of variable selection was less important, since both lasso selection and hdPS selection performed well despite the fact that the methods selected very different sets of covariates. Therefore, the poorer performance of lasso and ridge regression models for estimating treatment effect was likely due to bias in the estimated coefficients on confounders due to model shrinkage, which may have reduced the confounding control achieved by the model.

The confounding control achieved by a given variable selection technique depended somewhat on the simulation scenario. Specifically, in the NSAID cohort, any method that chose a large number of variables for inclusion in the PS performed well. If fewer variables were selected, then the method of variable selection was more important, with bias-based hdPS preferred over exposure-based hdPS. These findings correspond with previous evaluations of hdPS in empirical cohorts, where more variables selected generally corresponded with a reduction in bias.25 However, in the scenario with only 10% of patients exposed, using the lasso outcome model for variable selection for the PS performed best with respect to bias, perhaps because this method does not require evaluation of variable association with exposure. In the anticonvulsant cohort, selection methods that chose fewer variables, using either lasso or bias-based hdPS selection, generally performed best. These results align well with the extensive analyses of this cohort performed by Patorno et al,34 who also found that fewer variables led to more stable estimated PS models and better estimates of treatment effect. The problems with PS convergence observed with more variables may also explain the relatively poor performance of lasso selection in these data, since lasso selected at least 100 variables in every simulated data set.

To better understand the different results observed in the 2 cohorts used for simulations, we compared the prevalence of each of the top 500 bias-based hdPS variables in exposed and unexposed patients from each cohort. In the NSAID data, hdPS variables were relatively balanced between exposed and unexposed, indicating that variable associations with exposure were not strong. The anticonvulsant data contained many variables that were strongly associated with exposure to highly inducing anticonvulsants, potentially leading to nonpositivity in treatment assignment and nonconvergence of the PS model when these variables are selected for confounder adjustment. Furthermore, some of these variables were not included (in any form) in the simulation models for generating outcome, and therefore constitute instruments for the exposure–outcome association in the simulations. In particular, when entering the hdPS covariates into the PS model one at a time, as described by Patorno et al,34 the first convergence failure occurs when an indicator for convulsions is entered into the model. This variable indicates nonspecific convulsions, not attributable to a previously diagnosed condition, a case where older therapies may be preferred by many physicians.

Overall, automated variable selection, combined with PS adjustment, performed better than adjustment for all variables via a regularized multivariate outcome model. However, in cases where there are potential problems with instrumental variables or model convergence in the PS model, as in the anticonvulsant example, investigators may need to further restrict the covariate to be even more parsimonious than what is selected by either the standard 500- variable hdPS approach or the lasso approach. Selecting a smaller number of hdPS variables may be a good approach to further restriction, since the highest ranking hdPS variables are unlikely to be instruments.

In the second set of simulations, we compared approaches for estimating treatment effect using the PS given a predefined set of covariates. We found that regression on the PS using 1 nonlinear spline and no trimming performed best when there was no treatment effect heterogeneity, regardless of the method used for PS estimation. Untrimmed, untruncated matching weights for estimation of the SATT also performed well, even in the presence of strong effect heterogeneity. However, the matching weights method has the disadvantage that the PS must be modeled approximately correctly, which was always the case in these simulations, but cannot be verified in real data. Therefore, in comparative effectiveness research studies with few observed outcome events, we recommend that researchers investigate the potential for treatment effect heterogeneity across the range of the PS, for example, by estimating treatment effect first within quintiles of the PS. If there is little evidence of heterogeneity, regression on the PS using a nonlinear GAM fit may be preferable. If there is evidence of heterogeneity, then matching weights may be preferred. More commonly used approaches—such as 1-to-1 matching, stratification, and IPTW—are not recommended when there are few outcomes and nonoverlap of PS distributions.

Although previously the most common PS method in the medical literature,15,32 regression on the PS has been criticized recently. Several studies have found that the estimated coefficient on treatment from a regression on treatment and the PS can result in biased effect estimates.35,39,52 Vansteelandt and Daniel53 also showed that the marginal regression estimators presented in this paper are biased if the outcome logistic regression model is misspecified. Our approach nonetheless succeeds because we allowed for a nonlinear association between PS and outcome through the use of a spline term, limiting the potential for model misspecification when there is no heterogeneity in treatment effect along the PS. When there is strong heterogeneity, as in scenario 7 in our simulations, the 1-spline model for outcome is not correctly specified, and the marginal RR estimate may therefore be biased. Performing regression on the PS using 2 nonlinear splines has the advantage that it can correctly model the association between PS and outcome, even in the presence of treatment effect heterogeneity. However, we found that this approach could be badly biased and highly imprecise when there was nonoverlap in the PS distributions between treated and control patients.

Matching weights, a variation on IPTW that targets the treatment effect in the 1-to-1 matched sample, also had good performance in the simulations, even in the presence of heterogeneity, and provides many practical advantages, but may be more biased than other approaches when the PS model is misspecified. The matching weight approach focuses estimation on patients with good overlap on the PS without discarding any patients or events in the region of nonoverlap. For this reason, the matching weight estimator had similar bias but was generally much more precise than 1-to-1 matching. In contrast, we found that IPTW estimators, even with the use of weight stabilization, can be highly biased and imprecise in the presence of nonpositivity or nonoverlap of PS distributions.36 However, we also found that the problems resulting from extreme weights extend to stratified estimators or full matching estimators that rely on weights to target the marginal treatment effect. For example, full matching, implemented with a caliper, removed patients beyond the region of overlap, but could still result in extremely imbalanced matched sets in the areas of sparse overlap, such as 1 treated patient matched to 500 or more control patients and vice versa. This matching imbalance yields weights that are nearly as extreme as IPTW, as well as poor bias and variance.

In scenarios with low exposure prevalence, we also found that boosted or lasso logistic regression for estimation of the PS model could improve the bias of treatment effect estimators that were biased because of unstable weights. In most other scenarios, the method used for estimation of the PS model did not greatly impact treatment effect estimates. In our simulations, both exposure and outcome-generating models included interaction terms and sometimes nonlinear associations, although these associations were often weak. In cases of strong interaction and nonlinear associations, the added flexibility of the boosted regression tree model may provide greater improvements.

Limitations

Although the relative performance of methods was generally consistent across the simulation scenarios considered in each study, the specific results observed are dependent on the data-generating process and parameter values chosen. In the first set of simulations, we attempted to induce unmeasured confounding in scenario 7 by excluding from analysis important confounders from claims. However, unmeasured confounding appeared to be relatively weak, likely due to strong correlation between measured and unmeasured covariates. In real data, unmeasured confounders may be completely uncorrelated with the information available in the health care database; for example, smoking history is largely absent from health care claims databases and is not likely to be proxied well by information that is available in claims. In those cases, the relative performance of methods may differ from that observed in this study, and all methods are likely to perform poorly. In addition, even the expanded outcome-generating model used in scenarios 2 through 7 may be unrealistically simplistic and not appropriately represent the data-generating mechanism of a real health care outcome. In the second set of simulations, consideration of other base data sets with better PS overlap would be likely to lead to greatly improved performance for IPTW and other approaches that focus on the S ATE and ATT, rather than the feasible estimands. However, the performance of all methods would improve in that case, so relative performance would not be significantly different.

Finally, our simulations assume throughout that the covariates available for adjustment are all pretreatment covariates and that mediators on the causal pathway between exposure and outcome have been excluded. The performance of all methods evaluated would likely be worse if this assumption is not satisfied,54 so investigators must practice care in covariate specification, even when using an automated method for selection and robust PS analysis methods.

Future Research

In the first study, we focused on hdPS variable selection vs regularization of an outcome regression model, 2 general approaches to variable selection and treatment effect estimation out of many possible. We chose these methods because they are the automated approaches most likely to be used regularly for confounder selection in comparative effectiveness studies implemented in health care claims databases. Other approaches to variable selection, such as step-wise regression, are known to produce biased treatment effect estimates and underestimates of uncertainty.55 Other high-dimensional modeling approaches, such as boosted regression, aim to minimize prediction error and so may be inappropriate for confounder adjustment.56 Recently proposed confounder selection methods that rely on Bayesian model averaging or stochastic search are computationally intensive and would require significant adaptation to analyze the thousands of potential covariates encountered in secondary databases.57-60

In the second study, we focused on methods that are commonly used for analysis of the PS, such as 1-to-1 matching and IPTW, as well as a few promising alternatives, such as marginalized regression on the PS and matching weights. The findings of this study emphasize the need for additional research in this area, as neither of the 2 best-performing methods were robust in all data-generating scenarios. Specifically, regression on the PS with 1 spline is robust to moderate misspecification of the PS, but is biased when treatment effect is heterogeneous. Matching weights is robust to heterogeneous treatment effect, but is more dependent on correct specification of the PS model. Therefore, approaches that combine the advantages of these 2 approaches could lead to highly robust inference. One approach that deserves consideration is an optimal weighted regression adjustment, proposed by Vansteelandt and Daniel,53 which estimates the association between PS and outcome separately in each treatment group (allowing for treatment effect heterogeneity), but then avoids model extrapolation by estimating treatment effect only in the overlapping population through the use of weights. Additional work is also needed to evaluate the accuracy of standard error estimators for these methods and to explore the performance of methods when there is important unmeasured confounding or uncertainty in model specification, which was not explored in this study. Because the bias-variance trade-off of each method depends on the study size, the low bias observed with 1-to-1 matching is likely to become more important in determining MSE as the number of observed outcomes increases. Therefore, future work should focus on improving understanding of how the relative performance of methods varies as the number of observed outcome events increases. Finally, there is a strong need for evaluation of methods as applied to survival outcomes, which are also common in CER and PCOR. Although many methods would be expected to perform similarly in the context of a survival outcome, some methods—for example, outcome regression on the PS—would need additional study in this context.

Conclusions

When combined with appropriate study design,61 the methods evaluated in this study may play an integral part in an automated learning health care environment that can provide fast and accurate answers to patients' pressing clinical questions.62 Our findings suggest that automated variable selection methods, such as hdPS and lasso, can be useful for building PS models that appropriately adjust for confounding, but regularized regression approaches should not be used to simultaneously select variables and adjust for confounding via the outcome model. Our findings also suggest a reconsideration of the most popular approaches to PS adjustment in nonrandomized studies of treatments with few observed outcome events, such as 1-to-1 matching and IPTW. Approaches that focus on treatment effect estimates in the feasible population (ie, the population of patients with PS values in the region of good PS overlap between treatment and control groups) while preserving study size and number of outcomes are likely to lead to better estimates of treatment effect. Incorporation of these findings into statistical practice for the analysis of nonrandomized PCOR data sets from large administrative health care data sets should lead to robust estimates that are less dependent on investigator choice and are a better reflection of the true underlying causal association, thereby leading to better information available to patients and their providers and, ultimately, better treatment decision making.

References

1.
Schneeweiss S, Avorn J. A review of uses of health care utilization databases for epidemiologic research on therapeutics. J Clin Epidemiol. 2005;58(4):323-337. [PubMed: 15862718]
2.
Gurwitz JH, Col NF, Avorn J. The exclusion of the elderly and women from clinical trials in acute myocardial infarction. JAMA. 1992;268(11):1417-1422. [PubMed: 1512909]
3.
Goldberg NH, Schneeweiss S, Kowal MK, Gagne JJ. Availability of comparative efficacy data at the time of drug approval in the United States. JAMA. 2011;305(17):1786-1789. [PubMed: 21540422]
4.
Lloyd-Jones D, Adams RJ, Brown TM, et al. Heart disease and stroke statistics: 2010 update. Circulation. 2010;121(7):e46-e215. [PubMed: 20019324]
5.
Halpern SD, Ubel PA, Berlin JA, Townsend RR, Asch DA. Physicians' preferences for active-controlled versus placebo-controlled trials of new antihypertensive drugs. J Gen Intern Med. 2002;17(9):689-695. [PMC free article: PMC1495099] [PubMed: 12220365]
6.
US Food and Drug Administration. The Sentinel Initiative: A National Strategy for Monitoring Medical Product Safety. US Department of Health and Human Services; 2008.
7.
Billewicz W. The efficiency of matched samples: An empirical investigation. Biometrics. 1965;21(3):623-644. [PubMed: 5858095]
8.
Cochran WG. The effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics. 1968;24(2):295-313. [PubMed: 5683871]
9.
Cochran WG, Rubin DB. Controlling bias in observational studies: a review. Sankhyā Indian J Stat Ser A. 1973;35(4):417-446.
10.
Hirano K, Imbens GW. Estimation of causal effects using propensity score weighting: an application to data on right heart catheterization. Health Serv Outcomes Res Methodol. 2001;2(3):259-278.
11.
Hirano K, Imbens GW, Ridder G. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica. 2003;71(4):1161-1189.
12.
Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11(5):550-560. [PubMed: 10955408]
13.
Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41-55.
14.
Glynn RJ, Gagne JJ, Schneeweiss S. Role of disease risk scores in comparative effectiveness research with emerging therapies. Pharmacoepidemiol Drug Saf. 2012;21(S2):138-147. [PMC free article: PMC3454457] [PubMed: 22552989]
15.
Glynn RJ, Schneeweiss S, Stürmer T. Indications for propensity scores and review of their use in pharmacoepidemiology. Basic Clin Pharmacol Toxicol. 2006;98(3):253-259. [PMC free article: PMC1790968] [PubMed: 16611199]
16.
Stürmer T, Rothman KJ, Avorn J, Glynn RJ. Treatment effects in the presence of unmeasured confounding: dealing with observations in the tails of the propensity score distribution—a simulation study. Am J Epidemiol. 2010;172(7):843-854. [PMC free article: PMC3025652] [PubMed: 20716704]
17.
Stürmer T, Schneeweiss S, Brookhart MA, Rothman KJ, Avorn J, Glynn RJ. Analytic strategies to adjust confounding using exposure propensity scores and disease risk scores: nonsteroidal antiinflammatory drugs and short-term mortality in the elderly. Am J Epidemiol. 2005;161(9):891-898. [PMC free article: PMC1407370] [PubMed: 15840622]
18.
Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol. 1996;49(12):1373-1379. [PubMed: 8970487]
19.
Brookhart MA, Stürmer T, Glynn RJ, Rassen J, Schneeweiss S. Confounding control in healthcare database research: challenges and potential approaches. Med Care. 2010;48(suppl 6):S114-S120. doi:10.1097/MLR.0b013e3181dbebe3 [PMC free article: PMC4024462] [PubMed: 20473199] [CrossRef]
20.
Austin PC, Grootendorst P, Anderson GM. A comparison of the ability of different propensity score models to balance measured variables between treated and untreated subjects: a Monte Carlo study. Stat Med. 2007;26(4):734-753. [PubMed: 16708349]
21.
Rubin DB. Estimating causal effects from large data sets using propensity scores. Ann Intern Med. 1997;127(8 Pt 2):757-763. [PubMed: 9382394]
22.
Pearl J. On a Class of Bias-Amplifying Variables That Endanger Effect Estimates. Association for Uncertainty in Artificial Intelligence; 2010.
23.
Myers JA, Rassen JA, Gagne JJ, et al. Effects of adjusting for instrumental variables on bias and precision of effect estimates. Am J Epidemiol. 2011;174(11):1213-1222. [PMC free article: PMC3254160] [PubMed: 22025356]
24.
Hernán MA, Robins JM. Instruments for causal inference: an epidemiologist's dream? Epidemiology. 2006;17(4):360-372. doi:10.1097/01.ede.0000222409.00878.37 [PubMed: 16755261] [CrossRef]
25.
Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, Brookhart MA. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology. 2009;20(4):512-522. [PMC free article: PMC3077219] [PubMed: 19487948]
26.
Rassen JA, Glynn RJ, Brookhart MA, Schneeweiss S. Covariate selection in high-dimensional propensity score analyses of treatment effects in small samples. Am J Epidemiol. 2011;173(12):1404-1413. [PMC free article: PMC3145392] [PubMed: 21602301]
27.
Franklin JM, Schneeweiss S, Polinski JM, Rassen JA. Plasmode simulation for the evaluation of pharmacoepidemiologic methods in complex healthcare databases. Comput Stat Data Anal. 2014;(72):219-226. [PMC free article: PMC3935334] [PubMed: 24587587]
28.
Greenland S. Invited commentary: variable selection versus shrinkage in the control of multiple confounders. Am J Epidemiol. 2008;167(5):523-529; discussion 530-531. doi:10.1093/aje/kwm355 [PubMed: 18227100] [CrossRef]
29.
Hastie T, Tibshirani R, Friedman J. The Elements Statistical Learning: Data Mining, Inference, and Prediction. 2nd Edition. Springer; 2009.
30.
Rosenbaum PR, Rubin DB. Reducing bias in observational studies using subclassification on the propensity score. J Am Stat Assoc. 1984;79(387):516-524.
31.
Shah BR, Laupacis A, Hux JE, Austin PC. Propensity score methods gave similar results to traditional regression modeling in observational studies: a systematic review. J Clin Epidemiol. 2005;58(6):550-559. [PubMed: 15878468]
32.
Weitzen S, Lapane KL, Toledano AY, Hume AL, Mor V. Principles for modeling propensity scores in medical research: a systematic literature review. Pharmacoepidemiol Drug Saf. 2004;13(12):841-853. [PubMed: 15386709]
33.
Kurth T, Walker AM, Glynn RJ, et al. Results of multivariable logistic regression, propensity matching, propensity adjustment, and propensity-based weighting under conditions of nonuniform effect. Am J Epidemiol. 2006;163(3):262-270. [PubMed: 16371515]
34.
Patorno E, Glynn RJ, Hernández-Díaz S, Liu J, Schneeweiss S. Studies with many covariates and few outcomes: selecting covariates and implementing propensity-score-based confounding adjustments. Epidemiology. 2014;25(2):268-278. [PubMed: 24487209]
35.
Austin PC. The performance of different propensity score methods for estimating marginal hazard ratios. Stat Med. 2013;32(16):2837-2849. doi:10.1002/sim.5705 [PMC free article: PMC3747460] [PubMed: 23239115] [CrossRef]
36.
Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat Med. 2004;23(19):2937-2960. [PubMed: 15351954]
37.
Franklin JM, Rassen JA, Bartels DB, Schneeweiss S. Prospective cohort studies of newly marketed medications: using covariate data to inform the design of large-scale studies. Epidemiology. 2014;25(1):126-133. [PubMed: 24240651]
38.
Gutman R, Rubin D. Robust estimation of causal effects of binary treatments in unconfounded studies with dichotomous outcomes. Stat Med. 2013;32(11):1795-1814. [PubMed: 23019093]
39.
Cangul M, Chretien Y, Gutman R, Rubin D. Testing treatment effects in unconfounded studies under model misspecification: logistic regression, discretization, and their combination. Stat Med. 2009;28(20):2531-2551. [PubMed: 19572258]
40.
Freedman DA. Randomization does not justify logistic regression. Stat Sci. 2008;23(2):237-249.
41.
Gail MH, Wieand S, Piantadosi S. Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika. 1984;71(3):431-444.
42.
Franklin JM, Eddings W, Glynn RJ, Schneeweiss S. Regularized regression versus the high-dimensional propensity score for confounding adjustment in secondary database analyses. Am J Epidemiol. 2015;182(7):651-659. [PubMed: 26233956]
43.
Franklin JM, Eddings W, Austin PC, Stuart EA, Schneeweiss S. Comparing the performance of propensity score methods in healthcare database studies with rare outcomes. Stat Med. 2017;36(12):1946-1963. [PubMed: 28208229]
44.
Schneeweiss S, Solomon DH, Wang PS, Rassen J, Brookhart MA. Simultaneous assessment of short-term gastrointestinal benefits and cardiovascular risks of selective cyclooxygenase 2 inhibitors and nonselective nonsteroidal antiinflammatory drugs: an instrumental variable analysis. Arthritis Rheum. 2006;54(11):3390-3398. doi:10.1002/art.22219 [PubMed: 17075817] [CrossRef]
45.
Bross ID. Spurious effects from an extraneous variable. J Chronic Dis. 1966;19(6):637-647. [PubMed: 5966011]
46.
Le Cessie S, Van Houwelingen J. Ridge estimators in logistic regression. Appl Stat. 1992;41(1):191-201.
47.
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Methodol. 1996;58(1):267-288.
48.
Efron B. Estimating the error rate of a prediction rule: improvement on cross-validation. J Am Stat Assoc. 1983;78(382):316-331.
49.
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1-22. [PMC free article: PMC2929880] [PubMed: 20808728]
50.
Seeger JD, Bykov K, Bartels DB, Huybrechts K, Zint K, Schneeweiss S. Safety and effectiveness of dabigatran and warfarin in routine care of patients with atrial fibrillation. Thromb Haemost. 2015;114(6):1277-1289. doi:10.1160/TH15-06-0497 [PubMed: 26446507] [CrossRef]
51.
Austin PC. Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies. Pharm Stat. 2011;10(2):150-161. [PMC free article: PMC3120982] [PubMed: 20925139]
52.
Austin PC, Grootendorst P, Normand S-LT, Anderson GM. Conditioning on the propensity score can result in biased estimation of common measures of treatment effect: a Monte Carlo study. Stat Med. 2007;26(4):754-768. doi:10.1002/sim.2618 [PubMed: 16783757] [CrossRef]
53.
Vansteelandt S, Daniel RM. On regression adjustment for the propensity score. Stat Med. 2014;33(23):4053-4072. [PubMed: 24825821]
54.
Schisterman EF, Cole SR, Platt RW. Overadjustment bias and unnecessary adjustment in epidemiologic studies. Epidemiology. 2009;20(4):488-495. [PMC free article: PMC2744485] [PubMed: 19525685]
55.
Steyerberg EW, Eijkemans MJ, Habbema JD. Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis. J Clin Epidemiol. 1999;52(10):935-942. [PubMed: 10513756]
56.
Friedman JH. Greedy function approximation: a gradient boosting machine [English summary]. Ann Stat. 2001;29(5):1189-1232.
57.
Zigler CM, Dominici F. Uncertainty in propensity score estimation: Bayesian methods for variable selection and model averaged causal effects. J Am Stat Assoc. 2014;109(505):95-107. doi:10.1080/01621459.2013.869498 [PMC free article: PMC3969816] [PubMed: 24696528] [CrossRef]
58.
Vansteelandt S, Bekaert M, Claeskens G. On model selection and model misspecification in causal inference. Stat Methods Med Res. 2012;21(1):7-30. [PubMed: 21075803]
59.
Wang C, Parmigiani G, Dominici F. Bayesian effect estimation accounting for adjustment uncertainty. Biometrics. 2012;68(3):661-671. doi:10.1111/j.1541-0420.2011.01731.x [PubMed: 22364439] [CrossRef]
60.
Crainiceanu CM, Dominici F, Parmigiani G. Adjustment uncertainty in effect estimation. Biometrika. 2008;95(3):635-651. doi:10.1093/biomet/asn015 [CrossRef]
61.
Schneeweiss S. A basic study design for expedited safety signal evaluation based on electronic healthcare data. Pharmacoepidemiol Drug Saf. 2010;19(8):858-868. [PMC free article: PMC2917262] [PubMed: 20681003]
62.
Schneeweiss S. Learning from big health care data. N Engl J Med. 2014;370(23):2161-2163. doi:10.1056/NEJMp1401111 [PubMed: 24897079] [CrossRef]

Acknowledgment

Research reported in this report was [partially] funded through a Patient-Centered Outcomes Research Institute® (PCORI®) Award (ME-1303-5796). Further Information available at: https://www.pcori.org/research-results/2013/methods-variable-selection-and-treatment-effect-estimation-nonrandomized-studies-few-outcome-events-and-many-confounders

Appendices

PCORI ID: ME-1303-5796

Suggested citation:

Franklin JM, Eddings W, Gopalakrishnan C, et al. (2018). Methods for Variable Selection and Treatment Effect Estimation in Nonrandomized Studies with Few Outcome Events and Many Confounders. Patient-Centered Outcomes Research Institute (PCORI). https://www.doi.org/10.25302/3.2018.ME.13035796

Disclaimer

The [views, statements, opinions] presented in this report are solely the responsibility of the author(s) and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute® (PCORI®), its Board of Governors or Methodology Committee.

Copyright © 2018 Brigham and Women's Hospital All Rights Reserved.

This book is distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License which permits noncommercial use and distribution provided the original author(s) and source are credited. (See https://creativecommons.org/licenses/by-nc-nd/4.0/

Bookshelf ID: NBK591439PMID: 37184181DOI: 10.25302/3.2018.ME.13035796

Views

  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (2.9M)

Other titles in this collection

Related information

Similar articles in PubMed

See reviews...See all...

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...