Considerations for the Design and Conduct of Externally Controlled
Trials
Summary
Design Considerations - The protocol should be finalized before initiating the externally controlled (EC) trial. - Consider using the estimand framework to aid in the design and analysis plan. - Prespecify plans for measuring and analyzing confounding factors and sources of bias.
Data Selection - For the external control arm, include a comprehensive table describing comparability considerations.
Analysis Plan - The analysis plan should be prespecified, thorough (including sensitivity analyses, plans for handling missing data, etc.), and include a formal evaluation of comparability.
Given that externally controlled trials do not involve the randomization of the study population to the treatments being compared, it is crucial that the treatment and control arm populations are as similar as possible regarding known factors that can affect the outcome being measured. These factors, discussed in more detail in section III, include important baseline characteristics (e.g., demographic factors, comorbidities), disease attributes (e.g., severity, symptoms, duration of illness), start of follow-up for the treatment of interest, concomitant therapies, and the clinical observations collected. Importantly, before opting to conduct a clinical trial using an external control arm as a comparator, sponsors and investigators should consider the likelihood that such a trial design would be able to distinguish the effect of a drug from other factors that impact the outcome of interest and meet regulatory requirements.
The suitability of an externally controlled trial design warrants a case-by-case assessment, informed by factors including the heterogeneity of the disease (e.g., clinical presentation, severity, prognosis), preliminary evidence regarding the drug product under investigation, the approach to ascertaining the outcome of interest, and whether the goal of the trial is to show superiority or non-inferiority. Notably, if the natural history of a disease is well-defined and the disease is known not to improve in the absence of an intervention or with available therapies, historical information can potentially serve as the control group. For example, the objective response rate is often used as a single-arm trial endpoint in oncology, given the established understanding that tumor shrinkage rarely occurs without an intervention.
Study Design
Reducing the potential for bias in externally controlled trials is best addressed during the design phase, where well-chosen design elements enhance confidence in the interpretability of study results. Sponsors should finalize a study protocol before initiating the externally controlled trial. This includes selecting the external control arm and analytic approach, rather than choosing an external control arm after the completion of a single-arm trial. Specific design elements to be prespecified in the protocol include suitable study data sources, baseline eligibility (inclusion and exclusion) criteria, appropriate exposure definitions and windows, well-defined and clinically meaningful endpoints, cogent analytic plans, and approaches to minimize missing data and sources of bias.
Estimand Framework
The estimand framework involves a precise description of the treatment effect reflecting the clinical question posed by the study objective. This can aid in designing an externally controlled trial. An estimand is conceptually comprised of the study population, treatment of interest and comparator, outcome of interest, handling of intercurrent events, and summary measures. Many elements of the estimand framework are described individually in the subsections below, promoting alignment of trial objectives, conduct, analysis, and interpretation of results.
Confounding and Bias
A specific design consideration for externally controlled trials involves prespecifying plans on how to measure and analyze data on important confounding factors and sources of bias.
Before deciding whether an externally controlled trial is a suitable design to answer the research question of interest, sponsors should confirm that recognized, important prognostic characteristics can be assessed in the data sources that will be used in the trial. Specifically, the source population for the external control arm should be as comparable as possible to the treatment arm population.
Although unmeasured confounding, lack of blinding, and other sources of bias cannot be eliminated in externally controlled trials, an assessment of the extent of confounding and bias, along with analytic methods to reduce the impact of such bias, are critically important in the conduct of such trials.
Designation of Index Date (Time Zero)
A specific and difficult challenge when designing externally controlled trials is specifying the index date (also called time zero), which is the start of the observation period for assessing endpoints. Given the lack of randomization in externally controlled trials, differences in how the index date is determined across trial arms may lead to biased effect estimates.
If there are temporal differences in this date relative to treatment initiation or other important landmark times by treatment arm, any observed treatment effects may be biased. Determination of the index date in the treatment arm and the external control arm should avoid analyses that include a period of time (immortal time) during which the outcome of interest could not have occurred in one of the two arms. For example, consider an externally controlled trial that involves a time-to-event mortality endpoint and an index date established as the time of having failed prior therapy. If analyses of participants in the treatment arm include only those who actually receive the drug of interest, then any period of time between eligibility determination (i.e., failed prior therapy) and treatment initiation is immortal time; that is, the person must survive the period to receive the drug and be accounted for in the analysis. In contrast, if patients in the external control arm do not receive subsequent therapy after determination of eligibility (i.e., failed prior therapy), these patients would be included in the analysis regardless of survival. Accordingly, patients with very short survival times would be included in the control arm but not in the treatment arm, introducing a bias that makes the drug seem more effective than it actually is.
Assessment of Outcomes
The lack of blinding to treatments in externally controlled trials can pose challenges when considering certain outcomes. Sponsors should seek to assess outcomes consistently across the treatment arm and the external control arm for the results of an externally controlled trial to be credible.
When considering outcomes in externally controlled trials, sponsors should also evaluate the consistency of timing of outcome assessments in the treatment arm compared to the external control arm. In general, the timing and frequency of outcome assessments in RWD will have been determined during clinical care and may have been influenced by the patient’s clinical status, whereas outcome assessments in the treatment arm are protocol-specified. Accordingly, sponsors should first establish for what total duration of time and at what intervals the outcome of interest should be assessed in the analysis of data from an externally controlled trial. Based on such determinations, sponsors can then evaluate whether the availability and timing of outcome assessments are sufficient and comparable across both arms of the externally controlled trial for the research hypothesis being tested.
Data from Clinical Trials Using data from another clinical trial for an external control arm can offer advantages over using data collected during routine clinical care, primarily due to the rigor of protocol-based (and thus more consistent) data collection. However, such use is only appropriate when comparability exists between the two trial arms in terms of participant eligibility criteria, treatment administration, patterns of care (e.g., location of treatment sites), recording of concomitant medications, and assessments of adverse events and outcomes.
Data from RWD Sources The concerns discussed in the preceding section about comparability of participant characteristics, timing and frequency of data collection, and patterns of care should be addressed when using RWD (Real World Data) collected from patients for non-research purposes as external control arms. Additionally, specific issues related to missing data from RWD sources, which are obtained as part of routine clinical practice, can threaten the validity of the results of an externally controlled trial.
Time Periods
Clinical care aspects, such as the standard of care, types of
treatments, supportive care regimens, and criteria for disease response
or progression, may change over time. Addressing such temporal
differences with statistical analyses alone is challenging. It’s crucial
to consider the impact of different time frames in the treatment and
external control arms on the interpretability of study
findings.
Geographic Region
Standards of care and factors affecting health-related outcomes (e.g.,
access to care) can vary across geographic regions and healthcare
systems. Balancing participants across geographic regions and healthcare
systems in an externally controlled trial, when possible, can help
mitigate confounding based on such differences.
Diagnosis
Diagnostic criteria may differ due to practice variation or changes over
time between the data collection for the treatment arm and the external
control arm. Sponsors should consider the diagnostic standards used and
ensure that relevant clinical tests to establish a diagnosis are
conducted and reported consistently across compared arms.
Prognosis
When sufficient knowledge of relevant prognostic factors is available,
prognostic indicators for participants in each arm should be evaluated
to ensure they are similar enough to allow an unbiased assessment of the
treatment-outcome association.
Treatments
Attributes of the treatment of interest—including drug formulation,
dose, route of administration, timing, frequency, duration, and specific
rules for dose modifications, interruptions, discontinuations, and
adherence—will have been prespecified or measured in the treatment arm.
In contrast, aspects of a comparator treatment in the external control
arm may not have been protocol-driven, depending on the data source.
Sponsors should assess whether the external control arm data can be
meaningfully compared to the treatment arm data.
Other Treatment-Related Factors
Treatment-related considerations, when relevant, include previous
treatments received (e.g., lines of therapy in cancer patients),
concomitant medications affecting the outcome, or predictive biomarkers
(e.g., genomic testing) related to treatment. When differentially
distributed across compared groups, these factors can compromise the
assessment of the drug-outcome association.
Follow-Up Periods
The designation of the index date should be consistent between the
treatment and external control arms, and the duration of follow-up
periods should be comparable across compared arms.
Intercurrent Events
Assess the relevance of intercurrent events across treatment arms,
including the differential use of additional therapies after the
initiation of the treatment of interest.
Outcome
The reliability and consistency of endpoint measurements in an
externally controlled trial can be influenced by endpoint definitions,
the data source for the external control arm, and knowledge of treatment
received. Sponsors should apply the same criteria for evaluating and
timing outcome assessments across both arms of the externally controlled
trial.
Missing Data
Assessing the extent of missing data in the external control arm is
crucial before conducting an externally controlled trial to evaluate its
feasibility. When analyzing results, the impact of missing data in both
the treatment and external control arms should also be
evaluated.
General Considerations
Statistical Analysis Plan: Before conducting an externally controlled trial, sponsors should develop a statistical analysis plan that prespecifies analyses of interest, such as analyses of primary and secondary endpoints, calculations of statistical power and sample size, and plans to control the chance of erroneous conclusions (e.g., to control the overall type I error probability). The statistical analysis plan should be submitted along with the protocol to the relevant review division before initiating enrollment in the clinical trial for the experimental treatment.
Blinding Decisions: Decisions regarding the study design and statistical analysis plan for an externally controlled trial should be blinded to any observed external control data (e.g., from an existing RWD source), with the exception of planned feasibility analyses, such as evaluating the availability of key variables or assessing missing data.
Analytic Methods: In general, the analytic method used should identify and manage sources of confounding and bias, including a strategy to account for differences in baseline factors and confounding variables between trial arms.
Assumptions and Diagnostics: The assumptions involved should be made explicit, and sensitivity analyses as well as model diagnostics should be conducted to examine such assumptions.
Comparability Analyses: Even when employing analytic methods to balance the trial arm populations, sponsors should propose additional analyses to evaluate the actual comparability between the external control and treatment arms for important covariates. Determining similarity across trial arms will require selection of specific population characteristics to compare, a method for comparison, and criteria to demonstrate similarity. For example, an a priori threshold could be set to determine whether the external control population has a statistical distribution of covariates that is similar to the treatment arm population after a balancing method, such as weighting, has been applied.
Effect Size Considerations: Consideration should also be given, based on available scientific data, to the anticipated effect size for analyses of the primary endpoint. Especially when the anticipated effect size is modest, an externally controlled trial may not be an appropriate study design because of concerns for bias affecting the results. In addition, sponsors should develop a priori plans for assessing the impact of confounding factors and sources of bias, with quantitative or qualitative bias analyses used to evaluate these concerns.
Missing Data
Additional Analyses
Randomized Controlled Trials (RCTs) - Overview: RCTs are considered the gold standard for evaluating treatment effects because they use random allocation to assign treatments to subjects. This method ensures that treatment status is not confounded by baseline characteristics (both measured and unmeasured). - Effect Estimation: In RCTs, the effect of a treatment on outcomes is directly estimated by comparing the outcomes between treated and untreated (control) groups.
Observational Studies - Challenge: Unlike RCTs, observational studies do not use random treatment allocation, leading to potential systematic differences in baseline characteristics between treated and untreated subjects. - Necessity for Adjustment: To estimate treatment effects accurately in observational studies, it’s crucial to account for these baseline differences.
Propensity Score - Definition: The propensity score is the probability of receiving treatment, conditional on observed baseline characteristics. - Purpose: It serves to mimic some characteristics of RCTs within observational studies by balancing the observed baseline covariates between treated and untreated groups.
Propensity score matching approximates a random trial to match controls with experimental
When both assumptions are satisfied, the propensity score can be effectively used as a balancing score, where treated and untreated subjects with similar propensity scores will have similar distributions of observed covariates. This allows researchers to approximate the conditions of a randomized controlled trial, thereby enabling them to estimate causal treatment effects from observational data.
Propensity score matching (PSM) is a statistical technique used to create comparable groups in observational studies where random assignment is not possible. This method helps to reduce bias in estimates of treatment effects by balancing observed covariates between treated and untreated groups. Here’s a closer look at how PSM works and the steps involved:
PSM aims to mimic the conditions of a randomized controlled trial by matching units (e.g., patients, schools, etc.) that have received a treatment with similar units that have not, based on their propensity scores. A propensity score is the probability of a unit being assigned to a particular treatment, given a set of observed covariates.
Alternative Methods Using Propensity Scores
Besides matching, propensity scores can be utilized through: - Stratification: Dividing the sample into quintiles or deciles based on propensity scores and comparing outcomes within these strata. - Regression Adjustment: Including the propensity score as a covariate in a regression model. - Weighting: Applying weights based on the inverse probability of treatment to create a synthetic sample in which the distribution of measured baseline covariates is independent of treatment assignment.
Matching algorithms play a pivotal role in propensity score matching (PSM) by determining how participants in treatment and control groups are paired based on their estimated propensity scores. Here’s a detailed breakdown of the key matching methods and algorithms, as well as the critical decisions involved in the process:
Greedy Matching: Quickly selects matches based on immediate proximity in propensity scores, without considering future matches. This includes:
Genetic Matching: Refines matches iteratively by considering both propensity scores and Mahalanobis distance, enhancing the overall match quality.
Optimal Matching: Aims to minimize the total within-pair difference in propensity scores across all pairs, striving for the most statistically balanced matches.
Matching without Replacement vs. Matching with Replacement
Greedy vs. Optimal Matching
When implementing propensity score matching (PSM) in observational studies, selecting the right matching algorithm is crucial, especially when sample sizes are small. The choice of algorithm affects the trade-off between bias and variance, and ultimately, the validity and reliability of the estimated treatment effects.
In propensity score matching, achieving a balance between minimizing bias and variance is essential. A match that is too strict (e.g., requiring exact matches on many covariates) may reduce bias but increase variance because fewer matches are available, leading to less precise estimates. Conversely, more lenient matching criteria can increase the sample size of matched pairs but may introduce bias if the matches are not sufficiently similar.
Assessing the quality of matching in propensity score analysis is a crucial step to ensure the validity of causal inference. Below are detailed methodologies and considerations to check the quality of matching, focusing on overlap and common support, and subsequent steps to validate matching effectiveness:
Visual Analysis
Sensitivity to Extreme Values
Standardized Bias (SB)
Two-Sample t-Test
Joint Significance and Pseudo-R^2
Stratification Test
Complexity in Variance Estimation
Stratification on the propensity score is a method of controlling for confounding in observational studies by dividing subjects into strata based on their estimated propensity scores. This technique aims to make the treatment and control groups within each stratum more comparable, thereby reducing bias and approximating the conditions of a randomized controlled trial.
Stratification Process: Subjects are ranked and divided into mutually exclusive subsets based on the quintiles of their estimated propensity scores. This division often results in five equal-size groups, each representing a different segment of the propensity score distribution.
Bias Reduction: According to research by Cochran (1968) and later by Rosenbaum and Rubin (1984), stratifying on the quintiles of a continuous confounder can eliminate approximately 90% of the bias. This effectiveness is maintained when applying stratification to the propensity score, significantly reducing the bias due to measured confounders.
Increasing Strata for Bias Reduction: While increasing the number of strata can further reduce bias, the marginal benefit decreases with more strata. This diminishing return needs to be balanced against the complexity and sample size requirements of additional strata.
Quasi-Randomized Controlled Trials (quasi-RCTs): Each stratum can be seen as an independent quasi-RCT where the treatment effect is estimated by directly comparing outcomes between treated and untreated subjects within that stratum.
Pooling of Stratum-Specific Estimates: The treatment effects estimated within each stratum can be pooled to derive an overall estimate of the treatment effect. This is done using weighted averages, where weights are typically equal to 1/K for K strata, or proportional to the number of treated subjects in each stratum to focus on the average treatment effect on the treated (ATT).
Variance Estimation: Pooling the variances of the stratum-specific treatment effects provides a comprehensive estimate of the variance for the overall treatment effect. This aspect of variance estimation is crucial for assessing the precision and statistical significance of the estimated effects.
Within-Stratum Regression Adjustment: To further refine the estimates and account for any residual differences between treated and untreated subjects within each stratum, regression adjustment can be applied. This step adjusts for covariates that may still be imbalanced within strata.
Stratum-Specific Effects: Each stratum-specific effect provides insight into how treatment effects might vary across different levels of propensity score, offering a more nuanced understanding of the treatment’s impact across different subgroups.
Propensity score weighting is a statistical technique commonly used in observational studies to control for confounding variables. This technique allows researchers to estimate the effect of a treatment by creating a more balanced comparison between treated and untreated groups.
### Overview
Advantages
Limitations
In Real-World Evidence (RWE) studies, different propensity score weighting methods are tailored to specific analytical goals and study designs. Here, to outline three common methods and their specific applications and advantages in the context of estimating treatment effects:
IPTW is designed to estimate the Average Treatment Effect (ATE) across the entire population under study, assuming that every individual could potentially receive the treatment. This method assigns weights based on the inverse probability of receiving the treatment as predicted by the propensity score. Specifically: - Treated patients receive weights of \(\frac{1}{\text{propensity score}}\). - Control patients receive weights of \(\frac{1}{1 - \text{propensity score}}\).
Advantages: - Ensures a balanced representation by adjusting for the differences in baseline characteristics across the treated and control groups. - Particularly useful when evaluating the potential impact of a treatment on a general population.
SMR weighting is tailored to studies where it is important to preserve the characteristics of one study arm, typically the clinical trial arm, while making comparisons with an external control arm (ECA). This approach adjusts the ECA so that it resembles the trial population more closely, rather than balancing both populations to a common standard.
Advantages: - Preservation of Trial Results: Keeps the integrity of the clinical trial arm intact while adjusting the ECA. - Useful for External Comparisons: Ideal for studies incorporating ECAs where the clinical trial data is considered the standard.
Overlap weighting focuses on the subset of patients whose characteristics most strongly overlap between the treated and untreated groups. It assigns weights that are inherently bounded between zero and one, which represents a proportionate influence based on the degree of overlap in their propensity scores.
Advantages: - Reduction of Extreme Weights: Unlike IPTW, which can give rise to extreme weights if patients have very low or very high propensity scores, overlap weighting naturally bounds weights, reducing the influence of outliers. - Balances Confounders: Ensures a more perfect balance of measured confounders between treatment groups, minimizing residual confounding.
Practical Considerations: - Selection of
Method: The choice between IPTW, SMR, and overlap weighting
should depend on the specific objectives of the study and the nature of
the data. - Addressing Limitations: While these methods
can significantly reduce bias due to confounding, they still rely on the
assumption that all relevant confounders have been measured and
correctly modeled. - Software Implementation: In R,
packages like MatchIt
and twang
provide tools
to implement these weighting methods efficiently, allowing for robust
sensitivity analyses and diagnostics to check the balance and
performance of the weights.
Covariate adjustment using the propensity score involves incorporating the propensity score as a covariate in a regression model that also includes the treatment indicator. This approach allows for the control of confounding variables that are accounted for in the propensity score, providing a more precise estimate of the treatment effect.
Propensity Score Estimation: First, calculate the propensity score for each participant. This score is typically estimated using logistic regression, where the treatment assignment is regressed on observed covariates.
Regression Model: Choose the appropriate regression model based on the nature of the outcome variable:
Treatment Effect Estimation:
Advantages
Disadvantages
When estimating the propensity score, two critical decisions must be made: the choice of the model for estimating the score and the selection of variables to include in that model. Both of these choices are essential for ensuring that the propensity score effectively balances the treatment and control groups on observed covariates, thereby reducing bias in the estimation of treatment effects.
In summary, when estimating the propensity score, the choice of model depends on the nature of the treatment (binary vs. multiple treatments), with logit and probit models commonly used for binary treatments, and multinomial probit often preferred for multiple treatments. The selection of variables to include in the model is critical and should focus on those that influence both treatment assignment and the outcome to reduce confounding and bias in the estimated treatment effects.
a. Binary Treatment Case: - Logit vs. Probit Models: - In cases where there is a binary treatment (i.e., participation vs. non-participation), either the logit or probit models are typically used to estimate the propensity score. - Logit Model: The logit model assumes a logistic distribution of the error terms and is characterized by its “S” shaped curve, with more density mass at the extremes (closer to 0 and 1). This characteristic makes it slightly more sensitive to extreme values (cases where the probability of receiving the treatment is very close to 0 or 1). - Probit Model: The probit model assumes a normal distribution of the error terms. The choice between logit and probit often results in similar propensity scores because, in practice, the differences in their predictions are minor. - Conclusion: Since both models generally yield similar results, the choice between them is not critical. However, if the distribution of the propensity scores is a concern, the logit model might be preferred due to its distribution characteristics.
b. Multiple Treatments: - Multinomial Logit vs. Multinomial Probit Models: - In cases where there are more than two treatment options, the model choice becomes more complex. - Multinomial Logit Model: This model extends the logit model to multiple categories. However, it is based on the Independence of Irrelevant Alternatives (IIA) assumption, which implies that the relative odds of choosing between any two alternatives are independent of the presence of other alternatives. - Multinomial Probit Model: This model is often preferred for multiple treatments because it relaxes the IIA assumption. It allows for more flexible correlations between the error terms across different treatment categories, leading to potentially more accurate estimations of the propensity scores in cases with multiple treatments. - Conclusion: The multinomial probit model is generally preferred for multiple treatment scenarios because it makes fewer restrictive assumptions compared to the multinomial logit model.
The choice of variables to include in the propensity score model is crucial and has been the subject of extensive debate in the applied literature. This choice directly impacts the effectiveness of the propensity score in balancing covariates between treated and control groups.
a. Theoretical Considerations: - The propensity score is defined as the probability of treatment assignment given a set of covariates: \(e_i = Pr(Z_i = 1 | X_i)\). The goal is to include variables in the model that ensure the Conditional Independence Assumption (CIA), which states that the potential outcomes are independent of the treatment assignment, given the propensity score.
b. Common Approaches: 1. All Measured Baseline Covariates: - This approach involves including every available baseline characteristic in the propensity score model, regardless of whether they are related to treatment assignment or the outcome. While comprehensive, this method may introduce noise and reduce the efficiency of the propensity score estimation.
c. Practical Considerations: - Overfitting: Including too many variables can lead to overfitting, where the propensity score model fits the sample data well but does not generalize to the population. - Omitted Variable Bias: Excluding important confounders can lead to biased estimates of the treatment effect. Thus, a careful balance is needed to include relevant covariates without overcomplicating the model. - Interaction Terms and Non-Linearities: Sometimes, including interaction terms or non-linear transformations of variables (e.g., squared terms) can improve the balance achieved by the propensity score.
Over-parameterization in propensity score models can indeed create issues in statistical analysis, as noted by Bryson et al. (2002) and others. Here’s a breakdown of the two primary reasons why over-parameterized models should be avoided, especially when dealing with propensity scores:
1. Exacerbation of the Support Problem
2. Increased Variance of Propensity Score Estimates
Practical Implications
When designing a propensity score model, consider these guidelines to avoid over-parameterization:
When constructing a propensity score model, selecting the appropriate variables is crucial because these choices significantly affect the ability to balance covariates between the treated and untreated groups. This process involves statistical strategies to optimize the selection of variables based not on maximizing the prediction accuracy per se, but rather on achieving a balance that supports causal inference. Here’s an overview of the three strategies outlined by researchers like Heckman and colleagues for variable selection in the context of propensity score estimation:
This approach focuses on maximizing the correct classification rate within the sample: - Methodology: Variables are selected based on their ability to correctly predict treatment assignment. An observation is classified as receiving treatment (1) if the estimated propensity score is higher than the sample proportion of individuals actually receiving the treatment. Otherwise, it’s classified as not receiving treatment (0). - Goal: The objective is to maximize the overall prediction accuracy, assuming equal misclassification costs for both groups. This method effectively prioritizes variables that improve the model’s ability to distinguish between those likely and unlikely to receive treatment. - Limitation: While this method can optimize the model’s predictive accuracy, it might not necessarily lead to the best covariate balance between treatment groups, as it emphasizes prediction over balance.
This method uses a stepwise approach to building the propensity score model: - Methodology: Starting with a basic model (often including key demographic variables like age and location), additional variables are incrementally tested and included if they prove statistically significant at conventional levels (e.g., p<0.05). - Combined Approach: It can be combined with the hit or miss method, where variables are included if they are both statistically significant and enhance the prediction rates substantially. - Consideration: This approach aligns with traditional statistical modeling principles but may lead to overfitting if too many variables are tested and included based solely on their statistical significance without considering their practical impact on treatment effect estimation.
This method focuses on model accuracy and the impact of each variable block on model performance: - Methodology: Beginning with a minimal model, additional blocks of variables are progressively added. The inclusion of each block is evaluated based on its effect on the mean squared error (MSE) of the model. - Consideration: This method emphasizes goodness-of-fit, but as noted by Black and Smith, it should be guided by theoretical and empirical considerations regarding the variables’ relevance to treatment assignment and outcomes. - Limitation: There’s a risk of selecting a model more for its fit than for its theoretical justification, which might skew the propensity score’s effectiveness in balancing covariates crucial for causal inference.
Sensitivity analysis in causal inference is crucial when using matching estimators to assess treatment effects, as it helps researchers understand how deviations from the critical assumption of unconfoundedness due to unobserved heterogeneity might affect their results.
1. Rosenbaum and Rubin’s Approach (1983): - Concept: This method proposes evaluating the sensitivity of the Average Treatment Effect (ATE) to assumptions about an unobserved binary covariate \(U\) that influences both the treatment assignment and the outcome. - Implementation: Assume that the treatment assignment is not entirely random but could be unconfounded given both observable characteristics \(X\) and an unobserved covariate \(U\). By hypothesizing different scenarios for the distribution of \(U\) and its relationships with the treatment \(D\) and outcomes \(Y(0)\) and \(Y(1)\), researchers can assess how changes in these assumptions might alter the ATE estimates. - Utility: This approach allows for a structured assessment of how sensitive results are to the omission of potentially important unobserved covariates.
2. Imbens’s Method (2003): - Concept: Rather than focusing on the coefficients of the unobserved covariates, this approach uses partial \(R^2\) values to quantify the impact of unobserved variables. - Implementation: The partial \(R^2\) represents how much of the variation in treatment assignment is explained by the unobserved covariates, after accounting for the observed covariates. - Utility: By comparing these \(R^2\) values to those of observed covariates, researchers can more easily interpret and judge the likelihood that unobserved factors could substantially alter the results, enhancing the practicality of sensitivity analyses.
1. Rosenbaum’s Framework (2002): - Concept: This approach assesses how much an unmeasured variable could potentially impact the treatment selection process, enough to invalidate the conclusions derived from a matching analysis. - Implementation: Analyze how variations in the influence of an unmeasured variable \(U\) might change the treatment effect estimates, assuming all else constant. This could involve hypothesizing different levels of association between \(U\) and both the treatment and the outcome. - Utility: It provides a way to conceptually and quantitatively assess the vulnerability of causal claims to hidden biases due to unmeasured confounding.
2. Ichino et al.’s Technique (2006): - Concept: Focuses specifically on the sensitivity of the Average Treatment effect on the Treated (ATT) to deviations from the assumption of unconfoundedness. - Implementation: - Simulate different distributions for \(U\) and incorporate these into the set of matching variables. - Re-estimate the ATT by including \(U\) in the analysis, effectively assessing how the ATT changes under various hypothetical scenarios concerning \(U\). - This simulation helps visualize how robust the ATT estimates are to changes in assumptions about the nature of the unmeasured confounder.
Austin PC. An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies. Multivariate Behav Res. 2011 May;46(3):399-424. doi: 10.1080/00273171.2011.568786. Epub 2011 Jun 8. PMID: 21818162; PMCID: PMC3144483.
Caliendo, M., & Kopeinig, S. (2008). SOME PRACTICAL GUIDANCE FOR THE IMPLEMENTATION OF PROPENSITY SCORE MATCHING. Journal of Economic Surveys, 22(1), 31–72. doi:10.1111/j.1467-6419.2007.00527.x https://sci-hub.se/10.1111/j.1467-6419.2007.00527.x
Understanding propensity score weighting methods https://aetion.com/evidence-hub/understanding-propensity-score-weighting-methods-rwe/
Diamond, A. & Sekhon, J. (2013). Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies. The Review of Economics and Statistics. July 2013, Vol. 95, No. 3, Pages: 932-945.
Ho et. al (2007). Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis, 15(3).
King, G. & Nielsen, R. Why prop. scores should not be used for matching. Retrieved February 2, 2017 from: http://gking.harvard.edu/files/gking/files/psnot.pdf