Sample Size Determination for Survival Endpoint
Survival analysis, also known as time-to-event (TTE) analysis, is a statistical method used to estimate the time it takes for a specific event of interest to occur. This type of analysis is crucial in fields where timing is an essential factor, such as in clinical trials, where understanding the length of time until an event such as death, disease progression, or recurrence can guide treatment decisions and policy-making.
Events in Survival Analysis
Survival analysis is particularly prevalent in clinical areas such as oncology and cardiology, where understanding the progression of diseases over time is critical. These analyses help quantify the benefits of new treatments or interventions over standard care or placebos.
Early vs. Late Stage Trials
The power of a survival analysis study is influenced by several design choices:
Sample size determination for survival analysis is an important aspect of designing studies that aim to investigate time-to-event outcomes.
1. Define Objectives First, clearly define the study objective. In survival analysis, this typically involves comparing survival times between two or more groups. For example, comparing the survival time of patients receiving a new treatment versus those receiving a standard treatment.
2. Choose the Survival Function Determine which type of survival function or model will be used, such as the exponential, Weibull, or Cox proportional hazards model. The choice depends on the nature of the data and the specific assumptions each model holds.
3. Determine Endpoint Identify the primary endpoint of the study, such as time until death, disease recurrence, or another event. This will also involve deciding how censored data (participants who leave the study early or do not experience the event by the end of the study) will be handled.
4. Specify Effect Size Determine the effect size that is clinically meaningful. This could be a difference in median survival times between groups or a hazard ratio.
5. Set the Parameters for Calculation - Alpha (Type I error rate): Typically set at 0.05, representing a 5% chance of falsely declaring a difference in survival. - Beta (Type II error rate): Commonly set at 0.20, corresponding to an 80% power to detect a difference if one truly exists. - Accrual Time: The period during which subjects are recruited and followed up. - Follow-up Time: The time after the end of accrual, during which events are collected.
6. Estimate Event Rate and Survival Probabilities Estimate the proportion of participants expected to experience the event in each group if you have prior data or studies to inform these estimates. Survival probabilities at specific time points (like 1-year or 5-year survival) are also useful.
7. Adjust for Dropouts and Non-adherence Finally, adjust the initial sample size estimate to account for potential dropouts and non-adherence to treatment protocols to ensure that the study maintains sufficient power.
In survival analysis, the proportional hazards (PH) assumption is a cornerstone of commonly used methods like the log-rank test and the Cox proportional hazards model. This assumption implies that the hazard ratios between groups are constant over time. However, there are many situations, particularly with newer treatments like immunotherapies, where this assumption does not hold, leading to non-proportional hazards (NPH) scenarios.
Non-proportional hazards occur when the hazard ratio between treatment groups changes over time. This can manifest in several ways:
NPH scenarios pose significant challenges in the analysis and interpretation of clinical trial data, as traditional methods may lead to biased or inaccurate estimates of treatment effects. This can affect the statistical power and the type I error rate of a study, potentially leading to incorrect conclusions about a therapy’s efficacy.
Due to the limitations of traditional PH-based methods in handling NPH scenarios, there is a growing interest in developing robust alternatives. Here are some approaches:
Ongoing Debate
The debate over the best approach to handle NPH scenarios is ongoing. No single method can universally handle all types of non-proportional hazards effectively. The choice of method often depends on: - The specific nature of the hazard function in the data. - The clinical context and the mechanism of action of the treatment. - The goal of the analysis (e.g., regulatory submission, exploratory research).
Power for TTE related to several choices:
Power driven primarily by number of events (E) not sample size (N):
Calculating E separate from N:
Accrual/Follow-up
Survival Distribution/Effect Size
Other Consideration:
Reference
Note:
We assume that a study is to be made comparing the survival (or healing) of a control group with an experimental group. The control group (group 1) consists of patients that will receive the existing treatment. In cases where no existing treatment exists, the group 1 consists of patients that will receive a placebo. The experimental group (group 2) will receive the new treatment. We assume that the critical event of interest is death and that two treatments have survival distributions with instantaneous death (hazard) rates, 𝜆1 and 𝜆2. These hazard rates are a subject’s probability of death in a short period of time.
There are several ways to compare two hazard rates. One is the difference, \(\lambda_2-\lambda_1\). Another is the ratio, \(\lambda_2 / \lambda_1\), called the hazard ratio. \[ H R=\frac{\lambda_2}{\lambda_1} \] Note that since HR is formed by dividing the hazard rate of the experimental group by that of the control group, a treatment that has a smaller hazard rate than the control will have a hazard ratio that is less than one.
The hazard ratio may be formulated in other ways. If the proportions surviving during the study are called \(S 1\) and \(S 2\) for the control and experimental groups, the hazard ratio is given by \[ H R=\frac{\log \left(S_2\right)}{\log \left(S_1\right)} \] Furthermore, if the median survival times of the two groups are \(M 1\) and \(M 2\), the hazard ratio is given by \[ H R=\frac{M_1}{M_2} \]
We assume that the logrank test will be used to analyze the data once they are collected. However, often Cox’s proportional hazards regression is used to do the actual analysis. The power calculations of the logrank test are based on several other parameters \[ z_{1-\beta}=\frac{|H R-1| \sqrt{N(1-w) \varphi\left[\left(1-S_1\right)+\varphi\left(1-S_2\right)\right] /(1+\varphi)}}{(1+\varphi H R)}-z_{1-\alpha / k} \] where \(k\) is 1 for a one-sided hypothesis test or 2 for a two-sided test, \(\alpha\) and \(\beta\) are the error rates defined as usual, the \(z^{\prime}\) s are the usual points from the standard normal distribution, \(w\) is the proportion that are lost to follow up, and \(\varphi\) represents the sample size ratio between the two groups. \[ \varphi=\frac{N_2}{N_1} \] Note that the null hypothesis is that the hazard ratio is one, i.e., that \[ H_0: \frac{\lambda_2}{\lambda_1}=1 \]
ssc.logRank.Freedman <- function(S.trt, S.ctrl, sig.level = 0.05, power = 0.8,
alternative = c("two.sided", "less", "greater"),
method = c("Freedman"),
pr=TRUE) {
# FIXME: Relabel S.trt and S.ctrl as S.ctrl and S.trt
alt <- match.arg(alternative)
za <- if (alt == "two.sided") {
stats::qnorm(sig.level / 2)
} else {
stats::qnorm(sig.level)
}
zb <- stats::qnorm(1 - power)
haz.ratio <- log(S.trt) / log(S.ctrl)
if(pr)
cat("\nHazard ratio:",format(haz.ratio),"\n")
cat("Expected number of events:", 4 * (za + zb) ^ 2 / log(1 / haz.ratio) ^ 2)
cat("\n")
(((haz.ratio + 1) / (haz.ratio - 1)) ^ 2) *
(za + zb) ^ 2 / (2 - S.trt - S.ctrl)
}
ssc.logRank.Freedman(0.5,0.7,power = 0.817)
##
## Hazard ratio: 1.943358
## Expected number of events: 74.32079
## [1] 99.81032
# HR
# log(0.7) / log(0.5)
Using an unstratified log-rank test at the one-sided 2.5% significance level, a total of 282 events would allow 92.6% power to demonstrate a 33% risk reduction (hazard ratio for RAD/placebo of about 0.67, as calculated from an anticipated 50% increase in median PFS, from 6 months in placebo arm to 9 months in the RAD001 arm).
With a uniform accrual of approximately 23 patients per month over 74 weeks and a minimum follow up of 39 weeks, a total of 352 patients would be required to obtain 282 PFS events, assuming an exponential progression-free survival distribution with a median of 6 months in the Placebo arm and of 9 months in RAD001 arm. With an estimated 10% lost to follow up patients, a total sample size of 392 patients should be randomized.
Yao JC, Shah MH, Ito T, Bohas CL, Wolin EM, Van Cutsem E, Hobday TJ, Okusaka T, Capdevila J, de Vries EG, Tomassetti P, Pavel ME, Hoosen S, Haas T, Lincy J, Lebwohl D, Öberg K; RAD001 in Advanced Neuroendocrine Tumors, Third Trial (RADIANT-3) Study Group. Everolimus for advanced pancreatic neuroendocrine tumors. N Engl J Med. 2011 Feb 10;364(6):514-23. doi: 10.1056/NEJMoa1009290. PMID: 21306238; PMCID: PMC4208619. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4208619/
Significance Level (1-Sided) | 0.025 |
---|---|
Placebo Median Survival (months) | 6 |
Everolimus Median Survival (months) | 9 |
Hazard Ratio | 0.66667 |
Accrual Period (Weeks) | 74 |
Minimum Follow-Up (Weeks) | 39 |
Power % (under constant HR) | 92.6 |
# Load required library
library("powerSurvEpi")
# Define parameters
power <- 0.926
alpha <- 0.025 # One-sided significance level
k <- 74 / 52 # Accrual time in years
followup_time <- 39 / 52 # Minimum follow-up time in years
median_survival_placebo <- 6 / 12 # Median survival in years for placebo
median_survival_treatment <- 9 / 12 # Median survival in years for treatment
dropout_rate <- 0.10
# Calculate event rates assuming exponential survival
# Event Rates (pC, pE)**: These are derived from the median survival times assuming an exponential survival model.
pC <- log(2) / median_survival_placebo # Event rate for placebo
pE <- log(2) / median_survival_treatment # Event rate for treatment
RR <- 0.66667 # Risk reduction as hazard ratio
# Calculate the required sample size
sample_size <- ssizeCT.default(power = power,
k = k + followup_time, # Total study duration including follow-up
pE = pE,
pC = pC,
RR = RR,
alpha = alpha)
# Adjust for dropout
final_sample_size <- sample_size / (1 - dropout_rate)
final_sample_size <- ceiling(final_sample_size) # Round up to the next whole number
# Output the calculated sample size
print(paste("Total sample size needed, accounting for dropout:", final_sample_size))
## [1] "Total sample size needed, accounting for dropout: 242" "Total sample size needed, accounting for dropout: 112"
proc power;
twosamplelogrank
test=logrank
groupmedians = (6 9) /* Median survival times in months */
hazardratio = 0.66667 /* Hazard Ratio */
accrualtime = 74/52 /* Accrual time in years */
followuptime = 39/52 /* Follow-up time in years */
sides = 1 /* One-sided test */
power = 0.926 /* Desired power */
alpha = 0.025 /* One-sided significance level */
ntotal = . /* Let SAS calculate required total sample size */
groupweights = (1 1); /* Equal weighting of groups */
run;
Reference
Introduction
One reason of log-rank tests are useful is that they provide an objective criteria (statistical significance) around which to plan out a study:
In survival analysis, we need to specify information regarding the censoring mechanism and the particular survival distributions in the null and alternative hypotheses.
We shall assume that the patients enter a trial over a certain accrual period of length \(a\), and then followed for an additional period of time \(f\) known as the follow-up time. Patients still alive at the end of follow-up are censored.
Exponential Approximation
In general, it is assumed we have constant hazards (i.e., exponential distributions) for the sake of simplicity. Because other work in literature has indicated that the power/sample size obtained from assuming constant hazards is fairly close to the empirical power of the log-rank test, provided that the ratio between the two hazard functions is constant. Typically in a power analysis, we are simply trying to find the approximate number of subjects required by the study, and many approximations/guesses are involved, so using formulas based on the exponential distribution is often good enough.
Reference
Assumes exponential distributions for both treatment groups. Uses the George-Desu method along with formulas of Schoenfeld that allow estimation of the expected number of events in the two groups. To allow for drop-ins (noncompliance to control therapy, crossover to intervention) and noncompliance of the intervention, the method of Lachin and Foulkes is used.
For handling noncompliance, uses a modification of formula (5.4) of Lachin and Foulkes. Their method is based on a test for the difference in two hazard rates, whereas cpower is based on testing the difference in two log hazards. It is assumed here that the same correction factor can be approximately applied to the log hazard ratio as Lachin and Foulkes applied to the hazard difference.
Note that Schoenfeld approximates the variance of the log hazard ratio by 4/m, where m is the total number of events, whereas the George-Desu method uses the slightly better 1/m1 + 1/m2. Power from this function will thus differ slightly from that obtained with the SAS samsizc program.
##
## Accrual duration: 1.5 years Minimum follow-up: 5 years
##
## Total sample size: 950
##
## Alpha= 0.05
##
## 5-year Mortalities (Events Rate)
## Control Intervention
## 0.18 0.10
##
## Hazard Rates
## Control Intervention
## 0.03969019 0.02107210
##
## Probabilities of an Event During Study
## Control Intervention
## 0.2039322 0.1140750
##
## Expected Number of Events
## Control Intervention
## 96.9 54.2
##
## Hazard ratio: 0.5309147
##
## Drop-in rate (controls):10%
## Non-adherence rate (intervention):15%
## Effective hazard ratio with non-compliance: 0.6219687
## Standard deviation of log hazard ratio: 0.1696421
## Approximation method of variance of the log hazard ratio based on Peterson B, George SL: Controlled Clinical Trials 14:511–522; 1993.
##
## Power
## 0.7993381
##
## Accrual duration: 1.5 years Minimum follow-up: 5 years
##
## Total sample size: 950
##
## Alpha= 0.05
##
## 5-year Mortalities (Events Rate)
## Control Intervention
## 0.18 0.10
##
## Hazard Rates
## Control Intervention
## 0.03969019 0.02107210
##
## Probabilities of an Event During Study
## Control Intervention
## 0.2039322 0.1140750
##
## Expected Number of Events
## Control Intervention
## 91.8 57.0
##
## Hazard ratio: 0.5309147
##
## Drop-in rate (controls):10%
## Non-adherence rate (intervention):15%
## Effective hazard ratio with non-compliance: 0.6219687
## Standard deviation of log hazard ratio: 0.1639526
## Approximation method of variance of the log hazard ratio based on Schoenfeld D: Biometrics 39:499–503; 1983.
##
## Power
## 0.8254654
Patients will be accrued uniformly over two years and then followed for an additional three years past the accrual period. Some loss to follow-up is expected, with roughly exponential rates that would result in about 50% loss with the standard treatment within 10 years. The loss to follow-up with the proposed treatment is more difficult to predict, but 50% loss would be expected to occur sometime between years 5 and 20.
# time at event estimated = 2
# duration of accrual period = 2
# minimum follow-up time = 3
# Standard treatment: 50% loss with the standard treatment within 10 years
# Proposed treatment: 50% loss would be expected to occur sometime between years 5 and 20
# The "Standard" curve specifying an exponential form with a survival probability of 0.5 at year 5.
# The "Proposed" curve is a piecewise linear curve defined by the five points shown
proc power;
twosamplesurvival test= logrank
accrualtime=2
followuptime=3
power = 0.8
alpha = 0.05
sides = 2
curve("Standard") = 5 : 0.5
curve("Proposed") = (1 to 5 by 1):(0.95 0.9 0.75 0.7 0.6)
groupsurvival = "Standard" | "Proposed"
groupmedlosstimes = 10 | 20 5
npergroup = .;
run;
data __NULL_;
HR1 = -log(1-0.3);
HR2a = -log(1-0.45);
HR2b = -log(1-0.5);
put HR1 HR2a HR2b;
run;
proc power;
twosamplesurvival test=logrank
/* Specify Analysis Information */
accrualtime=2
followuptime=3
power = 0.8
alpha = 0.05
sides = 2
/* Specify Effects */
gexphs= 0.3567 | 0.5978 .6931
groupweights = (2 1)
/* Specify Loss Information */
grouplossexphazards=(0.3567 0.3567)
ntotal= .;
plot y=power min=0.5 max=0.90;
run;
Clinical trial to assess new treatment for patients with chronic active hepatitis.
Calculation
proc power;
twosamplesurvival test=logrank
/* Specify Analysis Information */
followuptime = 5
totalTIME = 5
power = 0.8
alpha = 0.05
sides = 2
/* Specify Effects */
hazardratio = 0.57
refsurvexphazard=0.178
ntotal = . ;
run;
proc power;
twosamplesurvival
test=logrank
curve("Control") = (0 5):(1 0.8)
curve("Treatment") = (0 5):(1 0.85)
refsurvival = "Control"
accrualtime = 2.5
followuptime = 2.5
hazardratio = 1.373
alpha = 0.05
sides = 2
ntotal = .
power = 0.8;
run;
Piecewise linear survival curve
proc power;
twosamplesurvival test=logrank
curve("Existing Treatment") = 5 : 0.5
curve("Proposed Treatment") = 1 : 0.95 2 : 0.90 3:0.75 4:0.70 5:0.60
groupsurvival = "Existing Treatment" | "Proposed Treatment"
accrualtime = 2
FOLLOWUPTIME = 3
power = 0.80
alpha=0.05
npergroup = . ;
run;
Group sequential design with interim analyses
the survival probability at 12 months are for standard and proposed groups are specified the statement of grouplossexphazards is used to account for the dropout rate.
proc power;
twosamplesurvival test=logrank
curve("Standard") = 12 : 0.8781
curve("Proposed") = 12 : 0.9012
groupsurvival = "Standard" | "Proposed"
accrualtime = 18
Totaltime = 24
GROUPLOSSEXPHAZARDS = (0.0012 0.0012)
NSUBINTERVAL = 1
power = 0.85
ntotal = . ;
run;
Reference
This procedure is based on the formulas presented in Pintilie (2006) and Machin et al. (2009), which are both based on the original paper Pintilie (2002).
Introduction
Logrank test is used to compare the two survival distributions because it is easy to apply and is usually more powerful than an analysis based simply on proportions. It compares survival across the whole spectrum of time, not at just one or two points, and accounts for censoring.
When analyzing time-to-event data and calculating power and sample size, a complication arises when individuals in the study die from risk factors that are not directly related to the risk factor of interest. For example, a researcher may wish to determine if a new drug for some disease improves patient survival time when compared to a standard treatment. Therefore, the researchers would be interested to know how long each patient lives until he or she dies from the disease. However, during the course of the study, patients may also die from other risks such as myocardial infarction, diabetes, or even an accident. When a patient dies from one of these other risk factors, then the main event of interest cannot be observed, so the true time-to-event of the disease for that patient can never be determined.
Power Overestimated
If the results are not adjusted, then the power calculated for the logrank test of the main event of interest may be grossly overestimated, depending on the incidence of competing risks
Assumptions
The power and sample size calculations in the module for the logrank test are based on the following assumptions:
Details
The hazard rates for the event of interest and competing risks in group \(i\) are calculated from the cumulative survival functions as \[ \begin{aligned} & h_{e v, i}=\left(\frac{-\ln \left(S_{e v, i}(T 0)\right)}{T 0}\right) \\ & h_{c r, i}=\left(\frac{-\ln \left(S_{c r, i}(T 0)\right)}{T 0}\right) \end{aligned} \] The hazard ratio used in power calculations is calculated from the hazard rates for the event of interest as \[ H R=\left(\frac{h_{e v, 2}}{h_{e v, 1}}\right) \] the hazard rate for the treatment group divided by the hazard rate for the control group. The hazard rates may be calculated using cumulative survival proportions or cumulative incidences as described above.
Then we can calculate Probability of Event and Number of Event
Probability of Event
With the hazard rates for the event of interest and competing risks, the probability of observing the event of interest in a subject in group \(i, P r_{e v, i}\), is given as \[ P r_{e v, i}=\frac{h_{e v, i}}{h_{e v, i}+h_{c r, i}}\left(1-\frac{\exp \left\{-(T-R) \times\left(h_{e v, i}+h_{c r, i}\right)\right\}-\exp \left\{-T \times\left(h_{e v, i}+h_{c r, i}\right)\right\}}{R \times\left(h_{e v, i}+h_{c r, i}\right)}\right), \] where \(T\) is the total time of trial and \(R\) is the accrual time. The follow-up time is calculated from \(T\) and \(R\) as \[ \text { Follow-Up Time }=T-R \text {. } \] The overall probability of observing the event of interest during the study in both groups is given as \[ P r_{e v}=p_1 P r_{e v, 1}+\left(1-p_1\right) P r_{e v, 2} \] where \(p_1\) is the proportion of subjects in group 1 , the control group.
Number of Events
When dealing with time-to-event data, it is the number of events observed, not the total number of subjects that is important to achieve the specified power. The total required number of events (for the event of interest), \(E\), is calculated from the total sample size \(N\) and \(P r_{e v}\) as \[ E=N \times P r_{e v} \] The number of events in group \(i\) is calculated as \[ E_i=n_i \times P r_{e v, i} \] where \(n_i\) is the sample size for the \(i^{i \text { th }}\) group.
Power and Sample Size Calculations
Assuming an exponential model and independence of failure times for the event of interest and competing risks, Pintilie (2006) gives the following equation relating E (total number of events for the risk factor of interest) and power:
\[ z_{1-\beta}=\sqrt{E \times p_1\left(1-p_1\right)} \log (H R)-z_{1-\alpha / 2} \] with
This power formula indicates that it is the total number of events observed, not the number of subjects that is critical for achieving the desired power for the logrank test.
The power formula can be rearranged to solve for \(E\), the total number of events required. The formula is \[ E=\left(\frac{1}{p_1\left(1-p_1\right)}\right) \times\left(\frac{z_{1-\alpha / 2}+z_{1-\beta}}{\log (H R)}\right)^2 . \] The overall sample size can be computed from \(E\) and \(P r_{e v}\) as \[ N=\frac{E}{P r_{e v}}=\left(\frac{1}{p_1\left(1-p_1\right) \times P r_{e v}}\right) \times\left(\frac{z_{1-\alpha / 2}+z_{1-\beta}}{\log (H R)}\right)^2 . \] The individual group sample sizes are calculated as \[ \begin{aligned} & n_1=N \times p_1, \\ & n_2=N \times\left(1-p_1\right), \end{aligned} \] where \(p_1\) is the proportion of subjects in group 1 , the control group.
Alternative Hypothesis: Two-Sided
Alpha: 0.05
R (Accrual Time): 3
T-R (Follow-Up Time): 2
T0 (Fixed Time Point): 3
Sev1(T0) (Control): 0.5
HR (Hazard Ratio = hev2 / hev1): 0.5
Scr1(T0) (Control): 0.4
Percent in Group 1: 50
Power: 0.6162274
Total Power (N): 150
##
## Sample Size Calculation using Logrank Tests Accounting for Competing Risks
## Alpha 0.025
## Power 61.62274 %
##
## Accrual time of survival rate observed: 3 years
## Total time of tria: 5 years
## Follow-Up Time: 2 years
##
## Survival probability for the event of interest in group 1: 0.5
## Survival probability for the event of interest in group 2: 0.7071068
## Hazard Ration: 0.5
##
## Competing risks probability: 0.4
##
## Proportion of subjects in group 1: 0.5
## Proportion of subjects in group 2: 0.5
##
## The probability of observing the event of interest in a subject during the study for the group 1: 0.3574638
## The probability of observing the event of interest in a subject during the study for the group 2: 0.2072824
##
## The number of events required for the group 1: 27
## The number of events required for the group 2: 16
## The total number of events required for the study: 43
##
## The sample sizes for the group 1: 75
## The sample sizes for the group 2: 75
## The total sample size of both groups combined: 150
With Interim Analysis
## N1_Event IA N2_Event IA N1_Patient IA N2_Patient IA NTotal IA N_Patient FU Power
## 62.00 3.00 134.00 15.00 149.00 32.00 91.24
Method | Description |
---|---|
Log-Rank | “Average Hazard Ratio” – same as from univariate Cox Regression model |
Linear-Rank (Weighted) | Gehan-Breslow-Wilcoxon, Tarone-Ware, Farrington-Manning, Peto-Peto, Threshold Lag, Modestly Weighted Linear-Rank (MWLRT) |
Piecewise Linear-Rank | Piecewise Parametric, Weighted Piecewise Model (e.g. APPLE), Change Point Models |
Combination | Maximum Combination (MaxCombo) Test Procedure |
Survival Time | Milestone Survival (KM), Restricted Mean Survival Time, Landmark Analysis |
Relative Time | Ratio of Times to Reach Event Proportion, Accelerated Failure Time Models |
Others | Responder-Based, Frailty Models, Renyi Models, Net Benefit (Buyse) |
1. Concept: - The MaxCombo test is designed to handle multiple linear-rank tests simultaneously and to select the “best” test from the candidate tests. This approach helps in controlling Type I error rates while still allowing flexibility in the choice of statistical tests.
2. Test Variants: - Various forms of the
Fleming-Harrington family of tests (denoted as F-H(G) Tests) are used,
each specified by different parameterizations (G(p,q)
) that
emphasize different portions of the survival curve. For example, some
may focus more on early failures while others on late failures.
F-H (G) Tests | Proposal |
---|---|
G(0,1; 1,0) | Lee (2007) |
G(0,0*; 0,1; 1,0) | Karrison (2016) |
G(0,0; 0,1; 1,0; 1,1) | Lin et al (2020) |
G(0,0; 0,0.5; 0.5,0; 0.5,0.5) | Roychoudhury et al (2021) |
G(0,0; 0,0.5) | Mukhopadhyay et al (2022) |
G(0,0; 0,0.5; 0.5,0) | Mukhopadhyay et al (2022) |
3. Common Usage: - Typically, 2-4 candidate tests are considered with Fleming-Harrington being popular due to its flexibility. It can accommodate Log-Rank and Peto-Peto tests, among others, allowing researchers to tailor the analysis to the specific characteristics of their survival data.
Issues with MaxCombo Tests
1. Type I Error and Estimand: - Critics point out that MaxCombo tests, while versatile, can sometimes lead to significant results even when the treatment effect is not better than the control across all times. This can mislead the conclusions about a treatment’s efficacy, especially if it is only effective late in the follow-up period (late efficacy).
2. Interpretability: - There are concerns about the interpretability of using an average hazard ratio as the estimand because it might not accurately reflect the dynamics of the treatment effect over time, particularly under non-proportional hazards scenarios.
3. Alternatives for Improvement: - Modifications to
the Fleming-Harrington weights (G(p,q)
parameters) are
suggested to better handle scenarios with non-proportional hazards. For
example, changing the focus from early to late survival times can be
achieved by adjusting these parameters.
4. Communication of Results: - It’s recommended to use the MaxCombo for analytical purposes but to communicate the results using more interpretable measures such as the Restricted Mean Survival Time (RMST), which provides a direct, clinically meaningful measure of survival benefit.
Reference
Introduction
We will use \(\hat{\theta}\) as our test statistic, and reject \(H_0\) in favor of \(H_A\) if \(\hat{\theta}>k\) for some constant \(k\). - The significance level of the test, or Type I error rate, is \(\alpha=P\left(\hat{\theta}>k \mid \theta=\theta_0\right)\). 。 If \(Z=\frac{\hat{\theta}-\theta}{1 / \sqrt{d}}\), then we have \(\alpha=P\left(Z>\frac{k-\theta_0}{1 / \sqrt{d}}\right)\). 。 Let \(\Phi\left(z_\alpha\right)=1-\alpha\), then \(z_\alpha=\frac{k-\theta_0}{1 / \sqrt{d}}\) and hence \(k=\theta_0+\frac{z_\alpha}{\sqrt{d}}\). - The power of the test is given by \[ 1-\beta=P\left(\hat{\theta}>k \mid \theta=\theta_A\right)=P\left(Z>\frac{k-\theta_A}{1 / \sqrt{d}}\right) \] - Solving for \(d\) we have \[ \begin{gathered} z_{1-\beta}=-z_\beta=\sqrt{d}\left(k-\theta_A\right)=\sqrt{d}\left(\theta_0+\frac{z_\alpha}{\sqrt{d}}-\theta_A\right) \\ \Rightarrow d=\frac{\left(z_\beta+z_\alpha\right)^2}{\left(\theta_A+\theta_0\right)^2}=\frac{\left(z_\beta+z_\alpha\right)^2}{(\log \Delta)^2} . \end{gathered} \]
Probability of Event
Calculate patient/subject needed based on Probability of Event
We need to provide an estimate of the proportion \(\pi\) of patients who will die by the time of analysis. - If all patients entered at the same time, we would simply have \(\pi=1-S_\lambda(t)\), where \(t\) is the follow-up time. - However, patients actually enter over an accrual period of length \(a\) and then, after accrual to the trial has ended, they are followed for an additional time \(f\). - So a patient who enters at time \(t=0\) will have failure probability \(\pi(0)=1-S_\lambda(a+f)\) as this patient will have the maximum possible follow-up time \(a+f\). - Similarly, for any patient who enters at a time \(t \in[0, a]\), the failure probability \(\pi(t)=1-S_\lambda(a+f-t)\). - Assuming that the patients enter uniformly between times 0 and \(a\), the probability of death can be computed as \[ \pi=\int_0^a \frac{1}{a}\left[1-S_\lambda(a+f-t)\right] d t . \] - Assuming \(S_\lambda(t)=e^{-\lambda t}\), we have \[ \pi=1-\frac{1}{a \lambda}\left[e^{-\lambda f}-e^{-\lambda(a+f)}\right] . \]
Suppose that we are designing a Phase II oncology trial where we plan a 5% level (one-sided) test, and we need 80% power to detect a hazard ratio of 1.5. We can find the required number of deaths as follows:
# Log-mean based approach
# Expected number of events
ssc.onesample.logMean(HR = 1.5, sig.level = 0.05, power = 0.8)
##
## Hazard ratio: 1.5
## Alpha (one-sided): 0.05
## Power: 80 %
##
## Log-mean based approach
## Expected number of events: 38
We wanted to design a Phase II oncology trial where we plan a \(5 \%\) level (one-sided) test, and we need \(80 \%\) power to detect a hazard ratio of 1.5 .
Suppose that \(\lambda_0=0.15\), then we have \(\lambda_A=\lambda_0 / \Delta=0.1\). Assume accrual period \(a=2\) years and follow-up time \(f=3\) years. The probability of death under \(H_A: \lambda=0.1\) is computed as:
ssc.onesample.logMean2(HR = 1.5, sig.level = 0.05, power = 0.8, lambda=0.10, accrual=2, followup=3)
## Expected number of events: 38
## Probability of event: 0.329
## Expected number of patients: 116
Reference
Introduction
For fixed \(d, V=\sum t_i \sim \operatorname{Gamma}(d, \lambda)\) and it is known \({ }^2\) that \[ W=\frac{2 d \lambda}{\hat{\lambda}} \sim \chi_{2 d}^2 \] although this result is approximate for general censoring patterns. Under \(H_0: \lambda=\lambda_0\), we need to find a constant \(k\) such that \(\alpha=P\left(1 / \hat{\lambda}>k \mid \lambda=\lambda_0\right)=P\left(W>2 d k \lambda_0\right)\). Thus we have \(\chi_{2 d, \alpha}^2=2 d k \lambda_0\) and hence \(k=\frac{\chi_{2 d_d \alpha}^2}{2 d \lambda_0}\). The power of the test is given by \[ 1-\beta=P\left(1 / \lambda>k \mid \hat{\lambda}=\lambda_A=P\left(W>2 d k \lambda_A\right)\right) . \] We have \(\chi_{2 d, 1-\beta}^2=2 d k \lambda_A \Rightarrow \chi_{2 d, 1-\beta}^2=\frac{\chi_{2 d, \alpha}^2 \lambda_A}{\lambda_0}\), hence \(\Delta=\frac{\lambda_0}{\lambda_A}=\frac{\chi_{2 d, \alpha}^2}{\chi_{2 d, 1-\beta}^2}\). For specified \(\alpha\), power \(1-\beta\), and ratio \(\Delta\), we may solve this for the required number of deaths, \(d\).
\(\Delta\) can be computed using the following function:
expLikeRatio = function(d, alpha, pwr){
num = qchisq(alpha, df=(2*d), lower.tail=F)
denom = qchisq(pwr, df=(2*d), lower.tail=F)
Delta = num/denom
Delta
}
To get the number of deaths \(d\) for a specified \(\Delta\), we define a new function \(L R(d)=\frac{\chi_{2 d, \alpha}^2}{\chi_{2 d, 1-\beta}^2}-\Delta\). The solution for \(L R(d)=0\) is the required number of deaths and is computed as:
expLRdeaths = function(Delta, alpha, pwr){
LR = function(d, alpha, pwr, Delta){
expLikeRatio(d, alpha, pwr) - Delta
}
# Find the root for the function LR(d)
result = uniroot(f = LR, lower = 1, upper = 1000,
alpha = alpha, pwr = pwr, Delta = Delta)
result$root
}
Suppose that we are designing a Phase II oncology trial where we plan a 5% level (one-sided) test, and we need 80% power to detect a hazard ratio of 1.5. We can find the required number of deaths as follows:
ssc.onesample.LR(HR = 1.5, sig.level = 0.05, power = 0.8)
## Expected number of events: 37
Introduction
Cox, D. R., & Oakes, D. (1984). Analysis of survival data. CRC Press.
Cox, D. R. (1972). Regression models and life‐tables. Journal of the Royal Statistical Society: Series B (Methodological), 34(2), 187-202.
Collett, D. (2015). Modelling survival data in medical research. CRC Press.
Fleming, T. R., & Harrington, D. P. (2013). Counting processes and survival analysis. John Wiley & Sons.
Klein, J. P., Van Houwelingen, H. C., Ibrahim, J. G., & Scheike, T. H. (Eds.). (2016). Handbook of survival analysis. CRC Press.
Andersen, P. K., Borgan, O., Gill, R. D., & Keiding, N. (2012). Statistical models based on counting processes. Springer Science & Business Media.
Lin, H., & Zelterman, D. (2002). Modeling survival data: Extending the Cox model. Springer.
Klein, J. P., & Moeschberger, M. L. (2003). Survival analysis: Techniques for censored and truncated data. Springer.
Lemeshow, S., May, S., & Hosmer Jr, D. W. (2011). Applied survival analysis: Regression modeling of time-to-event data. John Wiley & Sons.
Aalen, O., Borgan, O., & Gjessing, H. (2008). Survival and event history analysis: A process point of view. Springer Science & Business Media.
U.S. Food and Drug Administration. (2018). Clinical Trial Endpoints for the Approval of Cancer Drugs and Biologics. Retrieved from http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/ucm071590.pdf
Kilickap, S., Demirci, U., Karadurmus, N., Dogan, M., Akinci, B., & Sendur, M. A. N. (2018). Endpoints in oncology clinical trials. J BUON, 23, 1-6.
Freedman, L. S. (1982). Tables of the number of patients required in clinical trials using the logrank test. Statistics in Medicine, 1(2), 121-129.
Schoenfeld, D. A. (1983). Sample-size formula for the proportional-hazards regression model. Biometrics, 499-503.
Lachin, J. M., & Foulkes, M. A. (1986). Evaluation of sample size and power for analyses of survival with allowance for nonuniform patient entry, losses to follow-up, noncompliance, and stratification. Biometrics, 507-519.
Lakatos, E. (1988). Sample sizes based on the log-rank statistic in complex clinical trials. Biometrics, 229-241.
Lakatos, E., & Lan, K. K. G. (1992). A comparison of sample size methods for the logrank statistic. Statistics in Medicine, 11(2), 179-191.
Bariani, G. M., de Celis, F., Anezka, C. R., Precivale, M., Arai, R., Saad, E. D., & Riechelmann, R. P. (2015). Sample size calculation in oncology trials. American Journal of Clinical Oncology, 38(6), 570-574.
Tang, Y. (2021). A unified approach to power and sample size determination for log-rank tests under proportional and nonproportional hazards. Statistical Methods in Medical Research, 30(5), 1211-1234.
Tang, Y. (2022). Complex survival trial design by the product integration method. Statistics in Medicine, 41(4), 798-814.
Hsieh, F. Y., & Lavori, P. W. (2000). Sample-size calculations for the Cox proportional hazards regression model with nonbinary covariates. Controlled Clinical Trials, 21(6), 552-560.
Wu, J. (2014). Sample size calculation for the one-sample log-rank test. Pharmaceutical Statistics, 14(1), 26-33.
Phadnis, M. A. (2019). Sample size calculation for small sample single-arm trials for time-to-event data: Logrank test with normal approximation or test statistic based on exact chi-square distribution? Contemporary Clinical Trials Communications, 15, 100360.
Jung, S. H. (2008). Sample size calculation for the weighted rank statistics with paired survival data. Statistics in Medicine, 27(17), 3350-3365.
Lachin, J. M. (2013). Sample size and power for a logrank test and Cox proportional hazards model with multiple groups and strata, or a quantitative covariate with multiple strata. Statistics in Medicine, 32(25), 4413-4425.
Litwin, S., Wong, Y.-N., & Hudes, G. (2007). Early stopping designs based on progression-free survival at an early time point in the initial cohort. Statistics in Medicine, 26(14), 4400-4415.
Liu, Y., & Lim, P. (2017). Sample size increase during a survival trial when interim results are promising. Communications in Statistics-Theory and Methods, 46(14), 6846-6863.
Freidlin, B., & Korn, E. L. (2017). Sample size adjustment designs with time-to-event outcomes: a caution. Clinical Trials, 14(6), 597-604.
Ren, S., & Oakley, J. E. (2014). Assurance calculations for planning clinical trials with time-to-event outcomes. Statistics in Medicine, 33(1), 31-45.
Yao, J. C., et al. (2011). Everolimus for advanced pancreatic neuroendocrine tumors. New England Journal of Medicine, 364(6), 514-523.
Fine, G. D. (2007). Consequences of delayed treatment effects on analysis of time-to-event endpoints. Drug Information Journal, 41(4), 535-539.
Alexander, B. M., Schoenfeld, J. D., & Trippa, L. (2018). Hazards of hazard ratios-deviations from model assumptions in immunotherapy. The New England Journal of Medicine, 378(12), 1158-1159.
Public Workshop: Oncology Clinical Trials in the Presence of Non‐Proportional Hazards, The Duke‐Margolis Center for Health Policy, February 2018. Available at: https://slideplayer.com/slide/14007912/.
Royston, P., & Parmar, M. K. (2020). A simulation study comparing the power of nine tests of the treatment effect in randomized controlled trials with a time-to-event outcome. Trials, 21(1), 1-17.
Logan, B. R., Klein, J. P., & Zhang, M. J. (2008). Comparing treatments in the presence of crossing survival curves: An application to bone marrow transplantation. Biometrics, 64(3), 733-740.
Fleming, T. R., & Harrington, D. P. (1981). A class of hypothesis tests for one and two-sample censored survival data. Communications in Statistics-Theory and Methods, 10(8), 763-794.
Pepe, M. S., & Fleming, T. R. (1989). Weighted Kaplan-Meier statistics: A class of distance tests for censored survival data. Biometrics, 497-507.
Breslow, N. E., Edler, L., & Berger, J. (1984). A two-sample censored-data rank test for acceleration. Biometrics, 1049-1062.
Lan, K. K. G., & Wittes, J. (1990). Linear rank tests for survival data: Equivalence of two formulations. The American Statistician, 44(1), 23-26.
Yang, S., & Prentice, R. (2010). Improved log-rank-type tests for survival data using adaptive weights. Biometrics, 66, 30-38.
Lee, S. H. (2007). On the versatility of the combination of the weighted log-rank statistics. Computational Statistics & Data Analysis, 51(12), 6557-6564.
Hasegawa, T. (2014). Sample size determination for the weighted log-rank test with the Fleming–Harrington class of weights in cancer vaccine studies. Pharmaceutical Statistics, 13(2), 128-135.
Karrison, T. (2016). Versatile tests for comparing survival curves based on weighted log-rank statistics. Stata Journal, 16(3), 678-690.
Lin, R. S., Lin, J., Roychoudhury, S., Anderson, K. M., Hu, T., Huang, B., Leon, L. F., Liao, J. J., Liu, R., Luo, X., & Mukhopadhyay, P. (2020). Alternative analysis methods for time-to-event endpoints under nonproportional hazards: A comparative analysis. Statistics in Biopharmaceutical Research, 12(2), 187-198.
Roychoudhury, S., Anderson, K. M., Ye, J., & Mukhopadhyay, P. (2021). Robust design and analysis of clinical trials with nonproportional hazards: A straw man guidance from a cross-pharma working group. Statistics in Biopharmaceutical Research, 1-15.
Mukhopadhyay, P., Ye, J., Anderson, K. M., Roychoudhury, S., Rubin, E. H., Halabi, S., & Chappell, R. J. (2022). Log-rank test vs MaxCombo and difference in restricted mean survival time tests for comparing survival under nonproportional hazards in immuno-oncology trials: A systematic review and meta-analysis. JAMA Oncology.
Freidlin, B., & Korn, E. L. (2019). Methods for accommodating nonproportional hazards in clinical trials: Ready for the primary analysis? Journal of Clinical Oncology, 37(35), 3455.
Bartlett, J. W., Morris, T. P., Stensrud, M. J., Daniel, R. M., Vansteelandt, S. K., & Burman, C. F. (2020). The hazards of period-specific and weighted hazard ratios. Statistics in Biopharmaceutical Research, 12(4), 518.
Magirr, D., & Burman, C. F. (2019). Modestly weighted log-rank tests. Statistics in Medicine, 38(20), 3782-3790.
Magirr, D. (2021). Non‐proportional hazards in immuno‐oncology: Is an old perspective needed? Pharmaceutical Statistics, 20(3), 512-527.
Magirr, D., & Burman, C. F. (2023). The MaxCombo Test Severely Violates the Type I Error Rate. JAMA Oncology, 9(4), 571–572.
Mukhopadhyay, P., Roychoudhury, S., & Anderson, K. M. (2023). The MaxCombo Test Severely Violates the Type I Error Rate—Reply. JAMA Oncology, 9(4), 572-572.
Royston, P., & Parmar, M. K. (2013). Restricted mean survival time: An alternative to the hazard ratio for the design and analysis of randomized trials with a time-to-event outcome. BMC Medical Research Methodology, 13(1), 1-15.
Uno, H., Claggett, B., Tian, L., Inoue, E., Gallo, P., Miyata, T., Schrag, D., Takeuchi, M., Uyama, Y., Zhao, L., & Skali, H. (2014). Moving beyond the hazard ratio in quantifying the between-group difference in survival analysis. Journal of Clinical Oncology, 32(22), 2380.
Uno, H., Wittes, J., Fu, H., Solomon, S., Claggett, B., Tian, L., Cai, T., Pfeffer, M., Evans, S., & Wei, L. (2015). Alternatives to hazard ratios for comparing the efficacy or safety of therapies in noninferiority studies. Annals of Internal Medicine, 163(2), 127.
Kim, D. H., Uno, H., & Wei, L. J. (2017). Restricted mean survival time as a measure to interpret clinical trial results. JAMA Cardiology, 2(11), 1179-1180.
Uno, H., & Tian, L. (2020). Is the log-rank and hazard ratio test/estimation the best approach for primary analysis for all trials? Journal of Clinical Oncology, 38(17), 2000-2001.
Eaton, A., Therneau, T., & Le-Rademacher, J. (2020). Designing clinical trials with (restricted) mean survival time endpoint: Practical considerations. Clinical Trials, 17(3), 285-294.
Xu, Z., Zhen, B., Park, Y., & Zhu, B. (2017). Designing therapeutic cancer vaccine trials with delayed treatment effect. Statistics in Medicine, 36(4), 592-605.
Xu, Z., Park, Y., Zhen, B., & Zhu, B. (2018). Designing cancer immunotherapy trials with random treatment time‐lag effect. Statistics in Medicine, 37(30), 4589-4609.
Xu, Z., Park, Y., Liu, K., & Zhu, B. (2020). Treating non-responders: Pitfalls and implications for cancer immunotherapy trial design. Journal of Hematology & Oncology, 13(1), 1-11.
Phadnis, M. A., & Mayo, M. S. (2021). Sample size calculation for two-arm trials with time-to-event endpoint for non-proportional hazards using the concept of Relative Time when inference is built on comparing Weibull distributions. Biometrical Journal, 63(7), 1406-1433.
Anderson, K. M. (1991). A nonproportional hazards Weibull accelerated failure time regression model. Biometrics, 281-288.
Balan, T. A., & Putter, H. (2020). A tutorial on frailty models. Statistical Methods in Medical Research, 29(11), 3424-3454.
Chen, L. M., Ibrahim, J. G., & Chu, H. (2014). Sample size determination in shared frailty models for multivariate time-to-event data. Journal of Biopharmaceutical Statistics, 24(4), 908-923.
Rényi,
A. (1953). On the theory of order statistics. Acta Math. Acad. Sci. Hung, 4(2), 48-89.