Design and Monitoring of Adaptive Clinical Trials
An adaptive design is a clinical trial design that allows adaptations or modifications to aspects of the trial after its initiation without undermining the validity and integrity of the trial (Chow, Chang, and Pong, 2005). The PhRMA Working Group defines an adaptive design as a clinical study design that uses accumulating data to decide how to modify aspects of the study as it continues, without undermining the validity and integrity of the trial (Gallo et al., 2006; Dragalin and Fedorov, 2006).
Adaptive designs may include, but are not limited to, (1) a group sequential design, (2) a sample-size reestimation design, (3) a drop-arm design, (4) an add-arm design, (5) a biomarker-adaptive design, (6) an adaptive randomization design, and (7) an adaptive dose-escalation design. An adaptive design typically consists of multiple stages. At each stage, data analyses are conducted, and adaptations are made based on updated information to maximize the probability of success.
A group sequential design (GSD) is an adaptive design that allows a trial to stop earlier based on the results of interim analyses. For a trial with a positive result, early stopping ensures that a new drug product can be available to the patients sooner. If a negative result is indicated, early stopping avoids wasting resources and reduces the unnecessary exposure to the ineffective drug. Sequential methods typically lead to savings in sample size, time, and cost when compared with the classical design with a fixed sample size. GSD is probably one of the most commonly used adaptive designs in clinical trials. There are three different types of GSDs: - Early efficacy stopping design, - Early futility stopping design, - Early efficacy or futility stopping design.
If we believe (based on prior knowledge) that the test treatment is very promising, then an early efficacy stopping design should be used. If we are very concerned that the test treatment may not work, an early futility stopping design should be employed. If we are not certain about the magnitude of the effect size, a GSD permitting early stopping for both efficacy and futility should be considered.
A sample-size reestimation (SSR) design refers to an adaptive design that allows for sample size adjustment or reestimation based on the review of interim analysis results (Figure 1.1). The sample size requirement for a trial is sensitive to the treatment effect and its variability. An inaccurate estimation of the effect size and its variability could lead to an underpowered or overpowered design, neither of which is desirable. If a trial is underpowered, it will be unlikely to detect a clinically meaningful difference, and consequently could prevent a potentially effective drug from being delivered to patients. On the other hand, if a trial is overpowered, it could lead to unnecessary exposure of many patients to a potentially harmful compound when the drug, in fact, is not practically effective. In practice, it is often difficult to estimate the effect size and variability when designing a clinical trial protocol. Thus, it is desirable to have the flexibility to reestimate the sample size in the middle of the trial.
There are three types of sample-size reestimation procedures:
If the sample size adjustment is based on the (observed) pooled variance at the interim analysis to recalculate the required sample size, it does not require unblinding the data. If the effect size and its variability are re-assessed, and sample size is adjusted based on the updated information. The mixed approach also requires unblinded data, but does not fully use the unblinded information, thus providing an information masker to the public (see Chapter 5). The statistical method for adjustment could be based on effect size or the conditional power.
A GSD can also be viewed as an SSR design, in which the sample size increase is predetermined. For example, if passing the interim analysis, the sample size will increase to the second stage sample size regardless of effect size. However, when the SSR trial passes the interim analysis, the sample size is usually determined by the observed effect size at interim and the so-called conditional power.
A drop-arm or drop-loser design (DLD) is an adaptive design consisting of multiple stages. At each stage, interim analyses are performed, and the losers (i.e., inferior treatment groups) are dropped based on prespecified criteria (Figure 1.2). Ultimately, the best arm(s) are retained. If there is a control group, it is usually retained for the purpose of comparison. This type of design can be used in phase II/III combined trials. A phase II clinical trial is often a dose-response study, where the goal is to assess whether there is a treatment effect. If there is a treatment effect, the goal becomes finding the appropriate dose level (or treatment groups) for the phase III trials. This type of traditional design is not efficient with respect to time and resources because the phase II efficacy data are not pooled with data from phase III trials, which are the pivotal trials for confirming efficacy. Therefore, it is desirable to combine phases II and III so that the data can be used efficiently, and the time required for drug development can be reduced.
In a classical drop-loser design, patients are randomized into all arms (doses) and, at the interim analysis, inferior arms are dropped. Therefore, compared to the traditional dose-finding design, this adaptive design can reduce the sample size by not carrying over all doses to the end of the trial or dropping the losers earlier. However, all the doses have to be explored. For unimodal (including linear or umbrella) response curves, we proposed an effective dose-finding design that allows adding arms at the interim analysis. The trial design starts with two arms; depending on the response of the two arms and the unimodality assumption, we can decide which new arms to add. This design does not require exploring all arms (doses) to find the best responsive dose; therefore, it can further reduce the sample size from the drop-loser design by as much as 10%–20% (Chang and Wang, 2014).
An adaptive randomization design (ARD) allows modification of randomization schedules during the conduct of the trial. In clinical trials, randomization is commonly used to ensure a balance with respect to patient characteristics among treatment groups. However, there is another type of ARD, called response-adaptive randomization (RAR), in which the allocation probability is based on the response of the previous patients. RAR was initially proposed because of ethical considerations (i.e., to have a larger probability to allocate patients to a superior treatment group); however, response randomization can be considered a drop-loser design with a seamless allocation probability of shifting from an inferior arm to a superior arm. The well-known response-adaptive models include the randomized play-the-winner (RPW) model.
Biomarker-adaptive design (BAD) or biomarker-enrichment design (BED) refers to a design that allows for adaptations using information obtained from biomarkers. A biomarker is a characteristic that is objectively measured and evaluated as an indicator of normal biologic or pathogenic processes or pharmacological response to a therapeutic intervention (Chakravarty, 2005). A biomarker can be a classifier, prognostic, or predictive marker. A classifier biomarker is a marker that usually does not change over the course of the study, like DNA markers. Classifier biomarkers can be used to select the most appropriate target population, or even for personalized treatment. Classifier markers can also be used in other situations. For example, it is often the case that a pharmaceutical company has to make a decision whether to target a very selective population for whom the test drug likely works well or to target a broader population for whom the test drug is less likely to work well. However, the size of the selective population may be too small to justify the overall benefit to the patient population. In this case, a BAD may be used, where the biomarker response at interim analysis can be used to determine which target populations should be focused on (Figure 1.3).
Dose-escalation is often considered in early phases of clinical development for identifying maximum tolerated dose (MTD), which is often considered the optimal dose for later phases of clinical development. An adaptive dose-escalation design is a design at which the dose level used to treat the next-entered patient is dependent on the toxicity of the previous patients, based on some traditional escalation rules. Many early dose-escalation rules such as 3+3 rules are adaptive, but the adaptation algorithm is somewhat ad hoc. Recently more advanced dose-escalation rules have been developed using modeling approaches (frequentist or Bayesian framework) such as the continual reassessment method (CRM) (O’Quigley et al., 1990; Chang and Chow, 2005) and other accelerated escalation algorithms. These algorithms can reduce the sample-size and overall toxicity in a trial and improve the accuracy and precision of the estimation of the MTD.
There are three major phases of drug development:
Clinical trials are typically divided into three phases:
FDA Approval Process
Decision Point
Indicators: The colored circles (green and red) next to each decision
point likely indicate a binary decision-making outcome:
Two-Group Superiority Trials
α = Pr {reject H0 when H0 is true}
β = Pr {fail to reject H0 when Ha is true}
Two-Group Noninferiority Trial
There are three major sources of uncertainty about the conclusions from a noninferiority (NI) study: (1) the uncertainty of the active-control effect over a placebo, which is estimated from historical data; (2) the possibility that the control effect may change over time, violating the “constancy assumption”; and (3) the risk of making a wrong decision from the test of the noninferiority hypothesis in the NI study, i.e., the type-I error. These three uncertainties have to be considered in developing a noninferiority design method.
Two-Group Equivalence Trial
Pharmacokinetics (PK) is the study of the body’s absorption, distribution, metabolism, and elimination of a drug. An important outcome of a PK study is the bioavailability of the drug. The bioavailability of a drug is defined as the rate and extent to which the active drug ingredient or therapeutic moiety is absorbed and becomes available at the site of drug action. As bioavailability cannot be easily measured directly, the concentration of drug that reaches the circulating bloodstream is taken as a surrogate. Therefore, bioavailability can be viewed as the concentration of drug that is in the blood. Two drugs are bioequivalent if they have the same bioavailability. There are a number of instances in which trials are conducted to show that two drugs are bioequivalent (Jones and Kenward, 2003): (1) when different formulations of the same drug are to be marketed, for instance, in solid-tablet or liquid-capsule form; (2) when a generic version of an innovator drug is to be marketed; (3) when production of a drug is scaled up, and the new production process needs be shown to produce drugs of equivalent strength and effectiveness to that of the original process.
At the present time, average bioequivalence (ABE) serves as the current international standard for bioequivalence (BE) testing using a 2 × 2 crossover design. The PK parameters used for assessing ABE are area under the curve (AUC) and peak concentration (Cmax). The recommended statistical method is the two one-sided tests procedure to determine if the average values for the PK measures determined after administration of the T (test) and R (reference) products were comparable. This approach is termed average bioequivalence (ABE). It is equivalent to the so-called confidence interval method, which involves the calculation of a 90% confidence interval for the ratio of the averages (population geometric means) of the measures for the T and R products. To establish BE, the calculated confidence interval should fall within a BE limit, usually 80% − 125% for the ratio of the product averages. The 1992 guidance has also provided specific recommendations for logarithmic transformation of PK data, methods to evaluate sequence effects, and methods to evaluate outlier data.
In practice, people also use parallel designs and the 90% confidence interval for nontransformed data. To establish BE, the calculated confidence interval should fall within a BE limit, usually 80% − 120% for the difference of the product averages.
Adaptive design in clinical trials is evolving rapidly, with increasing complexity and regulatory support shaping its future directions. The integration of new statistical methods and broader applications signifies a dynamic shift towards more efficient and effective drug development processes.
In Phase I of adaptive clinical trials, there is a significant shift towards designs that simultaneously evaluate toxicity and efficacy. This approach, guided by frameworks such as the FDA’s “Project Optimus,” emphasizes optimizing the dosage of drugs for oncologic diseases. The initiative, formally titled “Optimizing the Dosage of Human Prescription Drugs and Biological Products for the Treatment of Oncologic Diseases” and launched on August 24, encourages developing trials that define both the maximum tolerated dose (MTD) and the optimal biological dose (OBD) early in the clinical development process.
Phase II trials are increasingly incorporating adaptive elements, particularly through designs that allow for interim futility analysis, a practice already common but growing in sophistication. Examples include:
In confirmatory Phase III trials, interim analyses have become a standard practice, but there is now a growing interest in integrating these trials with more complex elements. This includes the use of advanced statistical methods and adaptive features that allow trials to be more responsive to emerging data without compromising on scientific rigor.
Regulatory Landscape: Draft ICH E20 on Adaptive Clinical Trials
Looking ahead to Q1 2025, the draft ICH E20 on Adaptive Clinical Trials represents a major milestone. This guideline is expected to:
Significantly, there is continued interest in platform trials, particularly those building on successes in oncology (like Umbrella and Basket Master Protocol Trials) and Covid-19 therapeutics. These trial designs are expected to expand into non-oncology areas, leveraging their flexibility and efficiency to accelerate drug development across various therapeutic areas.
There are three methods for two-stage adaptive design methods. The methods are differentiated in terms of the test statistics defined at each stage, while the test statistic is evidence against the null hypothesis in terms of cumulative information (data). The three methods are constructed based on three different ways to combine the so-called stagewise p-values: (1) the sum of p-values (MSP), (2) the product of p-values (MPP), and (3) the inverse-normal p-values (MINP). The MINP is a generalization of the test statistics used in the classical group sequential design. The operating characteristics of MPP and MINP are very similar. The methods are very general, meaning that they can be applied to broad adaptations and any number of stages in the adaptive trials. However, we will focus on two-stage designs and provide the closed forms for determination of stopping boundaries.
Suppose, in a two-group parallel design trial, that the objective is to test the treatment difference (One Sided)
\[ H_0: \mu_1=\mu_2 \text { versus } H_a: \mu_1>\mu_2 \]
The test statistic at the \(k^{t h}\) stage is denoted by \(T_k\), which is constructed based on cumulative data at the time of analysis. A convenient way to combine the data from different stages is to combine \(p_i(i=1,2)\), the \(p\)-value from the subsample obtained at the \(i^{\text {th }}\) stage. For a normal endpoint, the stagewise \(p\)-value is \(p_i=1-\Phi\left(z_i\right)\), where \(z_i\) is z-score calculated based on the subsample obtained at the \(i^{t h}\) stage.
For group sequential design, the stopping rules at the first stage are
\[ \left\{\begin{array}{l} \text { Reject } H_0 \text { (stop for efficacy) if } T_1 \leq \alpha_1 \\ \text { Accept } H_0 \text { (stop for futility) if } T_1>\beta_1 \\ \text { Continue with adaptations if } \alpha_1<T_1 \leq \beta_1 \end{array}\right. \]
where \(0 \leq \alpha_1<\beta_1 \leq 1\). The stopping rules at the second stage are
\[ \left\{\begin{array}{l} \text { Stop for efficacy if } T_1 \leq \alpha_2 \\ \text { Stop for futility if } T_1>\alpha_2 \end{array}\right. \]
For convenience, \(\alpha_k\) and \(\beta_k\) are called the efficacy and futility boundaries, respectively.
To reach the second stage, a trial has to pass the first stage. Therefore, the probability of rejecting the null hypothesis \(H_0\) or, simply, the rejection probability at the second stage is given by \[ \operatorname{Pr}\left(\alpha_1<T_1<\beta_1, T_k<\alpha_2\right)=\int_{\alpha_1}^{\beta_1} \int_{-\infty}^{\alpha_2} f_{T_1 T_2}\left(t_1, t_2\right) d t_2 d t_1 \] where \(f_{T_1 T_2}\) is the joint probability density function (p.d.f.) p.d.f. of \(T_1\) and \(T_2\).
When the futility boundary \(\beta_1=1\), futility stopping is impossible. In this case, the efficacy boundary at the second stage \(\alpha_2\) is determined by formulation \(\alpha_1+\operatorname{Pr}\left(\alpha_1<T_1<1, T_k<\alpha_2\right)=\alpha\). In comparison with the efficacy boundary \(\alpha_2^*\) when there is a futility boundary \(\beta_1\), it is obvious that \(\operatorname{Pr}\left(\alpha_1<T_1<\beta_1, T_k<\alpha_2\right)=\operatorname{Pr}\left(\alpha_1<T_1, T_k<\alpha_2^*\right)\). It follows that \(\alpha_2^* \leq \alpha_2\). However, in current practice, the futility boundary \(\beta_1\) may not be followed by the pharmaceutical industry (nonbinding futility boundary); therefore, the FDA will not accept the \(\alpha_2^*\) for the phase-III studies. Instead, the nonbinding futility boundary \(\alpha_2\) is suggested. This is the so-called FDA’s nonbinding futility rule.
Chang (2006a) proposed an adaptive design method, in which the test statistic is defined as the sum of the stagewise \(p\)-values. This method is referred to as MSP and the test statistic at the \(k^{t h}\) stage is defined as \[ T_k=\Sigma_{i=1}^k p_i, k=1, \ldots, K \]
Let \(\pi_1\) and \(\pi_2\) be the type-I errors spent (i.e., the probability of false rejection allowed) at the first and second stages, respectively. Under the nonbinding futility rule ( \(\beta_1=\alpha_2\) ), the stopping boundaries \(\alpha_1\) and \(\alpha_2\) must satisfy the following equations:
\[ \left\{\begin{array}{l} \pi_1=\alpha_1 \\ \pi_2=\frac{1}{2}\left(\alpha_2-\alpha_1\right)^2 \end{array}\right. \]
Since \(\pi_1+\pi_2=\alpha\), the stopping boundaries can be written as \[ \left\{\begin{array}{l} \alpha_1=\pi_1 \\ \alpha_2=\sqrt{2\left(\alpha-\pi_1\right)}+\pi_1 \end{array}\right. \]
As soon as we decide the significant level \(\alpha\) and \(\pi_1\), we can determine the stopping boundaries \(\alpha_1\) and \(\alpha_2\)
This method is referred to as MPP. The test statistic in this method is based on the product of the stagewise \(p\)-values from the subsamples. For two-stage designs, the test statistic is defined as (Bauer and Kohne, 1994)
\[ T_k=\Pi_{i=1}^k p_i, k=1,2 \]
Under the nonbinding futility rule \(\left(\beta_1=1\right)\), the \(\alpha\) spent at the two stages are given by
\[ \left\{\begin{array}{l} \pi_1=\alpha_1 \\ \pi_2=-\alpha_2 \ln \alpha_1 \end{array}\right. \]
Since \(\pi_1+\pi_2=\alpha\), the stopping boundaries can be written as \[ \left\{\begin{array}{l} \alpha_1=\pi_1 \\ \alpha_2=\frac{\pi_1-\alpha}{\ln \pi_1} \end{array}\right. \]
As soon as we decide \(\alpha\) and \(\pi_1\), we can determine the stopping boundaries \(\alpha_1\) and \(\alpha_2\)
It is interesting to know that when \(p_1<\alpha_2\), there is no point in continuing the trial because \(p_1 p_2<p_1<\alpha_2\) and efficacy should be claimed. Therefore it is suggested that we should choose \(\beta_1>\alpha_1>\alpha_2\).
Lehmacher and Wassmer (1999) proposed the test statistic at the \(k^{t h}\) stage that results from the inverse-normal method of combining independent stagewise \(p\)-values: \[ Z_k=\sum_{i=1}^k w_{k i} \Phi^{-1}\left(1-p_i\right) \]
where the weights satisfy the equality \(\sum_{i=1}^k w_{k i}^2=1\), and \(\Phi^{-1}\) is the inverse function of \(\Phi\), the standard normal c.d.f. Under the null hypothesis, the stagewise \(p_i\) is usually uniformly distributed over \([0,1]\). The random variables \(z_{1-p_i}=\Phi^{-1}\left(1-p_i\right)\) and \(Z_k\) have the standard normal distribution. MINP is a generalization of classical GSD (details will be provided later).
To be consistent with MSP and MPP, we transform the test statistic \(Z_k\) to the \(p\)-scale, i.e.,
\[ T_k=1-\Phi\left(\sum_{i=1}^k w_{k i} \Phi^{-1}\left(1-p_i\right)\right) \]
the stopping boundary is on the \(p\)-scale and easy to compare with other methods regarding operating characteristics.
the classical group sequential boundaries are valid regardless of the timing and sample-size adjustment that may be based on the observed data at the previous stages. Note that under the null hypothesis, \(p_i\) is usually uniformly distributed over \([0,1]\) and hence \(z_{1-p_i}=\Phi^{-1}\left(1-p_i\right)\) has the standard normal distribution. The LehmacherWassmer method provides a broad method for different endpoints as long as the \(p\)-value under the null hypothesis is stochastically larger than or equal to the \(p\)-value that is uniformly distributed over \([0,1]\).
Examples of stopping boundaries for a two-stage design with weights, \(w_1^2=\) \(\frac{n_1}{n_1+n_2}\)
In classical group sequential design (GSD), the stopping boundaries are usually specified using a function of stage \(k\). The commonly used such functions are Pocock and O’Brien-Fleming boundary functions. Wang and Tsiatis (1987) proposed a family of two-sided tests with a shape parameter \(\Delta\), which includes Pocock’s and O’Brien-Fleming’s boundary functions as special cases. Because W-T boundaries are based on the \(z\)-scale, for consistency, we can convert them to p-scale. The W-T boundary on \(p\)-scale is given by \[ a_k=1-\Phi\left(\alpha_K \tau_k^{\Delta-1 / 2}\right) \] where \(\tau_k=\frac{k}{K}\) or \(\tau_k=\frac{n_k}{N_K}\) (information time), and \(\alpha_K\) is the stopping boundary at the final stage and a function of the number of stages \(K, \alpha\), and \(\Delta\).
Let \(n_i=\) stagewise sample-size per group at stage \(i\) and \(N_k=\sum_{i=1}^k n_i\) be the cumulative sample-size at stage \(k\) and the information fraction (not information time!). If we choose \(w_{k i}=\sqrt{\frac{n_i}{N_k}}\), then the MINP is consistent with the classical group sequential design.
For the same hypothesis test with two stages, the test statistic for the two-stage design can be generalized to any \(K\)-stage design: \[ \left\{\begin{array}{l} T_k=\sum_{i=1}^k p_i \text { for MSP } \\ T_k=\prod_{i=1}^k p_i \text { for MPP } \\ T_k=1-\Phi\left(\sum_{i=1}^k w_{k i} z_{1-p_i}\right) \text { for MINP } \end{array}\right. \]
where \(p_i\) is the stagewise \(p\)-value at stage \(i\) and the subscript \(k=1, \ldots, K\). When the weight \(w_{k i}=\sqrt{\frac{n_i}{\sum_{i=1}^k n_i}}, T_k\) gives the test statistic for the classic group sequential design, where \(\mathrm{n}_i\) is the same size of subsample from stage \(i\).
The stopping rules at stage \(k\) are: \[ \begin{cases}\text { Stop for efficacy } & \text { if } T_k \leq \alpha_k, \\ \text { Stop for futility } & \text { if } T_k>\beta_k, \\ \text { Continue with adaptations if } \alpha_k<T_k \leq \beta_k,\end{cases} \] where \(\alpha_k<\beta_k(k=1, \ldots, K-1)\), and \(\alpha_K=\beta_K\). For convenience, \(\alpha_k\) and \(\beta_k\) are called the efficacy and futility boundaries, respectively.
For Two Stage \[ \left\{\begin{array}{l} \pi_1=\alpha_1 \\ \pi_2=\frac{1}{2}\left(\alpha_2-\alpha_1\right)^2 \end{array}\right. \]
Chang (2010) provides analytical solutions for up to five-stage designs. For the third stage, we have \[ \pi_3=\alpha_1 \alpha_2 \alpha_3+\frac{1}{3} \alpha_2^3+\frac{1}{6} \alpha_3^3-\frac{1}{2} \alpha_1 \alpha_2^2-\frac{1}{2} \alpha_1 \alpha_3^2-\frac{1}{2} \alpha_2^2 \alpha_3 . \]
After we obtained \(\alpha_1\) and \(\alpha_2\), we can ]obtain \(\alpha_3\) for the stopping boundary using numerical methods.
The general steps to determine the stopping boundaries for \(K\)-stage designs can be described as follows:
For Two Stage \[ \left\{\begin{array}{l} \pi_1=\alpha_1 \\ \pi_2=-\alpha_2 \ln \alpha_1 \end{array}\right. \]
To determine the stopping boundaries for the first two stages with MPP, we use Two Stage above but replace \(\alpha\) by \(\pi_2\) : \[ \left\{\begin{array}{l} \alpha_1=\pi_1, \\ \alpha_2=\frac{\pi_2}{-\ln \alpha_1} \end{array}\right. \]
For the third stage, we have (Chang, 2015) \[ \alpha_3=\frac{\pi_3}{\ln \alpha_2 \ln \alpha_1-\frac{1}{2} \ln ^2 \alpha_1} \]
As soon as \(\pi_1, \pi_2\), and \(\pi_3\) are determined, \(\alpha_1, \alpha_2\), and \(\alpha_3\) can be easily obtained
For MINP, the stopping boundaries can be determined through numerical integration or simulation (Table 4.3). Specifically, we use the two-stage adaptive design simulation R program with α1 = π1. Run the simulation under H0 using different values of α2 until the power is equal to π1 + π2. After we obtain the stopping boundaries α1 and α2, we use the three-stage simulation R program with α1 and α2 that have been determined, but try different values of α3 until the power is equal to α = 0.025.
We want the error-spending \(\pi_1, \ldots, \pi_K\) to follow a certain trend (e.g., monotonic increase), we set up a so-called error-spending function \(\pi^*\left(\tau_k\right)\), where \(\tau_k\) is the information time or sample size fraction at the \(k^{t h}\) interim analysis. The commonly used error-spending functions with one-sided \(\alpha\) are the O’Brien-Fleming-like error-spending function \[ \pi^*\left(\tau_k\right)=2\left\{1-\Phi\left(\frac{z_{1-\alpha / 2}}{\sqrt{\tau_k}}\right)\right\} \] the Pocock-like error-spending function \[ \pi^*\left(\tau_k\right)=\alpha \ln \left[1+\frac{e-1}{\tau_k}\right] \] and the power family \[ \pi^*\left(\tau_k\right)=\alpha \tau_k^\gamma \] where \(\gamma>0\) is a constant
The error-spending function \(\pi^*\left(\tau_k\right)\) presents the cumulative error ( \(\alpha\) ) spent up to the information time \(\tau_k\). Therefore, the error to spend at the \(k^{\text {th }}\) stage with information time \(\tau_k\) is determined by \[ \pi_k=\pi^*\left(\tau_k\right)-\pi^*\left(\tau_{k-1}\right) \]
It is desirable to have a trial design that allows for reestimation of sample-size at the interim analysis. Several different methods have been proposed for sample-size reestimation (SSR), including the blinded, unblinded, and mixed methods.
Futility Stopping: Implementing a futility boundary can help reduce the sample size when early results suggest that continuing the trial would be inefficient due to a very small effect size. Even with an increased maximum sample size (Nmax), the trial may still lack sufficient power to detect a meaningful effect.
Alpha Control: In the adaptive methods discussed (MSP, MPP, and MINP with unblinded or mixed SSR), the sample size adjustment algorithm does not need to be set in advance. Instead, it can be determined based on the interim analysis results. This flexibility allows for more responsive adjustments to trial parameters based on accumulating data.
Concerns with Predetermined Algorithms: Using a predetermined algorithm for sample size adjustment may require revealing efficacy data to the Independent Data Monitoring Committee (IDMC), which could bias the trial. A mixed SSR approach offers a straightforward way to manage this issue, maintaining the trial’s integrity without full disclosure of efficacy information.
Relevance of Power: Power is initially estimated to determine the trial’s probability of rejecting the null hypothesis. However, when sample sizes are adjustable during interim analyses, the initial total sample size becomes less critical for final outcomes and more relevant for budgeting and operational planning. Key parameters like the minimum and maximum sample sizes (Nmim and Nmax) directly impact the trial’s power.
Importance of Conditional Power: In adaptive designs, conditional power, which measures the likelihood of success given interim results, is often more critical than overall power. MSP generally offers superior conditional power compared to other methods. Regulatory bodies often favor nonbinding futility rules, under which MSP usually performs better.
Effectiveness of Mixed SSR Method: Simulations suggest that the mixed SSR method is highly effective and preferable over purely blind or unblind SSR approaches. It is recommended for practical use due to its ability to manage type-I error slightly better. Adjusting the significance level (alpha) based on simulation results can further help control type-I error rates.
Wittes and Brittain \((1990)\) and Gould and Shih \((1992,1998)\) discussed methods of blinded SSR. A blinded SSR assumes that the actually realized effect size estimate is not revealed through unblinding the treatment code. In blinded sample size reestimation, interim data are used to provide an updated estimate of a nuisance parameter without unblinding the treatment assignments. Nuisance parameters mentioned in this context are usually the variance for continuous outcomes. Blinded sample reestimation is generally well accepted by regulators (ICH E-9, 1999).
In clinical trials, the sample-size is determined by an estimated treatment difference and sample variability in the primary endpoint. Due to lack of knowledge of the test drug, the estimate of the variability is often not precise. As a result, the initially planned sample-size may turn out to be inappropriate and needs to be adjusted at interim analysis to ensure the power. A simple approach to dealing with nuisance parameter \(\sigma\) without unblinding the treatment code is to use the so-called maximum information design. The idea behind this approach is that recruitment continues until the prespecified information level is reached, i.e., \(I=N /\left(2 \hat{\sigma}^2\right)=I_{\max }\). For a given sample size, the information level reduces as the observed variance increases. The sample size per group for the two-group parallel design can be written in this familiar form: \[ N=\frac{2 \sigma^2}{\delta^2}\left(z_{1-\alpha}+z_{1-\beta}\right)^2, \] where \(\delta\) is treatment difference and \(\sigma\) is the common standard deviation. For the normal endpoint, the treatment difference from the pooled (blinded) data follows a mixed normal distribution with the variance (Chang, 2014) \[ \left(\sigma^*\right)^2=\sigma^2+\left(\frac{\delta}{2}\right)^2 \] where if we know (e.g., from a phase-II single arm trial) \(\sigma^2\), we can estimate the treatment difference \(\delta\), we can see that \[ N=2\left(\frac{\left(\sigma^*\right)^2}{\delta^2}-\frac{1}{4}\right)\left(z_{1-\alpha}+z_{1-\beta}\right)^2 \]
If we use the “lumped variance” \(\left(\sigma^*\right)^2\) at the interim analysis from the blinded data, the new sample size required for the stage 2 is \(N_2=N-N_1\) or \[ N_2=2\left(\frac{\left(\hat{\sigma}^*\right)^2}{\delta}-\frac{1}{4}\right)\left(z_{1-\alpha}+z_{1-\beta}\right)^2-N_1 \] where the interim sample size is \(N_1\).
In this method, we will not adjust the sample size if the observed treatment is zero or less. Otherwise, the sample size will be adjusted to meet the targeted conditional power but with the limit of the maximum sample size allowed due to the cost and time considerations: \[ n_2=\min \left(N_{\max }, \frac{2 \hat{\sigma}^2}{\hat{\delta}^2}\left[\frac{z_{1-\alpha}-w_1 z_{1-p_1}}{\sqrt{1-w_1^2}}-z_{1-c P}\right]^2\right), \text { for } \hat{\delta}>0 \]
where \(c P\) is the target conditional power, \(\sigma_0^2\) is the common variance, \(n_2\) is the sample-size per group at stage 2 , and \(z_x=\Phi^{-1}(x)\). Note that \(\hat{\delta}\) and \(\hat{\sigma}^2\) are MLEs from the unblinded data. A minimum sample size \(N_{\min }\) can also be imposed.
Stopping Boundaries for SSR
When implementing an adaptive design in clinical trials, one common strategy is to re-estimate the sample size at a subsequent stage based on the interim results. The validity of stopping boundaries established under a fixed sample size setting (as might be done in a group sequential design or GSD) is questioned when these adjustments are made.
For MINP
Under the assumption of normality, a linear combination of these variables would also be normally distributed. This would imply that adjustments to the sample size would not necessarily invalidate the established stopping boundaries if \(\tau_1\) were constant. However, because \(\tau_1\) itself is a function of \(n1\), which can change based on interim results (\(z1\) or \(p1\)), \(T2\) does not strictly remain a linear combination of two independent normal variables. Thus, adjustments to stopping boundaries might be necessary to maintain control over the type-I error rate.
For MSP and MPP
These methods rely on the assumption that the conditional distribution of the second stage p-value (\(p2\)) given the first stage p-value (\(p1\)) must be uniformly distributed within \([0,1]\) under \(H_0\).
The independence of \(p1\) and \(p2\) holds if:
In designs without sample size re-estimation, \(p1\) and \(p2\) remain independent and uniformly distributed. But if \(n2\) is directly or indirectly adjusted based on \(p1\), this independence may be compromised. However, under the true null hypothesis and irrespective of \(n2\), \(p2\) remains uniformly distributed due to the properties of the standard normal distribution of the test statistic, allowing existing GSD boundaries to be applicable for MSP and MPP.
In this method, the sample-size adjustment has the similar formulation as for unblinded SSR but replaces the term \(\frac{\hat{\sigma}^2}{\hat{\delta}^2}\) with blinded estimate \(\frac{\hat{\sigma}^*}{\delta_0}\) : \[ n_2=\min \left(N_{\max }, 2\left(\frac{\hat{\sigma}^*}{\delta_0}\right)^2\left[\frac{z_{1-\alpha}-w_1 z_{1-p_1}}{\sqrt{1-w_1^2}}-z_{1-c P}\right]^2\right), \hat{\delta}>0 \] where \(\delta_0\) is the initial estimation of treatment difference and \(\left(\hat{\sigma}^*\right)^2\) is the variance estimated from the blinded data: \[ \left(\hat{\sigma}^*\right)^2=\frac{\sum_{i=1}^N\left(x_i-\bar{x}\right)^2}{N} \]
To reduce the sample size when the drug is ineffective ( \(\delta\) is very small or negative), we need to have the futility boundary. Since the sample size adjustment is based on the blinded value \(\frac{\hat{\sigma}^*}{\delta_0}\), while \(z_{1-p_1}\) and the futility boundary are based on unblinded data, we call this method the mixed method.
Like the blinded maximum information SSR, the sample size adjustment with the mixed method will not release the treatment difference to the public. Most importantly the SSR method is much more efficient than all other methods: when the true treatment is lower than the initial estimation, the sample size will increase automatically to effectively prevent a large drop in power. However, the problem with this method is that it will increase the sample size dramatically even when the treatment effect is zero. Therefore, to avoid that, we have to use a futility boundary; for example, when the observed treatment effect is zero or less, the trial will be stopped.
The discussion around survival analyses in clinical trials often pivots on whether to focus on the number of patients or the number of events like deaths or disease progression. Event-based designs are specifically structured around the occurrence of these clinical endpoints, not just patient counts. This focus is driven by the statistical models used which assume that the events at each stage are independent from each other.
Independence of Statistics Across Stages: Traditional methods in survival analysis assume that the statistics at each stage are independent, meaning that the outcomes at one stage do not influence those at subsequent stages. For instance, the first group of patients enrolled—let’s say \(N_1\) patients—is used for early interim analysis regardless of whether these patients have experienced the event or not.
The log-rank test, commonly used for this analysis type, relies on event counts. Its statistic at various stages is given by: \[ T(D_k) = \sqrt{\frac{\hat{D}_k}{2}} \cdot \frac{\lambda_1}{\lambda_2} \sim N\left(\sqrt{\frac{D_k}{2}} \cdot \frac{\lambda_1}{\lambda_2}, 1\right) \] Here, \(\hat{D}_k\) represents the observed events, and \(D_k\) represents the expected events at the \(k^{th}\) stage, with \(\lambda_1\) and \(\lambda_2\) as model parameters.
Dependence Issues and Normal Approximation
While the test statistics \(T(D_k)\) are not independent due to their reliance on the cumulative data across stages, studies by Breslow and Haug (1977) and Canner (1997) affirm that the normal approximation is still valid, even with a small number of events.
Noninferiority Margin in Clinical Trials
In noninferiority (NI) trials, the objective is to demonstrate that the new treatment is not worse than the active control by more than a prespecified margin. There are two main approaches to define this margin:
Farrington-Manning Test for Noninferiority
Originally proposed for trials with binary endpoints, this method has since been extended to adaptive designs that might have different types of endpoints (such as mean, proportion, or median survival time). The null hypothesis for this test is structured as follows: \[ H_0: u_t - (1 - \delta_{NI}) u_c \leq 0 \] where \(u_t\) and \(u_c\) are the responses for the test and control groups, respectively, and \(\delta_{NI}\) is a fraction between 0 and 1 that represents how close the test treatment’s efficacy needs to be to the control’s efficacy.
Calculation of the Test Statistic
The test statistic for evaluating this hypothesis is given by: \[ z_k = \frac{\hat{u}_t - (1 - \delta_{NI}) \hat{u}_c}{\sqrt{\frac{\sigma_t^2}{n_k} + (1 - \delta_{NI})^2 \frac{\sigma_c^2}{n_k}}} \] Here, \(\hat{u}_t\) and \(\hat{u}_c\) are the estimated responses for the test and control groups at the \(k^{th}\) stage, \(\sigma_t^2\) and \(\sigma_c^2\) are the variances of these estimates, and \(n_k\) is the number of subjects at stage \(k\). The variances are provided by supplementary tables or derived from prior data and are used to normalize the effect size difference between the two treatments.
It is noted that using the Farrington-Manning approach can lead to a reduction in variance compared to the prefixed margin approach. The variance in this context is calculated as: \[ \sigma_t^2 + (1 - \delta_{NI})^2 \sigma_c^2 \] This reduction in variance can enhance the power of the test to detect noninferiority, making the Farrington-Manning approach potentially more powerful than the fixed margin approach.
Clinical trials with diseases of unknown etiology or where no single clinical endpoint has been established as most crucial might use coprimary endpoints. This approach accounts for diseases that manifest in multiple ways, requiring trials to assess efficacy across several metrics simultaneously. Traditionally, trials focused on a single primary endpoint, but the need to address multiple facets of a disease has led to the development of methodologies that cater to multiple endpoints.
Hypothesis Testing: For trials with coprimary endpoints, the null hypothesis typically states that the treatment is not superior (or noninferior) to the control for at least one of the endpoints. Rejection of this null hypothesis means the treatment has shown the desired effect across all coprimary endpoints.
Power and Sample Size: The inclusion of multiple primary endpoints affects the power of the trial, which is the probability of correctly rejecting the null hypothesis when it is false. The overall power to detect an effect can be considerably reduced compared to a single-endpoint trial, depending on the correlation among the endpoints. Calculating sample size requires careful consideration to ensure sufficient power for all endpoints.
Equivalently we can write the hypothesis test as \[ H_0: \cup_{j=1}^d H_{0 j} \text { versus } H_a: \cap_{j=1}^d H_{a j} \] where \(H_{0 j}: \mu_j \leq 0\) and \(H_{a j}: \mu_j>0\). The test statistics for these individual endpoints are defined in a usual way: \(Z_{N j}=\sum_{i=1}^N X_{i j} / \sqrt{N}\). We reject \(H_{0 j}\) if \(Z_{N j} \geq c\), the common critical value for all endpoints. Therefore we reject (6.16), and furthermore reject (6.17) if \(Z_{N j} \geq c\) for all \(j\). It is straightforward to prove that for \(\forall j\) and \(k\), where \(j, k=1,2, \ldots, d\), the covariance/correlation between the test statistics is the same as the covariance/correlation between the endpoints, i.e., \(\operatorname{Corr}\left(Z_{N j}, Z_{N k}\right)=\rho\) for \(j \neq k\).
Consider a two-stage ( \(K=2\) ) clinical trial with an interim analysis planned at information time \(\tau=0.5\) and the O’Brien-Fleming stopping boundaries are used; that is, the rejection boundaries are \(z_{1-\alpha_1}=2.80\) and \(z_{1-\alpha_2}=1.98\) at a one-sided level of significance \(\alpha=0.025\).
Overview of Adaptive Designs for Multiple-Endpoint Trials:
Methodology for Group Sequential Trials with Multiple Endpoints:: These trials are designed to evaluate endpoints at multiple stages or times, using interim analyses to potentially stop the trial early based on the results at each stage.
Procedures Proposed by Tang and Geller:
Procedure 1:
Interim Analyses: Conduct interim analyses to test each hypothesis \(H_{0,K}\) using predefined stopping boundaries, which are set as one-sided alpha levels across different stages (denoted as \(K\)).
Application of Closed Testing Procedure: If a particular hypothesis \(H_{0,M}\) is rejected at a stage \(t^*\), the closed testing procedure is then applied to test all remaining hypotheses \(H_{0,F}\) using the test statistics gathered until \(t^*\) with adjusted critical values.
Continuation Upon Non-Rejection: If any hypothesis is not rejected at \(t^*\), the trial continues to the next stage. The closed testing procedure is repeated, with previously rejected hypotheses considered as rejected without retesting.
Repetition Until Conclusion: This step is repeated until either all hypotheses are rejected or the trial reaches its final planned stage.
Procedure 2 (Modification of Procedure 1):
Example
A phase-III two-parallel group non-Hodgkin’s lymphoma trial was designed with three analyses. The primary endpoint is progression-free survival (PFS); the secondary endpoints are (1) overall response rate (ORR) including complete and partial response and (2) complete response rate (CRR). The estimated median PFS is 7.8 months and 10 months for the control and test groups, respectively. Assume a uniform enrollment with an accrual period of 9 months and a total study duration of 23 months. The estimated ORR is 16% for the control group and 45% for the test group. The classical design with a fixed sample-size of 375 subjects per group will allow for detecting a 3-month difference in median PFS with 82% power at a one-sided significance level of α = 0.025. The first interim analysis (IA) will be conducted on the first 125 patients/group (or total N1 = 250) based on ORR. The objective of the first IA is to modify the randomization. Specifically, if the difference in ORR (test-control), ∆ORR > 0, the enrollment will continue. If ∆ORR ≤ 0, then the enrollment will stop. If the enrollment is terminated prematurely, there will be one final analysis for efficacy based on PFS and possible efficacy claimed on the secondary endpoints. If the enrollment continues, there will be an interim analysis based on PFS and the final analysis of PFS. When the primary endpoint (PFS) is significant, the analyses for the secondary endpoints will be performed for the potential claim on the secondary endpoints. During the interim analyses, the patient enrollment will not stop.
Generalized Teng-Geller’s procedure 2 with MINP for this trial as illustrated below
Adaptive Seamless Phase-II/III Design Explained:
This design type combines elements from both phase-II and phase-III trials, typically referred to as a seamless adaptive design. It begins with what is known as the learning phase, akin to a traditional phase-II trial, which then seamlessly transitions into the confirmatory phase that mirrors a traditional phase-III trial. This design is particularly effective because it allows for the continuous learning and adaptation based on the data as it is collected.
Stages of the Adaptive Design:
Selection Stage: The initial part of the trial functions under a randomized parallel design where several doses (and potentially a placebo) are tested simultaneously. This stage determines the most effective or safest dose, referred to as ‘the winner,’ which will then be taken forward into the next phase of the trial.
Confirmation Stage: In this stage, new patients are recruited and randomized to either the selected dose or a placebo group. The trial continues with these groups to confirm the efficacy or safety profile identified in the selection stage. The final analysis incorporates cumulative data from both stages, which helps in affirming the drug’s effectiveness or safety with greater precision.
Huang, Liu, and Hsiao (2011) proposed a seamless design to allow prespecifying probabilities of rejecting the drug at each stage to improve the efficiency of the trial. Posch, Maurer, and Bretz (2011) described two approaches to control the type- error rate in adaptive designs with sample size reassessment and/or treatment selection. The first method adjusts the critical value using a simulation-based approach, which incorporates the number of patients at an interim analysis, the true response rates, the treatment selection rule, etc. The second method is an adaptive Bonferroni–Holm test procedure based on conditional error rates of the individual treatment–control comparisons. They showed that this procedure controls the type- error rate, even if a deviation from a preplanned adaptation rule or the time point of such a decision is necessary.
Suppose in a \(K\)-group trial, we define the global null hypothesis as \(H_G: \mu_0=\) \(\mu_1=\mu_2 \cdots \mu_K\) and the hypothesis test between the selected arm (winner) and the control as \[ H_G: \mu_0=\mu_s, s=\text { selected arm } \]
Chang and Wang (2014) derive formulations for the two-stage pick-thewinner design with any number of arms. The design starts with all doses under consideration, at the interim analysis; the winner with the best response observed will continue to the second stage with the control group.
Suppose a trial starts with \(K\) dose groups (arms) and one control arm (arm 0). The maximum sample size for each group is \(N\). The interim analysis will perform on \(N_1\) independent observations per group, \(x_{i j}\) from \(N\left(\mu_i, \sigma^2\right)\) \(\left(i=0, . ., K ; j=1, \ldots, N_1\right)\). The active arm with maximum response at the interim analysis and the control arm will be selected and additional \(N_2=\) \(N-N_1\) subjects in each arm will be recruited. We denote by \(\bar{x}_i\) the mean of the first \(N_1\) observations in the \(i^{t h}\) arm \((i=0,1, \ldots K)\), and \(\bar{y}_i\) the mean of the additional \(N_2\) observations \(y_{i j}\) from \(N\left(\mu_i, \sigma^2\right)\left(i=0, S ; j=1, \ldots, N_2\right)\). Here, \(\operatorname{arm} S\) is the active arm selected for the second stage. Let \(z_i=\frac{\bar{x}_i}{\sigma} \sqrt{N_1}\) and \(\tau_i=\frac{\bar{y}_i}{\sigma} \sqrt{N_2}\), so that, under the \(H_G, t_i\) and \(\tau_i\) are the standard normal distribution with pdf \(\phi\) and \(\operatorname{cdf} \Phi\), respectively.
Define the maximum statistic at the end of stage 1 as \[ Z_{(1)}=\max \left(t_1, t_2, \ldots, t_K\right) \]
At the final stage, using all data from the winner, we define the statistic \[ T^*=Z_{(1)} \sqrt{\tau}+\tau_s \sqrt{1-\tau} \] where \(\tau=\frac{N_1}{N}\) is the information time at the interim analysis, and \(s\) is the selected arm.
The final test statistic is defined as \[ T=\left(T^*-t_0\right) / \sqrt{2} \] \[ F_T(t)=\int_{-\infty}^{+\infty} \int_{-\infty}^{\infty}\left[\Phi\left(\frac{z-\tau_s \sqrt{1-\tau}}{\sqrt{\tau}}\right)\right]^K \phi\left(\tau_s\right) \phi(\sqrt{2} t-z) d \tau_s d z \]
When the information time \(\tau=1\), for the Dunnett test: \[ F_T(t)=\int_{-\infty}^{\infty}[\Phi(z)]^K \phi(\sqrt{2} t-z) d z \]
The concept of a Seamless Phase 2a/b Combination Design in clinical trials represents an innovative approach to streamline drug development by integrating what are traditionally two separate phases into a single, continuous study. This design allows for the more efficient evaluation of a therapeutic candidate, reducing the time and cost associated with transitioning between phases.
Typical Applications
Components of the Seamless Phase 2a/b Combination Design
Key Features and Methodologies
Key Features and Methodologies
The objective of this trial in asthma patients is to confirm sustained treatment effect, measured as FEV1 change from baseline to the 1 year of treatment. Initially, patients are equally randomized to four doses of the new compound and a placebo. Based on early studies, the estimated FEV1 changes at week 4 are 6%, 12%, 13%, 14%, and 15% (with pooled standard deviation 18%) for the placebo (dose level 0) and dose levels 1, 2, 3, and 4, respectively. One interim analysis is planned when 60 per group or 50% of patients have the efficacy assessments. The interim analysis will lead to picking the winner (arm with best observed response). The winner and placebo will be used at stage 2. At stage 2, we will enroll additionally 60 patients per group in the winner and control groups.
The simulations show that the pick-the-winner design has 95% power with the total sample size of 420.
## [1] "Power=" "0.0289010000000082" "Total N=" "420"
## [1] "Power=" "0.948279999998319" "Total N=" "420"
The Seamless Phase 2b/3 Study Design is an innovative clinical trial strategy that combines Phase 2b (often the dose confirmation phase) and Phase 3 (pivotal trials for efficacy and safety) into a single, continuous protocol. This design allows for a more fluid transition between the late-stage clinical development phases, reducing delays, and potentially accelerating the drug approval process.
Key Features of Seamless Phase 2b/3 Study Design
Example Studies:
The Add-Arms Design in clinical trials is a strategic approach primarily used in phase-II dose-finding studies to identify the Minimum Effective Dose (MED) of a drug. This design is pivotal for balancing efficacy and safety while managing the type-I error reasonably well. The MED is characterized as the dose level where the mean efficacy outcome matches a predefined target, using either a placebo or an active control as a reference.
Key Features of Add-Arms Design:
Methodology and Strategic Approach:
Advantages of Add-Arms Design:
Statistical Considerations:
The add-arm design is a three-stage adaptive design, in which we can use interim analyses and the unimode-response property to determine which doses cannot (unlikely) be the arm with best response–thus no exposure to those doses is necessary.
For convenience, we define the global null hypothesis as \(H_G: \mu_0=\mu_1=\) \(\mu_2=\cdots=\mu_K\) and the hypothesis test between the selected arm (winner) and the control as \[ H_0: \mu_0=\mu_s, s=\text { selected arm } \]
In the \(4+1\) arm design, there are \(K=4\) dose levels (active arms) and a placebo arm (dose level 0). Theoretically, if we know dose 2 has a larger response than dose 3 , then we know, by the unimode-response assumption, that the best response arm can be either dose 1 or 2 , but not dose 4 . Therefore, we don’t need to test dose 4 at all. Similarly, if we know dose 3 has a larger response than dose 2 , then we know, by the unimode-response assumption, that the best response arm can be either dose 4 or 3 , but not dose 1 . Therefore, we don’t need to test dose 1 at all. The problem is that we don’t know the true responses for doses 2 and 3 . We have to find them out based on the observed responses. Of course, we want the observed rates to reflect the true responses with high probability, which mean the sample size cannot be very small.
At the first stage, randomize subjects in two active and the placebo groups. The second stage is the add-arm stage, at which we determine which arm to be added based on the observed data from the first stage and the unimodal property of the response curve. At the third or final stage, more subjects will be added to the winner arm and the placebo. The randomization is specified as follows:
Therefore, there will be \(4 N_1+2 N_2\) total subjects. In the final analysis for the hypothesis test, we will use the data from \(N_1+N_2\) subjects in the winner and \(N_1+N_2\) subjects in arm 0 .
Note for Randomization
Use a \(N_1 / 2: N_1: N_1\) randomization instead of a \(N_1: N_1: N_1\) randomization. This is because the imbalanced randomization can keep the treatment blinding and balance the confounding factors at both stages. If N1 placebo subjects all randomized in the first stage, then at the second stage all N1 subjects will be assigned to the active group without randomization, thus unblinding the treatment and potentially imbalancing some (baseline) confounding factors.
To summarize, the two key ideas in this design are: (1) using the unimoderesponse property to determine which arms not to explore, and (2) determining the rule (cR) for picking the winner so that the selection probabilities for all active arms are (at least approximately) equal under a flat response curve.
When implementing add-arm designs in clinical trials, especially those with multiple dose levels such as the 4+1 design, managing selection probability and controlling Type-I error are critical components.
Establishing Thresholds and Critical Values
Modifying Stopping Boundaries
For phase-II dose-finding trials, we need to define response value at the minimum effective dose (MED), \(\mu_{M E D}\), which will be used to define the utility function: \[ U=\frac{1}{\left(\mu_i-\mu_{M E D}\right)^2} \] where \(\mu_i\) is the response in arm \(i\). Using this utility, we can convert the problem of finding the MED to the problem of finding the dose with the maximum utility \(U\) because at or near the MED, the maximum of \(U\) is achieved. However, to prevent a numerical overflow in the simulation, we have implemented the utility using \[ U=\frac{1}{\left(\hat{\mu}_i-\mu_{M E D}\right)^2+\varepsilon} \] where \(\varepsilon>0\) is a very small value (e.g., 0.00000001 ) introduced to avoid a numerical overflow when the observed \(\hat{\mu}_i=\mu_{M E D}\).
Anemia, specifically iron deficiency anemia (IDA), is a widespread condition characterized by a lack of sufficient healthy red blood cells, primarily due to iron deficiency. This deficiency impedes the body’s ability to produce adequate red blood cells, which are vital for transporting oxygen throughout the body. The most common causes of IDA include significant blood loss through menstruation, pregnancy-related issues, gastrointestinal bleeding, chronic illnesses like cancer, and chronic kidney disease (CKD).
In this context, FXT, an intravenous (IV) drug candidate, has been developed to treat IDA. The primary safety concerns associated with IV iron therapies like FXT include serious hypersensitivity reactions and potential cardiovascular events. The therapeutic objective is to determine the minimum effective dose (MED) of FXT, ideally providing a significant yet safe elevation in hemoglobin levels. The minimal clinically meaningful increase in hemoglobin (Hg) has been identified as 0.5 g/dL.
The design of the clinical trial for FXT involves testing four active doses: 200 mg, 300 mg, 400 mg, and 500 mg. The primary endpoint for the trial is the change in hemoglobin levels from baseline to week 5. Assuming an Emax model, the expected responses for these doses are calculated, with the placebo showing negligible effect due to its objective measurement in a laboratory setting.
The target for the MED is set above the minimal clinically meaningful difference at 0.7 g/dL. This target is chosen because defining the MED at a 0.5 g/dL increase results in a 50% chance that the actual observed response will be below this threshold. Setting the target at 0.7 g/dL aims to ensure that the observed response at the MED exceeds 0.5 g/dL with significantly higher probability.
To optimize the trial design, both add-arm and drop-arm strategies are employed:
The efficacy of these designs is evaluated through simulations, which help determine the probability of selecting each dose as the best option and calculate the total sample size required for robust statistical power. The add-arm design typically requires fewer subjects compared to the drop-arm design, offering a more resource-efficient approach while maintaining a high level of statistical power and controlling the type-I error rate effectively.
\[ \begin{array}{cccccc} \hline \text { Design method } & d_1 & d_2 & d_3 & d_4 & \text { Sample size } \\ & 200 \mathrm{mg} & 400 \mathrm{mg} & 500 \mathrm{mg} & 300 \mathrm{mg} & \\ \hline \text { Add-arm design } & 0.074 & 0.436 & 0.326 & 0.164 & 360 \\ \text { Drop-arm design } & 0.102 & 0.319 & 0.296 & 0.283 & 432\\ \hline \end{array} \]
## [1] "power=" "0.0253699999999991" "Sample Size=" "600" "Selection Prob=" "0.25060000000011" "0.248690000000108"
## [8] "0.25076000000011" "0.249950000000109"
## [1] "power=" "0.948099999999912" "Sample Size=" "696" "Selection Prob=" "0.002" "0.153299999999999"
## [8] "0.822099999999926" "0.0225999999999999"
## [1] "power=" "0.0216899999999993" "Sample Size=" "260" "Selection Prob=" "0.251790000000111" "0.247660000000107"
## [8] "0.25088000000011" "0.249670000000109"
This hypothetical phase-II-III seamless design is motivated by an actual asthma clinical development program. AXP is a second generation compound in the class of asthma therapies known as 5-LO inhibitors, which block the production of leukotrienes. Leukotrienes are major mediators of the inflammatory response. The company’s preclinical and early clinical data suggested that the drug candidate has potential for an improved efficacy and is well tolerated under the total dose of 1600 mg.
The objective of the multicenter seamless trial is to evaluate the effectiveness (as measured by FEV1) of oral AXP in adult patients with chronic asthma. Patients were randomized to one of five treatment arms, arm 0 (placebo), arm 1 (daily dose of 200 mg for 6 weeks), arm 2 (daily dose of 400 mg for 4 weeks), arm 3 (daily dose of 500 mg 3 weeks), and arm 4 (daily dose of 300 mg for 5 weeks). Since the efficacy is usually dependent on both AUC and Cmax of the active agent it is difficult to judge at the design stage exactly which dose-schedule combination will be the best. However, based on limited data and clinical judgment it might be reasonable to assume that the following dose sequence might show a unimodal (umbrella) response curve: arm 1, arm 2, arm 3, and arm 4. The dose responses are estimated to be 8%, 13%, 17%, 16%, and 15% for the placebo and the four active arms with a standard deviation of 26%.
We compare the 4+1 add-arm design against the drop-arm design. A total sample size of 600 subjects (N1 = N2 = 100) for the add-arm design will provide 89% power. We have also tried other different dose sequences, including wavelike sequence; the simulation results show that the power ranges from 88% to 89%, except for the linear response, which provides 86% power. Comparing the drop-arm design, 602 subjects (N1 = N2 = 86) will provide only 84% power, regardless of dose sequence.
The 4+1 add-arm design is a flexible approach in clinical trials that can be customized to include more arms, incorporate early rejections, and implement interim futility stopping while maintaining the same stopping boundaries if a nonbinding futility rule is used.
The 5+1 Add-Arm Design introduces five active arms plus a placebo, enhancing the trial’s ability to evaluate multiple dosages. This design adjusts the number of subjects and the allocation across different arms based on the responses observed in interim analyses, using a detailed randomization plan:
The 6+1 and 7+1 Add-Arm Designs expand on this by increasing the number of active arms, thus broadening the investigation scope and potentially improving the robustness of the findings through a more extensive comparative analysis. These designs follow a similar structure:
All these designs aim to optimize the dosing strategy by dynamically adjusting the trial arms based on ongoing results, which helps in accurately pinpointing the most effective treatment with statistical efficiency. The goal is to manage the overall trial resources better and potentially speed up the development process by quickly dropping less effective or less safe options.
There are two commonly used approaches for oncology dose-escalation trials: (1) the algorithm-based escalation rules, and (2) modelbased approaches. The second approach can be a frequentist or Bayesianbased response-adaptive method and can be used in any dose-response trials.
Oncology dose-escalation trials are a critical component in the development of new cancer treatments, particularly for life-threatening conditions where the potential toxicity of therapies can pose significant risks. These trials are designed to identify the maximum tolerated dose (MTD) of a drug, which is defined as the highest dose that can be administered to patients without causing unacceptable levels of dose-limiting toxicity (DLT).
Dose Level Selection
In oncology, the initial dose for phase-I trials is generally conservative to ensure patient safety, often set at a level that induces mortality in 10% of mice (LD10). This is the starting point, and doses escalate from there according to a predetermined scheme, such as the multiplicative Fibonacci sequence. This sequence increases doses in a controlled manner to approach but not exceed the biologically active and potentially toxic dose.
Traditional Escalation Rules
The traditional “3+3” dose-escalation rule is commonly used in early-phase clinical trials. Under this rule:
The “3+3” rule and its variants, like the “A+B” escalation rule, allow for structured dose increases while closely monitoring patient safety. The “A+B” model adapts the escalation process based on the DLT outcomes among a specific number of patients, offering flexibility to adjust the trial protocol based on emerging data.
A+B Escalation with and without Dose Deescalation The A+B escalation designs can include mechanisms for both escalating and de-escalating doses, enhancing the trial’s adaptability:
Such designs are crucial in oncology, where patient responses to drug doses can vary widely and the balance between efficacy and toxicity is delicate.
The continual reassessment method (CRM) is a sophisticated statistical approach used primarily in dose-finding studies, particularly within oncology trials. This method dynamically updates its dose-response model based on patient outcomes as the trial progresses, making it more responsive and potentially more efficient than traditional methods like the “3+3” design.
Probability Model for Dose-Response CRM utilizes a probability model to estimate the likelihood of a response based on dosage. A logistic model is commonly applied:
\[ P(x) = \left[1 + b \exp(-ax)\right]^{-1} \]
Here, \(x\) represents the dose level, \(P(x)\) is the probability of observing a response at dose \(x\), \(b\) is a constant, and \(a\) is a parameter that gets updated as more data becomes available.
Likelihood Function The likelihood function for the CRM is constructed using observed patient responses. If \(y_i\) is the response of the \(i\)th patient at dose level \(x_{mi}\), the likelihood is based on whether a response was observed or not:
This setup contributes to the Bayesian framework where prior and posterior distributions help in estimating the parameters more accurately.
Prior Distribution of Parameter In Bayesian statistics, prior knowledge about the parameter \(a\) is necessary. This can be represented by a prior distribution \(g_0(a)\), which might be non-informative if little is known about the parameter before the trial.
Reassessment of Parameter The parameter \(a\) is reassessed continually using the accumulated data from the trial. This can be done through Bayesian updating, where a posterior probability distribution of \(a\) is derived, or by using frequentist methods like maximum likelihood estimation. These updates help refine the predictions of the dose-response relationship as the trial progresses.
Assignment of Next Patient The next patient’s dose level is determined based on the updated dose-response model, aiming to assign them to a dose close to the current estimate of the MTD. This process uses the updated predictive probability:
\[ P(x) = \int \left[1 + b \exp(-ax)\right]^{-1} g_n(a|r) \, da \]
where \(g_n(a|r)\) is the updated posterior distribution of \(a\) given the data.
Stopping Rule The trial employs a pragmatic stopping rule to ensure efficiency and safety. Typically, if a certain number of patients, say six, have been treated at the same dose level and this level has repeatedly been estimated as the MTD, then the trial may conclude that this dose is the MTD and stop further recruitment to this dose level. This avoids overexposure of patients to potentially suboptimal doses and focuses resources on confirming the most promising dose levels.
An Enrichment Strategy in clinical trials is designed to enhance the probability of demonstrating a treatment effect by selecting patients who are more likely to respond to the treatment based on predefined criteria. This approach can be particularly beneficial when there’s a clear hypothesis that a specific subgroup of the population may exhibit a stronger response or better tolerance to the treatment.
The TAPPAS trial provides an illustrative example of applying adaptive design principles in a clinical trial setting, particularly for treating a rare cancer with limited treatment options.
Angiosarcoma and Current Treatments
Combination Therapy in the TAPPAS Trial
Trial Objectives
Population Focus
Adaptive Design Justification
The RALES (Randomized Aldactone Evaluation Study) trial was a significant clinical study focusing on the effectiveness of an aldosterone receptor blocker compared to a placebo.
The Psoriasis trial example involves: - Primary Endpoint: Achievement of PASI-75 by week 16, which measures improvement in psoriasis. - Design Parameters: Designed for 95% power to detect a 10% improvement with a new treatment relative to placebo, with uncertainty about the placebo response rate (π_c = 7.5%).
The problem here is that the power of the trial depends on both the actual placebo response rate (π_c) and the effect size (δ), which can be unknown and vary. If π_c or δ are misestimated, it can impact the trial’s power, making the originally calculated sample size insufficient or excessive.
By using an information-based design, the trial is allowed to adapt by recalculating the necessary sample size based on accruing data about the actual placebo rate and effect size. This can be done through interim analyses, where the actual information accrued (J_j) is compared against the pre-specified maximum information \(I_{\text{max}}\). If \(J_j\) meets or exceeds \(I_{\text{max}}\), or efficacy boundaries are crossed, the trial might be stopped early for efficacy or futility, or the sample size adjusted to meet the desired power.
This approach proposes using “statistical information” rather than fixed sample sizes to guide the monitoring and conclusion of clinical trials. The rationale here is to accumulate enough information to make robust statistical decisions, thereby potentially making the trial more efficient and flexible.
This approach is particularly beneficial in scenarios like the psoriasis trial where there is considerable uncertainty about critical parameters that influence study outcomes. It allows the study to adapt to the observed data, making it potentially more efficient and likely to reach conclusive results.
\[ I_{\text{max}} = \left( \frac{Z_{\alpha/2} + Z_{\beta}}{\delta} \right)^2 \times \text{Inflation Factor} \]
The table indicates that irrespective of the true placebo rate (π_c), the maximum statistical information \(I_{\text{max}}\) remains constant, suggesting the sample size (N_max) adjusts according to the variability observed due to π_c.
\[ J_j = \left[ \text{se}(\delta)^{-1} \right]^2 \left[ \frac{\hat{\pi}_c (1 - \hat{\pi}_c) + \hat{\pi}_e (1 - \hat{\pi}_e)}{N/2} \right]^{-1/2} \] - se(δ)^{-1: Represents the precision (or inverse of the standard error) of the estimated treatment effect. - N/2: Assumes an equal split of the sample size between treatment and control groups. - π_e and π_c: Estimated rates of the endpoint for the experimental and control groups, respectively.
Sample size re-estimation (SSR) in clinical trials is a strategic approach employed when initial assumptions about a study need adjustment based on interim data. This can be crucial for ensuring the scientific validity and efficiency of a trial.
During interim analysis, questions might arise such as: - Should the study continue to target a difference (δ) of 2 points? - Is the assumed standard deviation (σ = 7.5) still valid?
To address these questions: - Conditional Power (CP): This is the probability that the study will detect the predefined effect size, given the interim results. Adjustments might be made to increase the sample size to enhance CP. - Adjusting Critical Cut-off: To maintain the integrity of the type-1 error rate, the critical cut-off value for stopping the trial might need adjustment.
Increasing the sample size can boost CP because it typically reduces the variance of the test statistic, making it more likely that \(Z_2\) will exceed \(c\).
This adjustment ensures that even with an altered trial design, the integrity of the study’s conclusions remains sound. The statistical methodology aims to maintain the trial’s power (ability to detect a true effect) without compromising its rigor due to potential overestimation of the type-1 error.
Cap on Increases: Often, increases in sample size are capped (e.g., no more than double the initial size) to prevent logistical and financial overextension.
Zones of Adjustment:
Placebo Response and Treatment Effect: The expected placebo response rate is 35%, with an anticipated 20% improvement with crofelemer treatment. These assumptions are critical for calculating the necessary sample size and for power calculations to ensure the study is adequately powered to detect a clinically meaningful effect.
Implications for Sample Size Re-Estimation: Given the uncertainty in the optimal dose and variability in the placebo response, an adaptive trial design with sample size re-estimation could be considered. This approach would allow adjustments based on interim analysis results, potentially optimizing the study design in real-time to ensure sufficient power and minimize unnecessary exposure to less effective doses.
Interim Analyses: Conducting interim analyses would allow for the assessment of preliminary efficacy and safety data. Based on these data, decisions could be made about continuing, modifying, or stopping the trial for futility or efficacy.
Adjustments Based on Conditional Power: If interim results suggest changes in the estimated placebo response or differentially greater efficacy at specific doses, the sample size could be adjusted to ensure that the study remains adequately powered to detect significant treatment effects.
1. Inverse Normal Combination of Stage 1 and Stage 2 p-values
This method combines the p-values from different stages of the study using a weighted Z-transform approach. The formula provided:
\[ Z_2 = \sqrt{\frac{n_1}{n_1 + n(2)}} \phi^{-1}(1 - p_1) + \sqrt{\frac{n(2)}{n_1 + n(2)}} \phi^{-1}(1 - p_2) \]
This method assumes that combining information across stages can lead to a more powerful test while still controlling for Type I error, provided the combination rule is properly calibrated.
2. Closed Testing
Closed testing is a rigorous method for controlling FWER in the context of multiple hypothesis testing, especially when tests are not independent.
Features of MAMS:
Multiple Treatment Arms: Involves comparing several treatment options against a common control group, allowing simultaneous evaluation of multiple interventions.
Multiple Interim Analyses: Scheduled assessments of the accumulating data at multiple points during the trial. These interim looks allow for early decisions about the continuation, modification, or termination of treatment arms.
Early Stopping Rules: The trial can be stopped early for efficacy if a treatment shows clear benefit, or for futility if it’s unlikely to show benefit by the end of the study.
Continuation with Multiple Winners: Unlike traditional designs that might stop after finding one effective treatment, MAMS design can continue to evaluate other promising treatments.
Dropping Losers: Ineffective treatment arms can be discontinued at interim stages, focusing resources on more promising treatments.
Dose Selection: Flexibility to adjust doses or select the most effective dose based on interim results.
Sample Size Re-estimation (SSR): Sample sizes can be recalculated based on interim data to ensure adequate power is maintained throughout the trial, especially useful if initial estimates of effect size (δ) or variability (σ) are inaccurate.
Control of Type-1 Error: Despite the complexity and multiple hypothesis testing involved, the design includes methodologies to maintain strong control over the type-1 error rate, ensuring the validity of the trial’s conclusions.
Trial Details: - Intervention: Evaluated three doses of Variciguate compared to placebo. - Primary Endpoint: Week-12 reduction in the log of NT-proBNP, a biomarker used to assess heart function and heart failure. - Sample Size and Power: A total of 388 patients to achieve 80% power for detecting a change of δ = 0.187 in the log NT-proBNP, assuming a standard deviation (σ) of 0.52.
Adaptive Features: - Adaptive Design Considerations: The trial was prepared to adjust for different values of δ and σ than initially estimated, which is crucial if the biological effect of Variciguate or the variability in NT-proBNP measurements was misestimated. - Interim Analyses with SSR and Drop the Loser: The design included provisions for interim analyses to reassess the continued relevance of each dose. Less promising doses could be dropped (‘Drop the Loser’), and the sample size could be recalculated based on the data gathered to that point (‘SSR’).
j
), where
i
denotes the treatment arm.j
.Treatment Arms: - TRC105 + Pazopanib: TRC105 targets the endoglin receptor and is combined with Pazopanib, which targets the VEGF receptor. - Pazopanib Alone: Standard of care, serving as a control.
Subgroups: - Two primary subgroups, cutaneous and visceral. The cutaneous subgroup is notably more sensitive to TRC105, suggesting a potential for subgroup-specific efficacy.
Interim Decisions Based on Interim Analysis: - Favorable Results: If the interim results are favorable, the trial continues as planned. - Promising but Uncertain Results: If results are promising but not conclusively favorable, the trial may adapt by increasing the sample size to enhance statistical power. - Unfavorable Results for Combined Therapy: The trial continues as planned or stops for futility based on specific interim findings. - Population Enrichment: If the interim results suggest that the cutaneous subgroup is particularly responsive, the trial may shift its focus to this subgroup, enriching the patient population to those most likely to benefit.
Statistical Methodology for Decision Making:
1. In Case of No Enrichment
When there is no enrichment, i.e., the trial continues with the full patient population, the significance is declared if: \[ w_1 \Phi^{-1}(1 - p_1^{FS}) + w_2 \Phi^{-1}(1 - p_2^{FS}) \geq Z_{\alpha} \] \[ w_1 \Phi^{-1}(1 - p_1^F) + w_2 \Phi^{-1}(1 - p_2^F) \geq Z_{\alpha} \]
Where: - \(\Phi^{-1}\) is the inverse of the standard normal cumulative distribution function. - \(p_1^{FS}\) and \(p_2^{FS}\) are the p-values for the full sample from stages 1 and 2, respectively, after the interim analysis. - \(p_1^F\) and \(p_2^F\) are the p-values for the full sample from stages 1 and 2, respectively, before the interim analysis. - \(w_1\) and \(w_2\) are the weights assigned to the p-values from each respective stage. - \(Z_{\alpha}\) is the critical value from the standard normal distribution corresponding to the desired overall Type I error rate, \(\alpha\).
2. In Case of Enrichment
When the trial opts for enrichment, i.e., focusing on a specific subgroup (e.g., the cutaneous subgroup) after finding differential treatment effects, the significance is declared if: \[ w_1 \Phi^{-1}(1 - p_1^{FS}) + w_2 \Phi^{-1}(1 - p_2^{FS}) \geq Z_{\alpha} \] \[ w_1 \Phi^{-1}(1 - p_1^S) + w_2 \Phi^{-1}(1 - p_2^S) \geq Z_{\alpha} \]
Where: - \(p_1^S\) and \(p_2^S\) are the p-values from the enriched subgroup (e.g., cutaneous) from stages 1 and 2, respectively.
Recruitment is very challenging due to rare disease. Easier to start small and ask for more. Given the rarity of the disease and the challenges in recruitment:
Zone: Categorizes the possible outcomes of interim analysis into four zones: