Sample Size Determination for Repeated Measures
Randomization underpins causal inference: it turns systematic differences (bias) into explainable random error that the design and analysis can account for.
Logistical and scientific reasons often make cluster trials preferable or necessary:
Trade-offs and risks
Let \(\rho\) be the intracluster correlation (proportion of outcome variance shared within clusters). Even small \(\rho\) values matter.
Design effect (DE) inflates required sample size:
\[ \text{DE} \;=\; 1 + (m-1)\rho \quad (\text{equal cluster size } m) \]
With unequal cluster sizes (mean \(\bar m\), coefficient of variation \(CV\)):
\[ \text{DE} \;\approx\; 1 + \left[(CV^2+1)\bar m - 1\right]\rho \]
Effective sample size: \(n_\text{eff} = n/\text{DE}\). Increasing clusters (not just people per cluster) is usually the most efficient way to gain power.
Parallel CRT
Matched-Pair CRT
Stepped-Wedge CRT (SW-CRT)
Cluster Crossover CRT (CRXO)
Choosing a Cluster Randomization Approach – detailed guidance
Principle. The randomization scheme should follow the trial’s objective, endpoint behavior, and what design/analysis assumptions you can credibly defend under real operational constraints. A useful rule of thumb: pick the cluster design that is the closest feasible analog to the individually randomized design you would otherwise run.
Parallel cluster randomized trial (CRT) – the default workhorse
Cluster crossover CRT – efficiency via within-cluster contrasts
Stepped-wedge CRT – staggered rollout with time adjustment
How to choose – practical decision points
Sample size and power essentials (all CRTs)
Analysis implications
Operational safeguards
Core idea. In a cluster randomized trial (CRT) you size the study at the level that is randomized—the clusters—then work down to people and measurements. Because individuals within a cluster are correlated, the effective information per person is reduced and the number of clusters becomes the main driver of power.
Key quantities
Mean cluster size \(m\) (participants per cluster); number of clusters per arm \(k\).
Intracluster correlation (ICC, \(\rho\))—proportion of total variance explained by between-cluster differences:
\[ \textstyle \rho=\frac{\sigma_B^2}{\sigma_B^2+\sigma_W^2}. \]
Design effect (DE)—inflation factor that converts an individually randomized sample size to the CRT setting. For a two-level parallel CRT with equal cluster sizes:
\[ \textstyle \text{DE}=1+(m-1)\rho. \]
With unequal cluster sizes (mean \(\bar m\), coefficient of variation \(CV\)):
\[ \textstyle \text{DE}\approx 1+\left[(CV^2+1)\bar m-1\right]\rho. \]
Effective sample size \(n_{\text{eff}}=n/\text{DE}\).
What this means in practice
Workflow to compute a parallel CRT sample size
Specify the usual ingredients for your outcome (difference in means/proportions, log-odds ratio, hazard ratio), \(\alpha\), and target power.
Obtain or bracket a plausible ICC from prior audits/literature/pilot (plan sensitivity analyses over a range, e.g., 0.005–0.10).
Decide a feasible mean cluster size and expected CV of cluster sizes.
Compute the individually randomized sample size \(N_{\text{fixed}}\) and inflate:
\[ N_{\text{CRT}}=\text{DE}\times N_{\text{fixed}}. \]
Divide by \(m\) to get clusters per arm \(k\) (round up), then add margins for cluster loss and participant attrition.
Re-check power with your intended analysis model (mixed effects or GEE) and apply small-sample corrections if clusters are few.
Beyond two-level parallel designs
Analysis alignment (affects sample size assumptions)
Planning tips
Worked Examples Reference
2-Level CRT Example
He, J., Irazola, V., Mills, K.T., Poggio, R., Beratarrechea, A., Dolan, J., Chen, C.S., Gibbons, L., Krousel-Wood, M., Bazzano, L.A. and Nejamis, A., 2017. Effect of a community health worker–led multicomponent intervention on blood pressure control in low-income patients in Argentina: a randomized clinical trial. Jama, 318(11), pp.1016-1025.
3-Level CRT Example
Leahy, A.A., Eather, N., Smith, J.J., Hillman, C., Morgan, P.J., Nilsson, M., Lonsdale, C., Plotnikoff, R.C., Noetel, M., Holliday, E. and Shigeta, T.T., 2019. School-based physical activity intervention for older adolescents: rationale and study protocol for the Burn 2 Learn cluster randomised controlled trial. BMJ open, 9(5), p.e026029.
Matched-Pair CRT Example
Moore, H., Greenwood, D., Gill, T., Waine, C., Soutter, J. and Adamson, A., 2003. A cluster randomised trial to evaluate a nutrition training programme. The British Journal of General Practice, 53(489), p.271.
Stepped-Wedge CRT Example
Fadl, H., Saeedi, M., Montgomery, S., Magnuson, A., Schwarcz, E., Berntorp, K., Sengpiel, V., Storck-Lindholm, E., Strevens, H., Wikström, A.K. and Brismar-Wendel, S., 2019. Changing diagnostic criteria for gestational diabetes in Sweden-a stepped wedge national cluster randomised controlled trial-the CDC4G study protocol. BMC pregnancy and childbirth, 19(1), p.398.
Design being sized. Two-arm, parallel cluster randomized trial with a continuous primary outcome (change in systolic BP). Clusters are centers; all patients in a center get the same arm. Target: 80% power, two-sided α=0.05, detect a Δ = 4 mmHg mean difference. Assumptions: common SD = 10, ICC ρ = 0.06, mean cluster size m = 62. The slide shows 9 clusters per group.
Parameter | Value |
---|---|
Significance Level (2-sided) | 0.05 |
Mean Difference | 4 |
Standard Deviation | 10 |
ICC | 0.06 |
Sample Size Per Cluster | 62 |
Power | 80% |
Clusters Per Group | 9 |
—
Step 1 — Size the individually randomized equivalent.
For two equal groups with common SD, the usual per-arm size is
\[ n_{\text{fixed}} \;=\; \frac{2\,(z_{1-\alpha/2}+z_{1-\beta})^{2}\,\sigma^{2}}{\Delta^{2}}. \]
With \(z_{0.975}=1.96\), \(z_{0.8}=0.84\), \(\sigma=10\), \(\Delta=4\):
\[ n_{\text{fixed}} = \frac{2\,(1.96+0.84)^2\cdot 100}{16} = \frac{2\cdot 7.84\cdot 100}{16} \approx 98\ \text{per arm}. \]
(Standardized effect size \(Δ/σ=0.4\), a moderate effect.)
Step 2 — Inflate for clustering (design effect).
For equal cluster size \(m\) the design effect
\[ \text{DE}=1+(m-1)\rho=1+(62-1)\times 0.06=1+3.66=4.66. \]
Thus the clustered per-arm person sample size is
\[ n_{\text{CRT, per arm}}= \text{DE}\times n_{\text{fixed}} = 4.66\times 98 \approx 457\ \text{patients}. \]
Step 3 — Convert to clusters per arm.
With \(m=62\) patients per cluster:
\[ k \;=\; \left\lceil \frac{n_{\text{CRT, per arm}}}{m} \right\rceil = \left\lceil \frac{457}{62} \right\rceil = 8\ \text{clusters per arm (≈7.4 rounded up)}. \]
Why does the slide show 9? In practice you round up further to cover:
—
What if clusters are unequal? (mean \(\bar m=62\), coefficient of variation \(CV\)) Use the unequal-size approximation:
\[ \text{DE} \approx 1+\left[(CV^2+1)\bar m - 1\right]\rho . \]
Example \(CV=0.30\): \(\text{DE} \approx 1+[(0.09+1)\cdot 62 -1]\cdot 0.06 \approx 5.00\). Then \(n_{\text{CRT, per arm}}\approx 5.00\times 98=490\) ⇒ \(k=\lceil 490/62\rceil=8\); with a safety margin you’d again choose 9.
—
Power is driven by clusters, not people. With \(\rho=0.06\) the effective per-arm sample is \(n_{\text{eff}} = n_{\text{CRT}}/\text{DE}\). Once \(m\) is moderate (here 62), adding more patients per cluster yields diminishing returns; adding clusters is much more efficient.
Analysis alignment (what the size assumes). Plan to analyze with a mixed-effects model (random intercept for cluster) or GEE with robust SEs; when clusters are few (e.g., 9 per arm), use small-sample corrections (Satterthwaite/Kenward–Roger or small-sample-robust sandwich) and report cluster-level degrees of freedom.
Practical sizing tips.
Bottom line. Using Δ=4, SD=10, ρ=0.06, m=62 gives ~8 clusters per arm by formula; rounding up for real-world uncertainties yields 9 clusters per arm, exactly as the slide summarizes.
Design. Two-arm, parallel three-level cluster trial: students (level-1) nested in classes (level-2) nested in schools (level-3). Outcome is change in Progressive Aerobic Cardiovascular Endurance Run (PACER) laps. Assumptions:
Parameter | Value |
---|---|
Significance Level (2-sided) | 0.05 |
Mean Difference | 0.35 |
Standard Deviation (SD)* | 12.64* |
ICC (Level 1, Level 2) | 0.2, 0.03 |
Sample Size (Level 1, 2, 3) | 280 (14, 2, 10) |
Power | 80% |
1) Variance reduction from baseline adjustment (ANCOVA).
Using the well-known ANCOVA approximation, the post-treatment SD is reduced by \(\sqrt{1-r^2}\):
\[ \sigma_{\text{adj}} = \sigma\sqrt{1-r^2} = 29\sqrt{1-0.9^2} = 29\sqrt{0.19} \approx 12.64. \]
So the standardized effect is \(d = \Delta/\sigma_{\text{adj}} = 6/12.64 \approx 0.48\).
2) Individually randomized equivalent (per arm).
For a continuous outcome with common SD,
\[ n_{\text{fixed}} = \frac{2\,(z_{1-\alpha/2}+z_{1-\beta})^2\,\sigma_{\text{adj}}^2}{\Delta^2} = \frac{2\,(1.96+0.84)^2\,(12.64)^2}{6^2} \approx 69.6 \;\Rightarrow\; 70 \text{ per arm}. \]
3) Three-level design effect (balanced sizes).
Two students in the same class share correlation \(\rho_{\text{class}}+\rho_{\text{school}}\); in different classes of the same school they share \(\rho_{\text{school}}\); in different schools correlation is 0. With \(m=14\) students/class and \(c=2\) classes/school, the design effect is
\[ \text{DE} = 1 + (m-1)\big(\rho_{\text{class}}+\rho_{\text{school}}\big) + m\,(c-1)\,\rho_{\text{school}} = 1 + 13(0.20+0.03) + 14(1)(0.03) = 1 + 2.99 + 0.42 \approx 4.02. \]
4) Inflate and convert to clusters.
Total persons per arm:
\[ n_{\text{CRT, per arm}}=\text{DE}\times n_{\text{fixed}} \approx 4.02 \times 70 \approx 282. \]
With 14 students per class and 2 classes per school, that is
\[ \text{per arm: } \frac{282}{14} \approx 20 \text{ classes} \;=\; 10 \text{ schools}, \]
which matches the slide’s 280 students (2 classes of 14 in each of 10 schools).
Notes and checks.
Design and inputs
Parameter | Value |
---|---|
Significance Level (2-sided) | 0.05 |
Control Group Proportion | 0.2 |
Treatment Group Proportion | 0.4 |
Between-Cluster Variance | 0.005 |
Number of Matched Pairs | 12/2=6 |
Subjects per Cluster | 20 |
Power | 80% |
How matched-pair sizing works (cluster-level difference approach)
Analyze within-pair differences of cluster proportions: \(d_j = \hat p_{Tj}-\hat p_{Cj}\), \(j=1,\dots,J\).
\[ \mathrm{Var}(d_j) \;\approx\; 2\sigma_B^2(1-r)\;+\;\frac{p_C(1-p_C)}{m}\;+\;\frac{p_T(1-p_T)}{m}, \]
where \(r\) is the pairing correlation between matched clusters’ underlying risks.
\[ \mathrm{SE}(\bar d)=\sqrt{\mathrm{Var}(d_j)/J}. \]
Plugging the numbers Use \(p_C=0.20\), \(p_T=0.40\), \(m=20\), \(\sigma_B^2=0.005\). The within-cluster (binomial) pieces are
\[ \frac{0.20\times0.80}{20}=0.008,\qquad \frac{0.40\times0.60}{20}=0.012, \]
so within-cluster sum \(=0.020\). The between-cluster component depends on how strong matching is via \(r\):
If moderate pairing \(r=0.30\) (typical when matching uses good prognostic covariates): \(\;2\sigma_B^2(1-r)=0.01\times0.70=0.007\). Then \(\mathrm{Var}(d_j)\approx 0.007+0.020=0.027\), so \(\mathrm{SD}(d_j)\approx 0.164\), \(\mathrm{SE}(\bar d)=\sqrt{0.027/6}\approx 0.067\), and the non-centrality $ / /0.067 $. Using the usual z approximation (\(1.96+0.84=2.80\)) gives ≈80% power.
If no matching benefit \(r=0\) (i.e., stratified only): between-cluster term \(=0.01\), so \(\mathrm{Var}(d_j)\approx 0.031\), \(\mathrm{SE}(\bar d)\approx 0.072\), NCP \(\approx 2.78\). You would need about 7 pairs (14 practices) to reach 80% power.
If strong pairing \(r=0.50\): between-cluster term \(=0.005\), \(\mathrm{SE}(\bar d)\approx 0.066\), NCP \(\approx 3.03\) → comfortably ≥80%.
These calculations show why 12 practices (6 pairs) × 20 patients lands near 80% power under a plausible matching correlation (around 0.3–0.5). The exact power depends on \(r\) and whether you use z or small-sample t critical values (with \(df=5\), the t critical is 2.571; many planning tools still use z here and then round clusters up for safety).
What the “between-cluster variance = 0.005” implies For a binary outcome with average \(\bar p \approx 0.30\), the implied ICC is roughly
\[ \rho \approx \frac{\sigma_B^2}{\sigma_B^2+\bar p(1-\bar p)} \approx \frac{0.005}{0.005+0.21} \approx 0.023, \]
a small but non-negligible clustering effect—consistent with needing clusters (not just larger \(m\)) for power.
Practical notes
Design snapshot
Parameter | Value |
---|---|
Significance Level | 0.05 |
Control Proportion | 0.1 |
Treatment Proportion | 0.085 |
Variance Method | Null Variance |
SW Design | Complete |
Number of Clusters | 11 |
Power (%) | 90 |
Total Sample Size | 47916 |
Step 1 — Individually randomized benchmark
Ignoring clustering and the stepped-wedge structure, the total sample size to detect \(\Delta=0.015\) with \(p_C=0.10\) and \(p_T=0.085\) is
\[ N_{\text{fixed}} \;=\; \frac{2\,[z_{1-\alpha/2}+z_{1-\beta}]^2\,\bar p(1-\bar p)}{\Delta^2} \quad\text{with}\quad \bar p=\tfrac{1}{2}(p_C+p_T)=0.0925. \]
Using \(z_{0.975}=1.96\) and \(z_{0.90}=1.282\),
\[ N_{\text{fixed}} \approx \frac{2\,(1.96+1.282)^2\,(0.0925\times0.9075)}{0.015^2} \approx 15{,}700 \text{ participants (total)}. \]
This is the best-case (no clustering, no time effects).
Step 2 — Inflate for stepped-wedge clustering and time
Under the Hussey–Hughes linear mixed-model framework for a cross-sectional SW-CRT with:
the variance of the treatment effect is inflated by a factor that depends on \(\rho\) and the SW layout (number of clusters \(K\), number of periods \(P\), and how clusters are assigned to sequences). This is what the software labels the “null-variance” calculation (it plugs in \(p_0(1-p_0)\) for the residual variance of the binary outcome).
For the complete SW design in the protocol (each cluster switches once and all end up treated), with \(K=11\) and the reported \(\rho=0.0026\), the required total sample is about
\[ N_{\text{SW}} \;\approx\; \underbrace{\text{Inflation factor}}_{\approx 3.05} \times N_{\text{fixed}} \;\approx\; 3.05 \times 15{,}700 \;\approx\; \mathbf{47{,}900}, \]
That “inflation factor” bundles (i) within-cluster correlation, and (ii) the need to estimate time effects while using staggered exposure.
Step 3 — What this implies about per-period counts
Total \(N\) in a cross-sectional SW-CRT equals \(K \times P \times m\). If the rollout uses, for example, \(P=12\) periods (baseline + 11 steps), then
\[ m \approx \frac{47{,}916}{11\times 12}\approx 364 \]
women per cluster-period on average (a plausible number for national registry-scale obstetric outcomes). If \(P=11\), \(m \approx 398\). Either way, you can see why a small absolute effect (1.5%) plus period adjustment requires large per-period accrual despite a small ICC.
Why the sample is so large even with a tiny ICC
Planning notes
Start from the individually randomized \(N\) for a 1.5% absolute reduction, then apply the Hussey–Hughes stepped-wedge inflation (ICC + time effects). With \(K=11\), \(\rho=0.0026\), and a complete rollout, that yields a total of ≈47,916
Description: this tests if at least one mean is different among groups, where the groups are repeated measures (more than two) for a normally distributed variable. Repeated Measures ANOVA is the extension of the Paired T-test for more than two groups.
# library("WebPower")
# https://github.com/johnnyzhz/WebPower/blob/master/R/webpower.R
source("./03_Functions/webpower.R")
wp.rmanova(n=NULL, ng=1, nm=4, f=0.608, nscor=1, alpha=0.05, power=0.80, type=1)
## Repeated-measures ANOVA analysis
##
## n f ng nm nscor alpha power
## 30.81262 0.608 1 4 1 0.05 0.8
##
## NOTE: Power analysis for within-effect test
## URL: http://psychstat.org/rmanova
Reference: SAS for Mixed Models, Chapter 12: Power Calculations for Mixed Models
Power calculations for mixed models are more difficult due to their more complex covariance structure (Helms 1992, Stroup 2002, Tempelman 2006, Rosa, Steibel, and Tempelman 2006). Assuming the hypothesis test of interest is formed as a linear combination \(\mathbf{K}^{\prime} \boldsymbol{\beta}\), recall that our general \(t\) - and \(F\)-statistics can be written using the variance matrix \(\mathbf{K}^{\prime}\left[\mathbf{X}^{\prime} \mathbf{V}^{-1} \mathbf{X}\right]^{-1} \mathbf{K}\). So the power associated with such tests is a function of the following: - the magnitude of \(\mathbf{K}^{\prime} \boldsymbol{\beta}\), also known as the effect size - the design matrix \(\mathbf{X}\), including the number of its rows (the sample size) and the structure of its columns (from the fixed effects) - the values of the variance and covariance parameters in \(\mathbf{V}\) - the test size, commonly known as \(\alpha\), the probability of a Type I error, or one minus the specificity of the test
Several authors (Helms 1992, Stroup 2002, Tempelman 2005, Rosa, Steibel, and Tempelman 2005) have shown how to compute a simple approximation to the power of an F-test by using an observed F-statistic to compute the noncentrality parameter needed for the power calculation. Consider the mice experiment example below as a pilot study:
Obs | cage | condition | diet | gain |
---|---|---|---|---|
1 | 1 | 1 | normal | 58 |
2 | 1 | 1 | restrict | 58 |
3 | 1 | 1 | suppleme | 58 |
4 | 1 | 2 | normal | 54 |
5 | 1 | 2 | restrict | 46 |
6 | 1 | 2 | suppleme | 57 |
… | … | … | … | … |
31 | 6 | 3 | normal | 65 |
32 | 6 | 3 | restrict | 64 |
33 | 6 | 3 | suppleme | 54 |
34 | 6 | 4 | normal | 59 |
35 | 6 | 4 | restrict | 47 |
36 | 6 | 4 | suppleme | 73 |
proc mixed data=mice nobound cl;
class cage condition diet;
model gain= condition|diet / solution ddfm=satterth;
random cage cage*condition;
ods output tests3=t3;
run;
data f_power;
set t3;
Noncen = NumDF*FValue;
Alpha = 0.05;
FCrit = finv(1-Alpha,NumDF,DenDF,0);
Power = 1 - probf(FCrit,NumDF,DenDF,Noncen);
run;
proc print data=f_power;
run;
Figure: Results for Power Calculations for the Mice Experiment
Interpretation
Consider the test for the interaction between condition and diet. The power value of 0.42796 indicates that for a data set having the following characteristics, the probability of observing a CONDITION \(\times\) DIET \(F\)-test \(p\)-value less than \(\alpha=0.05\) is approximately 0.42796 : - exactly the same \(\mathbf{X}\) matrix as the present one - exactly the same \(\boldsymbol{\beta}\) vector with values equal to those in the “Solution for Fixed Effects” table - exactly the same \(\mathbf{V}\) matrix with values equal to the unbounded REML estimates as shown in the “Covariance Parameter Estimates table”
We now ask: what would the power for each \(F\)-test be if we had doubled the number of cages in the study, assuming the fixed effects and covariance parameter estimates stay the same?
Our strategy for computing these new powers is to run the same code as before but to increase the denominator degrees of freedom for each \(F\)-test by an appropriate amount. A convenient way to calculate these degrees of freedom is to run PROC MIXED on a doubled data set as follows.
data mice1;
set mice;
cage = cage + 6;
data mice2;
set mice mice1;
run;
proc mixed data=mice2 nobound cl;
class cage condition diet;
model gain=condition|diet / ddfm=satterth;
random cage cage*condition;
run;
Effect | Num DF | Den DF | F Value | Pr > F |
---|---|---|---|---|
condition | 3 | 11.9 | 13.91 | 0.0003 |
diet | 2 | 40 | 2.06 | 0.1405 |
condition*diet | 6 | 40 | 3.81 | 0.0043 |
The denominator degrees of freedom for the CONDITION main effect have increased from 4.72 to 11.9, and those for DIET and CONDITION × DIET have increased from 16 to 40.
We do not perform the power calculations using the doubled data set because the covariance parameter estimates are no longer the same as the original one. Instead, we rerun the code on the original pilot data and specify the new denominator degrees of freedom using the DDF= option as follows
proc mixed data=mice nobound cl;
class cage condition diet;
model gain = condition|diet / ddf=11.9,40,40;
random cage cage*condition;
ods output tests3=t3;
run;
data f_power2;
set t3;
Alpha = 0.05;
Noncen = NumDF*FValue;
FCrit = finv(1-Alpha,NumDF,DenDF,0);
Power = 1 - probf(FCrit,NumDF,DenDF,Noncen);
run;
proc print data=f_power2;
run;
Obs | Effect | NumDF | DenDF | FValue | ProbF | Alpha | Noncen | FCrit | Power |
---|---|---|---|---|---|---|---|---|---|
1 | condition | 3 | 11.9 | 5.16 | 0.0162 | 0.05 | 15.4900 | 3.49914 | 0.80776 |
2 | diet | 2 | 40 | 0.82 | 0.4456 | 0.05 | 1.6496 | 3.23173 | 0.18124 |
3 | condition*diet | 6 | 40 | 1.52 | 0.1956 | 0.05 | 9.1388 | 2.33585 | 0.52161 |
Interpretation
The approximate power for the CONDITION × DIET F-test has increased from 0.427 to 0.521. The power for CONDITION has increased even more, from 0.579 to 0.807, while that for DIET increased from 0.166 to only 0.181 due to its small F-statistic
An analytical approximation to power makes it feasible to quickly compute entire power curves. While several different kinds of curves can be constructed by holding all but two variables fixed, a classical power curve plots effect size along the horizontal axis and power along the vertical axis. To facilitate this kind of calculation, we move from an F-distribution to a tdistribution in order to be able to specify a range of effect sizes in the numerator of the tdistribution noncentrality parameters. The following SAS macro computes the power for an arbitrary set of ESTIMATE statements via the noncentral t-distribution. The GPLOT code following the macro plots the power curves
%macro MixedTPower;
* run proc mixed;
proc mixed data=&InData &ProcMixedOptions;
&ProcMixedStatements
ods output estimates=ests;
run;
libname OutLib "&OutPath";
* compute power curves for each difference
using a noncentral t-distribution;
data OutLib.MixedPower;
set ests;
do alpha = &AlphaValues;
do effectsize = &EffectSizes;
tcrit = tinv(1-alpha/2,df);
noncen = effectsize/stderr;
power = sum(1, -cdf("t", tcrit,df,noncen),
cdf("t",-tcrit,df,noncen));
output;
end;
end;
run;
* create JMP scripting code;
filename jslfile "&OutPath.mixedtpower.jsl";
data _null_;
file jslfile;
put "%str(data=open(%".\MixedPower.sas7bdat%");)";
put " ";
put "Overlay Plot(X( :effectsize), Y( :power),
Grouping( :Label, :alpha),";
put "Connect Thru Missing(1));";
run;
%mend MixedTPower;
%let InData = mice;
%let ProcMixedStatements=
%str(
class cage condition diet;
model gain = condition|diet / ddfm=satterth;
random cage cage*condition;
estimate "est1: condition 2-1" condition -1 1 0 0;
estimate "est2: diet restrict-normal" diet -1 1 0;
estimate "est3: condition*diet 1 restr-nor"
diet -1 1 0
condition*diet -1 1 0 0 0 0 0 0 0 0 0 0;
);
%let ProcMixedOptions = nobound;
%let AlphaValues = 0.05;
%let EffectSizes = 0 to 10 by 0.5;
%let OutPath = C:\temp\;
%MixedTPower;
data power; set outlib.mixedpower;
symbol = substr(label,4,1)+0;
run;
goptions reset=all hsize=9in vsize=8in;
symbol1 color=black value=none i=join l=1 r=1 w=2;
symbol2 color=black value=none i=join l=2 r=1 w=2;
symbol3 color=black value=none i=join l=21 r=1 w=2;
axis1 minor =none
order =(0 to 1 by 0.1)
label =(font='Arial Black' Height=1.5 angle=90 'Power')
offset=(0.3in,0.3in)
value =(font='Arial Black' Height=1.3);
axis2 minor =none
order =(-2 to 12 by 2)
label =(font='Arial Black' Height=1.5 'Effect Size')
offset=(0.3in,0.3in)
value =(font='Arial Black' Height=1.3);
legend1 across =1
cborder =black
mode =protect
position=(top left inside)
label = none
value =(font='Arial Black' Height=2
'Condition 2-1'
'Diet restrict-normal'
'Condition*diet 1 restr-nor');
proc gplot data=power;
plot power*effectsize=symbol /
noframe
vaxis=axis1 haxis=axis2
legend=legend1;
run;
quit;
This code constructs three different power curves corresponding to the main effect contrasts for CONDITION and DIET and one interaction contrast for CONDITION × DIET. In contrast to earlier examples that used the F-distribution, this macro does not use the estimate of β during the power calculations because you specify effect sizes directly. So the response variable GAIN is used only to compute values of the covariance parameters. You can alternatively use a PARMS / HOLD= statement in order to directly specify the covariance parameters, in which case the response variable is not used at all in the power calculations.
Figure: Power Curves for Selected Estimates from the Mice Experiment Pilot Study
Cluster Randomization Overviews
Donner, A., & Klar, N., 2000. Design and Analysis of Cluster Randomization Trials in Health Research. Arnold Publishers. Campbell, M.J., Walters, S.J., 2014. How to design, analyse and report cluster randomised trials in medicine and health related research. John Wiley & Sons. Hayes, R.J. and Moulton, L.H., 2017. Cluster randomised trials. Chapman and Hall/CRC. Eldridge, S. and Kerry, S., 2012. A practical guide to cluster randomised trials in health services research. John Wiley & Sons. Jung, S.H., 2024. Cluster Randomization Trials: Statistical Design and Analysis. Chapman and Hall/CRC. Murray, D.M., 1998. Design and analysis of group-randomized trials. Oxford University Press Campbell, M.K., Piaggio, G., Elbourne, D.R. and Altman, D.G., 2012. Consort 2010 statement: extension to cluster randomised trials. Bmj, 345. Campbell MJ, 2019. Cluster randomised trials. Med J Aust 210(4):154 Campbell, M.J., 2023. Cluster Randomized Trials and Stepped Wedge Trials. In Handbook of Epidemiology (pp. 1-38). New York, NY: Springer New York. Hemming, K. and Taljaard, M., 2023. Key considerations for designing, conducting and analysing a cluster randomized trial. International journal of epidemiology, 52(5), pp.1648-1658. Jaikumar, V., 2022. Cluster Randomized Trials: Concepts, Students 4 Best Evidence. Available from: https://s4be.cochrane.org/blog/2022/03/28/cluster-randomized-trials-concepts/
Issues in Cluster Randomized Designs
Donner, A. and Klar, N., 2004. Pitfalls of and controversies in cluster randomization trials. American journal of public health, 94(3), pp.416-422. Campbell MJ , 2014. Challenges of cluster randomised trials. J Comp Effect Res 3(3):271–281 Edwards, S.J., Braunholtz, D.A., Lilford, R.J. and Stevens, A.J., 1999. Ethical issues in the design and conduct of cluster randomised controlled trials. Bmj, 318(7195), pp.1407-1409. Taljaard, M., Weijer, C., Grimshaw, J.M. and Eccles, M.P., 2013. The Ottawa Statement on the ethical design and conduct of cluster randomised trials: precis for researchers and research ethics committees. Bmj, 346. Arnup, S.J., Forbes, A.B., Kahan, B.C., Morgan, K.E. and McKenzie, J.E., 2016. Appropriate statistical methods were infrequently used in cluster-randomized crossover trials. Journal of clinical epidemiology, 74, pp.40-50. Moerbeek M, 2006. Power and money in cluster randomized trials: when is it worth measuring a covariate? Stat Med 25:2607–2617 Kahan, B.C., Blette, B.S., Harhay, M.O., Halpern, S.D., Jairath, V., Copas, A. and Li, F., 2024. Demystifying estimands in cluster-randomised trials. Statistical methods in medical research, 33(7), pp.1211-1232.
Identification/Selection Bias in CRT
Puffer, S., Torgerson, D. and Watson, J., 2003. Evidence for risk of bias in cluster randomised trials: review of recent trials published in three general medical journals. Bmj, 327(7418), pp.785-789. Hahn, S., Puffer, S., Torgerson, D.J. and Watson, J., 2005. Methodological bias in cluster randomised trials. BMC medical research methodology, 5(1), p.10. Eldridge, S., Kerry, S. and Torgerson, D.J., 2009. Bias in identifying and recruiting participants in cluster randomised trials: what can be done?. Bmj, 339. Eldridge, S., Campbell, M., Campbell, M., Dahota, A., Giraudeau, B., Higgins, J., Reeves, B. and Siegfried, N., 2016. Revised Cochrane risk of bias tool for randomized trials (RoB 2.0): additional considerations for cluster-randomized trials. Cochrane Methods Cochrane Database Syst Rev, 10(suppl 1). Brierley, G., Brabyn, S., Torgerson, D. and Watson, J., 2012. Bias in recruitment to cluster randomized trials: a review of recent publications. Journal of evaluation in clinical practice, 18(4), pp.878-886. Bolzern, J., Mnyama, N., Bosanquet, K. and Torgerson, D.J., 2018. A review of cluster randomized trials found statistical evidence of selection bias. Journal of clinical epidemiology, 99, pp.106-112. Bobb, J.F., Qiu, H., Matthews, A.G., McCormack, J. and Bradley, K.A., 2020. Addressing identification bias in the design and analysis of cluster-randomized pragmatic trials: a case study. Trials, 21(1), p.289. Li, F., Tian, Z., Bobb, J., Papadogeorgou, G. and Li, F., 2022. Clarifying selection bias in cluster randomized trials. Clinical Trials, 19(1), pp.33-41.
Selecting Cluster Randomized Designs
Donner, A., & Klar, N., 1996. Statistical considerations in the design and analysis of community intervention trials. Journal of Clinical Epidemiology, 49(4), 435–439. Hemming, K., Copas, A., Forbes, A. and Kasza, J., 2024. What type of cluster randomized trial for which setting?. Journal of epidemiology and population health, 72(1), p.202195. Hemming, K., Eldridge, S., Forbes, G., Weijer, C. and Taljaard, M., 2017. How to design efficient cluster randomised trials. bmj, 358. Watson, S.I., Girling, A. and Hemming, K., 2023. Optimal study designs for cluster randomised trials: An overview of methods and results. Statistical Methods in Medical Research, 32(11), pp.2135-2157.
Matched-Pair Design
Martin DC, Diehr P, Perrin EB, Koepsell TD, 1993. The effect of matching on the power of randomized community intervention studies. Stat Med 12:329–338 Klar N, Donner A., 1997. The merits of matching in community intervention trials. Stat Med 16:1753–1764 Donner, A., Taljaard, M. and Klar, N., 2007. The merits of breaking the matches: a cautionary tale. Statistics in medicine, 26(9), pp.2036-2051. Imai, K., King, G. and Nall, C., 2009. The essential role of pair matching in cluster-randomized experiments, with application to the Mexican universal health insurance evaluation. Rhodes, W., 2014. Pairwise cluster randomization: An exposition. Evaluation Review, 38(3), pp.217-250. Chondros, P., Ukoumunne, O.C., Gunn, J.M. and Carlin, J.B., 2021. When should matching be used in the design of cluster randomized trials?. Statistics in Medicine, 40(26), pp.5765-5778.
Stepped Wedge Design
Hemming, K., Lilford, R., & Girling, A. J., 2014. Stepped-wedge cluster randomised controlled trials: a generic framework including parallel and multiple-level designs. Statistics in Medicine, 34(2), 181–196. Hemming K, Haines TP, Chilton PJ, Girling AJ, Lilford RJ., 2015. The stepped wedge cluster randomised trial: rationale, design, analysis, and reporting. BMJ. 2015 Feb 6;350:h391. Barker, D., McElduff, P., D’Este, C. and Campbell, M.J., 2016. Stepped wedge cluster randomised trials: a review of the statistical methodology used and available. BMC medical research methodology, 16(1), p.69. Hussey, M. A., & Hughes, J. P., 2007. Design and analysis of stepped wedge cluster randomized trials. Contemporary Clinical Trials, 28(2), 182–191. Copas AJ, Lewis JJ, Thompson JA et al., 2015. Designing a stepped wedge trial: three main designs, carry-over effects and randomisation approaches. Trials 16:352 Hughes, J.P., Granston, T.S. and Heagerty, P.J., 2015. Current issues in the design and analysis of stepped wedge trials. Contemporary clinical trials, 45, pp.55-60 Hemming, K., Taljaard, M., McKenzie, J.E., Hooper, R., Copas, A., Thompson, J.A., Dixon-Woods, M., Aldcroft, A., Doussau, A., Grayling, M. and Kristunas, C., 2018. Reporting of stepped wedge cluster randomised trials: extension of the CONSORT 2010 statement with explanation and elaboration. bmj, 363. Campbell MJ, Hemming K, Taljaard M., 2019. The stepped wedge cluster randomised trial: what it is and when it should be used. Med J Aust 210(6):253–254 Hooper, R., 2020. Reasons for doing a stepped wedge trial. Stepped Wedgehog. Find at: https://steppedwedgehog.blog/three-kinds-of-stepped-wedge-trial/ Hooper, R., 2020. Three kinds of stepped wedge trial. Stepped Wedgehog. Find at: https://steppedwedgehog.blog/three-kinds-of-stepped-wedge-trial/ Hooper, R. and Eldridge, S.M., 2021. Cutting edge or blunt instrument: how to decide if a stepped wedge design is right for you. BMJ quality & safety, 30(3), pp.245-250. Zhan, Z., van den Heuvel, E.R., Doornbos, P.M., Burger, H., Verberne, C.J., Wiggers, T. and de Bock, G.H., 2014. Strengths and weaknesses of a stepped wedge cluster randomized design: its application in a colorectal cancer follow-up study. Journal of clinical epidemiology, 67(4), pp.454-461. Prost A, Binik A, Abubakar I et al., 2015. Logistic, ethical, and political dimensions of stepped wedge trials: critical review and case studies. Trials 16:351 Brown CA, Lilford RJ., 2006. The stepped wedge trial design: a systematic review. BMC Med Res Methodol 6:54
Cross-over Cluster Randomized Design
Hemming, K., Taljaard, M., Weijer, C. and Forbes, A.B., 2020. Use of multiple period, cluster randomised, crossover trial designs for comparative effectiveness research. bmj, 371. McKenzie, J.E., Taljaard, M., Hemming, K., Arnup, S.J., Giraudeau, B., Eldridge, S., Hooper, R., Kahan, B.C., Li, T., Moher, D. and Turner, E.L., 2025. Reporting of cluster randomised crossover trials: extension of the CONSORT 2010 statement with explanation and elaboration. bmj, 388. Arnup, S.J., McKenzie, J.E., Hemming, K., Pilcher, D. and Forbes, A.B., 2017. Understanding the cluster randomised crossover design: a graphical illustration of the components of variation and a sample size tutorial. Trials, 18(1), p.381. Moerbeek, M., 2024. Optimal design of cluster randomized crossover trials with a continuous outcome: Optimal number of time periods and treatment switches under a fixed number of clusters or fixed budget. Behavior Research Methods, 56(8), pp.8820-8830. Hemming, K., Taljaard, M., Weijer, C. and Forbes, A.B., 2020. Use of multiple period, cluster randomised, crossover trial designs for comparative effectiveness research. bmj, 371. Hooper, R. and Bourke, L., 2015. Cluster randomised trials with repeated cross sections: alternatives to parallel group designs. bmj, 350. Turner, R.M., White, I.R. and Croudace, T., 2007. Analysis of cluster randomized cross‐over trial data: a comparison of methods. Statistics in medicine, 26(2), pp.274-289. Bellomo, R., Forbes, A., Akram, M., Bailey, M., Pilcher, D.V. and Cooper, D.J., 2013. Why we must cluster and cross over. Critical Care and Resuscitation, 15(3), pp.155-157. Kasza, J., Hemming, K., Hooper, R., Matthews, J.N.S. and Forbes, A.B., 2019. Impact of non-uniform correlation structure on sample size and power in multiple-period cluster randomised trials. Statistical methods in medical research, 28(3), pp.703-716.
Sample Size for Cluster Randomized Trials
Ahn, C., Heo, M., & Zhang, S., 2015. Sample Size Calculations for Clustered and Longitudinal Outcomes in Clinical Research. CRC Press. C. Rutterford, A. Copas, S. Eldridge, Methods for sample size determination in cluster randomized trials. International Journal of Epidemiology, 44:1051–1067, 2015. Donner A, Birkett N, Buck C (1981) Randomization by cluster: sample size requirements and analysis. Am J Epidemiol 114:906–914 Hayes, R.J. and Bennett, S., 1999. Simple sample size calculation for cluster-randomized trials. International journal of epidemiology, 28(2), pp.319-326. van Breukelen, G.J. and Candel, M.J., 2012. Calculating sample sizes for cluster randomized trials: we can keep it simple and efficient!. Journal of clinical epidemiology, 65(11), pp.1212-1218.
ICC/CV for Sample Size
Campbell, M.K., Mollison, J. and Grimshaw, J.M., 2001. Cluster trials in implementation research: estimation of intracluster correlation coefficients and sample size. Statistics in medicine, 20(3), pp.391-399. Vierron, E. and Giraudeau, B., 2009. Design effect in multicenter studies: gain or loss of power?. BMC medical research methodology, 9(1), p.39. Kerry SM, Bland JM, 2001. Unequal cluster sizes for trials in English and Welsh general practice: implications for sample size calculations. Stat Med 20:377–390 Eldridge S, Kerry S, Ashby D,. 2006. Sample size for cluster randomized trials: effect of coefficient of variation of cluster size and analysis method. Int J Epidemiol 35:1292–1300 Korevaar, E., Kasza, J., Taljaard, M., Hemming, K., Haines, T., Turner, E.L., Thompson, J.A., Hughes, J.P. and Forbes, A.B., 2021. Intra-cluster correlations from the CLustered OUtcome Dataset bank to inform the design of longitudinal cluster trials. Clinical Trials, 18(5), pp.529-540. Eldridge, S.M., Ukoumunne, O.C. and Carlin, J.B., 2009. The intra‐cluster correlation coefficient in cluster randomized trials: a review of definitions. International Statistical Review, 77(3), pp.378-394.
Sample Size for Complex CRT
Hooper R, Teerenstra S, de Hoop E et al,. 2016. Sample size calculation for stepped wedge and other longitudinal cluster randomised trials. Stat Med 35(26):4718–4728 Hemming K, Taljaard M., 2016. Sample size calculations for stepped wedge and cluster randomised trials: a unified approach. J Clin Epidemiol 69:137–146 Giraudeau, B., Ravaud, P. and Donner, A., 2008. Sample size calculation for cluster randomized cross‐over trials. Statistics in medicine, 27(27), pp.5578-5585. Hemming K, Kasza J, Hooper R, Forbes A, Taljaard M., 2020. A tutorial on sample size calculation for multiple-period cluster randomized parallel, cross-over and stepped-wedge trials using the Shiny CRT calculator. Int J Epidemiol 49(3):979–995 Li, F., Forbes, A.B., Turner, E.L. and Preisser, J.S., 2019. Power and sample size requirements for GEE analyses of cluster randomized crossover trials. Statistics in Medicine, 38(4), pp.636-649. Donner, A., 1992. Sample size requirements for stratified cluster randomization designs. Statistics in medicine, 11(6), pp.743-750.