Diagnostic Study: Design and Evaluation
Introduction to Study Phases: Explanation of the different phases of study facing in medical research, including: - Phase 1: Initial discovery of the diagnostic test’s potential value. - Phase 2: Estimation of the diagnostic accuracy using known disease status, often through retrospective case-control studies. - Phase 3: Evaluation of diagnostic accuracy in a real-world setting with unknown disease status, using prospective cohort studies. - Phase 4: Analysis of the diagnostic test’s impact on treatment outcomes and patient health, typically through randomized controlled trials.
Example of Diagnostic Tests for CAD
Single Test Design (Standard Design)
Comparative Test Design
Note
Test Result Indicator (R): Represents the outcome of a diagnostic test for an individual patient. This indicator can be either 1 (positive result, suggesting the presence of disease) or 0 (negative result, indicating the absence of disease). It’s crucial to understand that “positive” or “negative” does not necessarily correlate with desirable or undesirable outcomes for the patient; it merely reflects the presence or absence of disease as detected by the test.
True Disease Status (D): The actual health status of a patient, established by the reference standard. Like the test result indicator, this can be 1 (the patient is truly diseased) or 0 (the patient is not diseased). The reference standard is considered the most accurate method available for diagnosing the condition in question.
These are conditional probabilities representing the diagnostic test’s accuracy:
Planning of a diagnostic study, That is why a study should be carefully planned:
Develop a representative setting for the study
Define the correct study objective and study population
Assess the diagnostic accuracy appropriately
Avoid bias
Draw the correct conclusion
Objective: Clearly articulate what the study aims to achieve, whether it’s to evaluate the diagnostic accuracy of a new test, compare it against an existing standard, or understand its utility in a specific clinical scenario.
Population Selection: The study population should closely resemble the patient group in whom the test will be used in practice. This involves considering factors like age, gender, disease severity, and comorbid conditions to ensure the sample is representative.
Accuracy Measures: Define which measures of diagnostic accuracy (e.g., sensitivity, specificity, positive predictive value, negative predictive value) are most relevant to your study objectives. These measures should reflect the test’s ability to correctly diagnose or rule out the disease.
Appropriate Methods: Employ statistical methods and study designs that are best suited to measure these accuracy parameters reliably.
Identification and Mitigation: Recognize potential sources of bias in your study design, such as selection bias, information bias, and observer bias. Implement strategies to minimize their impact, for example, through blinding, randomization (where applicable), and standardized test procedures.
Data Interpretation: Analyze the data with an understanding of the study’s limitations and the context of the wider body of evidence.
Real-world Application: Ensure that the conclusions drawn from the study are applicable to the real-world clinical setting, taking into account the practical aspects of test implementation, such as cost, accessibility, and patient preferences.
Note
PICO are specific questions to define the research question. An estimand is a precise description of the test accuracy that reflects the clinical study objective
QUADAS-2 is an important tool to assess the quality of diagnostic accuracy studies and at the same time to plan a study without bias. It introduces 4 domains to handle the risk of bias and applicability assessment: Patient Selection, Index Test, Reference Standard, Flow and Timing
The PICO framework (Patient, Intervention, Comparison, Outcome) is traditionally used in treatment studies to formulate research questions and guide study design in evidence-based medicine. Adapting this framework to diagnostic studies involves a slight modification to align with the objectives of diagnostic accuracy research:
Diagnostic PICO Framework Components
Population (P): The group of patients for whom the diagnostic test is intended. In your example, this refers to symptomatic patients with clinical indications for coronary imaging, ensuring that the study focuses on a relevant and clearly defined patient group.
Index Test (I): The diagnostic test under evaluation, which in the context of the EVASCAN study is the Computed Tomographic Coronary Angiography (CTCA). This clarity ensures that all study efforts are centered around assessing this specific test’s accuracy and utility.
Comparator/Comparison (C): This could be either a standard reference test (gold standard) against which the new test is compared or another novel test serving as a comparator. The choice between using a comparator test or a reference standard depends on the study’s objectives and the availability of an established diagnostic method for the condition in question.
Outcome (O): The outcomes in diagnostic studies are measures of test accuracy, including sensitivity, specificity, positive and negative predictive values, and positive and negative likelihood ratios. These measures provide a comprehensive view of the test’s performance in identifying the presence or absence of disease.
The estimand framework goes a step further by defining specific attributes that need to be considered in diagnostic accuracy studies:
Target Population
Index Test
Condition
Accuracy Measure
Strategies for Interfering Events
QUADAS framework, specifically its second version introduced in 2011, provides a structured approach for evaluating the risk of bias and applicability in diagnostic accuracy studies. By examining studies through the lens of four key domains—patient selection, index test, reference standard, and flow and timing—QUADAS-2 allows for a comprehensive assessment of study quality. Let’s delve into each domain and apply them to the EVASCAN study as per your example:
Before delving into sample size calculation, the text emphasizes the need to clearly define the study’s hypothesis and test design, which are foundational for calculating the required sample size. It introduces a “single test design” approach, comparing the index test against predefined minimum thresholds for sensitivity and specificity. This method regards sensitivity and specificity as independent endpoints, reflecting the distinct populations they pertain to (diseased vs. non-diseased), hence treated as independent co-primary endpoints.
The combination of these endpoints into a global hypothesis is achieved through an “intersection-union test,” which posits a global null hypothesis integrating two individual hypotheses concerning sensitivity and specificity. This global hypothesis can only be rejected if both individual hypotheses (pertaining to sensitivity and specificity) are simultaneously rejected, ensuring a comprehensive evaluation of the test’s diagnostic accuracy.
The intersection-union test framework allows for a more nuanced and rigorous approach to sample size calculation. It requires the study to meet predefined benchmarks for both sensitivity and specificity, considering them in tandem rather than in isolation. This approach ensures that the diagnostic test is adequately evaluated for its ability to correctly identify both diseased and non-diseased individuals, which is crucial for its application in clinical settings.
NOTE: Sensitivity and specificity are independent co-primary endpoints
This conventional sample size calculation is designed to ensure that the study is adequately powered to detect the expected performance of the experimental test at the specified minimum thresholds for sensitivity and specificity. The final sample size is determined by the larger of the two sample sizes calculated for sensitivity and specificity, ensuring that the study has sufficient power to test both measures effectively.
Problem: Overpowered.
The parameters used for this calculation are as follows:
The conventional approach might suggest a larger sample size (e.g., 1122 individuals), which could be significantly reduced by adopting the optimal approach without compromising the study’s power. This efficiency is further illustrated through empirical data, showing that the optimal approach maintains close to target power across varying prevalence rates, while the conventional method might result in overpowered studies.
The concepts of sample size re-estimation and interim analysis play a critical role in the design and execution of clinical trials. These processes allow researchers to adjust their study based on preliminary data, ensuring the trial remains effective and efficient. There are two main types of interim analysis: blinded and unblinded. Each has its specific context, potential adjustments, and implications for statistical integrity.
Blinded Interim Analysis
In a blinded interim analysis, the individuals analyzing the data do not have access to information regarding which participants are receiving the treatment versus the control. This lack of knowledge about the combination of index test and reference standard results means that there is no need for adjustment of the Type I error rate. The primary goal is to assess data quality and trial conduct without bias.
Possible adjustments during a blinded interim analysis may include:
Unblinded Interim Analysis
In contrast, an unblinded interim analysis involves a review of the data with full knowledge of which participants are in the treatment group and which are in the control group. This approach allows for a more detailed examination of the efficacy and safety data but requires careful handling to preserve the integrity of the trial.
Because the unblinded interim analysis can influence future decisions about the trial, including potentially stopping the trial early for efficacy or safety reasons, it’s crucial to adjust the Type I error rate. This adjustment helps maintain the statistical validity of the study’s conclusions despite the interim peek at the data.
Possible adjustments during an unblinded interim analysis include:
Endpoints and Hypotheses
Co-Primary Endpoints: both null hypotheses have to be rejected - Single Test Design \[ \mathrm{H}_{0, \text { global }}: \mathrm{H}_{0_{\mathrm{se}}}: \mathrm{Se}_{\min } \geq \mathrm{Se}_{\mathrm{E}} \quad \text { U } \quad \mathrm{H}_{0_{\mathrm{sp}}}: \mathrm{Sp}_{\text {min }} \geq \mathrm{Sp}_{\mathrm{E}} \]
Sensitivity and specificity are the main accuracy measures.
Predictive values are very relevant from patient‘s perspective, but dependent on prevalence
your report includes not only positive and negative test results but also inconclusive ones, and the significance of reporting each.
Youden index = sensitivity + specificity - 1
Confidence intervals
Two-sided \((1-\alpha)\)-confidence interval for a single proportion: \[ \left[\hat{p} \pm z_{1-\alpha / 2} \sqrt{\hat{p}(1-\hat{p}) /{ }_N}\right] \]
Hypotheses
\[ H_0: p \leq p_0 \] vs. \[ H_1: p>p_0 \]
Test statistic (Wald test): \[ T=\frac{\hat{p}-p_0}{\sqrt{\frac{\hat{p} \cdot(1-\hat{p})}{n}}} \stackrel{H_0}{\sim} N(0,1) \]
Test decision: reject \(H_0\), if \(T>z_{1-\alpha / 2}\)
In order to test comparative designs, we must look at the difference between estimates for the index test and the comparator test \((\Delta)\).
Hypotheses comparative design \(\quad H_0: \Delta \leq \Delta_0 \quad\) vs. \(\quad H_1: \Delta>\Delta_0 \quad\left(\Delta=p_I-p_C\right)\)
Guidance for missing values in the reference test
Is accuracy sufficient for answering the question? - If yes, it proceeds to ask if there’s a single reference standard providing adequate classification. - If verification is possible and there are no mechanisms leading to incomplete verification observed, a classic diagnostic accuracy study is suggested. - If there are mechanisms leading to incomplete verification, imputing missing reference standard outcomes is considered. - If no, it leads to the question of whether there is external data on the degree of imperfection of the reference standard. - If there is external data, the use of second reference standard or alternative reference standard in unverified patients might be helpful. - If there’s no external data, it considers whether multiple tests can provide adequate classification. - If multiple tests can’t provide adequate classification, other types of evaluation like validation studies are considered. - If multiple tests can provide adequate classification and there is a consensus on a predefined rule to define the target condition, various methods such as using correction methods for imperfect reference standard, using a composite reference standard, or using panel diagnosis or latent class analysis are considered.
Clustered data is commonly evaluated by reducing the observations to the patient level. For this, the observations of a patient are reduced to the single most conspicuous one. This often results in overestimation of sensitivity and underestimation of specificity
Covariates: These are variables that are analogous to independent variables in normal regression models. Covariates can influence the outcome of interest and are important to account for in statistical analyses to avoid confounding.
Clustered Data: This occurs when participants in a study are observed multiple times, resulting in repeated measures. Clustered data require special statistical approaches to account for the lack of independence among observations from the same individual.
Factorial Designs: This refers to studies that are structured such that patients are evaluated under systematically different conditions. Factorial designs can assess the effects of multiple interventions simultaneously and can also be used to investigate interactions between factors.
library("pROC")
data(aSAH)
ROC = roc(aSAH$outcome, aSAH$s100b,levels=c("Good", "Poor"))
plot.roc(ROC)
AUC = auc(ROC)
AUC.CI = ci.auc(ROC,method="delong")
AUC
## Area under the curve: 0.7314
AUC.CI
## 95% CI: 0.6301-0.8326 (DeLong)
ROC2 = roc(aSAH$outcome, aSAH$wfns,levels=c("Good", "Poor"))
plot.roc(ROC2,add=TRUE)
roc.test(ROC,ROC2)
##
## DeLong's test for two correlated ROC curves
##
## data: ROC and ROC2
## Z = -2.209, p-value = 0.02718
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
## -0.17421442 -0.01040618
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.7313686 0.8236789
library(epiR)
dct = table(aSAH$s100b, aSAH$outcome)
epi.tests(dct, conf.level = 0.95)
##
## Point estimates and 95% CIs:
## --------------------------------------------------------------
## Apparent prevalence * 0.30 (0.07, 0.65)
## True prevalence * 0.50 (0.19, 0.81)
## Sensitivity * 0.00 (0.00, 0.52)
## Specificity * 0.40 (0.05, 0.85)
## Positive predictive value * 0.00 (0.00, 0.71)
## Negative predictive value * 0.29 (0.04, 0.71)
## Positive likelihood ratio 0.00 (0.00, NaN)
## Negative likelihood ratio 2.50 (0.85, 7.31)
## False T+ proportion for true D- * 0.60 (0.15, 0.95)
## False T- proportion for true D+ * 1.00 (0.48, 1.00)
## False T+ proportion for T+ * 1.00 (0.29, 1.00)
## False T- proportion for T- * 0.71 (0.29, 0.96)
## Correctly classified proportion * 0.20 (0.03, 0.56)
## --------------------------------------------------------------
## * Exact CIs
ods graphics on;
proc logistic data=roc plots=roc(id=prob);
class outcome wfns;
model outcome(event='Poor') = s100b wfns / nofit;
roc 's100b' s100b;
roc 'wfns' wfns;
roccontrast reference/ estimate e;
run;
ods graphics off;
proc freq data=ROC;
where outcome="Poor"; * where outcome="Good";
tables s100b_d / binomial(level="1");
exact binomial;
run;
European Medicines Agency. CPMP/EWP/1119/98/Rev. 1 - Guideline on clinical evaluation of diagnostic agents. Retrieved November 30, 2023, from https://www.ema.europa.eu/en/documents/scientific-guideline/guideline-clinical-evaluation-diagnostic-agents_en.pdf.
Zapf A, Stark M, Gerke O, Ehret C, Benda N, Bossuyt P, Deeks J, Reitsma J, Alonzo T, Friede T. Adaptive trial designs in diagnostic accuracy research. Stat Med. 2020;39(5):591-601.
Haelio, 2014. https://www.healio.com/news/cardiology/20210715/the-death-of-stress-test-and-the-rise-of-the-coronary-ct-angiogram
Zhou, X.-H., Obuchowski, N. A., & McClish, D. K. (2011). Statistical Methods in Diagnostic Medicine. John Wiley & Sons, Inc. https://doi.org/10.1002/9780470906514
Whiting PF, Rutjes AW, Westwood ME, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155(8):529-536. doi:10.7326/0003-4819-155-8-201110180-00009
Pepe, M. S. (2004). The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press.
Stahlmann K, Reitsma JB, Zapf A. Missing values and inconclusive results in diagnostic studies - A scoping review of methods. Stat Methods Med Res. 2023;32(9):1842-1855. doi:10.1177/09622802231192954
Ibrahim JG, Chen M, Gwon Y, et al. The power prior: theory and applications. Stat Med 2015; 34: 3724–3749; Neuenschwander B, Capkun-Niggli G, Branson M, et al. Summarizing historical information on controls in clinical trials. Clin Trials 2010; 7: 5–18
Beam CA. Analysis of clustered data in receiver operating characteristic studies. Stat Methods Med Res. 1998;7. doi:10.1177/096228029800700402
Fleiss JL, Levin B, Paik MC. Statistical Methods for Rates and Proportions. Wiley; 2003. doi:10.1002/0471445428
Gatsonis CA. Random-effects models for diagnostic accuracy data. Acad Radiol. 1995;2 Suppl 1:S14-21.
Hajian-Tilaki KO, Hanley JA, Joseph L, Collet JP. Extension of receiver operating characteristic analysis to data concerning multiple signal detection tasks. Acad Radiol. 1997;4. doi:10.1016/S1076-6332(05)80295-8
Hellmich M, Abrams KR, Jones DR, Lambert PC. A Bayesian approach to a general regression model for ROC curves. Med Decis Making. 1998;18. doi:10.1177/0272989X9801800412
Leisenring W, Pepe MS, Longton G. A marginal regression modelling framework for evaluating medical diagnostic tests. Stat Med. 1997;16. doi:10.1002/(SICI)1097-0258(19970615)16:11<1263::AIDSIM550>3.0.CO;2-M
Michael H, Tian L, Ghebremichael M. The ROC curve for regularly measured longitudinal biomarkers. Biostatistics. 2019;20. doi:10.1093/biostatistics/kxy010
Peng F, Hall WJ. Bayesian analysis of ROC curves using Markov-chain Monte Carlo methods. Med Decis Making. 1996;16.doi:10.1177/0272989X9601600411
Rao JN, Scott AJ. A simple method for the analysis of clustered binary data. Biometrics. 1992;48:577-585.
Tang LL, Zhang W, Li Q, Ye X, Chan L. Least squares regression methods for clustered ROC data with discrete covariates: Least squares methods for clustered ROC data. Biom J. 2016;58. doi:10.1002/bimj.201500099
Williams RL. A note on robust variance estimation for cluster-correlated data. Biometrics. 2000;56. doi:10.1111/j.0006- 341X.2000.00645.x
Withanage N, de Leon AR, Rudnisky CJ. Joint estimation of disease-specific sensitivities and specificities in reader-based multi-disease diagnostic studies of paired organs. J Appl Stat. 2014;41. doi:10.1080/02664763.2014.909790
Zwinderman AH, Glas AS, Bossuyt PM, Florie J, Bipat S, Stoker J. Statistical models for quantifying diagnostic accuracy with multiple lesions per patient. Biostatistics. 2008;9. doi:10.1093/biostatistics/kxm052
Emir B, Wieand S, Jung SH, Ying Z. Comparison of diagnostic markers with repeated measurements: a nonparametric ROC curve approach. Stat Med. 2000;19. doi:10.1002/(sici)1097-0258(20000229)19:4<511::aidsim353>3.0.co;2-3
Konietschke F, Harrar SW, Lange K, Brunner E. Ranking procedures for matched pairs with missing data — Asymptotic theory and a small sample approximation. Comput Stat Data Anal. 2012;56. doi:10.1016/j.csda.2011.03.022
Li G, Zhou K. A unified approach to nonparametric comparison of receiver operating characteristic curves for longitudinal and clustered data. J Am Stat Assoc. 2008;103. doi:10.1198/016214508000000364
Obuchowski NA. Nonparametric analysis of clustered ROC curve data. Biometrics. 1997;53:567-578.
Smith PJ, Thompson TJ, Engelgau MM, Herman WH. A generalized linear model for analysing receiver operating characteristic curves. Stat Med. 1996;15. doi:10.1002/(SICI)1097-0258(19960215)15:3<323::AIDSIM159>3.0.CO;2-A
Tang LL, Liu A, Chen Z, Schisterman EF, Zhang B, Miao Z. Nonparametric ROC summary statistics for correlated diagnostic marker data. Stat Med. 2013;32. doi:10.1002/sim.5654
Toledano AY, Gatsonis C. Ordinal regression methodology for ROC curves derived from correlated data. Stat Med. 1996;15. doi:10.1002/(SICI)1097-0258(19960830)15:16<1807::AID-SIM333>3.0.CO;2-U
Werner C, Brunner E. Rank methods for the analysis of clustered data in diagnostic trials. Comput Stat Data Anal. 2007;51. doi:10.1016/j.csda.2006.05.023
Wu Y. Optimal nonparametric estimator of the area under ROC curve based on clustered data. Commun Stat - Theory Methods. 2020;49. doi:10.1080/03610926.2018.1563176
Zou KH. Comparison of correlated receiver operating characteristic curves derived from repeated diagnostic test data. Acad Radiol. 2001;8. doi:10.1016/S1076-6332(03)80531-7
Emir B, Wieand S, Su JQ, Cha S. Analysis of repeated markers used to predict progression of cancer. Stat Med. 1998;17. doi:10.1002/(sici)1097-0258(19981130)17:22<2563::aid-sim952>3.0.co;2-o
Konietschke F, Harrar SW, Lange K, Brunner E. Ranking procedures for matched pairs with missing data — Asymptotic theory and a small sample approximation. Comput Stat Data Anal. 2012;56. doi:10.1016/j.csda.2011.03.022
Lim Y. A GEE approach to estimating accuracy and its confidence intervals for correlated data. Pharm Stat. 2020;19. doi:10.1002/pst.1970
Smith PJ, Hadgu A. Sensitivity and specificity for correlated observations. Stat Med. 1992;11. doi:10.1002/sim.4780111108
Sternberg MR, Hadgu A. A GEE approach to estimating sensitivity and specificity and coverage properties of the confidence intervals. Stat Med. 2001;20. doi:10.1002/sim.688
Werner C, Brunner E. Rank methods for the analysis of clustered data in diagnostic trials. Comput Stat Data Anal. 2007;51. doi:10.1016/j.csda.2006.05.023
Connor, R. J. Sample Size for Testing Differences in Proportions for the Paired-Sample Design. Biometrics. 1987, 43(1), 207–211.https://doi.org/10.2307/2531961
Stark, M., Hesse, M., Brannath, W., & Zapf, A. (2022). Blinded sample size re-estimation in a comparative diagnostic accuracy study. BMC Medical Research Methodology, 22(1), 1-12. https://doi.org/10.1186/s12874-022-01564-2
Stark M, Zapf A. Sample size calculation and re-estimation based on the prevalence in a single-arm confirmatory diagnostic accuracy study. Stat Methods Med Res. 2020;29. doi:10.1177/0962280220913588