Diagnostic performance of ultrasound risk stratification systems on thyroid nodules cytologically classified as indeterminate: a systematic review and meta-analysis
Article information
Abstract
Purpose
Ultrasound (US) risk stratification systems (RSSs) are increasingly being utilized for the optimal management of thyroid nodules, including those with indeterminate cytology. The goal of this study was to evaluate the category-based diagnostic performance of US RSSs in identifying malignancy in indeterminate nodules.
Methods
This systematic review and meta-analysis was registered on PROSPERO (CRD42021266195). PubMed, EMBASE, and Web of Science were searched through December 1, 2022. Original articles reporting data on the performance of US RSSs for indeterminate nodules were included. The numbers of nodules classified as true negative, true positive, false negative, and false positive were extracted.
Results
Thirty-three studies evaluating 7,225 indeterminate thyroid nodules were included. The diagnostic accuracy was quantitatively synthesized using a Bayesian bivariate model based on the integrated nested Laplace approximation in R. For the intermediate- to high-risk category, the sensitivity levels of the American College of Radiology, the American Thyroid Association, the European Thyroid Association, the Korean Thyroid Association/Korean Society of Thyroid Radiology, and Kwak et al. were found to be 0.80, 0.72, 0.76, 0.96, and 0.97, respectively. The corresponding specificity measurements were 0.36, 0.50, 0.49, 0.28, and 0.17. Furthermore, for the high-risk category, the sensitivity values were 0.40, 0.46, 0.55, 0.47, and 0.10, while the specificity levels were 0.91, 0.90, 0.71, 0.91, and 0.99, respectively.
Conclusion
The overall diagnostic performance of the US RSSs was moderate in the differentiation of indeterminate nodules.
Introduction
Cytologically indeterminate thyroid nodules present a consistent challenge in medical management. Currently, the most effective method for determining which nodules require surgical intervention is the fine-needle aspiration biopsy (FNAB). However, cytological results remain indeterminate for 17%-23% of all nodules [1]. The introduction of the six-tiered Bethesda System for Reporting Thyroid Cytopathology (BSRTC) has been helpful in categorizing these results. This system divides indeterminate cytological results into three of the six categories: III (atypia of undetermined significance or follicular lesion of undetermined significance [AUS/FLUS]), IV (follicular neoplasm or suspicious for follicular neoplasm [FN/SFN]), and V (suspicious for malignancy [SM]). These categories correspond to malignancy rates of 5%-15%, 15%-30%, and 60%-75%, respectively [2,3].
The BSRTC and the American Thyroid Association (ATA) guidelines integrate recommendations of repeated FNAB or diagnostic thyroidectomy for indeterminate thyroid nodules [4,5]. However, the best approach for managing these nodules remains a topic of debate, with options ranging from active surveillance and repeated FNAB to core-needle biopsy (CNB), molecular testing, and diagnostic thyroidectomy. The challenge lies in striking a delicate balance between underestimating and undertreating thyroid cancer, and overtreating nodules that are ultimately diagnosed as benign following histological analysis [6,7]. Therefore, it is prudent to identify predictors that can help identify nodules that unequivocally require surgical intervention [8].
Among the diagnostic tools widely available, ultrasound (US) is often the first to be utilized in determining the next steps in such cases. US risk stratification systems (RSSs), more commonly known as thyroid imaging reporting and data systems (TIRADS), have been developed to enhance the selection process of thyroid lesions that necessitate further FNAB or active surveillance [9]. Each category within the US RSS is associated with an escalating likelihood of malignancy, thus warranting more aggressive clinical management [10]. Presently, numerous US RSSs are included in the available guidelines, and several studies have been conducted to evaluate the diagnostic performance of US RSSs on indeterminate nodules. Consequently, the present study was performed to consolidate the diagnostic performance of various US RSSs in detecting thyroid cancer within indeterminate nodules.
Materials and Methods
This systematic review and meta-analysis, registered under PROSPERO with the registration number CRD42021266195, adheres to the Preferred Reporting Items for Systematic Review and Meta-Analyses (PRISMA) extension for diagnostic test accuracy statements [11].
Literature Search
A literature search was conducted across the PubMed, EMBASE, and Web of Science databases through December 1, 2022. The search terms were as follows: thyroid AND (indeterminate OR undetermined OR suspicious OR Bethesda) AND ((thyroid imaging reporting and data system) OR TIRADS OR TI-RADS OR stratification OR classification). The search was restricted to publications in English, but no limitations were implemented based on publication date or whether the studies involved humans or animals.
Inclusion Criteria
First, studies or their subsets that reported data on any US RSS according to the following guidelines were eligible for inclusion: the American Association of Clinical Endocrinologists/American College of Endocrinology/Associazione Medici Endocrinologi (AACE/ACE/AME US RSS) [12], the American College of Radiology (ACR-TIRADS) [13], the American Thyroid Association (ATA US RSS) [4], the British Thyroid Association (BTA US RSS) [14], the European Thyroid Association (EU-TIRADS) [15], the French-TIRADS [16], the TIRADS by Horvath et al. [9], the Korean Society of Thyroid Radiology (K-TIRADS) [17], and the TIRADS by Kwak et al. [18]. These were used as diagnostic criteria for malignant thyroid nodules among patients with a previous indeterminate FNAB report. Next, indeterminate nodules that had at least surgical pathology were included in the meta-analysis. Exclusion criteria included (1) articles not relevant to the subject of this review; (2) review articles, editorials or letters, comments, and conference proceedings; (3) case reports or case series; and (4) articles not written in English.
Data Extraction
One investigator extracted descriptive data, which were then verified by another researcher. This descriptive data encompassed the study and test characteristics. Two separate reviewers independently gathered the numerical data. Any discrepancies in the data extraction were resolved through consensus. If the data could not be extracted, the authors reached out to the authors to request additional data.
Quality Assessment
Two reviewers independently evaluated the risk of bias and potential applicability issues using the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) tool [19]. Discrepancies were resolved through consensus.
Data Synthesis
For each study, two-by-two tables were created, with the results demonstrating the highest performance selected if different radiologists separately evaluated the diagnostic performance. The criteria for positive test results were defined as either intermediate to high risk (category 4 or 5) or high risk (category 5). A Bayesian bivariate model of diagnostic test studies was implemented, utilizing the integrated nested Laplace approximation (INLA). This model provided accurate posterior marginal distributions for sensitivity and specificity, along with all hyperparameters, without the need for Markov chain Monte Carlo sampling [20]. Additionally, univariate estimates of sensitivity and specificity, complete with 95% credible intervals (CrIs), were made available for interpretation. The summary receiver operating characteristic curve was also provided. The area under the receiver operating characteristic curve (AUC) values, accompanied by 95% CrIs, were combined. Summary positive and negative likelihood ratios (LR+s and LR-s, respectively) were calculated from the summary sensitivity and specificity estimates. The Bayesian bivariate model incorporated four models for enhanced accuracy. In model 1, both sensitivity and specificity were modeled in the bivariate model. In models 2, 3, and 4, sensitivity and false-negative rate (1-specificity), false-positive rate (1-sensitivity) and specificity, and false-positive rate (1-sensitivity) and false-negative rate (1-specificity) were modeled in the bivariate model, respectively. Model selection was guided by the deviance information criterion (DIC), with a lower DIC indicating a better model fit. To test for publication bias, a Deeks funnel plot was constructed, and statistical significance was assessed using the Deeks asymmetry test. Subgroup analyses were performed according to indeterminate classifications (AUS/FLUS, SN/FSN, and SM). The bivariate meta-regression model considered the following variables: study design (prospective vs. retrospective), sample size (cutoff at 140, which was the median value of the proportions reported by the included studies), proportion of malignancy (cutoff at 35%, which was the median value of the proportions reported by the included studies), and study location (East Asia vs. other countries). All analyses were primarily conducted using R software ver. 4.0.5 (R Foundation for Statistical Computing, Vienna, Austria; https://www.r-project.org) and the R packages meta4diag 2.0.8 and INLA 21.02.23. A P-value of <0.05 was considered to indicate statistical significance.
Results
Literature Search
The study screening procedure is depicted in a PRISMA 2020 flow diagram (Fig. 1). In total, 968 records were identified from PubMed, EMBASE, and Web of Science, with an additional six articles retrieved from other sources. Following the selection process, 33 articles were included in the meta-analysis [6,10,21-51].
Study Characteristics
Table 1 and Supplementary Table 1 display the characteristics and two-by-two data presentation of the included articles, respectively. Of the 33 studies, six were prospective in design [23,25,32,33,37,38], and only two were multicenter studies [38,49]. These studies were published between 2015 and 2021, with the number of evaluated indeterminate nodules ranging from 17 to 683. Two studies assessed the AACE/ACE/AME US RSS, 15 evaluated the ACR-TIRADS, 10 examined the ATA US RSS, one looked at the BTA US RSS, five studied the EU-TIRADS, two analyzed the French-TIRADS, two investigated the TIRADS described by Horvath et al.; seven explored the K-TIRADS, and nine scrutinized the TIRADS delineated by Kwak et al. The prevalence of malignant indeterminate nodules in each study ranged from 15.4% to 80.8%. Most of the studies reported surgical pathology as the reference standard for malignant and benign diagnosis, with the exception of six studies that added repeated FNAB, CNB, or follow-up for reference [30,31,33,47,49,50]. In total, this review included 2,662 malignant and 4,563 benign nodules.
Quality Assessment
The results of the quality assessment using the QUADAS-2 tool are depicted in Fig. 2. Generally, all studies achieved the required quality standards. However, some were identified as having unclear or high risk of bias or concerns regarding applicability.
Diagnostic Performance of the US RSSs
Fig. 3 summarizes the estimates of diagnostic performance, considering intermediate to high risk as positive. The specific results of each study are presented in Supplementary Fig. 1. The overall sensitivity and specificity for US RSSs were found to be 0.86 (95% CrI, 0.80 to 0.91) and 0.33 (95% CrI, 0.25 to 0.41), respectively. For ACR-TIRADS, the sensitivity and specificity were 0.80 (95% CrI, 0.70 to 0.88) and 0.36 (95% CrI, 0.23 to 0.49), respectively. In the case of the ATA US RSS, the sensitivity and specificity were 0.72 (95% CrI, 0.50 to 0.88) and 0.50 (95% CrI, 0.35 to 0.64), respectively. For the EU-TIRADS, the sensitivity and specificity were 0.76 (95% CrI, 0.58 to 0.88) and 0.49 (95% CrI, 0.32 to 0.66), respectively. For the K-TIRADS, the sensitivity and specificity were 0.96 (95% CrI, 0.81 to 1.00) and 0.28 (95% CrI, 0.02 to 0.73), respectively. For the TIRADS described by Kwak et al., the sensitivity and specificity were 0.97 (95% CrI, 0.90 to 1.00) and 0.17 (95% CrI, 0.07 to 0.31), respectively. Furthermore, the Deeks funnel plot and asymmetry test did not indicate a significant probability of publication bias, with the exception of the ATA US RSS (P=0.023). The summary receiver operating characteristic curve of the diagnostic performance of each ultrasound risk stratification system for categorization of intermediate to high risk as positive was shown in Supplementary Fig. 2.
Fig. 4 summarizes the estimates of diagnostic performance, with high risk considered positive. Specific data related to different US RSSs are presented in Supplementary Fig. 3. The overall sensitivity and specificity for US RSSs were found to be 0.35 (95% CrI, 0.27 to 0.43) and 0.93 (95% CrI, 0.91 to 0.96), respectively. For the ACR-TIRADS, the sensitivity and specificity were 0.40 (95% CrI, 0.27 to 0.53) and 0.91 (95% CrI, 0.87 to 0.94), respectively. The ATA US RSS showed a sensitivity and specificity of 0.46 (95% CrI, 0.28 to 0.65) and 0.90 (95% CrI, 0.85 to 0.95), respectively. The EU-TIRADS demonstrated a sensitivity and specificity of 0.55 (95% CrI, 0.40 to 0.67) and 0.71 (95% CrI, 0.57 to 0.82), respectively. For the K-TIRADS, the sensitivity and specificity were 0.47 (95% CrI, 0.23 to 0.69) and 0.91 (95% CrI, 0.84 to 0.96), respectively. The TIRADS by Kwak et al. showed a sensitivity and specificity of 0.10 (95% CrI, 0.05 to 0.18) and 0.99 (95% CrI, 0.98 to 1.00), respectively. Furthermore, no significant probability of publication bias was detected. The summary receiver operating characteristic curve of the diagnostic performance of each ultrasound risk stratification system for categorization of high risk as positive was shown in Supplementary Fig. 4.
The results of model selection guided by the DIC were shown in Supplementary Table 2.
Subgroup Analysis
A subgroup analysis was conducted based on various indeterminate categories, namely AUS/FLUS, FN/SFN, and SM. The results for AUS/FLUS are displayed in Table 2 and Supplementary Table 3. When considering intermediate to high risk as positive, the highest sensitivity was observed in TIRADS by Kwak et al., while the highest specificity was found in ATA US RSS. The overall sensitivity and specificity were 0.90 (95% CrI, 0.82 to 0.96) and 0.40 (95% CrI, 0.24 to 0.57), respectively. When high risk was considered positive, the highest sensitivity was seen in the EU-TIRADS, and the highest specificity was found in the TIRADS delineated by Kwak et al. The overall sensitivity and specificity in this case were 0.33 (95% CrI, 0.23 to 0.44) and 0.94 (95% CrI, 0.90 to 0.97), respectively.
For FN/SFN and SM, the number of studies was insufficient to conduct quantitative analysis for each US RSS. In the FN/SFN subgroup, the overall sensitivity and specificity were 0.64 (95% CrI, 0.44 to 0.81) and 0.40 (95% CrI, 0.26 to 0.56), respectively, when categorizing intermediate to high risk as positive. When categorizing high risk as positive, the sensitivity and specificity were 0.22 (95% CrI, 0.11 to 0.36) and 0.89 (95% CrI, 0.79 to 0.96), respectively. For SM, the overall sensitivity and specificity were 0.89 (95% CrI, 0.78 to 0.96) and 0.23 (95% CrI, 0.11 to 0.38), respectively, when categorizing intermediate to high risk as positive. When categorizing high risk as positive, the sensitivity and specificity were 0.49 (95% CrI, 0.31 to 0.68) and 0.99 (95% CrI, 0.95 to 1), respectively (Supplementary Table 4).
Meta-Regression
The results of the meta-regression are outlined in Table 3 (all US RSSs) and Supplementary Table 5 (each US RSS). Overall, no significant covariates were identified when the risk was set to intermediate or high. However, the sensitivity of the high-risk category was influenced by variations in malignant prevalence (P=0.010) and study location (P=0.031). The specificity, in contrast, was potentially affected by all four covariates: study design (P=0.011), number of nodules (P=0.014), prevalence of malignancy (P<0.01), and study location (P<0.01).
Discussion
To the best of the authors’ knowledge, the present study is the first in the literature to investigate the utility of US RSSs in patients with cytologically indeterminate nodules. The current meta-analysis examined the diagnostic performance of various US RSSs, using 33 studies that included 7,225 indeterminate thyroid nodules. Limited data were available on the AACE/ACE/AME, BTA, French, and Horvath et al. TIRADS. However, more studies were found evaluating the ACR TIRADS, ATA US RSS, EU-TIRADS, K-TIRADS, and TIRADS outlined by Kwak et al. Most US RSSs are pattern-based systems. For instance, the K-TIRADS incorporates solidity, echogenicity, and suspicious features (nonparallel orientation, spiculated/microlobulated margin, and microcalcifications) to stratify nodules [17]. Other examples include the ATA US RSS and the EU-TIRADS. In contrast, some US RSSs are scoring systems. For example, with the ACR-TIRADS, all US characteristics are integrated and scored from 0 to 3 based on their malignant potential [13]. The Kwak TIRADS also employs a score-based system. The advantage of pattern-based systems is that they are intuitive and practical for clinical application, while a scoring system may provide a more objective evaluation of each nodule [52].
In the present meta-analysis, individual system meta-analyses were used to identify the threshold categories with the highest accuracy for indeterminate nodules. These categories included TR5 (highly suspicious) for the ACR TIRADS, high suspicion for the ATA system, EU-TIRADS 5 (high risk) for the EU-TIRADS, K-TIRADS 5 (high suspicion) for the K-TIRADS, and category 5 (highly suggestive of malignancy) for the Kwak TIRADS. At these category thresholds, the RSSs demonstrated a sensitivity of 10%-55%, a specificity of 71%-99%, and an accuracy of 69%-79% (Fig. 4, Supplementary Table 6). Kim et al. [53] reported similar results for thyroid nodules across all categories, with a higher sensitivity of 65%-77% and a higher specificity of 82%-90%. However, the difference lay in the threshold categories with the highest accuracy for the Kwak TIRADS, which was category 4c in the study by Kim et al. Overall, the clinical application of US RSSs in indeterminate nodules provides valuable information for deciding between surgical treatment or active surveillance.
The diagnostic performance for indeterminate nodules varied among US RSSs. For the category deemed intermediate to high risk, the highest sensitivity was observed with the Kwak TIRADS (0.97; 95% CrI, 0.90 to 1.00), while the lowest was seen with the ATA US RSS (0.72; 95% CrI, 0.50 to 0.88). Conversely, the specificity was highest for the ATA US RSS (0.50; 95% CrI, 0.35 to 0.64) and lowest for the Kwak TIRADS (0.17; 95% CrI, 0.07 to 0.31). For the high-risk category, the highest and lowest sensitivity values were observed for the EU-TIRADS (0.55; 95% CrI, 0.40 to 0.67) and the Kwak TIRADS (0.10; 95% CrI, 0.05 to 0.18), respectively. The specificity was highest for the EU-TIRADS (0.71; 95% CrI, 0.57 to 0.82) and lowest for the Kwak TIRADS (0.99; 95% CrI, 0.98 to 1.00). However, due to the absence of studies directly comparing different US RSSs, these differences should be interpreted with caution. The variation in diagnostic performance was not solely due to the overlapping US appearance of benign and malignant nodules, but also to substantial variability in thyroid nodule reporting and recommendations for further workup [35]. Limited evidence was available of differences in interobserver agreement among US RSSs, with only Sahli et al. [42] reporting moderate agreement for ACR-TIRADS among the three participating radiologists. Compared to US features, the use of US RSSs may improve interobserver agreement, and when selecting nodules for FNAB, the interobserver agreement can approach perfection [54,55]. US practitioners can adapt each RSS to their clinical setting, considering the proportion of malignant thyroid nodules and other factors. In primary hospitals, most patients present due to thyroid nodules detected during routine physical examinations. However, in tertiary hospitals, many patients are referred due to an initial diagnosis and surgical recommendation from a primary hospital. Consequently, these tertiary hospitals tend to have a higher proportion of malignant nodules. Table 3 indicates that a higher proportion of malignant nodules can increase sensitivity and decrease specificity, leading to a high proportion of false positive cases. In such cases, clinicians can opt for noninvasive strategies such as active surveillance for nodules of similar categories to avoid unnecessary FNAB. Conversely, in situations with lower proportions of malignant nodules, repeat FNAB or surgery may be chosen over active surveillance [56].
AUS/FLUS accounts for the majority of indeterminate nodules, yet the actual incidence of malignancy within AUS/FLUS remains uncertain due to the lack of pathologic confirmation in every case [57]. Research into AUS/FLUS has revealed a broad spectrum of malignant incidence, ranging from 5%-27% in all cases and 6%-48% in surgical cases [58]. In this meta-analysis, the K-TIRADS demonstrated the highest sensitivity (0.95; 95% CrI, 0.85 to 1.00) and specificity (0.75; 95% CrI, 0.13 to 1.00) when intermediate to high risk was categorized as positive. However, the K-TIRADS results could be impacted by an excess of zeros in the two-by-two table, as it had the lowest AUC among all of the US RSSs. For the high-risk category, the EU-TIRADS (0.59; 95% CrI, 0.41 to 0.73) and Kwak TIRADS (0.99; 95% CrI, 0.97 to 1.00) exhibited the highest sensitivity and specificity, respectively. Despite variations among US RSSs, AUS/FLUS could still benefit from US RSSs in determining the need for repeated FNAB, as opposed to diagnostic thyroidectomy [43]. Numerous studies have highlighted the advantages of repeated FNAB in reclassifying an AUS/FLUS result into a category with a more definitive malignancy rate and management strategy [59,60]. Due to insufficient data on FN/SFN and SM, only overall effects were analyzed. Generally, US RSSs were more effective in identifying malignancy in SM (AUC, 0.95; 95% CrI, 0.93 to 0.98) than in AUS/FLUS or FN/SFN when considering the intermediate to high-risk category as positive, likely due to the substantially higher malignancy rate in SM. Ultimately, in the meta-regression, factors such as sample size, the proportion of malignant nodules, and study location were identified as common sources of study heterogeneity.
This meta-analysis had several limitations. First, while category-based comparisons of diagnostic performance are intuitively interpretable, they are inherently limited due to the varying malignancy risks of the categories suggested in the guidelines. Second, most of the included studies had retrospective and single-center designs. Furthermore, despite the use of a Bayesian model to fit estimates and mitigate heterogeneity, substantial between-study heterogeneity persisted, particularly due to the mixed indeterminate components. Third, the diagnosis of both benign and malignant lesions typically relied on surgical pathology, potentially introducing a reference standard bias. Fourth, actual recommendations for FNAB are based on a combination of risk categories and nodule size, a factor not assessed in this study. Finally, insufficient studies were available to conduct quantitative analyses on all of the included US classification systems.
In conclusion, the diagnostic performance of the US RSS in accordance with the representative society guidelines was found to be moderate. This study aims to equip readers and physicians with insights into the performance of each RSS in the context of indeterminate nodules. This information could be instrumental in making decisions about system implementation. Further prospective studies that evaluate all of the most common US RSSs and utilize histology as the standard of reference are necessary.
Notes
Author Contributions
Conceptualization: Xing Z, Qiu Y, Zhu J, Wu W. Data acquisition: Xing Z, Qiu Y, Wu W. Data analysis or interpretation: Xing Z, Qiu Y, Su A. Drafting of the manuscript: Xing Z, Qiu Y. Critical revision of the manuscript: Xing Z, Qiu Y, Zhu J, Su A, Wu W. Approval of the final version of the manuscript: all authors.
No potential conflict of interest relevant to this article was reported.
Acknowledgements
We thank the authors and participants of the included studies for their important contributions.
Supplementary Material
References
Article information Continued
Notes
Key point
For the intermediate- to high-risk category, the sensitivity levels of the American College of Radiology Thyroid Imaging Reporting and Data System (TIRADS), American Thyroid Association guidelines, European Thyroid Association TIRADS, Korean Society of Thyroid Radiology TIRADS, and Kwak et al. TIRADS ranged from 0.72 to 0.97, while the specificity measurements ranged from 0.17 to 0.49. For the high-risk category, European Thyroid Association TIRADS demonstrated the highest sensitivity at 0.55, while Kwak TIRADS showed the highest specificity at 0.99. This study provided information regarding the performance of each RSS in the context of indeterminate nodules.