Diagnostic performance of the modified Korean Thyroid Imaging Reporting and Data System for thyroid malignancy according to nodule size: a comparison with five society guidelines
Article information
Abstract
Purpose
The aim of this study was to evaluate the diagnostic performance of the modified Korean Thyroid Imaging Reporting and Data System (K-TIRADS) compared with five society risk stratification systems (RSSs) according to nodule size.
Methods
In total, 3,826 consecutive thyroid nodules (≥1 cm) with final diagnoses in 3,088 patients were classified according to five RSSs. The K-TIRADS was modified by raising the biopsy size threshold for low-suspicion nodules and subcategorizing intermediate-suspicion nodules. We assessed the performance of the RSSs as triage tests and their diagnostic accuracy according to nodule size (with a threshold of 2 cm).
Results
Of all nodules, 3,277 (85.7%) were benign and 549 (14.3%) were malignant. In small thyroid nodules (≤2 cm), the American College of Radiology Thyroid Imaging Reporting and Data System (ACR TI-RADS) had the highest reduction rate of unnecessary biopsies (76.3%) and the lowest sensitivity (76.1%). The modified K-TIRADS had the second highest reduction rate of unnecessary biopsies (67.6%) and sensitivity (86.6%). The modified K-TIRADS and ACR TI-RADS had the highest diagnostic odds ratios (P=0.165) and the highest areas under the curve (P=0.315). In large nodules (>2 cm), the sensitivity of the ACR TI-RADS for malignancy was significantly lower (88.8%) than the sensitivities of the modified K-TIRADS and other RSSs, which were very high (98.7%-99.3%) (P<0.001).
Conclusion
The modified K-TIRADS allows a large proportion of unnecessary biopsies to be avoided, while maintaining high sensitivity and diagnostic accuracy for small malignant tumors and very high sensitivity for large malignant tumors.
Introduction
Ultrasonography (US) is a primary diagnostic tool for the evaluation of thyroid nodules [1] and many international societies have proposed widely used US risk stratification systems (RSSs) for thyroid nodules in clinical practice guidelines [2-7]. RSSs are used for triage to select patients for US-guided aspiration/biopsy and to rule out thyroid malignancy. As triage tests, RSSs play a role in reducing unnecessary nodule biopsies and require an appropriate sensitivity for thyroid malignancy [8]. Recent comparative studies [9-14] showed a wide spectrum of diagnostic performance for the biopsy criteria in the five US RSSs: the American Association of Clinical Endocrinologists (AACE)/American College of Endocrinology (ACE)/Associazione Medici Endocrinologi (AME) guideline, the American College of Radiology (ACR) Thyroid Imaging Reporting and Data System (TI-RADS) proposed by the ACR, the American Thyroid Association (ATA) Guideline, the European Thyroid Imaging Reporting and Data System (EU-TIRADS) proposed by the European Thyroid Association, and the Korean Thyroid Imaging Reporting and Data System (K-TIRADS) proposed by the Korean Society of Thyroid Radiology/Korean Thyroid Association. Previous comparative study results [9-14] have raised the need to find balanced optimal biopsy criteria within an RSS, and have also shown that the K-TIRADS had the highest sensitivity and highest rate of unnecessary biopsies. In light of this finding, it is necessary to modify the K-TIRADS to reduce the rate of unnecessary biopsies while maintaining an appropriate sensitivity for malignancy.
Tumor size is an important prognostic factor in papillary thyroid cancer (PTC) and follicular thyroid cancer (FTC) [15,16]. The risk of distant metastasis increases for tumors larger than 2 cm [17] and the risk of local tumor invasion, nodal metastasis, and distant metastasis becomes higher as tumor size increases [18]. Therefore, the diagnostic performance of RSSs needs to be evaluated depending on the nodule size, but this has rarely been investigated. The aim of this study was to develop a modified version of the K-TIRADS and to evaluate the diagnostic performance of the modified K-TIRADS compared with the five RSSs as triage tests for the detection of thyroid malignancy according to nodule size.
Materials and Methods
Compliance with Ethical Standards
This study was approved by the institutional review board of GangNeung Asan Hospital in Korea (2020-03-020) and informed consent was waived for this retrospective study.
Study Population
Overall, 4,359 consecutive patients underwent US-guided fine-needle aspiration (FNA) or core needle biopsy (CNB) for thyroid nodules between January 2011 and December 2019 at a single tertiary hospital. Among 3,905 patients with 4,832 nodules ≥1 cm, 998 nodules without final diagnoses confirmed through surgical or biopsy findings (non-diagnostic biopsy results [n=481], atypical or follicular lesion of undetermined clinical significance [n=424], follicular neoplasm or suspected follicular neoplasm [n=47], suspicious for malignancy [n=27], and one benign FNA result with a subsequent discordant biopsy result of follicular neoplasm or suspicious follicular neoplasm and suspicious for malignancy [n=19]), and eight nodules with US images of suboptimal quality were excluded. The remaining 3,088 patients with 3,826 nodules were included in the final study population (2,497 women and 591 men; median age, 56 years; interquartile range [IQR], 47 to 64 years) (Fig. 1). Final diagnoses were determined by the definitive FNA or CNB results (benign or malignant) and surgical histologic diagnoses.
US Examinations and Image Analysis
All US examinations were performed using a 5- to 12-MHz linear probe and a real-time US system (IU22 or EPIQ7,Philips Medical Systems, Bothell, WA, USA). All US images of thyroid nodules between January 2011 and February 2017 were obtained according to the Korean Society of Thyroid Radiology guidelines [5,19] and the US images were retrospectively reviewed by one experienced radiologist (D.G.N.) with 22 years of experience in performing thyroid US, who had no previous knowledge of the FNA results or final diagnoses. The US images of the thyroid nodules obtained between March 2017 and December 2019 were prospectively evaluated before biopsy by two radiologists (D.G.N. and W.P.) with 22 years and 4 years of experience in performing thyroid US, respectively.
The US features of nodules were strictly assessed using the definitions in the US lexicons of the five RSSs [3-7] to minimize the misclassification of nodules in both the retrospective and prospective datasets (Supplementary Table 1). Extrathyroidal extension status was not evaluated in this study because of the absence of standardized specified US criteria. An isolated macrocalcification was defined as an entirely calcified nodule with posterior acoustic shadowing, in which no soft-tissue component was identified due to dense shadowing on the US image [20]. A nodule with this finding was categorized as a nodule of intermediate suspicion in the K-TIRADS [5], as a nodule of moderate suspicion (4 points) in the ACR TI-RADS [6,11], and as an unclassified nodule in other RSSs. Isoechoic nodules with an irregular margin, microcalcification, and a taller-than-wide shape were categorized as unclassified nodules in the ATA guideline. A reviewer (D.G.N.), who had no previous knowledge of the FNA results or final diagnoses, classified nodules based on the assessed US features, and determined the candidates for FNA based on the maximal diameter and category of each nodule according to the guideline of each RSS or TI-RADS (Supplementary Table 2).
Biopsy Size Thresholds According to Categories in the Five Risk Stratification Systems
Supplementary Table 2 lists the biopsy size thresholds and calculated malignancy risks according to the categories in the five RSSs. Nodules classified as low-risk by the AACE/ACE/AME guideline, not suspicious (TR2) or benign (TR1) by the ACR TI-RADS, or benign by the ATA guideline, EU-TIRADS, and K-TIRADS were considered not to be indicated for biopsy in this study because they are not routinely indicated for biopsy for diagnostic purposes according to each RSS.
Development of the Modified Korean TI-RADS
The modified K-TIRADS was developed by revising the K-TIRADS (Table 1). Intermediate suspicion (K-TIRADS 4) nodules were subcategorized into K-TIRADS 4A and 4B based on the malignancy risk of the US patterns (Table 1). K-TIRADS category 4 includes solid hypoechoic nodules without any of three suspicious US features (microcalcification, nonparallel orientation [taller-than-wide], spiculated/microlobulated margin) and partially cystic or isoechoic and hyperechoic nodules with any of the three suspicious US features. Solid hypoechoic nodules without any of the three suspicious US features were subcategorized by the degree of hypoechogenicity (mild vs. marked hypoechogenicity) and macrocalcification based on the results of previous studies reporting that hypoechogenicity and macrocalcification increased the malignancy risk of solid hypoechoic nodules without any of the three suspicious US features (Shin HS, unpublished data) [21]. Marked hypoechogenicity was defined as similar echogenicity or hypoechogenicity relative to the anterior neck muscle [21]. Partially cystic or isoechoic and hyperechoic nodules with any of the three suspicious US features were subcategorized according to the number of coexisting three suspicious US features (one vs. two or three) because the presence of a higher number of suspicious US features in a nodule may indicate a higher malignancy risk [22]. The size threshold for biopsy was subdivided into 1 cm for K-TIRADS 4B and 1.5 cm for K-TIRADS 4A, and the size threshold for biopsy was raised from 1.5 cm to 2 cm for low suspicion (K-TIRADS 3) nodules.
Assessment of the Diagnostic Performance of Risk Stratification Systems for Thyroid Malignancy
All nodules were dichotomized into those for which a biopsy was indicated (test positivity) or was not indicated (test negativity) by the biopsy criteria of each RSS (Table 1, Supplementary Table 2). As primary measures of test performance, the reduction rate of unnecessary biopsies and sensitivity were used to assess the performance of each RSS as a triage test [8] and the negative likelihood ratio (LR-) was used to assess its performance as a rule-out test [23]. The LR indicates how much a positive or negative test result by the RSS raises or lowers the pretest probability of the target disorder (thyroid malignancy) [23,24]. The global discriminative performance was assessed as a secondary measure of the diagnostic performance of the RSSs by the diagnostic odds ratio (DOR) and the area under the receiver operating characteristic curve (AUC). The DOR is a single indicator of test performance and is independent of the prevalence of malignant tumors. The DOR is equal to the positive likelihood ratio (LR+) divided by LR-, and is the ratio of the odds of a biopsy being indicated in a malignant nodule relative to the odds of a biopsy being indicated in a benign nodule [25].
Statistical Analyses
The chi-square or Fisher exact test was used to compare the frequency of categorical variables. Multivariable logistic regression analyses were performed to determine independent US predictors among US features in the subgroup of K-TIRADS 4 nodules. Sensitivity, specificity, positive predictive value, negative predictive value, LR+, LR-, DOR, and the AUC were calculated with 95% confidence intervals. All diagnostic values were compared among the RSSs in overall nodules and according to nodule size (with a size threshold of 2 cm). The statistical comparisons of the LR and DOR among the RSSs were performed using a regression model approach proposed by Gu and Pepe [26] and the Z test, respectively. The DeLong test was used to compare the AUC among the RSSs. Statistical analyses were performed using SPSS version 25 for Windows (IBM Corp., Armonk, NY, USA) and R 3.6.3 for Windows (R Development Core Team, Vienna, Austria). A significant difference was defined as a P-value <0.05.
Results
Clinical Data
The median size (maximal diameter) of the nodules was 1.7 cm (IQR, 1.3 to 2.6 cm; range, 1 to 10 cm). The maximal diameter of nodules was small (1-2 cm) in 2,385 nodules (62.3%) (median size, 1.4 cm) and large (>2 cm) in 1,441 nodules (37.7%) (median size, 2.9 cm). Of the 3,826 nodules, 3,277 (85.7%) were benign and 549 (14.3%) were malignant. Malignant nodules were diagnosed based on histologic findings after surgery (n=411) or malignant FNA or CNB results (n=138). Benign nodules were diagnosed based on histologic findings after surgery (n=390), at least two benign FNA or CNB results (n=545), and one benign FNA (n=2,055) or CNB result (n=287) (Fig. 1).
The 549 malignant nodules included 494 PTCs (90.0%), 32 FTCs (5.8%), eight anaplastic carcinomas (1.5%), eight metastases (1.5%), four lymphomas (0.7%), and three medullary thyroid carcinomas (0.5%). The proportion of PTCs among malignant tumors was significantly higher in small (1-2 cm) tumors than in large (>2 cm) tumors (96.5% vs. 73.0%, P<0.001), and the proportion of FTCs was significantly higher in large (>2 cm) malignant tumors than in small malignant tumors (1-2 cm) (15.8% vs. 2.0%, P<0.001).
Subcategorization of the Malignancy Risk of IntermediateSuspicion Nodules in K-TIRADS
Table 1 lists the malignancy risk of nodules classified using the modified K-TIRADS, in which intermediate suspicion (K-TIRADS 4) nodules were subcategorized (Figs. 2-5). Among the solid hypoechoic nodules without any suspicious US features (microcalcification, nonparallel orientation, spiculated/microlobulated margin) classified as K-TIRADS 4, marked hypoechogenicity and macrocalcification were independently predictive of malignancy (P<0.001 for both) in the multivariable analysis. Markedly hypoechoic nodules showed a significantly higher malignancy risk than mildly hypoechoic nodules (27.1% vs. 11.2%, P<0.001), and nodules with macrocalcifications showed a significantly higher malignancy risk than nodules without macrocalcifications (38.9% vs. 13.4%, P<0.001). Among the partially cystic or isoechoic and hyperechoic nodules with suspicious US features classified as K-TIRADS 4, the malignancy risk of nodules with two or three suspicious features was significantly higher than that of nodules with one suspicious US feature (48.2% vs. 8.3%, P<0.001).
Diagnostic Performance of the Modified K-TIRADS and five RSSs in All Nodules
Table 2 presents the diagnostic performance of the biopsy criteria of the RSSs for malignancy in all thyroid nodules. The ACR TI-RADS had the highest reduction rate of unnecessary biopsies (65.2%), lowest sensitivity (79.6%), highest specificity (65.2%), and a relatively high LR- (0.31). The K-TIRADS had the lowest reduction rate of unnecessary biopsies (18.6%), highest sensitivity (96.9%), lowest specificity (18.6%), and lowest LR- (0.17). The modified K-TIRADS had the second highest reduction rate of unnecessary biopsies (43.7%) and sensitivity (90.0%), and the second lowest LR- (0.23). The ACR TI-RADS had the highest DOR (7.32) and AUC (0.724).
Diagnostic Performance of the Modified K-TIRADS and Five Risk Stratification Systems in Small Nodules (≤2 cm)
Table 3 shows the diagnostic performance of the biopsy criteria of the RSSs for malignancy in small thyroid nodules (≤2 cm). The reduction rate of unnecessary biopsies was the highest with the ACR TI-RADS (76.3%) and was the lowest with the K-TIRADS (27.4%). The modified K-TIRADS had the second highest reduction rate of unnecessary biopsies (67.6%), which was similar to that of the AACE/ACE/AME (P=0.343) and higher than those of the K-TIRADS, ATA guideline, and EU-TIRADS (P<0.001 for all). The highest sensitivity (96.2%) was found for the K-TIRADS and the lowest (76.1%) for the ACR TI-RADS. The modified K-TIRADS had the second highest sensitivity (86.6%), which was similar to the sensitivities of the ATA guideline (P=0.492) and the EU-TIRADS (P=0.174).
The K-TIRADS and the modified K-TIRADS had the lowest LR- values (0.14 and 0.20, respectively; P=0.106) which were significantly lower than those of the other RSSs (P<0.05). The modified K-TIRADS and the ACR TI-RADS had the highest DORs (13.55 and 10.24, respectively; P=0.165) and the highest AUCs (0.771 and 0.762, respectively; P=0.315). The ATA guideline had a reduction rate of unnecessary biopsies of 37.7%, a sensitivity of 95.7%, an LR- of 0.11, an AUC of 0.667, and a DOR of 13.54 when the unclassified nodules were categorized as intermediate-suspicion nodules.
Diagnostic Performance of the Modified K-TIRADS and Five Risk Stratification Systems in Large Nodules (>2 cm)
Table 4 lists the diagnostic performance of the biopsy criteria of the RSSs for malignancy in large thyroid nodules (>2 cm). The reduction rate of unnecessary biopsies was highest with the ACR TI-RADS (48.2%); this rate was significantly higher than those of other RSSs (1.2%-20.5%) (P<0.001 for all). The sensitivity of the ATA (80.3%) and the ACR TI-RADS (88.8%) was significantly lower than those of other RSSs, which had similarly very high sensitivities (98.7%-99.3%) (P≤0.001 for all). The LR- was lowest (0.19) with the modified K-TIRADS. The DOR was highest with the ACR TI-RADS (7.38) and second highest with the modified K-TIRADS (5.56) (P=0.712). The AUC was highest (0.685) with the ACR TI-RADS; this AUC was significantly higher than the AUCs (0.503-0.528) of the other RSSs (P<0.001). The ATA guideline had a reduction rate of unnecessary biopsies of 0.9%, a sensitivity of 100.0%, a specificity of 0.9%, a LR- of 0.00, an AUC of 0.505, and no calculable DOR when the unclassified nodules were categorized as intermediate-suspicion nodules. Seventeen malignant tumors were missed by the ACR TI-RADS, of which 15 were classified as TR2 and 2 as TR3, and the histologic types of these tumors were PTC in 11 cases, including six follicular variant PTCs, and FTC in six cases. The malignant tumors classified as TR2 by the ACR TI-RADS accounted for 15 of the 152 malignant tumors larger than 2 cm (9.9%) (Fig. 6).
Discussion
The modified K-TIRADS substantially reduced the number of unnecessary biopsies compared to the K-TIRADS, while maintaining a relatively high sensitivity (86.6%) in small nodules (≤2 cm), by raising the size thresholds for biopsy in low suspicion (K-TIRADS 3) and subcategorizing intermediate suspicion (K-TIRADS 4A) nodules. Meanwhile, the modified K-TIRADS, K-TIRADS, EU-TIRADS, and AACE/ACE/AME guidelines had similarly very high sensitivities and very low reduction rates of unnecessary biopsies, whereas the ACR TI-RADS had a relatively low sensitivity and a high reduction rate of unnecessary biopsies in large nodules (>2 cm). The relatively low sensitivity of the ACR TI-RADS in large nodules was mainly due to "not suspicious" (TR2) nodules, which are not indicated for biopsy. The low sensitivity of the ATA guideline for malignancy in large thyroid nodules was caused by the unclassified nodules, and the diagnostic performance of the ATA guideline was similar to that of the K-TIRADS when the unclassified nodules were categorized as intermediate-suspicion nodules, as verified in a recent study [27]. The differences in diagnostic performance among RSSs in small nodules are mostly caused by differences in the size thresholds for biopsy and nodules not indicated for biopsy, rather than by differences in the structure (pattern-based versus point-based systems) or US criteria for nodule classification. The diagnostic performances of the RSSs were similar at the same size threshold for biopsy in simulation studies [13,27] and the diagnostic performance estimated by the classified categories was comparable among the RSSs [28].
Disagreements may exist regarding the most appropriate measure of test accuracy for evaluating the performance of an RSS as a triage test. Although the DOR and AUC are effective measures of global diagnostic accuracy, two tests with an identical DOR and AUC can have very different sensitivities and specificities, with distinct clinical consequences [24]. Therefore, the DOR or AUC does not seem to be an appropriate primary measure for evaluating the performance of an RSS as a triage test. The most desirable RSS should be able to reduce unnecessary biopsies as much as possible, while maintaining an appropriate sensitivity for malignancy. Several points need to be considered regarding this issue. First, the diagnostic performance of the RSS needs to be stratified according to nodule size. The strategy of a higher reduction rate of unnecessary biopsies despite a lower sensitivity of the biopsy criteria may be appropriate for small nodules (1-2 cm), considering the favorable prognosis of most small thyroid cancers. Meanwhile, the strategy of a higher sensitivity despite a lower reduction rate of unnecessary biopsies may be appropriate in large nodules (>2 cm), considering the higher risk of aggressive behavior in large malignant tumors [17]. Second, the appropriate sensitivity of the biopsy criteria for malignancy should be determined based on a careful consideration of the risks and benefits to the patients. The hazard of false-negative results poses a potential risk of increased morbidity and mortality due to missing malignant tumors, which may be mitigated by US surveillance in small thyroid cancers. The hazard of false-positive results is a risk of potential complications and increased cost due to the increased number of biopsies. However, it should be considered that US-guided FNA is a very safe procedure and the cost-effectiveness of biopsy versus US surveillance may be controversial [29].
Although the strategy of using strict biopsy criteria and US monitoring of nodules that do not meet the biopsy criteria has been adopted for small thyroid nodules, it is still uncertain whether US monitoring of nodule growth can effectively prevent the potential risk of nodal or distant metastases because small PTCs may show macroscopic nodal metastases and small FTCs rarely show distant metastases. It should also be considered that there was no enlargement of the primary tumor in 11 of 12 low-risk papillary microcarcinomas (92%) that showed novel lymph node metastasis during active surveillance [30].
Our study has several limitations. First, our study included only nodules for which US-guided biopsy had been performed, which may inevitably induce selection bias and underestimate the actual reduction rate of unnecessary biopsies. Second, the reference standards for benign and malignant diagnoses were based on the biopsy results and surgical histologic findings, meaning that rare false-negative or false-positive results may have been present. The estimated malignancy risk of nodules might have been underestimated because many nodules were finally diagnosed based on one benign FNA or CNB result. Third, our cohort database was generated at a single tertiary hospital. Further investigation in prospective multicenter studies will be necessary to validate the results of our study.
In conclusion, the modified K-TIRADS enables a high reduction rate of unnecessary biopsies, while maintaining a relatively high sensitivity and diagnostic accuracy for small malignant tumors compared to the K-TIRADS and other RSSs. Although the ACR TI-RADS has the strength of reducing unnecessary biopsies, it has a limitation of low sensitivity (less than 90%) for large malignant tumors, in contrast to the very high sensitivities of other RSSs. Further investigation and efforts should be made to reach a consensus on the appropriate sensitivity of the RSS for malignancy according to the nodule size.
Notes
Author Contributions
Conceptualization: Na DG. Data acquisition: Na DG, Paik W. Data analysis or interpretation: Na DG, Cha J, Kim SY, Gwon HY. Drafting of the manuscript: Na DG, Cha J. Critical revision of the manuscript: Na DG, Cha J, Yoo RE. Approval of the final version of the manuscript: all authors.
No potential conflict of interest relevant to this article was reported.
Acknowledgements
This research was supported by the Medical Research Promotion Program through the GangNeung Asan Hospital funded by the Asan Foundation (2020IC001). We thank Min Sun Kim for her assistance with the data analysis.