Assessing the diagnostic performance of thyroid biopsy with recommendations for appropriate interpretation
Article information
Abstract
Purpose
The diagnostic performance of thyroid biopsy is influenced by several factors, including differences in the Bethesda categorization for malignancy, the inclusion or exclusion of non-diagnostic results, the definition used for the final diagnosis, and the definition of an inconclusive diagnosis. The purpose of this study was to provide an understanding of the factors influencing the diagnostic performance of thyroid biopsy.
Methods
We collected data retrospectively between January and December 2013 from a cohort of 6,762 thyroid nodules from 6,493 consecutive patients who underwent biopsy. In total, 4,822 nodules from 4,553 patients were included. We calculated the biopsy sensitivity according to the inclusion of different Bethesda categories in the numerator and the exclusion of non-diagnostic results, as well as the diagnostic accuracy according to different definitions of a benign diagnosis. We obtained the conclusive and inconclusive diagnosis rates.
Results
The sensitivity increased when more Bethesda categories were included in the numerator and when non-diagnostic results were excluded. When a benign thyroid nodule diagnosis was defined as benign findings on surgical resection, concordant benign results on at least two occasions, or an initial benign biopsy result and follow-up for more than 12 months, the accuracy was higher than when the diagnosis was based on surgical resection alone (91.1% vs. 68.7%). A higher conclusive diagnosis rate was obtained when Bethesda categories I and III were considered inconclusive than when Bethesda categories I, III and IV were considered inconclusive (78.3% vs. 72.8%, P<0.001).
Conclusion
Understanding the concepts presented herein is important in order to appropriately interpret the diagnostic performance of thyroid biopsy.
Introduction
Ultrasound (US)-guided biopsy is widely used to detect thyroid cancer, with satisfactory diagnostic performance [1-7]. Although many studies have evaluated the diagnostic performance of thyroid biopsy, including fine needle aspiration (FNA) and core needle biopsy (CNB), these studies have used heterogeneous definitions of benign or inconclusive biopsy results and there is a lack of consensus in the existing research on this topic. There have been no studies in which the investigators evaluated the fundamental factors affecting the interpretation of diagnostic performance. For example, a previous study [8] considering Bethesda category III as positive, rather than as an indeterminate result, added Bethesda category III to Bethesda categories IV, V, and VI in the numerator of the diagnostic performance calculation and the sensitivity of the biopsy marginally increased from 97.0% to 97.2%. Furthermore, studies differ in the definitions used for the final diagnosis. Two previous studies included surgical resection and clinical follow-up in the definition of the final diagnosis [9,10] whereas three previous studies [11-13] included only surgical resection. The impact of these unrealized factors is especially large in studies comparing the diagnostic performance of FNA and CNB. A recently published paper [14] found that there was no benefit in performing CNB over FNA and that both had a comparable diagnostic performance. However, that study [14] did not intentionally exclude non-diagnostic results of FNA after using propensity score-matching. Despite the overwhelming number of published studies, we suggest that there substantial variation exists in the interpretation of diagnostic performance across various studies and even within single studies.
The purpose of our study was to investigate the factors influencing the diagnostic performance of thyroid biopsy. Furthermore, we propose a recommendation for the appropriate interpretation of diagnostic performance.
Materials and Methods
Study Population
This retrospective study was approved by our institutional review board, and we received a waiver for informed written consent to use the data. The study population was obtained from 6,762 thyroid nodules from 6,493 consecutive patients who underwent biopsy between January and December 2013 at an academic tertiary referral hospital. Thyroid nodules in patients who had previously undergone biopsy (n=1,940), and 853 nodules without a final diagnosis were excluded. Finally, a total of 4,822 thyroid nodules with an initial biopsy from 4,553 patients were included in this study: 2,114 nodules from 1,928 patients who had undergone CNB and 2,708 nodules from 2,625 patients who had undergone FNA (Fig. 1). The study population has been analyzed in a previous study evaluating the efficacy and safety of CNB [15]. Whether to perform CNB or FNA was determined mainly according to the referring physician’s preference and CNB was performed for calcified nodules or predominantly cystic nodules, for which FNA may be less effective [16,17].
US-Guided FNA and CNB Procedures
US images were obtained for the evaluation of thyroid nodules using either an HDI 5000 (ATL Ultrasound, Bothell, WA, USA) or a Sequoia (Acuson, Mountain View, CA, USA) instrument equipped with a 5-12 MHz or an 8-15 MHz linear-array transducer. All US-guided procedures were performed by radiologists under the supervision of two faculty radiologists (J.H.B. and J.H.L., with 19 and 14 years of clinical experience, respectively, in performing and evaluating thyroid US). The US-guided CNB and FNA procedures for thyroid nodules were performed according to current practice guidelines [5,9,18-23].
Histopathologic Analysis of CNB Specimens and Cytopathologic Analysis of FNA
All CNB specimens and FNA cytological analyses were reviewed by a thyroid cytopathologist (D.E.S., with 11 years of clinical experience in thyroid cytopathology). Although the CNB diagnostic criteria for thyroid nodules had not yet been standardized during our study period, the histologic results of CNB were categorized into the same six categories of the Bethesda system that is used in the analysis of FNA cytology, with the following six standardized options [9,19,21,24,25]: Category I (non-diagnostic) included the absence of any identifiable follicular thyroid tissue, presence of only the normal thyroid gland, and tissue containing only a few follicular cells insufficient for diagnosis. Category II (benign) included all benign thyroidal and nonthyroidal disease. Category III (indeterminate lesion) corresponded to atypia of undetermined significance or follicular lesion of undetermined significance. Category IV (follicular neoplasm or suspicious for a follicular neoplasm) encompassed neoplastic lesions with follicular proliferative patterns. A category V (suspicious for malignancy) diagnosis was given when histologic features were strongly suspicious for malignancy, but insufficient for a definite diagnosis of malignancy. A category VI (malignancy) diagnosis was given when the typical histologic features were diagnosed as malignancy on a histologic specimen. The FNA cytology diagnoses were categorized into six categories according to the Bethesda System for Reporting Thyroid Cytopathology [9,24,26,27].
Analysis of US Findings
The US images were independently reviewed by two radiologists (J.H.B. and S.M.H.). When analyzing the US images, the radiologists assessed the thyroid nodules using criteria obtained from published reports [28-32], including the size (≥1 cm or <1 cm), internal content (solid, predominantly solid, predominantly cystic, or cystic), shape (round to oval or irregular), orientation (parallel or nonparallel), margin (well-defined smooth, microlobulated or spiculated, or ill-defined), echogenicity of the solid portion (hyperechogenicity or isoechogenicity, or hypoechogenicity or marked hypoechogenicity), and the presence of microcalcifications, macrocalcifications, and/or rim calcifications. The relationship between the final diagnosis (malignancy based on histopathologic findings from surgical resection or biopsy) and malignant US findings was assessed. The suspicious US features included irregular shape, nonparallel orientation, spiculated/microlobulated margin, marked hypoechogenicity, and the presence of microcalcifications [20].
Statistical Analysis
A final diagnosis of malignancy was made based on histopathologic readings from surgical resections or biopsies. A benign diagnosis was made when one of the following conditions was fulfilled: a surgical diagnosis of benignity, concordant benign results after biopsy on at least two occasions, or an initial benign biopsy result with a reduced or stable size on US follow-up at least 12 months later. We combined the diagnostic results of FNA and CNB thyroid biopsies. The diagnostic performance was calculated according to the following four criteria with different Bethesda categorizations in the numerator: criterion 1, Bethesda category VI; criterion 2, Bethesda categories V and VI; criterion 3, Bethesda categories IV, V, and VI; and criterion 4, Bethesda categories III, IV, V, and VI (Supplementary Table 1). We calculated the diagnostic performance including the diagnostic accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). We also calculated diagnostic performance parameters in the same manner after excluding Bethesda category I from the thyroid nodules in the dataset. We also analyzed diagnostic performance according to nodule size (<1 cm or ≥1 cm), different definitions of a benign diagnosis (i.e., surgical resection, concordant benign results on at least two occasions, or an initial benign biopsy result and follow-up at least 12 months later vs. surgical resection only), and differences in tumor subtype: (1) conventional papillary thyroid cancer (PTC) only; (2) conventional PTC and follicular variant PTC (FVPTC) and follicular carcinoma (FC); and (3) non-conventional PTC, including FC, FVPTC, medullary carcinoma, anaplastic carcinoma, sarcoma, lymphoma, and metastasis.
Nodules were classified and the inconclusive diagnosis rate was compared according to two criteria: criterion 1 (Bethesda categories I and III) and criterion 2 (Bethesda categories I, III, and IV). The Student t-test was used for continuous variables. The chi-square test was used for comparisons of categorical variables. All tests were two-sided, and a significant P-value was defined as P<0.05. The statistical analysis was conducted using SPSS version 21.0 for Windows (IBM Corp., Armonk, NY, USA).
Results
The clinical and imaging characteristics of the nodules are shown in Table 1. In the FNA group, the mean size of the 2,708 nodules was 12.35±8.24 cm (range, 0.5 to 6.5 cm), with 54.4% (1,474 of 2,708) being ≥1 cm. In the CNB group, the mean size of the 2,114 nodules was 16.83±12.27 cm (range, 0.5-10 cm), with 70.0% (1,395 of 2,114) being ≥1 cm. The biopsy results and final diagnoses, according to the Bethesda category, are summarized in Table 2. The nodules (n=3,969) with a final diagnosis comprised 2,799 benign nodules and 1,170 malignant nodules (Supplementary Table 2).
Sensitivity for Malignancy
The final diagnoses of 3,969 thyroid nodules were included in the calculations of diagnostic performance (Table 3). The biopsy sensitivity was highest (94.8%) using criterion 4, and showed a steady increase moving from criterion 1 (69.7%) to criteria 2 (80.5%) and 3 (84.8%). With the exclusion of Bethesda category I (non-diagnostic) from the dataset, the sensitivity of all of the criteria increased: from 69.7% to 71.0% for criterion 1, from 80.5% to 82.0% for criterion 2, from 84.8% to 86.3% for criterion 3, and from 94.8% to 96.5% for criterion 4. When we assessed the diagnostic performance according to the nodule size, a higher sensitivity was obtained from biopsies of smaller nodules (<1 cm) than from biopsies of larger nodules (88.3% vs. 69.1% for criterion 2). Regarding tumor subtypes, a higher sensitivity and PPV were obtained when only conventional PTC was considered as a malignancy (Supplementary Table 3).
Diagnostic Accuracy According to Definitions of a Benign Diagnosis
We calculated the diagnostic performance according to different definitions of a benign diagnosis (Table 4). When a benign diagnosis was defined as a benign result upon surgical resection, concordant benign results on at least two occasions, or an initial benign biopsy result and follow-up of at least 12 months later-a definition that included more benign nodules, but with no change in the number of malignant nodules that were included-the specificity and NPV improved, with higher accuracy than was obtained when using the more strict definition of benign findings upon surgical resection only (91.1% to 68.7%). The sensitivity and PPV maintained similar rates regardless of the definition.
Conclusive Results
The conclusive diagnosis rate showed a significant difference (78.3% vs. 72.8%, P<0.001) when the inconclusive diagnosis rate was calculated using Bethesda categories I and III (21.7%, 1,047 of 4,822) and Bethesda categories I, III, and IV (27.2%, 1,313 of 4,822), respectively.
Discussion
Our study demonstrates several factors that may influence the diagnostic performance of thyroid nodule biopsy. The sensitivity increased when the numerator included more Bethesda categories and when nodules with non-diagnostic biopsy results were excluded from the dataset. Ideally, we recommend Bethesda categories V and VI or VI to be considered as positive for malignancy in the numerator, and that Bethesda category I nodules should not be excluded from the dataset for the interpretation of diagnostic performance. The diagnostic accuracy increased when a benign diagnosis was defined as benign findings on surgical resection, concordant benign results on at least two occasions, or an initial benign biopsy result and follow-up for more than 12 months. When conducting a diagnostic accuracy study, we suggest generating lower and higher bound estimates for accuracy by using surgical resection alone or by including other biopsy and follow-up data as the definition for the final diagnosis. The rate of conclusive results increased when we defined Bethesda categories I and III as inconclusive results compared to the combination of Bethesda categories I, III, and IV. In our opinion, the inconclusive rate may include Bethesda categories I and III, as they are candidates for diagnostic surgery or repeat biopsy. Our study results will be helpful in understanding the results of various diagnostic performance studies of thyroid biopsy.
The sensitivity is influenced by the datasets assigned to the numerator or denominator. First, regarding the numerator, there is heterogeneity in the application of the Bethesda system for thyroid biopsy interpretation. In a previous study [8] that considered Bethesda category III as positive, rather than as an indeterminate result, adding Bethesda category III to Bethesda categories IV, V, and VI in the numerator marginally increased the sensitivity of thyroid FNA from 97.0% to 97.2%. Regarding the denominator, in a recent study by Choi et al. [14] comparing the diagnostic performance of thyroid biopsy procedures for detecting malignancy, excluding Bethesda category IV from the denominator increased the sensitivity of FNA and CNB sensitivity from 93.8% to 94.0% and from 84.7% to 88.1%, respectively, and excluding Bethesda categories I, III, and IV from the denominator substantively increased the sensitivity even further, to 99.8% and 99.1%, respectively. Accordingly, an unrealistically high sensitivity will be calculated when malignancies classified as Bethesda categories I, III, and IV are excluded from the denominator due to the reduced numbers of false negative results. We also observed sensitivity changes with the exclusion of Bethesda category I from the dataset. Therefore, if a study excludes the majority of non-diagnostic biopsy results from the analysis, the sensitivity will be biased, especially when comparing FNA and CNB, which have significantly different non-diagnostic result rates. Based on our analysis, including Bethesda categories V and VI or VI as positive for malignancy in the numerator and not excluding Bethesda category I from the dataset appear to be the most recommended conditions for diagnostic interpretation.
As the definition used for the final diagnosis can affect the results of diagnostic accuracy, an appropriate definition of the final diagnosis is critical. Two previous studies including surgical resection and clinical follow-up as the definition of the final diagnosis [9,10] showed higher accuracy than three studies [11-13] that defined the final diagnosis on the basis of surgical resection alone. Our results verified that when the definition of the final diagnosis was broader, with the inclusion of more benign thyroid nodules, the specificity and ensuing accuracy were higher than when a stricter definition was used, such as surgical resection alone. The sensitivity and PPV were maintained due to the absence of a change in the number of malignant nodules diagnosed in these different scenarios. Therefore, we suggest generating a range of lower and upper bound estimates of diagnostic accuracy corresponding to the use of surgical resection alone or surgical resection combined with biopsy and follow-up data as the definitions of the final diagnosis.
Several studies have applied the terms "conclusive" and "inconclusive" when comparing results across different biopsy procedures of thyroid nodules. A higher rate of conclusive biopsy results is favorable, as further unnecessary biopsies can then be minimized [29,33]. However, the definition of these two terms is inconsistent. For example, in one study [34], inconclusive results included Bethesda categories I, III, and IV, whereas other studies [20,35] defined inconclusive results as including Bethesda categories I and III. If we simulate the previous findings of Suh et al. [20] by classifying only Bethesda category IV as an inconclusive result, the conclusive result rate increases from 5.9% (Bethesda categories I and III) to 9.2%. The conclusive and inconclusive rates are inversely proportional, indicating that as one increases, the other decreases. The 2017 Bethesda system considers Bethesda categories IV, V, and VI to be conclusive results [26]. Therefore, we suggest that the most appropriate definition of inconclusive results would be Bethesda category I and III nodules.
Our study revealed other possible factors influencing the diagnostic performance of thyroid biopsy. The diagnostic performance may be influenced when conventional PTC prevails or the proportion of PTC among malignant tumors is relatively high in the patient population [36]. Regarding noninvasive follicular thyroid neoplasms with papillary-like nuclear features, which are included in the revised Bethesda system [26], if this diagnosis frequently occurs, we hypothesize that the diagnostic performance of procedures would be underestimated, although we did not specifically evaluate this possibility. Regarding the possibility of nodule size as another possible contributor to bias, as FC and FVPTC are usually larger than conventional PTC, better sensitivity was observed in smaller nodules. This result is similar to that of a previous study concerning CNB, which found higher sensitivity in small nodules, with a greater proportion of conventional PTC [20]. Therefore, when we interpret the diagnostic performance, we should consider the proportion of conventional PTCs and the tumor size in the cohort.
As another important factor affecting the diagnostic performance of thyroid biopsy, the proportion of repeated biopsies of nodules with previous inconclusive diagnostic results should be considered and matched in the patient population in order to obtain the optimal comparison of thyroid biopsy procedures using different patient populations. The repeated biopsy of nodules with prior inconclusive results generally yields a higher rate of repeated inconclusive results and a lower diagnostic sensitivity for malignancy compared to the initial biopsy results [7,37], which may cause a biased comparison if it is not matched between two populations.
The major limitation of our study is that there may have been selection bias due to its retrospective study design, and there may have been inherent bias in terms of the patient selection. Our study should be interpreted with some reservations because of the possibility of selection bias towards suspicious nodules owing to the usage of US for the CNB group, and because the biopsy procedure was determined according to the referring physician’s preference. However, our large study population may compensate for this selection bias. As mentioned above, the proportion of repeated biopsies of nodules with previous inconclusive diagnostic results should be considered, which we did not investigate. Future research with a high-volume dataset including either FNA- or CNB-diagnosed thyroid nodules would be beneficial to minimize these limitations. In addition, this study was carried out at a single institution, and therefore further generalization is required in future, multi-center studies. Lastly, most of the benign nodules were not confirmed with surgery.
In conclusion, this study suggests some factors that may influence the diagnostic performance of thyroid biopsy. Understanding these concepts is important for a more critical and appropriate interpretation of diagnostic performance.
Notes
Author Contributions
Conceptualization: Baek JH, Ha SM. Data acquisition: Baek JH, Ha SM, Suh CH, Shong YK, Sung TY, Song DE. Data analysis or interpretation: Ha SM. Drafting of the manuscript: Baek JH, Jung CK, Lee JH. Critical revision of the manuscript: Baek JH, Na DG. Approval of the final version of the manuscript: all authors.
No potential conflict of interest relevant to this article was reported.