The content of research paper

Development of a novel artificial intelligence algorithm to detect pulmonary nodules on chest radiography

Mitsunori Higuchi, Takeshi Nagata, Kohei Iwabuchi, Akira Sano, Hidemasa Maekawa, Takayuki Idaka, Manabu Yamasaki, Chihiro Seko, Atsushi Sato, Junzo Suzuki, Yoshiyuki Anzai, Takashi Yabuki, Takuro Saito, Hiroyuki Suzuki

Author information

Mitsunori Higuchi

Department of Thoracic Surgery, Aizu Medical Center, Fukushima Medical University
Takeshi Nagata

University of Tsukuba School of Integrative and Global Majors
Mizuho Research and Technologies, Ltd.
Kohei Iwabuchi

Mizuho Research and Technologies, Ltd.
Akira Sano

Mizuho Research and Technologies, Ltd.
Hidemasa Maekawa

Mizuho Research and Technologies, Ltd.
Takayuki Idaka

Mizuho Research and Technologies, Ltd.
Manabu Yamasaki

Mizuho Research and Technologies, Ltd.
Chihiro Seko

Mizuho Research and Technologies, Ltd.
Atsushi Sato

Fukushima Preservative Service Association of Health
Junzo Suzuki

Fukushima Preservative Service Association of Health
Yoshiyuki Anzai

Aizuwakamatsu Medical Association
Takashi Yabuki

Aizuwakamatsu Medical Association
Takuro Saito

Department of Surgery, Aizu Medical Center, Fukushima Medical University
Hiroyuki Suzuki

Department of Chest Surgery, Fukushima Medical University School of Medicine

Introduction

Recent advances in deep learning and large datasets have enabled algorithms to surpass the performance of medical professionals in a wide variety of medical imaging tasks, including imaging for diabetic retinopathy¹⁾ and hemorrhage identification²⁾. Lung cancer is the leading cause of cancer-related death worldwide³⁾. Therefore, the control of lung cancer is an urgent problem that needs to be resolved. Early detection of lung cancer is extremely important, and some clinical trials, including the National Lung Screening Trial⁴⁾ and the NELSON trial⁵⁾, have been performed with low-dose computed tomography (CT). Despite the superior ability of CT to detect pulmonary nodules, chest radiography is still widely accepted as the first-line imaging tool to screen for and detect lung lesions⁶^-⁸⁾. Pulmonary nodules are common initial radiologic manifestations of lung cancer; however, they can be easily missed when they are subtle, small, or localized to difficult areas. Pulmonary nodule detection by chest radiography has been the focus of several computer-aided detection (CAD) studies in recent decades⁹^,¹⁰⁾. However, early solutions were limited due to their low sensitivity and high false-positive rates. In this study, we developed a novel AI algorithm and assessed its ability to detect pulmonary nodules on chest radiography at different levels of detection difficulty with both normal and abnormal control images. We found that the AI algorithm exceeded the average radiologist performance for pulmonary nodule detection. We suggest that automated detection of diseases based on chest radiographs at the level of expert radiologists would confer tremendous benefit in the clinical setting.

Materials and methods

Program formulation

We used the CheXNet model¹¹⁾, which is a 121-layer convolutional neural network that inputs a chest X-ray image and outputs the probability of pulmonary nodules, and produces a heatmap localizing the areas of the image that are most indicative of pulmonary nodules. We trained the CheXNet model using the National Institutes of Health (NIH) Chest X-ray 14 dataset (Bethesda, MD), which contains 112,120 frontal-view chest X-ray images individually labeled with up to 14 different thoracic diseases, including atelectasis, cardiomegaly, consolidation, edema, effusion, emphysema, fibrosis, hernia, infiltration, mass, nodule, pleural thickening, pneumonia, and pneumothorax. From the NIH Chest X-ray 14 dataset, the data of 2,500 nodules (positive data) and 2,500 normal records (negative data) were used in this study.

Each output was normalized with a sigmoid function to [0,1]. The network was initialized with the pre-trained ImageNet model¹²⁾. First, we focused on the NIH Chest X-ray 14 dataset. The labels consisted of a C dimensional vector [l₁, l₂…l_c], where C = 14 with binary values, representing either the absence (0) or presence (1) of a pathology. As a multi-label problem, we independently treated all labels during the classification by defining the C binary cross-entropy loss function. As the dataset was highly imbalanced, we incorporated additional weights within the loss function based on the label frequency within each batch:

L (X, l_n) = −(w_p · l_n log(p) + w_n · (1 −l_n) log(1 − p)),

where w_p = (Pn + Nn) ÷ Pn and w_n = (Pn + Nn) ÷ Nn, with Pn and Nn indicating the number of samples with presence and absence of nodules, respectively.

The weights of the network were initialized with weights from the model that was pre-trained on ImageNet. The network was trained end-to-end using Adam optimization with standard parameters (β₁ = 0.9 and β₂ = 0.999). We trained the model using minibatches of size 16. We used an initial learning rate of 0.00001, which was decreased by a factor of 10 each time the validation loss plateaued after an epoch, and we selected the model with the lowest validation loss.

This study was approved by the Institutional Review Board of Fukushima Medical University (IRB-ID: 30290), which is guided by local policy, national law, and the World Medical Association Declaration of Helsinki. The chest radiographs used in this study were acquired by the Fukushima Preservative Service Association of Health in the course of its daily practice, where local lung cancer screening is mainly conducted. The need for written informed consent to use anonymized data was waived by the ethics review board. This study was supported by Grants-in-Aid for Scientific Research in Japan (ID: 21K08890).

Data training

We analyzed the image features as teacher data using 800 chest X-ray images (400 normal images [negative data] and 400 pulmonary nodules [positive data]) from Fukushima Preservative Service Association of Health, as well as 5,000 chest radiographs from the NIH Chest X-ray 14 dataset. The labelling for pulmonary nodules of the Fukushima dataset was assured by a process indicator which guaranteed the accuracy of pulmonary nodule detection in lung cancer screening. We categorized these data into two types: type A included both the Fukushima and NIH datasets and type B included only the Fukushima dataset. Then, we integrated the datasets for deep learning and convolutional neural network analyses using ImageNet to develop the proprietary AI algorithm. We then statistically analyzed the accuracy of radiograph interpretation. For cross-validation, we randomly divided the dataset into five groups, which included positive data and negative data at equal rates. Then, we validated one group as test data and used the other groups as training data. We assigned each group as test data and obtained five sets of results (Figure 1). Finally, we calculated the average accuracy for each set. We compared the receiver operating characteristic (ROC) area under the curve (AUC) of the AI model with values reported previously¹¹^,¹³^,¹⁴⁾. We also showed the accuracy of radiologists’ evaluations which were described in the website of L PIXEL Inc., Tokyo¹⁵⁾. The website showed the method of evaluation of pulmonary nodules by radiologists. Nine radiologists participated in an evaluation test that included 67 radiographs with pulmonary nodule and 253 normal radiographs.

Fig. 1. Schematic view of cross-validation. We randomly divided the dataset into five groups that included positive and negative data at equal rates. Then, we validated one group as test data and used the other groups as training data. Next, we assigned each group as test data and obtained five sets of results (Results 1-5). Finally, we calculated the average accuracy for each set.

Model interpretation

We demonstrated pulmonary nodules in the form of a heatmap display on each chest radiograph for easy visualization, and we presented the positive probability score as an index value (0.0-1.0), which indicated the possibility of pulmonary nodules using class activation maps (CAMs)¹⁶⁾. To generate the CAMs, we fed an image into the fully trained network and extracted the feature maps that were output by the final convolutional layer. With f_k as the k^th feature map and w_c,k as the weight in the final classification layer for feature map k leading to pulmonary nodules, we obtained a map M_c of the most salient features, which were used to classify the images as having pulmonary nodules by taking the weighted sum of the feature maps using their associated weights. The equation is as follows:

M_c = Σ_k w_c,k · f_k

We identified the most important features used by the model to predict the presence of pulmonary nodules by upscaling the map M_c to the dimensions of the image and overlaying the image. Our novel AI system underwent mechanical learning of training data, which were obtained using the same radiographic apparatus to eliminate the effects of differences between equipment.

Statistical analysis

The data are described as median and range for continuous variables and as percentages with 95% confidence intervals for quantitative variables. Statistical analyses were performed using SPSS 28.0.1.0 software (IBM Corp., Armonk, NY).

Results

AUC, sensitivity, and specificity

The AUC, sensitivity, and specificity of each of the five groups in both the type A and type B datasets were calculated by cross-validation (Table 1 and Table 2). The ROC curves of both the type A and B datasets are shown in Figure 2. Our novel AI approach demonstrated an accuracy (AUC) of 0.74, a sensitivity of 0.75, and a specificity of 0.60 for the type A dataset. The respective values for the type B dataset were 0.79, 0.72, and 0.74. The AI algorithm used a positive probability cutoff value of 0.5. The AI algorithm applied to both the type A and B datasets was superior to the accuracy of radiologists (AUC 0.71) and either superior or comparable to previous reports¹²^,¹⁴⁾. Radiologist detection of pulmonary nodules demonstrated an AUC of 0.7173 ± 0.0344, a sensitivity of 0.4710 ± 0.0611, and a specificity of 0.9635 ± 0.0198¹⁵⁾. These data are shown in Table 3. Table 4 compares our results with those of previous reports. Overall, our novel AI algorithm (using both the type A and type B datasets) resulted in comparable or superior AUC values to previous reports.

Table 1. Accuracy of each result with the type A dataset after cross-validation

Table 2. Accuracy of each result with the type B dataset after cross-validation

Fig. 2. Receiver operating characteristic (ROC) curves for the type A (a) and type B (b) datasets.

Table 3. Comparison of the accuracy of radiologist detection with AI algorithm detection

Table 4. Comparison of the AUC value of the AI algorithms with previous reports

Visual and numerical demonstrations

We demonstrated that heatmaps displayed on the monitor screen clearly corresponding to the location of pulmonary nodules, if each roentgenogram had pulmonary nodules (Figure 3). Each heatmap display expressed the location of pulmonary nodules, with the exception of Figure 3-c-2, which shows a false-positive result. We also evaluated the possibility of chest nodules as a positive probability score (Figure 3). Here, we determined the cutoff value as 0.5, with a value of ≥0.5 suggesting a positive finding. However, the positive probability score of Figure 3-c-2 showed a false-positive result.

Fig. 3. Three examples from datasets of a health examination center in this study. The proposed AI algorithm correctly detected pulmonary nodules and localized the areas in the image that were most indicative of pulmonary nodules (a-1, a-2). The AI algorithm also detected pulmonary nodules that were missed by the physicians (b-1, b-2). A false-positive display is shown in c-1 and c-2, which requires improvement. The positive probability scores of these cases are shown in a-2, b-2, and c-2, respectively.

Discussion

Chest radiography remains the primary diagnostic imaging modality for thoracic conditions because of its advantages over chest CT, including easier access, lower cost, and lower radiation exposure. However, previous studies have shown that 19%-26% of lung cancers that are visible on chest radiography are actually missed at the time of initial reading¹⁷^,¹⁸⁾, and low-dose CT as opposed to chest radiography is thus recommended for lung cancer detection¹⁹^,²⁰⁾. Resolving missed abnormal nodules or masses on chest radiography is an urgent problem for both physicians and patients. To date, many studies have reported the development of AI algorithms to read CT and radiography images, and some of these AI algorithms have already been put to clinical use.

In this study, we established two novel AI algorithms (type A and type B), which were constructed based on whether they included the NIH Chest X-ray 14 dataset or not, in addition to the inclusion of 800 chest radiographs from Fukushima Preservative Service Association of Health. The purpose of developing two types of algorithms was to confirm the accuracy of AI derived from a small number of radiographs. The novel AI algorithms, especially the one derived from our small type B data set, were associated with improvements in the AUC and sensitivity of pulmonary nodule detection on chest radiographs compared with the respective values for nodule detection by radiologists, as reported in previous studies¹⁶⁾. However, specificities of the AI algorithms were inferior to that of radiologist detection¹⁶⁾. Chest radiography is first used to screen for thoracic diseases; this is followed by conventional chest CT, positron emission tomography CT, magnetic resonance imaging, and/or other imaging modalities. Therefore, over-detection (false-positive results) of pulmonary nodules on chest radiography is not a crucial problem. In this study, type B yielded better accuracy than type A, which received pre-training with 5,000 radiographs from the NIH Chest X-ray 14 dataset. In general, to acquire a high accuracy by deep learning, massive teaching data are required, such as the NIH Chest X-ray 14 dataset. However, the type A dataset was inferior to the type B dataset. This may have been because the Chest X-ray 14 dataset included images from various types of radiography devices, with imaging environments and examinee postures that may have differed. In contrast, our Type B dataset, while much smaller than the NIH dataset, is largely derived from regular, standardized health screening that is conducted in Fukushima as part of Japan’s system of universal health care. AI developer must be mindful of such aforementioned variations. However, even with such variables, we were able to establish novel AI algorithms that demonstrated comparable or superior accuracy to those reported previously¹³^,¹⁴⁾.

The present study has several limitations that should be noted. First, this study used data with “nodule-positive radiographs” and “normal radiographs” rather than “nodule-negative radiographs” obtained from two datasets. Case-control methods for diagnostic accuracy studies could lead to overestimation with respect to sensitivity and specificity. We compared the AUC-ROC with those of previous studies which examined not only pulmonary nodules but also other pulmonary abnormalities using AI algorithms. Therefore, the AUCs of the current study could be on par with, or surpass, those of previous studies. In real-world situations, chest radiographs can possibly have more than one abnormality. Since clinicians have to detect all abnormalities, including pulmonary nodules, it is still not very clear whether or not an AI algorithm that focuses only on pulmonary nodules is useful. Second, the sample size was small, and images obtained from patients with lung cancer were lacking. We are now planning data collection to resolve these data insufficiencies, especially in terms of obtaining chest radiographs from patients with lung cancer as positive data. Third, we did not use technology that accounted for differences in radiograph apparatus and imaging environments. Therefore, we had to resolve these problems by standardizing the differences. We need to collect more high-quality data from various institutions to overcome this limitation. Third, the accuracy of the novel AI algorithm was evaluated using the NIH Chest X-ray 14 dataset and 800 chest radiographs obtained from Fukushima Preservative Service Association of Health. We did not assess whether this AI algorithm might improve the accuracy of pulmonary nodule detection by chest radiography if the physicians used it for CAD. Therefore, the use of the AI algorithm in CAD should be validated in the future. To evaluate the capability of CAD to assist physicians, a reader performance test comparing the physician performance before and after the use of CAD will be conducted. CAD is certified as a medical software for use by physicians as a second opinion. However, our ultimate goal is to achieve an autonomous AI algorithm to detect pulmonary nodules on chest radiographs, even though we will need to overcome several obstacles, including technical and legal issues, amongst others. The first fully autonomous AI algorithm that is able to perform diagnostic assessment without the supervision of an expert clinician is the IDx-DR AI system, which is used to analyze fundus photographs in the primary care setting to detect diabetic retinopathy. The IDx-DR AI system was approved in 2018 by the Food and Drug Administration²¹⁾. The autonomous AI algorithm should include additional components that help to guarantee the robustness of the AI output.

In conclusion, we developed an AI algorithm to detect pulmonary nodules from frontal-view chest radiographs with an accuracy that exceeds that of radiologists. We hope this technology can improve healthcare delivery and increase access to medical imaging expertise in parts of the world where access to skilled radiologists is limited. Our system also has the potential to support the current high-throughput reading workflow of radiologists by enabling them to gain more confidence in using AI systems to obtain a second opinion. We are now in the process of performing various types of validation to improve the accuracy and to achieve autonomous diagnosis.

Acknowledgements

We thank Emily Woodhouse, PhD, from Edanz (https://jp.edanz.com/ac) for editing a draft of this manuscript.

Conflict of interest disclosure

The authors have no conflicts of interest to declare.

Abstract

References