Abstract
<jats:title>Summary</jats:title> <jats:sec> <jats:title>Background</jats:title> <jats:p>Artificial intelligence (AI) algorithms have advanced and recently shown high accuracy in diagnosing skin cancer from dermoscopic images. This study compared the diagnostic performance of the large language model ChatGPT‐4 with that of specialized convolutional neural network (CNN)‐based models in analyzing melanocytic lesions.</jats:p> </jats:sec> <jats:sec> <jats:title>Patients and Methods</jats:title> <jats:p>A cross‐sectional comparative study was conducted using 117 dermoscopic images. The performance of ChatGPT‐4 was assessed under two conditions: diagnosing lesions directly without annotations and diagnosing after annotating dermoscopic features. Results were compared with CNN‐based models (YPSONO and ResNet) and human expert evaluations. The confusion matrices of all the models were calculated in addition to the diagnostic accuracy, sensitivity, specificity, and interobserver agreement (Cohen's Kappa).</jats:p> </jats:sec> <jats:sec> <jats:title>Results</jats:title> <jats:p>ChatGPT‐4 achieved 92 % sensitivity, 89 % specificity, and an accuracy of 89.7 % in direct diagnosis. When annotations were required, sensitivity and specificity dropped to 68 % and 64 %, respectively. Agreement with experts on dermoscopic patterns was minimal (Cohen's Kappa = 0.13). ChatGPT‐4 outperformed CNN models in direct diagnosis but exhibited notable limitations in describing dermoscopic features.</jats:p> </jats:sec> <jats:sec> <jats:title>Conclusions</jats:title> <jats:p>ChatGPT‐4 demonstrated promising potential for accurate melanoma versus nevus classification without annotations, surpassing CNN‐based models. However, its limited ability to describe dermoscopic features accurately highlights the need for further research and training.</jats:p> </jats:sec>