Abstract
<jats:p>In this article, the authors introduce a novel multimodal approach that harnesses the capabilities of large language models (LLMs) to enrich waste classification with semantic insights. The method generates descriptive prompts specifically tailored to the waste imagery, which are then used to infer semantic attributes relevant to classification tasks. These descriptions facilitate a transformative converter architecture that bridges textual and visual domains, enabling the model to interpret waste images with enhanced precision. They present the first multimodal waste classification model that leverages the LLM-generated textual descriptors alongside visual features. Extensive testing shows that the approach outperforms existing models, achieving a top accuracy improvement of 62.20%. A comprehensive suite of ablation studies further underscores the method's efficacy and resilience, confirming its potential to advance waste management by integrating the complementary strengths of both image and textual data.</jats:p>