Abstract:
This paper proposes a ConvNeXt backbone-based model for image captioning. Specifically, the ConvNeXt convolutional model, a state-of-the-art computer vision architecture, is integrated with a long short-term memory network enclosing a visual attention module. Diverse experiments were performed to evaluate the feasibility of ConvNeXt in this task. First, the impact of using four versions of ConvNeXt for feature extraction was studied. Additionally, two different learning rates were tested during the training stage of the encoder to analyze the impact of this on performance.
Furthermore, the effect of inclusion and exclusion of teacher-forcing at the decoder during training was analyzed. The 2014 MS COCO dataset was used, and the loss, top-5 accuracy, and BLEU-n were adopted as performance metrics. The results show that our proposed model outperforms the benchmark by 43.04% and 39.04% for soft-attention and hard-attention models, respectively, in terms of BLEU-4. Our model also surpasses equivalent approaches based upon vision transformers and data-efficient image transformers by 4.57% and 0.93%, respectively, in terms of BLEU-4. Moreover, it outperforms alternatives that use ResNet-101, ResNet-152, VGG-16, ResNeXt-101, and MobileNet V3 network-based encoders, by 6.44%, 6.46%, 6.47%, 6.39%, and 6.68%, respectively, in terms of top-5 accuracy, and by 18.46%, 18.44%, 18.46%, 18.24%, and 18.72%, respectively, in terms of loss.
Keywords: image captioning, ConvNeXt, computer vision, natural language processing, deep learning, artificial intelligence
Reviews
There are no reviews yet.