Application of Hybrid CNN-LSTM Architecture with Optuna Optimization for Weather Image Captioning

Sulaeman Salasa; Shintami Chusnul Hidayati; Muhamad Hilmil Muchtar Aditya Pradana

Authors

Sulaeman Salasa Institut Teknologi Sepuluh November
Shintami Chusnul Hidayati Institut Teknologi Sepuluh November
Muhamad Hilmil Muchtar Aditya Pradana Institut Teknologi Sepuluh November

Keywords:

Image Captioning, Weather, ResNet101, VGG16, LSTM, Optuna

Abstract

Automating the description of weather phenomena through visual imagery is a crucial step in supporting efficient meteorological monitoring systems. This study aims to compare the performance of two Deep Learning architectures, ResNet101-LSTM and VGG16-LSTM, in generating automatic image captions for various weather conditions. The research methodology involves extracting visual features using Residual Learning and VGG-Net, which are subsequently processed by Long Short-Term Memory (LSTM) units for text generation. Hyperparameter optimization was conducted using the Optuna framework to ensure both models operate at their peak configurations. The results indicate that ResNet101-LSTM provides superior linguistic accuracy, achieving a BLEU-1 score of 0.7553, a BLEU-4 score of 0.4593, and a METEOR score of 0.7264. Qualitatively, this model is capable of identifying environmental details with higher precision compared to VGG16-LSTM. However, loss curve analysis reveals that VGG16-LSTM demonstrates better convergence stability (good fit), whereas ResNet101-LSTM shows signs of slight overfitting. This study concludes that while ResNet101-LSTM is superior in accuracy according to standard NLP evaluation metrics, additional regularization techniques are required to maintain its performance stability on validation data.

References

H. Subyantara Wicaksana et al., “Evaluasi Kinerja Automatic Weather Station Berdasarkan Pengamatan Paralel di Stasiun Meteorologi Kemayoran,” 2021.

D. R. Wibawanty, W. Wandayantolis, and I. Ishak, “Verifikasi Kinerja Alat Automatic Weather System (AWS) dan Termometer Digital terhadap Observasi Manual di Stasiun Klimatologi Palembang,” JRST (Jurnal Riset Sains dan Teknologi), vol. 6, no. 2, p. 151, Nov. 2022, doi: 10.30595/jrst.v6i2.13541.

M. Elhoseiny, S. Huang, and A. Elgamma, “Weather classification with deep convolutional neural networks,” in 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 2015. doi: 10.1109/ICIP.2015.7351424.

N. N. H. Dinh, H. Shin, Y. Ahn, B. L. Oo, and B. T. H. Lim, “Attention-based image captioning for structural health assessment of apartment buildings,” Autom. Constr., vol. 167, Nov. 2024, doi: 10.1016/j.autcon.2024.105677.

Y. Yu et al., “Neural image caption generator based on crossbar array design of memristor module,” Neurocompu- ting, vol. 560, Dec. 2023, doi: 10.1016/j.neucom.2023.126766.

M. Abumohsen, A. Y. Owda, M. Owda, and A. Abumihsan, “Hybrid machine learning model combining of CNN- LSTM-RF for time series forecasting of Solar Power Generation,” e-Prime - Advances in Electrical Engineering, Electronics and Energy, vol. 9, Sep. 2024, doi: 10.1016/j.prime.2024.100636.

M. J. Parseh and S. Ghadiri, “Graph-based image captioning with semantic and spatial features,” Signal Process. Image Commun., vol. 133, Apr. 2025, doi: 10.1016/j.image.2025.117273.

T. Shahzad, M. Aoun, T. Mazhar, M. U. Tariq, K. Ouahada, and H. Hamam, “Mamba-caption: Long-range se- quence modelling for efficient and accurate image captioning,” Array, vol. 28, Dec. 2025, doi: 10.1016/j.ar- ray.2025.100538.

S. R. Mujawar and S. Iyer, “Deep learning model with co-ordinated relationship for image captioning enabled via attentional language encoder-decoder,” Signal Process. Image Commun., vol. 142, p. 117466, Mar. 2026, doi: 10.1016/j.image.2025.117466.

A. Sharma, H. Singh, and M. Pant, “Pixels to prose: A comprehensive survey of image captioning techniques with deep learning and generative artificial intelligence,” Neurocomputing, Feb. 28, 2026, doi: 10.1016/j.neu- com.2025.132385.

H. D. Abdulgalil and O. A. Basir, “Next-generation image captioning: A survey of methodologies and emerging challenges from transformers to Multimodal Large Language Models,” Natural Language Processing Journal, Sep. 01, 2025, doi: 10.1016/j.nlp.2025.100159.

F. Firdaus et al., “A medical image captioning system for TeleOTIVA: Supporting SDGs-oriented cervical pre- cancer screening in Indonesia,” Inform. Med. Unlocked, vol. 60, p. 101719, Jan. 2026, doi: 10.1016/j.imu.2025.101719.

L. Li, H. Li, and P. Ren, “Underwater image captioning via attention mechanism based fusion of visual and textual information,” Information Fusion, vol. 123, Nov. 2025, doi: 10.1016/j.inffus.2025.103269.

V. P. Saxena et al., “Enhancing Image Understanding in Automatic Captioning using Spatially-Aware Transformer Architectures,” in Procedia Computer Science, Elsevier B.V., 2025, pp. 2081–2090. doi: 10.1016/j.procs.2025.04.458.

N. Aljojo, H. Ardah, A. Tashkandi, and S. Habibullah, “Predicting abnormality-guided multimodal linguistic se- mantics Arabic image captioning,” Machine Learning with Applications, vol. 21, p. 100706, Sep. 2025, doi: 10.1016/j.mlwa.2025.100706.

M. R. Sree, M. Siddhartha, P. V. Vardhan Reddy, B. Kruthika, and R. P. Singh, “A Residual Network and Bi-direc- tional LSTM based Hybrid Approach to Remote Sensing Image Captioning,” in Procedia Computer Science, Elsevier B.V., 2025, pp. 88–97. doi: 10.1016/j.procs.2025.04.198.

X. Zhu, L. Li, J. Liu, Z. Li, H. Peng, and X. Niu, “Image captioning with triple-attention and stack parallel LSTM,” Neurocomputing, vol. 319, pp. 55–65, Nov. 2018, doi: 10.1016/j.neucom.2018.08.069.

R. Padate, A. Jain, M. Kalla, and A. Sharma, “Image caption generation using a dual attention mechanism,” Eng. Appl. Artif. Intell., vol. 123, Aug. 2023, doi: 10.1016/j.engappai.2023.106112.

Mrs. A. P and Dr. P. D, “Image Captioning System for Natural Language Processing using Optimized Attention- Augmented Residual Convolutional Neural Network,” Knowl. Based. Syst., p. 115272, Jan. 2026, doi: 10.1016/j.knosys.2026.115272.

M. Kaur and H. Kaur, “An Efficient CNN-LSTM Based Framework for Improved Image Captioning,” in Procedia Computer Science, Elsevier B.V., 2025, pp. 3601–3607. doi: 10.1016/j.procs.2025.04.615.

A. K. Poddar and R. Rani, “Hybrid Architecture using CNN and LSTM for Image Captioning in Hindi Language,” in Procedia Computer Science, Elsevier B.V., 2022, pp. 686–696. doi: 10.1016/j.procs.2023.01.049.

K. Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” in Proceedings of the 32nd International Conference on Machine Learning, 2015.

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and Tell: A Neural Image Caption Generator,” in Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.

S. Banerjee and A. Lavie, “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments,” in Proceedings of the ACL Workshop, 2005.

K. Zhang, P. Li, and J. Wang, “A Review of Deep Learning-Based Remote Sensing Image Caption: Methods, Mod- els, Comparisons and Future Directions,” Remote Sensing, vol. 16, no. 21, 4113, Nov. 2024, doi: 10.3390/rs16214113.

H. Xiao, F. Zhang, Z. Shen, K. Wu, and J. Zhang, “Classification of Weather Phenomenon From Images by Using Deep Convolutional Neural Network,” Earth and Space Science, vol. 8, no. 5, May 2021, doi: 10.1029/2020EA001604.

G. Luo, L. Cheng, C. Jing, C. Zhao, and G. Song, “A thorough review of models, evaluation metrics, and datasets on image captioning,” IET Image Process., vol. 16, pp. 311–332, Feb. 2022, doi: 10.1049/ipr2.12367.

K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016.

T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A Next-generation Hyperparameter Optimization Framework,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Jul. 2019, pp. 2623–2631. doi: 10.1145/3292500.3330701.

S. Mulyana et al., “Identifikasi Penyakit Tanaman Berdasarkan Citra Daun Berbasis Web dengan Pendekatan Algoritma Convolutional Neural Network,” SKANIKA: Sistem Komputer dan Teknik Informatika, vol. 8, no. 2, pp. 305–317, 2025.

E. Hari Rachmawanto and M. Muslih, “Convolutional Neural Network (CNN) untuk Klasifikasi Citra Penyakit Diabetes Retinopathy,” SKANIKA: Sistem Komputer dan Teknik Informatika, vol. 5, no. 2, pp. 167–176, 2022.

S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” Interna- tional Conference on Learning Representations, 2015.