Abstract
This research presents a novel approach to obstacle detection during navigation using a combination of Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks. The primary objective is to generate accurate image captions that describe the content of images, which is crucial for applications such as autonomous driving and assistive technologies for the visually impaired. We systematically analyze the architecture of our model, which consists of three main components: a CNN for feature extraction, an LSTM for sequence generation, and a mechanism for sentence formulation. By employing transfer learning with the Inception v3 architecture, we enhance the model's performance while reducing computational costs. Our experiments utilize the Flickr8k dataset, which comprises 8,000 images, each accompanied by five descriptive sentences. We introduce a simplified version of Gated Recurrent Units (GRUs) as an alternative to LSTMs, demonstrating comparable performance with fewer parameters, thus improving training efficiency. The model's effectiveness is evaluated using the Bilingual Evaluation Understudy (BLEU) score, which quantifies the quality of generated captions against reference sentences. Results indicate that our architecture achieves a BLEU score of aprox 80% on the training set and approx 75% on the test set, showcasing its capability to produce semantically and grammatically correct captions. Additionally, we explore the integration of attention mechanisms to enhance the model's focus on relevant image features during caption generation. The findings suggest that our approach not only meets the challenges of automatic image captioning but also holds potential for broader applications in image understanding and navigation systems. Future work will involve expanding the dataset and refining the model to further improve accuracy and robustness in diverse scenarios.