Image Caption Generator using Deep Learning

SARVESH AMRUTE
12 min readDec 4, 2022

Our brain is capable of identifying or annotating each image that is shown to us. What about computers, though? How can a computer analyse a picture and assign it a caption that is both extremely relevant and accurate? Building a useful caption generator for a picture was formerly thought to be very hard, but because to improvements in computer vision and deep learning techniques, the availability of pertinent datasets, and AI models, it is now much simpler. Many data annotation companies are making billions of dollars from caption creation, which is also expanding globally. In this tutorial, we’ll show you how to create an annotation tool that can use datasets to provide descriptions for images that are extremely relevant. For the same, it is necessary to have a basic understanding of two deep learning approaches, LSTM (a form of recurrent neural network) and Convolutional Neural Networks (CNN).

Now, let’s begin with a quick description of the image caption generator, CNN, and LSTM.

Image Caption Generator

Deep learning and computer vision are used in the process of creating image caption generators, which identify the context of a picture and annotate it with pertinent captions. The tagging of a picture with English keywords using datasets from model training is part of it. The CNN model known as Xception is trained using the Imagenet dataset. The extraction of picture features is handled by Xception. The LSTM model will receive these extracted characteristics and provide the picture caption.

https://stats.stackexchange.com/questions/387596/image-caption-generator

What is CNN?

CNN is a kind of deep learning that uses specialised deep neural networks to identify and categorise pictures. It is used to process data that is displayed as pictures that resemble 2D matrices. It can handle images that have been resized, translated, and rotated. The visual picture is analysed by scanning it from left to right and from top to bottom, then taking the pertinent characteristics out of it. The features for picture categorization are finally combined.

What is LSTM?

Long short-term memory (LSTM), a form of RNN (recurrent neural network), can solve sequence prediction issues. It is mostly used to forecast the following word since Google Search uses our algorithm to display the following word based on the previous content. The essential information is carried out and the irrelevant information is discarded when inputs are processed using LSTM.

To build an image caption generator model we have to merge CNN with LSTM. We can drive that:

Image Caption Generator Model(CNN-RNN model) = CNN + LSTM.

  • CNN- To extract features from the image. A pre-trained model called Xception is used for this.
  • LSTM- To generate a description from the extracted information of the image.

Dataset for Image Caption Generator

For the model training of picture caption generators, the Flickr 8K dataset is employed. The links below will take you straight to downloads of the dataset. Due to the size of the dataset (1GB), downloading takes some time. You can view all of the files in the Flickr 8k text folder in the image below. The most significant file is Flickr 8k.token, which contains all of the image names and captions. The Flickr8k Dataset folder contains 8091 photographs, while the Flickr8k text folder has text files including the image descriptions.

Pre-requisites

Our caption generator will be executed in Jupyter notebooks. Jupyter notebooks may be downloaded from this site. For the implementation, a solid grasp of Python, Deep Learning, and NLP is needed.

Install below libraries, to begin with, the project:

pip install TensorFlow
pip install Keras
pip install pillow
pip install NumPy
Pip install tqdm
Pip install jupyterlab

Building the Image Caption Generator

Let’s start by opening the jupyter notebook to create our Python3 project. Name your python3 file with train_caption_generate.ipynb

Import all the required packages

import numpy as np
from PIL import Image
import os
import string
from pickle import dump
from pickle import load
from keras.applications.xception import Xception #to get pre-trained model Xception
from keras.applications.xception import preprocess_input
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.preprocessing.text import Tokenizer #for text tokenization
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers.merge import add
from keras.models import Model, load_model
from keras.layers import Input, Dense#Keras to build our CNN and LSTM
from keras.layers import LSTM, Embedding, Dropout
from tqdm import tqdm_notebook as tqdm #to check loop progress
tqdm().pandas()

Perform data cleaning

As can be seen, the Flickr 8k.token file in the Flickr 8k text folder contains all image captions. If you closely examine this file, you can see the picture storage format: each image and caption are separated by a new line, and they each have five captions that are numbered 0 to 4.

Here are the 5 cleaning functions that will be defined:
1. load fp(filename) — To open a document file and read its contents into a string.

2. img capt(filename) — To build a lexicon of descriptions that will map pictures with all five captions.

3. txt cleaning(descriptions) — This function uses all descriptions as input to clean the data. We need to execute many forms of cleaning while working with textual data, such as uppercase to lowercase conversion, punctuation removal, and removal of words that include numbers.

4. txt vocab(descriptions) — This function is used to compile all of the unique terms that were taken out of the descriptions into a vocabulary.

5. Use the method save descriptions(descriptions, filename) to save all of the preprocessed descriptions into a single file.

6. Screenshot of Description Dictionary

Code:

# Load the document file into memory
def load_fp(filename):
# Open file to read
file = open(filename, 'r')
text = file.read()
file.close()
return text
# get all images with their captions
def img_capt(filename):
file = load_doc(filename)
captions = file.split('n')
descriptions ={}
for caption in captions[:-1]:
img, caption = caption.split('t')
if img[:-2] not in descriptions:
descriptions[img[:-2]] = [ caption ]
else:
descriptions[img[:-2]].append(caption)
return descriptions
#Data cleaning function will convert all upper case alphabets to lowercase, removing punctuations and words containing numbers
def txt_clean(captions):
table = str.maketrans('','',string.punctuation)
for img,caps in captions.items():
for i,img_caption in enumerate(caps):
img_caption.replace("-"," ")
descp = img_caption.split()
#uppercase to lowercase
descp = [wrd.lower() for wrd in descp]
#remove punctuation from each token
descp = [wrd.translate(table) for wrd in descp]
#remove hanging 's and a
descp = [wrd for wrd in descp if(len(wrd)>1)]
#remove words containing numbers with them
descp = [wrd for wrd in descp if(wrd.isalpha())]
#converting back to string
img_caption = ' '.join(desc)
captions[img][i]= img_caption
return captions
def txt_vocab(descriptions):
# To build vocab of all unique words
vocab = set()
for key in descriptions.keys():
[vocab.update(d.split()) for d in descriptions[key]]
return vocab
#To save all descriptions in one file
def save_descriptions(descriptions, filename):
lines = list()
for key, desc_list in descriptions.items():
for desc in desc_list:
lines.append(key + 't' + desc )
data = "n".join(lines)
file = open(filename,"w")
file.write(data)
file.close()
# Set these path according to project folder in you system, like i create a folder with my name shikha inside D-drive
dataset_text = "D:shikhaProject - Image Caption GeneratorFlickr_8k_text"
dataset_images = "D:shikhaProject - Image Caption GeneratorFlicker8k_Dataset"
#to prepare our text data
filename = dataset_text + "/" + "Flickr8k.token.txt"
#loading the file that contains all data
#map them into descriptions dictionary
descriptions = img_capt(filename)
print("Length of descriptions =" ,len(descriptions))
#cleaning the descriptions
clean_descriptions = txt_clean(descriptions)
#to build vocabulary
vocabulary = txt_vocab(clean_descriptions)
print("Length of vocabulary = ", len(vocabulary))
#saving all descriptions in one file
save_descriptions(clean_descriptions, "descriptions.txt")

Extract the feature vector

To extract the characteristics from these models, we will now employ a pre-trained model called Xception that has previously been trained on a sizable amount of data. To categorise the photos, Xception was trained on an imagenet dataset with 1000 distinct classes. We can easily import this model using keras.applications. To include the Xception model with our model, we must make a few adjustments to it. In order to extract the 2048 feature vectors and fit the 299*299*3 image size required by the xception model, the final classification layer must be deleted.

model = Xception( include_top=False, pooling=’avg’ )

These features are extracted for all photos using the Extract features() method. The features dictionary will be placed into a pickle file called “features.p” at the conclusion.

def extract_features(directory):
model = Xception( include_top=False, pooling='avg' )
features = {}
for pic in tqdm(os.listdir(dirc)):
file = dirc + "/" + pic
image = Image.open(file)
image = image.resize((299,299))
image = np.expand_dims(image, axis=0)
#image = preprocess_input(image)
image = image/127.5
image = image - 1.0
feature = model.predict(image)
features[img] = feature
return features
#2048 feature vector
features = extract_features(dataset_images)
dump(features, open("features.p","wb"))
#to directly load the features from the pickle file.
features = load(open("features.p","rb"))

Loading dataset for model training

Our Flickr 8k test folder has a file called “Flickr 8k.trainImages.txt.” A collection of 6000 picture names that are used for training purposes may be found in this file.

The following functions are needed to load the training datasets:

load_photos(fname) — takes a file name as an argument and loads a text file into a string to produce a list of picture names.

load_clean_descriptions(fname, image) — The captions for each image in the list of photographs are saved by this function to a dictionary. We add the and identifier to each caption in order to make it easier for the LSTM model to recognize the start and end of a caption.

load_features(photos) — This method returns the extracted feature vectors from the Xception model and the picture dictionary.

#load the data
def load_photos(filename):
file = load_doc(filename)
photos = file.split("n")[:-1]
return photos
def load_clean_descriptions(filename, photos):
#loading clean_descriptions
file = load_doc(filename)
descriptions = {}
for line in file.split("n"):
words = line.split()
if len(words)<1 :
continue
image, image_caption = words[0], words[1:]
if image in photos:
if image not in descriptions:
descriptions[image] = []
desc = ' ' + " ".join(image_caption) + ' '
descriptions[image].append(desc)
return descriptions
def load_features(photos):
#loading all features
all_features = load(open("features.p","rb"))
#selecting only needed features
features = {k:all_features[k] for k in photos}
return features
filename = dataset_text + "/" + "Flickr_8k.trainImages.txt"
#train = loading_data(filename)
train_imgs = load_photos(filename)
train_descriptions = load_clean_descriptions("descriptions.txt", train_imgs)
train_features = load_features(train_imgs)

Tokenizing the vocabulary

Machines cannot understand complicated English words, thus they require a straightforward numerical representation in order to analyse model data. For this reason, we assign each word in the vocabulary a different, unique index value. The Keras library has a built-in tokenizer function that generates tokens from our vocabulary. The “tokenizer.p” pickle file is where we may store them.

#convert dictionary to clear list of descriptions
def dict_to_list(descriptions):
all_desc = []
for key in descriptions.keys():
[all_desc.append(d) for d in descriptions[key]]
return all_desc
#creating tokenizer class
#this will vectorise text corpus
#each integer will represent token in dictionary
from keras.preprocessing.text import Tokenizer
def create_tokenizer(descriptions):
desc_list = dict_to_list(descriptions)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(desc_list)
return tokenizer
# give each word an index, and store that into tokenizer.p pickle file
tokenizer = create_tokenizer(train_descriptions)
dump(tokenizer, open('tokenizer.p', 'wb'))
vocab_size = len(tokenizer.word_index) + 1
Vocab_size #The size of our vocabulary is 7577 words.
#calculate maximum length of descriptions to decide the model structure parameters.
def max_length(descriptions):
desc_list = dict_to_list(descriptions)
return max(len(d.split()) for d in desc_list)
max_length = max_length(descriptions)
Max_length #Max_length of description is 32

Create a Data generator

We must provide the model with input and output sequences in order to train it as a supervised learning task. Our training sets contain a total of 6000 pictures with 2048-length feature vectors and captions encoded as integers. We will use a generator approach that will produce batches because it is impossible to store such a big quantity of data in memory.

For instance, in [x1, x2], the input and output of our model, respectively, are the 2048 feature vectors of the image, the input text sequence, and the predicted output text sequence, respectively.

#data generator, used by model.fit_generator()
def data_generator(descriptions, features, tokenizer, max_length):
while 1:
for key, description_list in descriptions.items():
#retrieve photo features
feature = features[key][0]
inp_image, inp_seq, op_word = create_sequences(tokenizer, max_length, description_list, feature)
yield [[inp_image, inp_sequence], op_word]
def create_sequences(tokenizer, max_length, desc_list, feature):
x_1, x_2, y = list(), list(), list()
# move through each description for the image
for desc in desc_list:
# encode the sequence
seq = tokenizer.texts_to_sequences([desc])[0]
# divide one sequence into various X,y pairs
for i in range(1, len(seq)):
# divide into input and output pair
in_seq, out_seq = seq[:i], seq[i]
# pad input sequence
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
# encode output sequence
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
# store
x_1.append(feature)
x_2.append(in_seq)
y.append(out_seq)
return np.array(X_1), np.array(X_2), np.array(y)
#To check the shape of the input and output for your model
[a,b],c = next(data_generator(train_descriptions, features, tokenizer, max_length))
a.shape, b.shape, c.shape
#((47, 2048), (47, 32), (47, 7577))

Define the CNN-RNN model

We’ll utilise the Keras Model to specify the model’s structure from the Functional API. It contains:

Feature Extractor — By using a thick layer, it will extract the features from images with a 2048-by-2048-pixel resolution and reduce the size to 256 nodes.

Sequence Processor — This integrated layer handles the text input after the LSTM layer.

Decoder: To create the final prediction, we will combine the results from the previous two layers and process the dense layer.

from keras.utils import plot_model
# define the captioning model
def define_model(vocab_size, max_length):
# features from the CNN model compressed from 2048 to 256 nodes
inputs1 = Input(shape=(2048,))
fe1 = Dropout(0.5)(inputs1)
fe2 = Dense(256, activation='relu')(fe1)
# LSTM sequence model
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = LSTM(256)(se2)
# Merging both models
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation='relu')(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)
# merge it [image, seq] [word]
model = Model(inputs=[inputs1, inputs2], outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='adam')
# summarize model
print(model.summary())
plot_model(model, to_file='model.png', show_shapes=True)
return model

Training the Image Caption Generator model

With 6000 training photos, we will create the input and output sequences to train our model. We define a model function. To fit the batches to the model, use fit generator(). The model is then saved to our models folder.

#train our model
print('Dataset: ', len(train_imgs))
print('Descriptions: train=', len(train_descriptions))
print('Photos: train=', len(train_features))
print('Vocabulary Size:', vocab_size)
print('Description Length: ', max_length)
model = define_model(vocab_size, max_length)
epochs = 10
steps = len(train_descriptions)
# creating a directory named models to save our models
os.mkdir("models")
for i in range(epochs):
generator = data_generator(train_descriptions, train_features, tokenizer, max_length)
model.fit_generator(generator, epochs=1, steps_per_epoch= steps, verbose=1)
model.save("models/model_" + str(i) + ".h5")

Testing the Image Caption Generator model

After a model has been successfully trained, it is our responsibility to enter test picture data to verify the model’s correctness. Let’s write a test caption.py file in Python to load the model and produce predictions.

import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import argparse
ap = argparse.ArgumentParser()
ap.add_argument('-i', '--image', required=True, help="Image Path")
args = vars(ap.parse_args())
img_path = args['image']
def extract_features(filename, model):
try:
image = Image.open(filename)
except:
print("ERROR: Can't open image! Ensure that image path and extension is correct")
image = image.resize((299,299))
image = np.array(image)
# for 4 channels images, we need to convert them into 3 channels
if image.shape[2] == 4:
image = image[..., :3]
image = np.expand_dims(image, axis=0)
image = image/127.5
image = image - 1.0
feature = model.predict(image)
return feature
def word_for_id(integer, tokenizer):
for word, index in tokenizer.word_index.items():
if index == integer:
return word
return None
def generate_desc(model, tokenizer, photo, max_length):
in_text = 'start'
for i in range(max_length):
sequence = tokenizer.texts_to_sequences([in_text])[0]
sequence = pad_sequences([sequence], maxlen=max_length)
pred = model.predict([photo,sequence], verbose=0)
pred = np.argmax(pred)
word = word_for_id(pred, tokenizer)
if word is None:
break
in_text += ' ' + word
if word == 'end':
break
return in_text
max_length = 32
tokenizer = load(open("tokenizer.p","rb"))
model = load_model('models/model_9.h5')
xception_model = Xception(include_top=False, pooling="avg")
photo = extract_features(img_path, xception_model)
img = Image.open(img_path)
description = generate_desc(model, tokenizer, photo, max_length)
print("nn")
print(description)
plt.imshow(img)

Output:

Conclusion

With the aid of CNN and LSTM, we create a deep learning model in this course. Our model was trained using a relatively small dataset of 8000 photos, whereas the business level model was trained using bigger datasets of more than 100,000 images. The accuracy increases with the size of the datasets. Therefore, you can test this model with large datasets if you want to create a caption generator that is more accurate.

--

--