深度學習實現(xiàn)自動生成圖片字幕

shengguo 發(fā)布于2019-07-30 18:38 / 3412人閱讀

摘要：介紹本次項目使用深度學習自動生成圖像字幕。本次，我們利用遷移學習使用模型實現(xiàn)此功能。使用對損失修正。至于文本預測部分與使用注意力機制實現(xiàn)機器翻譯大體一致。

介紹

本次項目使用深度學習自動生成圖像字幕。如上圖，模型自動生成“The person is riding a surfboard in the ocean”字幕。我們具體該如何實現(xiàn)呢？

如圖所示，我們需要分別使用CNN和RNN模型來實現(xiàn)。

CNN模型：

利用卷積網絡對圖像特征提取的強大能力，來提取特征信息。我們的CNN模型需要有強大的識別能力，因此該模型需要使用過大量，多類別的訓練集進行訓練，并且識別準確率較高。本次，我們利用遷移學習使用Inception模型實現(xiàn)此功能。
通過遷移學習實現(xiàn)OCT圖像識別文章中有遷移學習的相關介紹。

RNN模型：
對于文本序列數(shù)據(jù)，目前我們最好的選擇依然是RNN模型。為了提升模型預測能力，我們使用注意力機制實現(xiàn)文本預測。
注意力機制實現(xiàn)機器翻譯文章中有注意力機制的相關介紹。

對模型的細節(jié)要求我們將在對應代碼實現(xiàn)里進行介紹。

數(shù)據(jù)集介紹

我們使用MS-COCO數(shù)據(jù)集進行訓練，為方便理解，簡單介紹下數(shù)據(jù)格式。COCO數(shù)據(jù)有5種類型，分別是： object detection, keypoint detection, stuff segmentation, panoptic segmentation，image captioning。基礎數(shù)據(jù)結構如下圖所示：

具體樣例(部分)：

本次項目使用的是Image Captioning其中，每張照片不少于5個字幕：

數(shù)據(jù)下載處理

import tensorflow as tf
# 開啟eager模式
tf.enable_eager_execution()
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
import re
import numpy as np
import os
import time
import json
from glob import glob
from PIL import Image
import pickle

annotation_zip=tf.keras.utils.get_file(
    # cache_dir(默認值): `~/.keras`
    # cache_subdir: `datasets`,
    # ~/.keras/datasets/captions.zip
    fname="captions.zip",
    cache_subdir=os.path.abspath("."),
    origin="http://images.cocodataset.org/annotations/annotations_trainval2014.zip",
    # 解壓
    extract=True
)

# 返回文件夾名，實現(xiàn)：split(file)[0]
annotation_file = os.path.dirname(annotation_zip)+"/annotations/captions_train2014.json"
name_of_zip="train2014.zip"
if not os.path.exists(os.path.abspath(".")+"/"+name_of_zip):
    image_zip=tf.keras.utils.get_file(
        fname=name_of_zip,
        cache_subdir=os.path.abspath("."),
        origin="http://images.cocodataset.org/zips/train2014.zip",
        extract=True
    )
    PATH=os.path.dirname(image_zip)+"train2014/"
else:
    PATH=os.path.abspath(".")+"/train2014/"

讀取字幕和圖片：

# 讀取注釋json文件
with open(annotation_file,"r") as f:
    annotations=json.load(f)

# 保存全部字幕
all_captions=[]

# 保存全部圖片
all_img_name_vecotr=[]

# json格式參考COCO數(shù)據(jù)集官網
for annot in annotations["annotations"]:
    
    # 添加開始和結束標記
    caption=""+annot["caption"]+""
    # 獲取圖片名字
    image_id=annot["image_id"]
    # 參考文章開始給出的“具體樣例”
    full_coco_image_path=PATH+"COCO_train2014_"+"%012d.jpg"%(image_id)

    all_img_name_vecotr.append(full_coco_image_path)
    all_captions.append(caption)

# random_state 隨機種子，確保每次數(shù)據(jù)一致
train_captions,img_name_vector=shuffle(
        all_captions,
        all_img_name_vecotr,
        random_state=1
    )

    # 使用訓練集前30000樣本
    num_examples=30000
    train_captions=train_captions[:num_examples]
    img_name_vector=img_name_vector[:num_examples]

重訓練InceptionV3：

簡單介紹下InceptionV3模型：

Inception模型結構中最重要的思想就是卷積核分解。通過上圖可知，5x5的卷積可由2個3x3的卷積代替，3x3卷積可由一個3x1卷積和一個1x3卷積代替，代替的好處是減少了權重參數(shù)量，增加了網絡非線性（層增多）。比如，一個5x5卷積的權重參數(shù)量和2個3x3卷積的權重參數(shù)量分別是（5x5):(3x3)x2。InceptionV3中就將7x7的卷積分解成7x1卷積和1x7卷積。

批標準化（BN）正式提出是在InceptionV2，BN通過將輸入分布轉變成均值為0，標準差為1的正態(tài)分布，將值域處于激活函數(shù)敏感范圍從而防止梯度消失問題。正因為梯度消失問題的解決，我們可以使用更大的學習率進行訓練從而加快模型收斂。由于BN有類似Dropout的正則化作用，因此在訓練的時候不使用或少使用Dropout，并減輕L2正則。

使用非對稱卷積，如：1x3卷積，3x1卷積（論文作者指出在feature map的大小12x12~20x20之間效果最好）。

使用Label Smoothing對損失修正。下圖是新?lián)p失函數(shù)：

網絡各層信息如下圖所示：

# 使用inception V3 要求圖片分辨率：299，299
# 輸入值范圍[-1,1]

def load_image(image_path):
    img=tf.image.decode_jpeg(tf.read_file(image_path))
    img_reshape=tf.image.resize_images(img,(299,299))

    # 像素范圍[-1,1]
    # (-255)/255
    img_range=tf.keras.applications.inception_v3.preprocess_input(img_reshape)

    return img_range,image_path

使用遷移學習構建新模型：

# 最后一層卷積輸入shape(8*8*2048),并將結果向量保存為dict
image_model=tf.keras.applications.InceptionV3(
    # 不使用最后全連接層
    include_top=False,
    # inception模型的訓練集是imagenet
    weigths="imagenet"
)

# shape:(batch_size,299,299,3)
new_input=image_model.input

# hidden_layer shape:(batch_size,8,8,2048)
hidden_layer=image_model.layers[-1].output

# 創(chuàng)建新模型
image_features_extract_model=tf.keras.Model(
    new_input,
    hidden_layer
)

保存通過使用InceptionV3獲得的特征：

encode_train=sorted(set(img_name_vector))

# map:可以并行處理數(shù)據(jù)，默認讀取的文件具有確定性順序
# 取消順序可以加快數(shù)據(jù)讀取
# 通過設置參數(shù)num_parallel_calls實現(xiàn)
image_dataset=tf.data.Dataset.from_tensor_slices(encode_train).map(load_image).batch(16)


for img,path in image_dataset:
    # inception v3得到的feature
    batch_features=image_features_extract_model(img)
    batch_features=tf.reshape(
        
        # shape:(batch_size,8,8,2048) reshape：(batch_size,64,2048)
        batch_features,shape=(batch_features.shape[0],-1,batch_features[3])
    )

# 保存
for bf,p in zip(batch_features,path):
    path_of_feature=p.numpy().decode("utf-8")

    # 文件后綴.npy
    np.save(path_of_feature,bf.numpy())

文本處理

文本處理方式還是老規(guī)矩，先將文本轉成字典表示然后創(chuàng)建字符轉ID，ID轉字符，最后補長到預設長度。

# 計算最大長度
def calc_max_length(tensor):
    return max(len(t)for t in tensor)

top_k=5000
tokenizer=tf.keras.preprocessing.text.Tokenizer(
    num_words=top_k,

    # 字典中沒有的字符用代替
    oov_token="",

    # 需要過濾掉的特殊字符
    filters="!"#$%&()*+.,-/:;=?@[]^_`{|}~"
)

# 要用以訓練的文本列表
tokenizer.fit_on_texts(train_captions)

# 轉為序列列表向量
train_seqs=tokenizer.texts_to_sequences((train_captions))

tokenizer.word_index[""]=0

# 如果沒有指定最大長度，pad_sequences會自動計算最大長度
cap_vector=tf.keras.preprocessing.sequence.pad_sequences(
   sequences=train_seqs,
   # 后置補長
   padding="post"
)
max_length=calc_max_length(train_seqs)

模型訓練參數(shù)

拆分訓練集，驗證集：

img_name_train,img_name_val,cap_trian,cap_val=train_test_split(
    img_name_vector,
    cap_vector,

    # 驗證數(shù)據(jù)集占20%
    test_size=0.2,
    # 確保每次數(shù)據(jù)一致
    random_state=0

# 最好是2的次冪，更適合GPU運算（加快二進制運算）
BATCH_SIZE=64
# shuffle 緩沖區(qū)大小
BUFFER_SIZE=1000
# 詞嵌入維度
embedding_dim=256
units=512
vocab_size=len(tokenizer.word_index)

# 后面會將(8,8,2048)轉為(64,2048)
# 維度一定要一致
feature_shape=2048
attention_features_shape=64

# 加載保存的之前feature文件
def map_func(img_name,cap):
    img_tensor=np.load(img_name.decode("utf-8")+".npy")
    return img_tensor,cap
dataset=tf.data.Dataset.from_tensor_slices((img_name_train,cap_trian))

# num_parallel_calls 根據(jù)自己的CPU而定
dataset=dataset.map(lambda item1,item2:tf.py_func(
    map_func,[item1,item2],[tf.float32,tf.int32]
),num_parallel_calls=4)

# prefetch 可以合理利用CPU準備數(shù)據(jù)，GPU計算數(shù)據(jù)之間的空閑時間，加快數(shù)據(jù)讀取
dataset=dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(1)

創(chuàng)建模型

編碼器模型：

# 一層使用relu的全連接層
class CNN_Encoder(tf.keras.Model):
    def __init__(self,embedding_dim):
        super(CNN_Encoder, self).__init__()

        # fc shape:(batch_size,64,embedding_dim)
        self.fc=tf.keras.layers.Dense(embedding_dim)
    def __call__(self,x):
        x=self.fc(x)
        x=tf.nn.relu(x)

        return x

注意力層：
詳細介紹可以查看文章開始給出的鏈接，這里給出計算方程式：

class BahdanauAttention(tf.keras.Model):
    def __init__(self,units):
        super(BahdanauAttention, self).__init__()

        self.W1=tf.keras.layers.Dense(units)
        self.W2=tf.keras.layers.Dense(units)
        self.V=tf.keras.layers.Dense(1)

    def __call__(self, features,hidden):
        # 參考注意力機制計算的方程
        # feature shape:(batch_size,64,embedding_dim)
        # hidden_state shape:(batch_size,hidden_size)
        hidden_with_time_axis=tf.expand_dims(hidden,1)

        # score shape:(batch_size,64,hidden_size)
        score=tf.nn.tanh(self.W1(features)+self.W2(hidden_with_time_axis))

        # attention_weights shape:(batch_size,64,1)
        attention_weights=tf.nn.softmax(self.V(score),axis=1)
        context_vector=tf.reduce_sum(attention_weights*features,axis=1)

        return context_vector,attention_weights

解碼器中的GRU：

# 相比LSTM因為減少了一個門，參數(shù)少，收斂快
def gru(units):
    if tf.test.is_gpu_available():
    
        # 使用GPU加速計算
        return tf.keras.layers.CuDNNGRU(
            units=units,
            return_state=True,
            return_sequences=True,
            
            # 循環(huán)核的初始化方法
            # glorot_uniform是sqrt(2 / (fan_in + fan_out))的正態(tài)分布產生
            # 其中fan_in和fan_out是權重張量的扇入扇出（即輸入和輸出單元數(shù)目）
            recurrent_initializer="glorot_uniform"
        )
    else:
        return tf.keras.layers.GRU(
            return_sequences=True,
            return_state=True,

            # 默認：hard_sigmoid <= -1 輸出0，>=1 輸出1 ，中間為線性
            recurrent_activation="sigmoid",
            recurrent_initializer="glorot_uniform"
        )

解碼器模型：

# 使用注意力模型
class RNN_Decoder(tf.keras.Model):
    def __init__(self,embedding_dim,units,vocab_size):
        super(RNN_Decoder, self).__init__()
        self.units=units
        
        # 詞嵌入將高維離散數(shù)據(jù)轉為低維連續(xù)數(shù)據(jù)，并表現(xiàn)出數(shù)據(jù)之間的相似性（向量空間）
        self.embedding=tf.keras.layers.Embedding(input_shape=vocab_size,output_dim=embedding_dim)
        self.gru=gru(units)
        self.fc1=tf.keras.layers.Dense(self.units)
        self.fc2=tf.keras.layers.Dense(vocab_size)
        self.attention=BahdanauAttention(self.units)

    def __call__(self,x,features,hidden):
        # 獲取注意力模型輸出
        context_vector,attention_weights=self.attention(features,hidden)

        # x shape:(batch_size,1,embedding_dim)
        x=self.embedding(x)
        
        # 注意力，當前輸入合并
        # 注意力shape:(batch_size,1,hidden) x shape:(batch_size,1,embedding_size)
        # x shape:(batch_size, 1, embedding_dim + hidden_size)
        x=tf.concat([tf.expand_dims(context_vector,1),x],axis=-1)

        output,state=self.gru(x)

        # x shape:(batch_size,max_length,hidden_size)
        x=self.fc1(output)

        # x shape:(batch_size*max_length,hidden_size)
        x=tf.reshape(x,shape=(-1,x.shape[2]))

        # x shape:(batch_size*max_length,vocab_size)
        x=self.fc2(x)

        return x,state,attention_weights
    def reset_state(self, batch_size):
     return tf.zeros((batch_size, self.units))

模型訓練

實例化模型：

encoder = CNN_Encoder(embedding_dim)
decoder = RNN_Decoder(embedding_dim, units, vocab_size)

損失函數(shù)，優(yōu)化器設置：

# InceptionV3模型使用的不是Adam優(yōu)化器
# 各種優(yōu)化器以后放到一篇多帶帶的文章詳細介紹
optimizer=tf.train.AdamOptimizer(learning_rate=0.0001)

def loss_function(real,pred):
    mask=1-np.equal(real,0)
    
    # 帶mask的交叉熵損失
    loss_=tf.nn.sparse_softmax_cross_entropy_with_logits(
        labels=real,
        logits=pred
    )*mask

    return tf.reduce_mean(loss_)

訓練：

將使用InceptionV3模型提取的特征作為編碼器輸入

編碼器輸出，hidden_state，字幕文本作為解碼器輸入

解碼器hidden_state作為下一次輸入，預測值用于計算模型損失

使用標簽文本作為解碼器輸入（teacher-forcing模式）

梯度計算及應用

loss_plot=[]

EPOCHS=20

for epoch in range(EPOCHS):
    start=time.time()
    total_loss=0

    for (batch,(img_tensor,target)) in enumerate(dataset):
        loss=0

        # 每迭代一次batch后重置 hidden_state
        hidden=decoder.reset_states(batch_size=target.shape[0])
        
        # input維度是3維
        dec_input=tf.expand_dims([tokenizer.word_index[""]*BATCH_SIZE],1)
        
        # eager模式下記錄梯度
        with tf.GradientTape() as tape:
            # inception模式提取的特征
            features=encoder(img_tensor)

            # 每張照片不止一個captions
            for i in range(1,target.shape[1]):
            
                # attention_weights此處暫不需要
                predictions,hidden,_=decoder(dec_input,features,hidden)
                loss+=loss_function(target[:,i],predictions)

                # teacher forcing 使用標簽數(shù)據(jù)作為輸入替代hidden-output
                dec_input=tf.expand_dims(target[:,i],1)
            total_loss+=(loss/int(target.shape[1]))
            
            # 總訓練參數(shù)
            variables=encoder.variables+decoder.variables
            
            # 梯度計算及應用
            gradients=tape.gradient(loss,variables)
            optimizer.apply_gradients(zip(gradients,variables))

            if batch%100 == 0:
                print("epoch{},batch{},loss{:.4}".format(
                    epoch+1,
                    batch,
                    loss.numpy()/int(target.shape[1])
                ))
        loss_plot.append(total_loss/len(cap_vector))

plt.plot(loss_plot)
plt.xlabel("epochs")
plt.ylabel("loss")
plt.show()

模型預測

模型預測不使用Teacher forcing模式，當遇到預設的結束標記“”時模型結束訓練。

def evaluate(image):
    attention_plot = np.zeros((max_length, attention_features_shape))
    
    # 初始化hidden-state
    hidden = decoder.reset_state(batch_size=1)
    
    # shape：(1,299,299,3)
    temp_input = tf.expand_dims(load_image(image)[0], 0)
    
    # 特征提取
    img_tensor_val = image_features_extract_model(temp_input)
    
    # shape:(1,8,8,2048) reshape:(1,64,2048)
    img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))
    
    # shape:(1,64,256)
    features = encoder(img_tensor_val)
    
    # 增加batchsize維度
    dec_input = tf.expand_dims([tokenizer.word_index[""]], 0)
    result = []

    for i in range(max_length):
        predictions, hidden, attention_weights = decoder(dec_input, features, hidden)

        attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy()
        
        # 我們使用softmax歸一化結果，使用argmax查詢最大值
        # 對于分類數(shù)量大于2，softmax和sigmoid的區(qū)別是
        # 類別之間有相互關系的使用sigmoid，反之使用softmax
        predicted_id = tf.argmax(predictions[0]).numpy()
        
        # ID轉字符，獲取文本結果
        result.append(tokenizer.index_word[predicted_id])
        
        # 判斷是否是預設的結束標記
        if tokenizer.index_word[predicted_id] == "":
            return result, attention_plot
        
        # 將預測值作為輸入，預測下一個結果（teacher-forcing在這里使用數(shù)據(jù)標簽作為輸入）
        dec_input = tf.expand_dims([predicted_id], 0)

    attention_plot = attention_plot[:len(result), :]
    return result, attention_plot

以下用于可視化注意力機制訓練過程：
此處代碼主要是圖像展示就不做過多介紹了。

def plot_attention(image, result, attention_plot):
    temp_image = np.array(Image.open(image))

    fig = plt.figure(figsize=(10, 10))
    
    len_result = len(result)
    for l in range(len_result):
        temp_att = np.resize(attention_plot[l], (8, 8))
        ax = fig.add_subplot(len_result//2, len_result//2, l+1)
        ax.set_title(result[l])
        img = ax.imshow(temp_image)
        ax.imshow(temp_att, cmap="gray", alpha=0.6, extent=img.get_extent())

    plt.tight_layout()
    plt.show()
rid = np.random.randint(0, len(img_name_val))
image = img_name_val[rid]
real_caption = " ".join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
result, attention_plot = evaluate(image)

print ("Real Caption:", real_caption)
print ("Prediction Caption:", " ".join(result))
plot_attention(image, result, attention_plot)

Image.open(img_name_val[rid])

總結

想要對圖像生成字幕，首先需要提取圖像特征，本文我們利用遷移學習使用Inception模型來提取特征，對于Inception模型，我們重點理解卷積核分解。至于文本預測部分與使用注意力機制實現(xiàn)機器翻譯大體一致。有一點想說的是，類似這樣的項目維度轉換會比較多，也是很容易出錯的地方，這一點需要格外留意。

本文代碼內容來自 Yash Katariya在此表示感謝。

云服務器 GPU云服務器深度學習自動學習實現(xiàn)深度學習深度學習實現(xiàn) 自動深度學習

文章版權歸作者所有，未經允許請勿轉載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉載請注明本文地址：http://specialneedsforspecialkids.com/yun/42798.html

深度學習實現(xiàn)自動生成圖片字幕

摘要：介紹本次項目使用深度學習自動生成圖像字幕。本次，我們利用遷移學習使用模型實現(xiàn)此功能。使用對損失修正。至于文本預測部分與使用注意力機制實現(xiàn)機器翻譯大體一致。介紹 showImg(https://segmentfault.com/img/bVbkSso?w=2048&h=1358); 本次項目使用深度學習自動生成圖像字幕。如上圖，模型自動生成The person is riding a ...

Eastboat 2019-06-26 18:40 評論0 收藏0

發(fā)表評論

登陸后可評論

0條評論

shengguo

男|高級講師

我要關注我要私信

TA的文章

tomcat https

閱讀 1849·2021-11-25 09:43
環(huán)境影響著一個人...最高月薪20K，想成為最優(yōu)秀的人，就要向最優(yōu)秀的人學習

閱讀 1491·2021-09-02 15:21
我們來翻翻元素樣式的族譜-getComputedStyle

閱讀 3453·2019-08-30 15:52
一些問題

閱讀 1501·2019-08-30 12:48
提升網站頁面速度的14條最佳實踐（二）

閱讀 1295·2019-08-30 10:57
移動端H5頁面中1px邊框的幾種解決方法

閱讀 2929·2019-08-26 17:41
JS去重的幾種實現(xiàn)方法

閱讀 681·2019-08-26 11:59
for 循環(huán) var 和 let

閱讀 1366·2019-08-26 10:41

国产xxxx99真实实拍_久久不雅视频_高清韩国a级特黄毛片_嗯老师别我我受不了了小说

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

深度學習實現(xiàn)自動生成圖片字幕

相關文章