（三）神經網絡入門之隱藏層設計

kun_jian 發布于2019-07-30 15:19 / 3529人閱讀

摘要：在這個教程中，我們也將設計一個二分類神經網絡模型，其中輸入數據是一個維度，隱藏層只有一個神經元，并且使用非線性函數作為激活函數，模型結構能用圖表示為我們先導入教程需要使用的軟件包。

作者：chen_h
微信號 & QQ：862251340
微信公眾號：coderpai
簡書地址：https://www.jianshu.com/p/8e1...

這篇教程是翻譯Peter Roelants寫的神經網絡教程，作者已經授權翻譯，這是原文。

該教程將介紹如何入門神經網絡，一共包含五部分。你可以在以下鏈接找到完整內容。

（一）神經網絡入門之線性回歸

Logistic分類函數

（二）神經網絡入門之Logistic回歸（分類問題）

（三）神經網絡入門之隱藏層設計

Softmax分類函數

（四）神經網絡入門之矢量化

（五）神經網絡入門之構建多層網絡

隱藏層

這部分教程將介紹三部分：

隱藏層設計

非線性激活函數

BP算法

在前面幾個教程中，我們已經介紹了一些很簡單的教程，就是單一的回歸模型或者分類模型。在這個教程中，我們也將設計一個二分類神經網絡模型，其中輸入數據是一個維度，隱藏層只有一個神經元，并且使用非線性函數作為激活函數，模型結構能用圖表示為：

我們先導入教程需要使用的軟件包。

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import colorConverter, ListedColormap 
from mpl_toolkits.mplot3d import Axes3D 
from matplotlib import cm

定義數據集

在這篇教程中，我們將輸入數據x分類成兩個類別，用藍色表示t = 1，用紅色表示t = 0。其中，紅色分類樣本是一個多峰分布，被藍色分類樣本包圍。這些數據都是一維的，但是數據之間的間隔并不是線性的分割。這些數據特性將在下圖中表示出來。

這個二分類模型不會完全準確的分類處理啊，因為我們在其中加入了一個神經元，并且采用的是非線性函數。

# Define and generate the samples
nb_of_samples_per_class = 20  # The number of sample in each class
blue_mean = [0]  # The mean of the blue class
red_left_mean = [-2]  # The mean of the red class
red_right_mean = [2]  # The mean of the red class

std_dev = 0.5  # standard deviation of both classes
# Generate samples from both classes
x_blue = np.random.randn(nb_of_samples_per_class, 1) * std_dev + blue_mean
x_red_left = np.random.randn(nb_of_samples_per_class/2, 1) * std_dev + red_left_mean
x_red_right = np.random.randn(nb_of_samples_per_class/2, 1) * std_dev + red_right_mean

# Merge samples in set of input variables x, and corresponding set of
# output variables t
x = np.vstack((x_blue, x_red_left, x_red_right))
t = np.vstack((np.ones((x_blue.shape[0],1)), 
               np.zeros((x_red_left.shape[0],1)), 
               np.zeros((x_red_right.shape[0], 1))))

# Plot samples from both classes as lines on a 1D space
plt.figure(figsize=(8,0.5))
plt.xlim(-3,3)
plt.ylim(-1,1)
# Plot samples
plt.plot(x_blue, np.zeros_like(x_blue), "b|", ms = 30) 
plt.plot(x_red_left, np.zeros_like(x_red_left), "r|", ms = 30) 
plt.plot(x_red_right, np.zeros_like(x_red_right), "r|", ms = 30) 
plt.gca().axes.get_yaxis().set_visible(False)
plt.title("Input samples from the blue and red class")
plt.xlabel("$x$", fontsize=15)
plt.show()

非線性激活函數

在這里，我們使用的非線性轉換函數是Gaussian radial basis function (RBF)。除了徑向基函數網絡，RBF函數在神經網絡中不經常被作為激活函數。比較常見的激活函數是sigmoid函數。但我們根據設計的輸入數據x，在這里RBF函數能很好地將藍色樣本數據從紅色樣本數據中分類出來，下圖畫出了RBF函數的圖像。RBF函數給定義為：

RBF函數的導數為定義為：

# Define the rbf function
def rbf(z):
    return np.exp(-z**2)

# Plot the rbf function
z = np.linspace(-6,6,100)
plt.plot(z, rbf(z), "b-")
plt.xlabel("$z$", fontsize=15)
plt.ylabel("$e^{-z^2}$", fontsize=15)
plt.title("RBF function")
plt.grid()
plt.show()

BP算法

在訓練模型的時候，我們使用BP算法來進行模型優化，這是一種很典型的優化算法。BP算法的每次迭代分為兩步：

正向傳播去計算神經網絡的輸出。

利用神經網絡得出的結果和真實結果之間的誤差進行反向傳播來更新神經網絡的參數。

1. 正向傳播

在計算正向傳播中，輸入數據被一層一層的計算，最后從模型中得出輸出結果。

計算隱藏層的激活函數

隱藏層h經激活函數之后，輸出結果為：

其中，wh是權重參數。hidden_activations(x, wh)函數實現了該功能。

計算輸出結果的激活函數

神經網絡的最后一層的輸出，是將隱藏層的輸出h作為數據參數，并且利用Logistic函數來作為激活函數。

其中，w0是輸出層的權重，output_activations(h, w0)函數實現了該功能。我們在公式中添加了一個偏差項-1，因為如果不添加偏差項，那么Logistic函數只能學到一個經過原點的分類面。因為，隱藏層中的RBF函數的輸入值得范圍是從零到正無窮，那么如果我們不在輸出層加上偏差項的話，模型不可能學出有用的分類結果，因為沒有樣本的值將小于0，從而歸為決策樹的左邊。因此，我們增加了一個截距，即偏差項。正常情況下，偏差項也和權重參數一樣，需要被訓練，但是由于這個例子中的模型非常簡單，所以我們就用一個常數來作為偏差項。

# Define the logistic function
def logistic(z): 
    return 1 / (1 + np.exp(-z))

# Function to compute the hidden activations
def hidden_activations(x, wh):
    return rbf(x * wh)

# Define output layer feedforward
def output_activations(h , wo):
    return logistic(h * wo - 1)

# Define the neural network function
def nn(x, wh, wo): 
    return output_activations(hidden_activations(x, wh), wo)

# Define the neural network prediction function that only returns
#  1 or 0 depending on the predicted class
def nn_predict(x, wh, wo): 
    return np.around(nn(x, wh, wo))

2. 反向傳播

在反向傳播過程中，我們需要先計算出神經網絡的輸出與真實值之間的誤差。這個誤差會一層一層的反向傳播去更新神經網絡中的各個權重。

在每一層中，使用梯度下降算法按照負梯度方向對每個參數進行更新。

參數wh和wo利用w(k+1)=w(k)?Δw(k+1)更新，其中Δw=μ??ξ/?w，μ是學習率，?ξ/?w是損失函數ξ對參數w的梯度。

計算損失函數

在這個模型中，損失函數ξ與交叉熵損失函數一樣，具體解釋在這里：

損失函數對于參數wh和wo的表示如下圖所示。從圖中，我們發現誤差面不是一個凸函數，而且沿著wh = 0這一軸，參數wh將是損失函數的一個映射。

從圖中發現，沿著wh = 0，從wo > 0開始，損失函數有一個非常陡峭的梯度，并且我們要按照圖形的下邊緣進行梯度下降。如果學習率取得過大，那么在梯度更新的時候，可能跳過最小值，從一邊的梯度方向跳到另一邊的梯度方向。因為梯度的方向太陡峭了，每次對參數的更新跨度將會非常大。因此，在開始的時候我們需要將學習率取一個比較小的值。

# Define the cost function
def cost(y, t):
    return - np.sum(np.multiply(t, np.log(y)) + np.multiply((1-t), np.log(1-y)))

# Define a function to calculate the cost for a given set of parameters
def cost_for_param(x, wh, wo, t):
    return cost(nn(x, wh, wo) , t)

# Plot the cost in function of the weights
# Define a vector of weights for which we want to plot the cost
nb_of_ws = 200 # compute the cost nb_of_ws times in each dimension
wsh = np.linspace(-10, 10, num=nb_of_ws) # hidden weights
wso = np.linspace(-10, 10, num=nb_of_ws) # output weights
ws_x, ws_y = np.meshgrid(wsh, wso) # generate grid
cost_ws = np.zeros((nb_of_ws, nb_of_ws)) # initialize cost matrix
# Fill the cost matrix for each combination of weights
for i in range(nb_of_ws):
    for j in range(nb_of_ws):
        cost_ws[i,j] = cost(nn(x, ws_x[i,j], ws_y[i,j]) , t)
# Plot the cost function surface
fig = plt.figure()
ax = Axes3D(fig)
# plot the surface
surf = ax.plot_surface(ws_x, ws_y, cost_ws, linewidth=0, cmap=cm.pink)
ax.view_init(elev=60, azim=-30)
cbar = fig.colorbar(surf)
ax.set_xlabel("$w_h$", fontsize=15)
ax.set_ylabel("$w_o$", fontsize=15)
ax.set_zlabel("$xi$", fontsize=15)
cbar.ax.set_ylabel("$xi$", fontsize=15)
plt.title("Cost function surface")
plt.grid()
plt.show()

輸出層更新

?ξi/?wo是每個樣本i的輸出梯度，參照第二部分教程的方法，我們可以得出相應的推導公式：

其中，zoi=hi?wo，hi是樣本i經過激活函數之后輸出的值，?ξi/?zoi=δoi是輸出層誤差的求導。

gradient_output(y, t)函數實現了δo，gradient_weight_out(h, grad_output)函數實現了?ξ/?wo。

隱藏層更新

?ξi/?wh是每個樣本i在影藏層的梯度，具體計算如下：

其中，

?ξi/?zhi=δhi表示誤差對于隱藏層輸入的梯度。這個誤差也可以解釋為，zhi對于最后誤差的貢獻。那么，接下來我們定義一下這個誤差梯度δhi：

又應為?zhi/?wh=xi，那么我們能計算最后的值為：

在批處理中，對每個對應參數的梯度進行累加，就是最后的梯度。

gradient_hidden(wo, grad_output)函數實現了δh。
gradient_weight_hidden(x, zh, h, grad_hidden)函數實現了?ξ/?wh。
backprop_update(x, t, wh, wo, learning_rate)函數實現了BP算法的每次迭代過程。

# Define the error function
def gradient_output(y, t):
    return y - t

# Define the gradient function for the weight parameter at the output layer
def gradient_weight_out(h, grad_output): 
    return  h * grad_output

# Define the gradient function for the hidden layer
def gradient_hidden(wo, grad_output):
    return wo * grad_output

# Define the gradient function for the weight parameter at the hidden layer
def gradient_weight_hidden(x, zh, h, grad_hidden):
    return x * -2 * zh * h * grad_hidden

# Define the update function to update the network parameters over 1 iteration
def backprop_update(x, t, wh, wo, learning_rate):
    # Compute the output of the network
    # This can be done with y = nn(x, wh, wo), but we need the intermediate 
    #  h and zh for the weight updates.
    zh = x * wh
    h = rbf(zh)  # hidden_activations(x, wh)
    y = output_activations(h, wo)
    # Compute the gradient at the output
    grad_output = gradient_output(y, t)
    # Get the delta for wo
    d_wo = learning_rate * gradient_weight_out(h, grad_output)
    # Compute the gradient at the hidden layer
    grad_hidden = gradient_hidden(wo, grad_output)
    # Get the delta for wh
    d_wh = learning_rate * gradient_weight_hidden(x, zh, h, grad_hidden)
    # return the update parameters
    return (wh-d_wh.sum(), wo-d_wo.sum())

BP算法更新

下面的代碼，我們模擬了一個50次的循環。白色的點表示，參數wh和wo在誤差面上面的第k次迭代。

在更新過程中，我們不斷的線性減小學習率。這是為了在更新到最后的時候，學習率能是0。這樣能保證最后的參數更新不會在最小值附近徘徊。

# Run backpropagation
# Set the initial weight parameter
wh = 2
wo = -5
# Set the learning rate
learning_rate = 0.2

# Start the gradient descent updates and plot the iterations
nb_of_iterations = 50  # number of gradient descent updates
lr_update = learning_rate / nb_of_iterations # learning rate update rule
w_cost_iter = [(wh, wo, cost_for_param(x, wh, wo, t))]  # List to store the weight values over the iterations
for i in range(nb_of_iterations):
    learning_rate -= lr_update # decrease the learning rate
    # Update the weights via backpropagation
    wh, wo = backprop_update(x, t, wh, wo, learning_rate) 
    w_cost_iter.append((wh, wo, cost_for_param(x, wh, wo, t)))  # Store the values for plotting

# Print the final cost
print("final cost is {:.2f} for weights wh: {:.2f} and wo: {:.2f}".format(cost_for_param(x, wh, wo, t), wh, wo))

在我們的機器上面，最后輸出的結果是：
final cost is 10.81 for weights wh: 1.20 and wo: 5.56

但由于參數初始化的不同，可能在你的機器上面運行會有不同的結果。

# Plot the weight updates on the error surface
# Plot the error surface
fig = plt.figure()
ax = Axes3D(fig)
surf = ax.plot_surface(ws_x, ws_y, cost_ws, linewidth=0, cmap=cm.pink)
ax.view_init(elev=60, azim=-30)
cbar = fig.colorbar(surf)
cbar.ax.set_ylabel("$xi$", fontsize=15)

# Plot the updates
for i in range(1, len(w_cost_iter)):
    wh1, wo1, c1 = w_cost_iter[i-1]
    wh2, wo2, c2 = w_cost_iter[i]
    # Plot the weight-cost value and the line that represents the update 
    ax.plot([wh1], [wo1], [c1], "w+")  # Plot the weight cost value
    ax.plot([wh1, wh2], [wo1, wo2], [c1, c2], "w-")
# Plot the last weights
wh1, wo1, c1 = w_cost_iter[len(w_cost_iter)-1]
ax.plot([wh1], [wo1], c1, "w+")
# Shoz figure
ax.set_xlabel("$w_h$", fontsize=15)
ax.set_ylabel("$w_o$", fontsize=15)
ax.set_zlabel("$xi$", fontsize=15)
plt.title("Gradient descent updates on cost surface")
plt.grid()
plt.show()

分類結果的可視化

下面的代碼可視化了最后的分類結果。在輸入空間域里面，藍色和紅色代表了最后的分類顏色。從圖中，我們發現所有的樣本都被正確分類了。

# Plot the resulting decision boundary
# Generate a grid over the input space to plot the color of the
#  classification at that grid point
nb_of_xs = 100
xs = np.linspace(-3, 3, num=nb_of_xs)
ys = np.linspace(-1, 1, num=nb_of_xs)
xx, yy = np.meshgrid(xs, ys) # create the grid
# Initialize and fill the classification plane
classification_plane = np.zeros((nb_of_xs, nb_of_xs))
for i in range(nb_of_xs):
    for j in range(nb_of_xs):
        classification_plane[i,j] = nn_predict(xx[i,j], wh, wo)
# Create a color map to show the classification colors of each grid point
cmap = ListedColormap([
        colorConverter.to_rgba("r", alpha=0.25),
        colorConverter.to_rgba("b", alpha=0.25)])

# Plot the classification plane with decision boundary and input samples
plt.figure(figsize=(8,0.5))
plt.contourf(xx, yy, classification_plane, cmap=cmap)
plt.xlim(-3,3)
plt.ylim(-1,1)
# Plot samples from both classes as lines on a 1D space
plt.plot(x_blue, np.zeros_like(x_blue), "b|", ms = 30) 
plt.plot(x_red_left, np.zeros_like(x_red_left), "r|", ms = 30) 
plt.plot(x_red_right, np.zeros_like(x_red_right), "r|", ms = 30) 
plt.gca().axes.get_yaxis().set_visible(False)
plt.title("Input samples and their classification")
plt.xlabel("x")
plt.show()

輸入域的轉換

為什么神經網絡模型能利用最后的線性Logistic實現非線性的分類呢？關鍵原因是隱藏層的非線性RBF函數。RBF轉換函數可以將靠近原點的樣本（藍色分類）的輸出值大于0，而遠離原點的樣本（紅色樣本）的輸出值接近0。如下圖所示，紅色樣本的位置都在左邊接近0的位置，藍色樣本的位置在遠離0的位置。這個結果就是使用線性Logistic分類的。

同時注意，我們使用的高斯函數的峰值偏移量是0，也就是說，高斯函數產生的值是一個關于原點分布的數據。

# Plot projected samples from both classes as lines on a 1D space
plt.figure(figsize=(8,0.5))
plt.xlim(-0.01,1)
plt.ylim(-1,1)
# Plot projected samples
plt.plot(hidden_activations(x_blue, wh), np.zeros_like(x_blue), "b|", ms = 30) 
plt.plot(hidden_activations(x_red_left, wh), np.zeros_like(x_red_left), "r|", ms = 30) 
plt.plot(hidden_activations(x_red_right, wh), np.zeros_like(x_red_right), "r|", ms = 30) 
plt.gca().axes.get_yaxis().set_visible(False)
plt.title("Projection of the input samples by the hidden layer.")
plt.xlabel("h")
plt.show()

完整代碼，點擊這里

作者：chen_h
微信號 & QQ：862251340
簡書地址：https://www.jianshu.com/p/8e1...

CoderPai 是一個專注于算法實戰的平臺，從基礎的算法到人工智能算法都有設計。如果你對算法實戰感興趣，請快快關注我們吧。加入AI實戰微信群，AI實戰QQ群，ACM算法微信群，ACM算法QQ群。長按或者掃描如下二維碼，關注 “CoderPai” 微信號（coderpai）

云服務器 GPU云服務器彈出隱藏層 js動態隱藏層云計算三層架構 js倒計時隱藏層

文章版權歸作者所有，未經允許請勿轉載,若此文章存在違規行為，您可以聯系管理員刪除。

轉載請注明本文地址：http://specialneedsforspecialkids.com/yun/41183.html

（四）神經網絡入門之矢量化

摘要：但是在多層神經網絡，參數量非常巨大并且激活函數是非線性函數時，我們的損失函數極不可能是一個凸函數。作者：chen_h微信號 & QQ：862251340微信公眾號：coderpai簡書地址：https://www.jianshu.com/p/1fe... 這篇教程是翻譯Peter Roelants寫的神經網絡教程，作者已經授權翻譯，這是原文。該教程將介紹如何入門神經網絡，一共包含...

pf_miles 2019-07-30 15:19 評論0 收藏0
（五）神經網絡入門之構建多層網絡

摘要：我們通過構建一個由兩層隱藏層組成的小型網絡去識別手寫數字識別，來說明神經網絡向多層神經網絡的泛化能力。這個神經網絡將是通過隨機梯度下降算法進行訓練。批處理的最小數量訓練樣本的子集經常被稱之為最小批處理單位。作者：chen_h微信號 & QQ：862251340微信公眾號：coderpai簡書地址：https://www.jianshu.com/p/cb6... 這篇教程是翻譯Pet...

figofuture 2019-07-30 15:17 評論0 收藏0
（一）神經網絡入門之線性回歸

摘要：神經網絡的模型結構為，其中是輸入參數，是權重，是預測結果。損失函數我們定義為對于損失函數的優化，我們采用梯度下降，這個方法是神經網絡中常見的優化方法。函數實現了神經網絡模型，函數實現了損失函數。作者：chen_h微信號 & QQ：862251340微信公眾號：coderpai簡書地址：https://www.jianshu.com/p/0da... 這篇教程是翻譯Peter Roe...

lx1036 2019-07-30 15:18 評論0 收藏0
Logistic分類函數

摘要：對于多分類問題，我們使用函數來處理多項式回歸。概率方程表示輸出根據函數得到的值。最大似然估計可以寫成因為對于給定的參數，去產生和，根據聯合概率我們又能將似然函數改寫成。作者：chen_h微信號 & QQ：862251340微信公眾號：coderpai簡書地址：https://www.jianshu.com/p/abc... 這篇教程是翻譯Peter Roelants寫的神經網絡教程...

XBaron 2019-07-30 15:18 評論0 收藏0
Softmax分類函數

摘要：對于多分類問題，我們可以使用多項回歸，該方法也被稱之為函數。函數的交叉熵損失函數的推導損失函數對于的導數求解如下上式已經求解了當和的兩種情況。最終的結果為，這個求導結果和函數的交叉熵損失函數求導是一樣的，再次證明函數是函數的一個擴展板。作者：chen_h微信號 & QQ：862251340微信公眾號：coderpai簡書地址：https://www.jianshu.com/p/8eb...

BicycleWarrior 2019-07-30 15:19 評論0 收藏0