淺談deep stacking network --- 一種比較實(shí)用的deep learning算法

wenhai.he 發(fā)布于2019-04-25 17:57 / 2596人閱讀

摘要：分享一下組會(huì)的講稿。現(xiàn)在的應(yīng)用主要在于和語(yǔ)言以及圖像的分類和回歸。而卻由的命名可以看到的核心思想是做。這源自于在年提出的的思想。如下圖其作為一種而廣為使用。這是因?yàn)榈谋容^少，一般都是二分類問(wèn)題，減輕了的傳遞效應(yīng)。

分享一下組會(huì)的講稿。
附組會(huì)的ppt
http://vdisk.weibo.com/s/zfic-IP2yagqu
涉及到的原論文大概10幾篇，我打印出來(lái)看的，有人需要的話我回去好好對(duì)著打印出的論文再打出來(lái)標(biāo)題。。。。大概就是DEEP STACKING NETWORKS FOR INFORMATION RETRIEVAL這樣的

正文
deep stacking network 是 Li Deng 提出的一種判別模型。現(xiàn)在的應(yīng)用主要在于CTR IR和語(yǔ)言以及圖像的分類和回歸。
大體的結(jié)構(gòu)如下圖

1.簡(jiǎn)要介紹

Why dsn
話說(shuō) dnn 已經(jīng)比較好用了，各種包也很多了，那為什么還要用 dsn 呢？
很大一個(gè)原因是因?yàn)?dnn 在 fine tuning phase 用的是 stochastic gradient descent，對(duì)其做 parallelize across machines 比較困難。
而 dsn 卻 attacks the learning Scalability problem

Central Idea - Stacking
由 dsn 的命名可以看到 dsn 的核心思想是做 stacking。這源自于 Wolpert 在1992年提出的 stacked generalization 的思想。
如下圖

Level-0 models are based on different learning models and use original data (level-0 data)
Level-1 models are based on results of level-0 models (level-1 data are outputs of level-0 models) -- also called “generalizer”
其作為一種 ensemble method 而廣為使用。

Central Idea - DCN VS DSN
比較有意思的是dsn最初提出名字是叫做dcn的，也就是Deep Convex Network
Deng 老師的解釋是這樣的：兩個(gè)名字強(qiáng)調(diào)的地方不同
Deep Convex Network------accentuates the role of convex optimization
Deep Stacking Network----the key operation of stacking is emphasized

2.算法細(xì)節(jié)
在算法細(xì)節(jié)沒(méi)有太多公式推導(dǎo)，一方面是因?yàn)楣綄?shí)在比較簡(jiǎn)單，另一方面則是想講工程中實(shí)踐更看重的東西，比如超參數(shù)的選擇，weight初始化之類的。
主要為下面這些部分input output, W&U, fine-tuning, hyper-parameter, regularization, over-fitting

Input
對(duì)于 dsn 結(jié)構(gòu)來(lái)說(shuō)是如下圖紅圈圈出來(lái)的地方

主要的輸入以 image speech 以及 Semantic utterance classification 這三個(gè)方面的應(yīng)用為例說(shuō)明。

對(duì)于有些時(shí)候我們不需要做 feature selection 的情況：
對(duì) 于 Image 來(lái)說(shuō)可以是 a number of pixels or extracted features，或者是 values based at least in part upon intensity values, RGB values (or the like corresponding to the respective pixels)
對(duì)Speech來(lái)說(shuō)可以是samples of speech waveform或者是the extracted features from speech waveforms(such as power spectra or cepstral coefficients)
Note the use of speech waveform as the raw features to a speech recognizer isnot a crazy idea

對(duì)于有些情況我們需要做feature selection：
比如說(shuō)如下
對(duì)于 Acoustic models 其具有 9 standard MFCC features 和 millions of frames as training samples 當(dāng)然就不需要做 feature selection
而 Semantic utterance classification 有 as many as 125,000 unique trigrams as potential features
卻只有 16,000 utterances 這樣形成了一個(gè)sparse space，就需要做feature selection

具體來(lái)說(shuō)用什么 feature selection 的算法區(qū)別不大，這里說(shuō)下用 boosting classifier 的方法
Semantic Classification
-The input feature space is shrunk using the n-grams selected by the Boosting classifier.
-The weights coming with the decision stump are ignored, only binary features indicating absence or presence are used.(decision stump, which is a single node decision tree)

Output

根據(jù)需要解決的問(wèn)題定義來(lái)定
-representative of the values
????????0, 1, 2, 3, and so forth up to
????????9 with a 0-1 coding scheme
-representative of phones,
-HMM states of phones
-context-dependent HMM states of phones.
等等

Learning
先以下圖所示的最底層講述如何 learning

（豆瓣編輯太麻煩直接帖圖好了）
單層計(jì)算便是如此，其實(shí)挺簡(jiǎn)單的

但是具體計(jì)算的時(shí)候涉及到很多問(wèn)題
1.W 怎么初始化
2.超參數(shù)怎么調(diào)
3.什么時(shí)候需要做 regularization
4.解劇透問(wèn)題的時(shí)候 overfitting 的情況

Setting Weight Matrices W
在前面說(shuō)了單層的計(jì)算方法，對(duì)于多層來(lái)說(shuō)計(jì)算結(jié)構(gòu)如下

如同整體的結(jié)構(gòu)圖可以看到，把 output 是加入了下一層的 input里面了的。其余計(jì)算步驟和單層一樣。
但是對(duì)于 W 的初始化其實(shí)還是挺有趣的，有如下4種方法：
1.Take the same W from the immediately lower module already adjusted via fine tuning.
2.Take a copy of the RBM that initialized W at bottom module.
3.Use random numbers, making the full W maximally random before fine tuning.
4.Mix the above three choices with various weighting and with randomized order or otherwise.
當(dāng)然要注意的是the sub-matrix of W corresponding to the output units from the lower modules is always initialized with random numbers.

大家可能發(fā)現(xiàn)這4種方法差別還挺大的，但是有趣的是 with sufficient efforts put to adjust all other hyper-parameters, all four strategies above eventually gave similar classification accuracy。
正所謂調(diào)得一手好參數(shù)，再爛的結(jié)構(gòu)也不怕。不過(guò)對(duì)于隨機(jī)賦值的策略來(lái)說(shuō)雖然在 classification accuracy 沒(méi)有什么損失，但是 it takes many more modules and fine-tuning iterations than other strategies.

尤其要注意的是對(duì)于最底層的那個(gè) module 不能隨機(jī)賦值，還是要上 RBM

Fine-tuning
用 batch-mode gradient descent，沒(méi)什么好說(shuō)的，直接帖公式

Regularization
在做圖片和語(yǔ)言的時(shí)候不需要做 regularization，在做IR的時(shí)候需要做。這是因?yàn)?IR 的 output 比較少，一般都是二分類問(wèn)題，減輕了 stacking 的傳遞效應(yīng)。
做法為對(duì) U 做 L2 regularization，給 W 加個(gè) data reconstruction error term

Hyper-parameter
算法的超參數(shù)在于隱層選擇多少個(gè)神經(jīng)元。
這個(gè)是靠把數(shù)據(jù)集分為 train data、 test data 以及 development data 來(lái)做實(shí)驗(yàn)的。看著差不多就行了，一般大于 input 幾倍就好。
大體的感覺(jué)如下：

相當(dāng)于是input 784個(gè) feature，hidden 3000，output 10的節(jié)奏。

Over-fitting
最后說(shuō)下到底要 stacking 幾層的問(wèn)題。
層數(shù)少了效果差，層數(shù)多了 over fitting。
一般來(lái)說(shuō)是這樣的，如果你的特征選的比較好，參數(shù)初始化比較巧妙，超參數(shù)調(diào)的也比較好，那么層數(shù)就需要少一點(diǎn)；如果你上面這些東西都做的不好，那么層數(shù)就需要的多一些。（想象一下極端情況，特征非常巧妙，那一層就夠了）
調(diào)層數(shù)也是靠比較 train error 和 test error 來(lái)實(shí)現(xiàn)。
一般來(lái)說(shuō)，層數(shù)越多 train error 越小，但是多到一定程度的時(shí)候 test error 反而會(huì)增加，這時(shí)候我們就認(rèn)為出現(xiàn)了 over fitting，把結(jié)構(gòu)的層數(shù)定義在拐點(diǎn)處，如下圖，分別是兩個(gè)應(yīng)用場(chǎng)景，其實(shí)用的層數(shù)都不多。