機器學習：隨機森林學習筆記

arashicage 發布于2019-07-31 12:22 / 3397人閱讀

摘要：前言隨機森林是一個很強大的模型，由一組決策樹投票得到最后的結果。要研究清楚隨機森林，首先需要研究清楚決策樹，然后理解隨機森林如何通過多棵樹的集成提高模型效果。

前言

隨機森林是一個很強大的模型，由一組決策樹投票得到最后的結果。要研究清楚隨機森林，首先需要研究清楚決策樹，然后理解隨機森林如何通過多棵樹的集成提高模型效果。

本文的目的是將自己學習這個模型時有用的資料匯總在一起。

決策樹基本知識

決策樹知識點精要

ID3:信息增益
C4.5：信息增益率
CART：Gini系數

決策樹的優缺點

集成智慧編程

優點有：

最大的優勢是易于解釋

同時接受categorical和numerical數據，不需要做預處理或歸一化。

允許結果是不確定的：葉子節點具有多種可能的結果值卻無法進一步拆分，可以統計count，評估出一個概率。

缺點有：

對于只有幾種可能結果的問題，算法很有效；面對擁有大量可能結果的數據集時，決策樹會變得異常復雜，預測效果也可能會大打折扣。

盡管能處理簡單的數值型數據，但只能創建滿足“大于/小于”條件的節點。若決定分類的因素取決于更多變量的復雜組合，此時要根據決策樹進行分類就會比較困難了。例如，假設結果值是由兩個變量的差來決定的，那么這棵樹會變得異常龐大，而且預測的準確性也會迅速下降。

總而言之：決策樹最適合用來處理的，是那些帶分界點的、由大量分類數據和數值數據共同組成的數據集。

關于書中提到的假設結果值是由兩個變量的差來決定的，那么這棵樹會變得異常龐大，而且預測的準確性也會迅速下降，我們可以用下面的例子來實驗一下：

library(rpart)
library(rpart.plot);  

age1 <- as.integer(runif(1000, min=18, max=30))
age2 <- as.integer(runif(1000, min=18, max=30))

df <- data.frame(cbind(age1, aage2))

df <- df %>% dplyr::mutate(diff=age1-age2, label = diff >= 0 & diff <= 5)

ct <- rpart.control(xval=10, minsplit=20, cp=0.01) 
cfit <- rpart(label~age1+age2,
              data=df, method="class", control=ct,
              parms=list(split="gini")
)
print(cfit)


rpart.plot(cfit, branch=1, branch.type=2, type=1, extra=102,  
           shadow.col="gray", box.col="green",  
           border.col="blue", split.col="red",  
           split.cex=1.2, main="Decision Tree");  


cfit <- rpart(label~diff,
              data=df, method="class", control=ct,
              parms=list(split="gini")
)
print(cfit)

rpart.plot(cfit, branch=1, branch.type=2, type=1, extra=102,  
           shadow.col="gray", box.col="green",  
           border.col="blue", split.col="red",  
           split.cex=1.2, main="Decision Tree");

用age1和age2來預測，得到的決策樹截圖如下：

用diff來預測，得到的決策樹截圖如下：

隨機森林理論

sklearn官方文檔

Each tree in the ensemble is built from a sample drawn with replacement (bootstrap sample) from the training set. When splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features.

As a result of this randomness, the bias of the forest usually slightly increases with respect to the bias of a single non-random tree, but, due to average, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

In contrast to the original publication, the sklearn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class.

隨機森林實現

from sklearn.ensemble import RandomForestClassifier
X = [[0,0], [1,1]]
Y = [0, 1]
clf = RandomForestClassifier(n_estimator=10)
clf = clf.fit(X, Y)

調參

sklearn官網

核心參數由n_estimators和max_features：

n_estimators: the number of trees in the forest

max_features: the size of the random subsets of features to consider when splitting a node. Default values: max_features=n_features for regression problems, and max_features=sqrt(n_features) for classification tasks.

其他參數：Good results are often achieved when setting max_depth=None in combination with min_samples_split=1.

n_jobs=k：computations are partitioned into k jobs, and run on k cores of the machine. if n_jobs=-1 then all cores available on the machine are used.

特征重要性評估

sklearn官方文檔

The depth of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree are used contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features.

By averaging those expected activity rates over several randomized trees one can reduce the variance of such an estimate and use it for feature selection.

In practice those estimates are stored as an attribute named feature_importances_ on the fitted model. This is an array with shape (n_features,) whose values are positive and sum to 1.0. The higher the value, the more important is the contribution of the matching feature to the prediction function.

StackOverflow

You initialize an array feature_importances of all zeros with size n_features.

You traverse the tree: for each internal node that splits on feature i you compute the error reduction of that node multiplied by the number of samples that were routed to the node and add this quantity to feature_importances[i].

The error reduction depends on the impurity criterion that you use (e.g. Gini, Entropy). It"s the impurity of the set of observations that gets routed to the internal node minus the sum of the impurities of the two partitions created by the split.

關于作者：丹追兵：數據分析師一枚，編程語言python和R，使用Spark、Hadoop、Storm、ODPS。本文出自丹追兵的pytrafficR專欄，轉載請注明作者與出處：https://segmentfault.com/blog...

GPU云服務器云服務器 python隨機森林學習筆記學習筆記一基礎學習筆記

文章版權歸作者所有，未經允許請勿轉載,若此文章存在違規行為，您可以聯系管理員刪除。

轉載請注明本文地址：http://specialneedsforspecialkids.com/yun/45517.html

隨機森林算法入門(python)

摘要：翻譯自昨天收到推送了一篇介紹隨機森林算法的郵件，感覺作為介紹和入門不錯，就順手把它翻譯一下。隨機森林引入的隨機森林算法將自動創建隨機決策樹群。回歸隨機森林也可以用于回歸問題。結語隨機森林相當起來非常容易。翻譯自：http://blog.yhat.com/posts/python-random-forest.html 昨天收到yhat推送了一篇介紹隨機森林算法的郵件，感覺作為介紹和入門...

張遷 2019-07-31 10:52 評論0 收藏0

發表評論

登陸后可評論

0條評論

arashicage

男|高級講師

我要關注我要私信

TA的文章

神經網絡tensorflow

閱讀 663·2023-04-26 02:03
一招秒殺指針問題（指針數組，數組指針，n維指針，以及什么時候使用他們）

閱讀 1037·2021-11-23 09:51
8051單片機Proteus仿真與開發實例-DS1302 RTC驅動仿真

閱讀 1111·2021-10-14 09:42
#便宜#Pacificrack：2核/4G/50G SSD/5T/1Gbps/洛杉磯QN機房/年付$

閱讀 1738·2021-09-13 10:23
racknerd，2021中秋促銷，洛杉磯DC-02，$9.89/年，1核/10G SSD/512M

閱讀 927·2021-08-27 13:12
移動端兼容問題總結(1)

閱讀 839·2019-08-30 11:21
原生JS快速實現拖放（drag and drop）效果

閱讀 1001·2019-08-30 11:14
CSS :placeholder-shown偽類實現輸入框浮動文字效果

閱讀 1041·2019-08-30 11:09

国产xxxx99真实实拍_久久不雅视频_高清韩国a级特黄毛片_嗯老师别我我受不了了小说

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務器低至59元/年，更有多臺、長期優惠，快來選購！

機器學習：隨機森林學習筆記

相關文章

隨機森林算法入門(python)

發表評論

0條評論

arashicage

男|高級講師

TA的文章

神經網絡tensorflow

一招秒殺指針問題（指針數組，數組指針，n維指針，以及什么時候使用他們）

8051單片機Proteus仿真與開發實例-DS1302 RTC驅動仿真

#便宜#Pacificrack：2核/4G/50G SSD/5T/1Gbps/洛杉磯QN機房/年付$

racknerd，2021中秋促銷，洛杉磯DC-02，$9.89/年，1核/10G SSD/512M

移動端兼容問題總結(1)

原生JS快速實現拖放（drag and drop）效果

CSS :placeholder-shown偽類實現輸入框浮動文字效果

最新活動