摘要:前言隨機森林是一個很強大的模型,由一組決策樹投票得到最后的結果。要研究清楚隨機森林,首先需要研究清楚決策樹,然后理解隨機森林如何通過多棵樹的集成提高模型效果。
前言
隨機森林是一個很強大的模型,由一組決策樹投票得到最后的結果。要研究清楚隨機森林,首先需要研究清楚決策樹,然后理解隨機森林如何通過多棵樹的集成提高模型效果。
本文的目的是將自己學習這個模型時有用的資料匯總在一起。
決策樹基本知識決策樹知識點精要
ID3:信息增益
C4.5:信息增益率
CART:Gini系數
集成智慧編程
優點有:
最大的優勢是易于解釋
同時接受categorical和numerical數據,不需要做預處理或歸一化。
允許結果是不確定的:葉子節點具有多種可能的結果值卻無法進一步拆分,可以統計count,評估出一個概率。
缺點有:
對于只有幾種可能結果的問題,算法很有效;面對擁有大量可能結果的數據集時,決策樹會變得異常復雜,預測效果也可能會大打折扣。
盡管能處理簡單的數值型數據,但只能創建滿足“大于/小于”條件的節點。若決定分類的因素取決于更多變量的復雜組合,此時要根據決策樹進行分類就會比較困難了。例如,假設結果值是由兩個變量的差來決定的,那么這棵樹會變得異常龐大,而且預測的準確性也會迅速下降。
總而言之:決策樹最適合用來處理的,是那些帶分界點的、由大量分類數據和數值數據共同組成的數據集。
關于書中提到的假設結果值是由兩個變量的差來決定的,那么這棵樹會變得異常龐大,而且預測的準確性也會迅速下降,我們可以用下面的例子來實驗一下:
library(rpart) library(rpart.plot); age1 <- as.integer(runif(1000, min=18, max=30)) age2 <- as.integer(runif(1000, min=18, max=30)) df <- data.frame(cbind(age1, aage2)) df <- df %>% dplyr::mutate(diff=age1-age2, label = diff >= 0 & diff <= 5) ct <- rpart.control(xval=10, minsplit=20, cp=0.01) cfit <- rpart(label~age1+age2, data=df, method="class", control=ct, parms=list(split="gini") ) print(cfit) rpart.plot(cfit, branch=1, branch.type=2, type=1, extra=102, shadow.col="gray", box.col="green", border.col="blue", split.col="red", split.cex=1.2, main="Decision Tree"); cfit <- rpart(label~diff, data=df, method="class", control=ct, parms=list(split="gini") ) print(cfit) rpart.plot(cfit, branch=1, branch.type=2, type=1, extra=102, shadow.col="gray", box.col="green", border.col="blue", split.col="red", split.cex=1.2, main="Decision Tree");
用age1和age2來預測,得到的決策樹截圖如下:
用diff來預測,得到的決策樹截圖如下:
隨機森林理論sklearn官方文檔
Each tree in the ensemble is built from a sample drawn with replacement (bootstrap sample) from the training set. When splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features.
As a result of this randomness, the bias of the forest usually slightly increases with respect to the bias of a single non-random tree, but, due to average, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.
In contrast to the original publication, the sklearn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class.
隨機森林實現from sklearn.ensemble import RandomForestClassifier X = [[0,0], [1,1]] Y = [0, 1] clf = RandomForestClassifier(n_estimator=10) clf = clf.fit(X, Y)調參
sklearn官網
核心參數由n_estimators和max_features:
n_estimators: the number of trees in the forest
max_features: the size of the random subsets of features to consider when splitting a node. Default values: max_features=n_features for regression problems, and max_features=sqrt(n_features) for classification tasks.
其他參數:Good results are often achieved when setting max_depth=None in combination with min_samples_split=1.
n_jobs=k:computations are partitioned into k jobs, and run on k cores of the machine. if n_jobs=-1 then all cores available on the machine are used.
特征重要性評估sklearn官方文檔
The depth of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree are used contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features.
By averaging those expected activity rates over several randomized trees one can reduce the variance of such an estimate and use it for feature selection.
In practice those estimates are stored as an attribute named feature_importances_ on the fitted model. This is an array with shape (n_features,) whose values are positive and sum to 1.0. The higher the value, the more important is the contribution of the matching feature to the prediction function.
StackOverflow
You initialize an array feature_importances of all zeros with size n_features.
You traverse the tree: for each internal node that splits on feature i you compute the error reduction of that node multiplied by the number of samples that were routed to the node and add this quantity to feature_importances[i].
The error reduction depends on the impurity criterion that you use (e.g. Gini, Entropy). It"s the impurity of the set of observations that gets routed to the internal node minus the sum of the impurities of the two partitions created by the split.
關于作者:丹追兵:數據分析師一枚,編程語言python和R,使用Spark、Hadoop、Storm、ODPS。本文出自丹追兵的pytrafficR專欄,轉載請注明作者與出處:https://segmentfault.com/blog...
文章版權歸作者所有,未經允許請勿轉載,若此文章存在違規行為,您可以聯系管理員刪除。
轉載請注明本文地址:http://specialneedsforspecialkids.com/yun/45517.html
摘要:翻譯自昨天收到推送了一篇介紹隨機森林算法的郵件,感覺作為介紹和入門不錯,就順手把它翻譯一下。隨機森林引入的隨機森林算法將自動創建隨機決策樹群。回歸隨機森林也可以用于回歸問題。結語隨機森林相當起來非常容易。 翻譯自:http://blog.yhat.com/posts/python-random-forest.html 昨天收到yhat推送了一篇介紹隨機森林算法的郵件,感覺作為介紹和入門...
閱讀 663·2023-04-26 02:03
閱讀 1037·2021-11-23 09:51
閱讀 1111·2021-10-14 09:42
閱讀 1738·2021-09-13 10:23
閱讀 927·2021-08-27 13:12
閱讀 839·2019-08-30 11:21
閱讀 1001·2019-08-30 11:14
閱讀 1041·2019-08-30 11:09