Kaggle入門級賽題：房價預測——數據分析篇

sarva 發布于2019-07-31 11:18 / 1357人閱讀

摘要：本次分享的項目來自的經典賽題房價預測。分為數據分析和數據挖掘兩部分介紹。本篇為數據分析篇。賽題解讀比賽概述影響房價的因素有很多，在本題的數據集中有個變量幾乎描述了愛荷華州艾姆斯住宅的方方面面，要求預測最終的房價。

本次分享的項目來自 Kaggle 的經典賽題：房價預測。分為數據分析和數據挖掘兩部分介紹。本篇為數據分析篇。

賽題解讀 比賽概述

影響房價的因素有很多，在本題的數據集中有 79 個變量幾乎描述了愛荷華州艾姆斯 (Ames, Iowa) 住宅的方方面面，要求預測最終的房價。

技術棧

特征工程 (Creative feature engineering)

回歸模型 (Advanced regression techniques like random forest and
gradient boosting)

最終目標

預測出每間房屋的價格，對于測試集中的每一個Id，給出變量SalePrice相應的值。

提交格式

Id,SalePrice
1461,169000.1
1462,187724.1233
1463,175221
etc.

數據分析 數據描述

首先我們導入數據并查看：

train_df = pd.read_csv("./input/train.csv", index_col=0)
test_df = pd.read_csv("./input/test.csv", index_col=0)

train_df.head()

我們可以看到有 80 列，也就是有 79 個特征。

接下來將訓練集和測試集合并在一起，這么做是為了進行數據預處理的時候更加方便，讓測試集和訓練集的特征變換為相同的格式，等預處理進行完之后，再把他們分隔開。

我們知道SalePrice作為我們的訓練目標，只出現在訓練集中，不出現在測試集，因此我們需要把這一列拿出來再進行合并。在拿出這一列前，我們先來觀察它，看看它長什么樣子，也就是查看它的分布。

prices = DataFrame({"price": train_df["SalePrice"], "log(price+1)": np.log1p(train_df["SalePrice"])})
prices.hist()

因為label本身并不平滑，為了我們分類器的學習更加準確，我們需要首先把label給平滑化（正態化）。我在這里使用的是log1p, 也就是 log(x+1)。要注意的是我們這一步把數據平滑化了，在最后算結果的時候，還要把預測到的平滑數據給變回去，那么log1p()的反函數就是expm1()，后面用到時再具體細說。

然后我們把這一列拿出來：

y_train = np.log1p(train_df.pop("SalePrice"))

y_train.head()

有

Id
1    12.247699
2    12.109016
3    12.317171
4    11.849405
5    12.429220
Name: SalePrice, dtype: float64

這時，y_train就是SalePrice那一列。

然后我們把兩個數據集合并起來：

df = pd.concat((train_df, test_df), axis=0)

查看shape:

df.shape

(2919, 79)

df就是我們合并之后的DataFrame。

數據預處理

根據 kaggle 給出的說明，有以下特征及其說明：

SalePrice - the property"s sale price in dollars. This is the target variable that you"re trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale

接下來我們對特征進行分析。上述列出了一個目標變量SalePrice和 79 個特征，數量較多，這一步的特征分析是為了之后的特征工程做準備。

我們來查看哪些特征存在缺失值：

print(pd.isnull(df).sum())

這樣并不方便觀察，我們先查看缺失值最多的 10 個特征：

df.isnull().sum().sort_values(ascending=False).head(10)

為了更清楚的表示，我們用缺失率來考察缺失情況：

df_na = (df.isnull().sum() / len(df)) * 100
df_na = df_na.drop(df_na[df_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({"缺失率": df_na})
missing_data.head(10)

對其進行可視化：

f, ax = plt.subplots(figsize=(15,12))
plt.xticks(rotation="90")
sns.barplot(x=df_na.index, y=df_na)
plt.xlabel("Features", fontsize=15)
plt.ylabel("Percent of missing values", fontsize=15)
plt.title("Percent missing data by feature", fontsize=15)

我們可以看到PoolQC、MiscFeature、Alley、Fence、FireplaceQu 等特征存在大量缺失，LotFrontage 有 16.7% 的缺失率，GarageType、GarageFinish、GarageQual 和 GarageCond等缺失率相近，這些特征有的是 category 數據，有的是 numerical 數據，對它們的缺失值如何處理，將在關于特征工程的部分給出。

最后，我們對每個特征進行相關性分析，查看熱力圖：

corrmat = train_df.corr()
plt.subplots(figsize=(15,12))
sns.heatmap(corrmat, vmax=0.9, square=True)

我們看到有些特征相關性大，容易造成過擬合現象，因此需要進行剔除。在下一篇的數據挖掘篇我們來對這些特征進行處理并訓練模型。

不足之處，歡迎指正。

GPU云服務器云服務器 python預測房價房價預測python ASPNET入門數據篇入門篇

文章版權歸作者所有，未經允許請勿轉載,若此文章存在違規行為，您可以聯系管理員刪除。

轉載請注明本文地址：http://specialneedsforspecialkids.com/yun/44981.html

Kaggle入門級賽題：房價預測——數據挖掘篇

摘要：到這里，我們經過以上步驟處理過的數據，就可以喂給分類器進行訓練了。一般來說，單個分類器的效果有限。我們會傾向于把多個分類器合在一起，做一個綜合分類器以達到最好的效果。比理論上更高級點，它也是攬來一把的分類器。特征工程我們注意到 MSSubClass 其實是一個 category 的值： all_df[MSSubClass].dtypes 有： dtype(int64) 它不應該做...

joyqi 2019-07-31 11:19 評論0 收藏0
植被類型預測

摘要：通過海拔坡度到水源的距離地塊位置等特征項，對地塊植被的類型進行預測個類型。競賽結果提交請選手利用建立的模型對每階段提供的預測數據集中的地塊植被類型列進行預測類，預測結果按如下格式保存成格式提交。 showImg(https://segmentfault.com/img/bVbjmT7); 參加佛山互聯網協會建模大賽，主題為植被類型預測，數據量分3個階段，10/15/15萬左右的放出，暨...

z2xy 2019-07-31 11:18 評論0 收藏0
【Kaggle入門級競賽top5%排名經驗分享】— 建模篇

摘要：提取出中的信息特征缺失值同樣，觀察的缺失值情況缺失值處理發現兩位都是女性。特征缺失值特征有的缺失值，較為嚴重，如果進行大量的填補會引入更多噪聲。因為缺失值也是一種值，這里將缺失值視為一種特殊的值來處理，并根據首個字符衍生一個新的特征。作者：xiaoyu 微信公眾號：Python數據科學知乎：python數據分析師 showImg(https://segmentfault.com/...

iOS122 2019-07-30 17:14 評論0 收藏0