天池學習記錄——O2O優惠券使用預測賽題[1]

soasme 發布于2019-07-30 16:52 / 2514人閱讀

摘要：然而隨機投放的優惠券對多數用戶造成無意義的干擾。下面我們分別對訓練集中的類數據對優惠券使用的影響進行分析。在里有兩種折扣方法代表折扣率表示滿減。這里我們還要將滿減類型用式子轉換成折扣率。進行預測計算平均得到結果。

賽題說明

應用背景：以優惠券盤活老用戶或吸引新客戶進店消費是O2O（Online to Offline）的一種重要營銷方式。然而隨機投放的優惠券對多數用戶造成無意義的干擾。對商家而言，濫發的優惠券可能降低品牌聲譽，同時難以估算營銷成本。而個性化投放是提高優惠券核銷率的重要技術，它可以讓具有一定偏好的消費者得到真正的實惠，同時賦予商家更強的營銷能力。

目標：根據提供的O2O場景相關的豐富數據，通過分析建模，精準預測用戶是否會在規定時間內使用相應優惠券。

數據分析

讀取數據：

我們看到在 offline 訓練數據集中有以下 7 類數據：
User_id
Merchant_id
Coupon_id
Discount_rate
Distance
Date_received
Date

當 Coupon_id 為 null 時表示無優惠券消費，此時Discount_rate和Date_received字段無意義。

具體字段意義請參考賽題鏈接。

根據 Coupon_id 和 Date 是否為 null，可以將數據分為四種類型：

print("有優惠券，購買商品條數", dfoff[(dfoff["Coupon_id"] != "null") & (dfoff["Date"] != "null")].shape[0])
print("無優惠券，購買商品條數", dfoff[(dfoff["Coupon_id"] == "null") & (dfoff["Date"] != "null")].shape[0])
print("有優惠券，沒有購買商品條數", dfoff[(dfoff["Coupon_id"] != "null") & (dfoff["Date"] == "null")].shape[0])
print("無優惠券，也沒有購買商品條數", dfoff[(dfoff["Coupon_id"] == "null") & (dfoff["Date"] == "null")].shape[0])

得到結果：

其中，75382 表示用優惠券進行了消費的數量，即正樣本；977900 表示領取優惠券但沒有使用，這部分優惠券就被浪費了，即負樣本；701602 表示沒有優惠券的普通消費。

下面我們分別對訓練集中的 7 類數據對優惠券使用的影響進行分析。

1. 優惠券和距離

print("Discount_rate 類型:",dfoff["Discount_rate"].unique())
print("Distance 類型:", dfoff["Distance"].unique())

我們看到輸出的是str類型的數據，需要將它們轉換成numeric類型。

在Discount_rate里有兩種折扣方法：x in [0,1] 代表折扣率；x : y 表示滿 x 減 y。這里我們還要將滿 x 減 y 類型用式子1-y/x轉換成折扣率。并建立折扣券相關的特征 discount_rate, discount_man, discount_jian, discount_type。代碼如下：

# convert Discount_rate and Distance

def getDiscountType(row):
    if row == "null":
        return "null"
    elif ":" in row:
        return 1
    else:
        return 0

def convertRate(row):
    """Convert discount to rate"""
    if row == "null":
        return 1.0
    elif ":" in row:
        rows = row.split(":")
        return 1.0 - float(rows[1])/float(rows[0])
    else:
        return float(row)

def getDiscountMan(row):
    if ":" in row:
        rows = row.split(":")
        return int(rows[0])
    else:
        return 0

def getDiscountJian(row):
    if ":" in row:
        rows = row.split(":")
        return int(rows[1])
    else:
        return 0

def processData(df):
    
    # convert discunt_rate
    df["discount_rate"] = df["Discount_rate"].apply(convertRate)
    df["discount_man"] = df["Discount_rate"].apply(getDiscountMan)
    df["discount_jian"] = df["Discount_rate"].apply(getDiscountJian)
    df["discount_type"] = df["Discount_rate"].apply(getDiscountType)
    print(df["discount_rate"].unique())
    
    # convert distance
    df["distance"] = df["Distance"].replace("null", -1).astype(int)
    print(df["distance"].unique())
    return df

dfoff = processData(dfoff)
dftest = processData(dftest)

2. 時間
對收到優惠券的日期date_received和消費日期date_buy進行處理：

date_received = dfoff["Date_received"].unique()
date_received = sorted(date_received[date_received != "null"])

date_buy = dfoff["Date"].unique()
date_buy = sorted(date_buy[date_buy != "null"])

date_buy = sorted(dfoff[dfoff["Date"] != "null"]["Date"])

并輸出結果：

查看顧客每天收到的優惠券數量：

couponbydate = dfoff[dfoff["Date_received"] != "null"][["Date_received", "Date"]].groupby(["Date_received"], as_index=False).count()
couponbydate.columns = ["Date_received","count"]
couponbydate.head()

查看顧客用這些優惠券進行了消費的數量：

buybydate = dfoff[(dfoff["Date"] != "null") & (dfoff["Date_received"] != "null")][["Date_received", "Date"]].groupby(["Date_received"], as_index=False).count()
buybydate.columns = ["Date_received","count"]
buybydate.head()

將以上數據可視化：

plt.figure(figsize = (12,8))
date_received_dt = pd.to_datetime(date_received, format="%Y%m%d")

plt.subplot(211)
plt.bar(date_received_dt, couponbydate["count"], label = "number of coupon received" )
plt.bar(date_received_dt, buybydate["count"], label = "number of coupon used")
plt.yscale("log")
plt.ylabel("Count")
plt.legend()

plt.subplot(212)
plt.bar(date_received_dt, buybydate["count"]/couponbydate["count"])
plt.ylabel("Ratio(coupon used/coupon received)")
plt.tight_layout()

提取特征

上面顯示的是多帶帶一天的數據量，我們知道人們一般在星期天上街比較多，使用優惠券的可能性也增大，所以現在我們以星期為依據新建特征。

def getWeekday(row):
    if row == "null":
        return row
    else:
        return date(int(row[0:4]), int(row[4:6]), int(row[6:8])).weekday() + 1

dfoff["weekday"] = dfoff["Date_received"].astype(str).apply(getWeekday)
dftest["weekday"] = dftest["Date_received"].astype(str).apply(getWeekday)

# weekday_type :  周六和周日為1，工作日為0
dfoff["weekday_type"] = dfoff["weekday"].apply(lambda x : 1 if x in [6,7] else 0 )
dftest["weekday_type"] = dftest["weekday"].apply(lambda x : 1 if x in [6,7] else 0 )

# change weekday to one-hot encoding 
weekdaycols = ["weekday_" + str(i) for i in range(1,8)]
print(weekdaycols)

tmpdf = pd.get_dummies(dfoff["weekday"].replace("null", np.nan))
tmpdf.columns = weekdaycols
dfoff[weekdaycols] = tmpdf

tmpdf = pd.get_dummies(dftest["weekday"].replace("null", np.nan))
tmpdf.columns = weekdaycols
dftest[weekdaycols] = tmpdf

得到的tmpdf為以下形式：

對["date_received"]數據進行標注，轉換成numeric：

def label(row):
    if row["Date_received"] == "null":
        return -1
    if row["Date"] != "null":
        td = pd.to_datetime(row["Date"], format="%Y%m%d") -  pd.to_datetime(row["Date_received"], format="%Y%m%d")
        if td <= pd.Timedelta(15, "D"):
            return 1
    return 0
dfoff["label"] = dfoff.apply(label, axis = 1)

若 Date_received == "null"，則 y = -1；Date != "null" & Date-Date_received <= 15，則 y = 1；否則，y = 0。

此時，這些轉換后的數據已經以0，1，-1的形式存在了label列中。

模型訓練

在應用模型前，首先對數據進行劃分。在這里，我們將 20160101 到 20160515 的數據用作訓練集(train)，20160516 到 20160615 的數據用作驗證集(valid)。

df = dfoff[dfoff["label"] != -1].copy()
train = df[(df["Date_received"] < "20160516")].copy()
valid = df[(df["Date_received"] >= "20160516") & (df["Date_received"] <= "20160615")].copy()
print(train["label"].value_counts())
print(valid["label"].value_counts())

用線性模型 SGDClassifier 進行預測。

predictors = original_feature
print(predictors)

def check_model(data, predictors):
    
    classifier = lambda: SGDClassifier(
        loss="log", 
        penalty="elasticnet", 
        fit_intercept=True, 
        max_iter=100, 
        shuffle=True, 
        n_jobs=1,
        class_weight=None)

    model = Pipeline(steps=[
        ("ss", StandardScaler()),
        ("en", classifier())
    ])

    parameters = {
        "en__alpha": [ 0.001, 0.01, 0.1],
        "en__l1_ratio": [ 0.001, 0.01, 0.1]
    }

    folder = StratifiedKFold(n_splits=3, shuffle=True)
    
    grid_search = GridSearchCV(
        model, 
        parameters, 
        cv=folder, 
        n_jobs=-1, 
        verbose=1)
    grid_search = grid_search.fit(data[predictors], 
                                  data["label"])
    
    return grid_search

if not os.path.isfile("1_model.pkl"):
    model = check_model(train, predictors)
    print(model.best_score_)
    print(model.best_params_)
    with open("1_model.pkl", "wb") as f:
        pickle.dump(model, f)
else:
    with open("1_model.pkl", "rb") as f:
        model = pickle.load(f)

接下來，對每個優惠券預測的結果計算 AUC，再對所有的取平均。計算 AUC 的時候，如果label只有一類，就直接跳過，因為 AUC 無法計算。

進行預測：

y_valid_pred = model.predict_proba(valid[predictors])
valid1 = valid.copy()
valid1["pred_prob"] = y_valid_pred[:, 1]

計算平均 AUC：

vg = valid1.groupby(["Coupon_id"])
aucs = []
for i in vg:
    tmpdf = i[1] 
    if len(tmpdf["label"].unique()) != 2:
        continue
    fpr, tpr, thresholds = roc_curve(tmpdf["label"], tmpdf["pred_prob"], pos_label=1)
    aucs.append(auc(fpr, tpr))
print(np.average(aucs))

得到結果0.5348655160896371。

對測試集進行預測并提交結果：

y_test_pred = model.predict_proba(dftest[predictors])
dftest1 = dftest[["User_id","Coupon_id","Date_received"]].copy()
dftest1["label"] = y_test_pred[:,1]
dftest1.to_csv("submit1.csv", index=False, header=False)

至此，我們已經得到一個提交結果，在這個過程中用到的特征是優惠券，距離和時間。預測效果較差，還需要進行進一步的特征工程，來得到更好的效果。

思路解答

總結以上思路，首先對數據進行分析，通過畫圖可以更直觀的反映出數據的特征；然后根據對數據對分析結果，進行特征提取，用這些特征訓練所用的模型。在訓練過程中通過劃分數據集，分為訓練集和驗證集兩部分，對模型進行訓練；最后，將測試集的數據喂給訓練好的模型，得到預測結果，并轉換為能提交的.csv格式的文件。

這就是進行一次數據分析的大致思路，就本題來說，在特征工程和模型的選擇上還有更多的思考余地，來提高準確率。

用到的知識點

one-hot encoding
AUC

遇到的問題

針對博主的學習，在這次的賽題總結中反映出的問題有以下 3 點：

數據可視化的代碼部分，不夠了解，而畫圖可能為我們提供很多思路

對各個模型的參數有哪些需要深入了解，如果不想做調包俠客，就更要掌握調參背后的原理

特征工程是制勝的關鍵，需要不斷的練習學習

參考鏈接：
https://tianchi.aliyun.com/no...
https://tianchi.aliyun.com/no...

不足之處，歡迎指正。

GPU云服務器云服務器機器學習預測 asp微機使用記錄服務器文件使用記錄表使用云服務器上的旺旺聊天記錄

文章版權歸作者所有，未經允許請勿轉載,若此文章存在違規行為，您可以聯系管理員刪除。

轉載請注明本文地址：http://specialneedsforspecialkids.com/yun/41879.html

天池大數據比賽總結

摘要：這次比賽的題目是給定年月份的用戶在不同地點口碑購買記錄，以及年月淘寶上用戶的購物行為數據，來預測月這一整月用戶來到一個地點之后會光顧哪些口碑商鋪。一直想總結一下這次的比賽，拖啊拖。。。一直等到現在，趁著現在要找實習，好好總結一下。比賽題目比賽的官方網站在這，IJCAI SocInf16。這次比賽的題目是給定 2015 年 7 ~ 11 月份的用戶在不同地點口碑購買記錄，以及 2...

printempw 2019-07-31 10:58 評論0 收藏0
人工智能/數據科學比賽匯總 2019.2

摘要：內容來自，人工智能數據科學比賽整理平臺。消費者人群畫像信用智能評分月日月中國移動福建公司提供年月份的樣本數據脫敏，包括客戶的各類通信支出欠費情況出行情況消費場所社交個人興趣等豐富的多維度數據。內容來自 DataSciComp，人工智能/數據科學比賽整理平臺。Github：iphysresearch/DataSciComp 本項目由 ApacheCN 強力支持。微博 | 知乎 | C...

twohappy 2019-06-26 18:47 評論0 收藏0
人工智能/數據科學比賽匯總 2019.3

摘要：內容來自，人工智能數據科學比賽整理平臺。本項目由強力支持。天池閱讀更多內容來自 DataSciComp，人工智能/數據科學比賽整理平臺。Github：iphysresearch/DataSciComp 本項目由 ApacheCN 強力支持。微博 | 知乎 | CSDN | 簡書 | OSChina | 博客園全球城市計算AI挑戰賽 3月19日 - 4月11日, 2019 // ...

mayaohua 2019-06-26 18:53 評論0 收藏0