Pandas之旅（五): 構(gòu)建模型初入門：檢驗(yàn)數(shù)據(jù)一致性

hqman 發(fā)布于2019-07-31 10:13 / 1900人閱讀

摘要：如何根據(jù)需要?jiǎng)?chuàng)建簡(jiǎn)單模型大家好，今天這一期我想和大家分享有關(guān)于創(chuàng)建模型的部分，首先讓我們來看一個(gè)比較常見的場(chǎng)景你每天需要打開個(gè)進(jìn)行相同的操作，各種眼花繚亂的函數(shù)后老眼昏花。。。。

Pandas 如何根據(jù)需要?jiǎng)?chuàng)建簡(jiǎn)單模型

大家好，今天這一期我想和大家分享有關(guān)于pandas創(chuàng)建模型的部分，首先讓我們來看一個(gè)比較常見的場(chǎng)景：

你每天需要打開N個(gè)excel進(jìn)行相同的操作，各種眼花繚亂的VBA函數(shù)后老眼昏花。。。。

這種情況下，最好的解決辦法是先仔細(xì)想想業(yè)務(wù)需求是什么，根據(jù)實(shí)際情況可以用pandas搭建一個(gè)小型模型，一旦搭建完畢，你每天上班時(shí)就可以愉快地運(yùn)行Python腳本，轉(zhuǎn)身去喝杯咖啡，幾分鐘后心滿意足地回來，發(fā)現(xiàn)所有的繁瑣操作已經(jīng)搞定了，生活是這么美好、、、

閑話少說，讓我今天拋磚引玉，為大家簡(jiǎn)單介紹一個(gè)我使用比較多的小模型：檢驗(yàn)數(shù)據(jù)一致性（新老數(shù)據(jù)增加和減少的數(shù)量一致），今天的文章主要分為5部分

制作假數(shù)據(jù)

明確模型目的

開始實(shí)踐

源碼及GitHub地址

好啦，話不多說，讓我們一個(gè)個(gè)看吧

1. 制作假數(shù)據(jù)

import os

#這兩行僅僅是切換路徑，方便我上傳Github，大家不用理會(huì)
os.chdir("F:Python教程segmentfaultpandas_sharePandas之旅_05 如何構(gòu)建基礎(chǔ)模型")
os.getcwd()

"F:Python教程segmentfaultpandas_sharePandas之旅_05 如何構(gòu)建基礎(chǔ)模型"

首先讓我們一起制作一些假數(shù)據(jù)，我這里接下來生成一些有關(guān)訂單的假數(shù)據(jù)，當(dāng)然，到了文章的最后可能你會(huì)發(fā)現(xiàn)我們的模型并不是完美適用于這個(gè)類型，你會(huì)在生活中根據(jù)自己需要來調(diào)整，但是至少基礎(chǔ)的思路已經(jīng)有啦！

先建立一個(gè)fake_product的字典，keys是產(chǎn)品，value是單價(jià)，這里我們用一個(gè)在網(wǎng)上隨便找到的商品名稱的csv數(shù)據(jù)集,它只有一列ProductNames，product_names.csv和最后的代碼都會(huì)放在github上，如果大家感興趣可以下載~

import numpy as np
import pandas as pd
f"Using {pd.__name__},{pd.__version__}"

"Using pandas,0.23.0"

fake_df = pd.read_csv("product_names.csv")
fake_df.head(10)

	Product_Names
0	TrailChef Deluxe Cook Set
1	TrailChef Double Flame
2	Star Dome
3	Star Gazer 2
4	Hibernator Lite
5	Hibernator Extreme
6	Hibernator Camp Cot
7	Firefly Lite
8	Firefly Extreme
9	EverGlow Single

fake_df["Product_Names"].is_unique

True

這里我們可以看到，數(shù)據(jù)集主要包括的就是一些產(chǎn)品的名字，而且沒有重復(fù)值，我們現(xiàn)在把他們導(dǎo)出至一個(gè)字典，并隨機(jī)給每個(gè)產(chǎn)品任意的價(jià)格(在20至100之間),因?yàn)檫@里我們要隨機(jī)生成一些假數(shù)據(jù)，所以讓我們引用random這個(gè)包

import random

fake_product = { k:random.randint(20,100) for k in fake_df["Product_Names"]}
fake_product

{"TrailChef Deluxe Cook Set": 62,
 "TrailChef Double Flame": 78,
 "Star Dome": 58,
 "Star Gazer 2": 73,
 "Hibernator Lite": 56,
 "Hibernator Extreme": 99,
 "Hibernator Camp Cot": 33,
 "Firefly Lite": 27,
 "Firefly Extreme": 30,
 "EverGlow Single": 44,
 "EverGlow Butane": 33,
 "Husky Rope 50": 59,
 "Husky Rope 60": 81,
 "Husky Rope 100": 71,
 "Husky Rope 200": 81,
 "Granite Climbing Helmet": 86,
 "Husky Harness": 76,
 "Husky Harness Extreme": 73,
 "Granite Signal Mirror": 67,
 "Granite Carabiner": 63,
 "Granite Belay": 49,
 "Granite Pulley": 48,
 "Firefly Climbing Lamp": 47,
 "Firefly Charger": 60,
 "Firefly Rechargeable Battery": 52,
 "Granite Chalk Bag": 22,
 "Granite Ice": 71,
 "Granite Hammer": 50,
 "Granite Shovel": 41,
 "Granite Grip": 74,
 "Granite Axe": 68,
 "Granite Extreme": 74,
 "Mountain Man Extreme": 87,
 "Polar Sun": 82,
 "Polar Ice": 47,
 "Edge Extreme": 53,
 "Bear Survival Edge": 81,
 "Glacier GPS Extreme": 48,
 "BugShield Extreme": 87,
 "Sun Shelter Stick": 42,
 "Compact Relief Kit": 46,
 "Aloe Relief": 24,
 "Infinity": 73,
 "TX": 43,
 "Legend": 100,
 "Kodiak": 44,
 "Capri": 31,
 "Cat Eye": 62,
 "Dante": 71,
 "Fairway": 77,
 "Inferno": 59,
 "Maximus": 38,
 "Trendi": 35,
 "Zone": 87,
 "Max Gizmo": 67,
 "Pocket Gizmo": 73,
 "Ranger Vision": 73,
 "Trail Master": 96,
 "Hailstorm Steel Irons": 79,
 "Hailstorm Titanium Irons": 31,
 "Lady Hailstorm Steel Irons": 91,
 "Lady Hailstorm Titanium Irons": 99,
 "Hailstorm Titanium Woods Set": 74,
 "Hailstorm Steel Woods Set": 30,
 "Lady Hailstorm Titanium Woods Set": 99,
 "Lady Hailstorm Steel Woods Set": 84,
 "Course Pro Putter": 64,
 "Blue Steel Putter": 26,
 "Blue Steel Max Putter": 96,
 "Course Pro Golf and Tee Set": 90,
 "Course Pro Umbrella": 20,
 "Course Pro Golf Bag": 66,
 "Course Pro Gloves": 61,
 "TrailChef Canteen": 60,
 "TrailChef Kitchen Kit": 53,
 "TrailChef Cup": 88,
 "TrailChef Cook Set": 27,
 "TrailChef Single Flame": 45,
 "TrailChef Kettle": 70,
 "TrailChef Utensils": 88,
 "Star Gazer 6": 42,
 "Star Peg": 28,
 "Hibernator": 47,
 "Hibernator Self - Inflating Mat": 66,
 "Hibernator Pad": 89,
 "Hibernator Pillow": 84,
 "Canyon Mule Climber Backpack": 82,
 "Canyon Mule Weekender Backpack": 92,
 "Canyon Mule Journey Backpack": 82,
 "Canyon Mule Cooler": 23,
 "Canyon Mule Carryall": 56,
 "Firefly Mapreader": 77,
 "Firefly 2": 76,
 "Firefly 4": 75,
 "Firefly Multi-light": 91,
 "EverGlow Double": 34,
 "EverGlow Lamp": 28,
 "Mountain Man Analog": 39,
 "Mountain Man Digital": 85,
 "Mountain Man Deluxe": 84,
 "Mountain Man Combination": 40,
 "Venue": 56,
 "Lux": 44,
 "Polar Sports": 20,
 "Polar Wave": 62,
 "Bella": 45,
 "Hawk Eye": 42,
 "Seeker 35": 81,
 "Seeker 50": 90,
 "Opera Vision": 98,
 "Glacier Basic": 63,
 "Glacier GPS": 66,
 "Trail Scout": 32,
 "BugShield Spray": 34,
 "BugShield Lotion Lite": 90,
 "BugShield Lotion": 84,
 "Sun Blocker": 88,
 "Sun Shelter 15": 45,
 "Sun Shelter 30": 100,
 "Sun Shield": 62,
 "Deluxe Family Relief Kit": 43,
 "Calamine Relief": 82,
 "Insect Bite Relief": 72,
 "Star Lite": 32,
 "Star Gazer 3": 95,
 "Single Edge": 87,
 "Double Edge": 20,
 "Bear Edge": 80,
 "Glacier Deluxe": 82,
 "BugShield Natural": 83,
 "TrailChef Water Bag": 99,
 "Canyon Mule Extreme Backpack": 58,
 "EverGlow Kerosene": 78,
 "Sam": 67,
 "Polar Extreme": 34,
 "Seeker Extreme": 43,
 "Seeker Mini": 26,
 "Flicker Lantern": 44,
 "Trail Star": 47,
 "Zodiak": 31,
 "Sky Pilot": 58,
 "Retro": 99,
 "Astro Pilot": 99,
 "Auto Pilot": 20}

len(fake_product)

這里我們看到生成了一個(gè)有144個(gè)item組成，key為產(chǎn)品名稱，value及單價(jià)的fake_product字典，接下來為了省事，
我簡(jiǎn)單地創(chuàng)建了一個(gè)方法get_fake_data可以讓我們最終得到一個(gè)填充好的假數(shù)據(jù)集合，返回的也是字典

def get_fake_data(id_range_start,id_range_end,random_quantity_range=50):
#     Id=["A00"+str(i) for i in range(0,id_range)]
    Id=[]
    Quantity = []
    Product_name=[]
    Unit_price=[]
    Total_price=[]

    for i in range(id_range_start,id_range_end):
        random_quantity = random.randint(1,random_quantity_range)
        name, price = random.choice(list(fake_product.items()))

        Id.append("A00"+str(i))
        Quantity.append(random_quantity)
        Product_name.append(name)
        Unit_price.append(price)
        Total_price.append(price*random_quantity)
   
    result = {
    "Product_ID":Id,
    "Product_Name":Product_name,
    "Quantity":Quantity,
    "Unit_price":Unit_price,
    "Total_price":Total_price
}
    
    return result

# total = [quantity[i]* v for i,v in enumerate(unit_price)]    也可以最后用推導(dǎo)式來求total，皮一下
# total_price=[q*p for q in quantity for p in unit_price]

首先，這個(gè)方法不夠簡(jiǎn)潔，大家可以優(yōu)化一下，但是今天的重點(diǎn)在于小模型，讓我們著重看一下最后返回的dict，它包含如下幾列：

Product_ID：訂單號(hào)，按照順序遞增生成

Product_Name：產(chǎn)品名稱，隨機(jī)生成

Quantity：隨機(jī)生成在1~random_quantity_range之間的每個(gè)訂單的產(chǎn)品訂購量

Unit_price:產(chǎn)品價(jià)格

Total_price：總價(jià)

每組數(shù)據(jù)長(zhǎng)度均為 id_range_end - id_range_start，現(xiàn)在讓我們生成兩組假數(shù)據(jù)：

fake_data= get_fake_data(1,len(fake_product)+1)

這里我們可以看到我們生成了一組假數(shù)據(jù)，Id從A001 ~ A00145

讓我們簡(jiǎn)單看看假數(shù)據(jù)的keys和每組數(shù)據(jù)的長(zhǎng)度：

fake_data.keys()

dict_keys(["Product_ID", "Product_Name", "Quantity", "Unit_price", "Total_price"])

for v in fake_data.values():
    print(len(v))

可以發(fā)現(xiàn)每組key對(duì)應(yīng)的list長(zhǎng)度都是144

2. 明確模型的目的

我們可以利用pandas自帶的from_dict方法把dict轉(zhuǎn)化為Dataframe，這里我們分別用剛剛生成的fake_data來模擬1月的庫存和2月的庫存情況，我們可以把fake_data分成兩組，A001-A00140一組，A008-A00144一組，這樣就完美的模擬了實(shí)際情況。

因?yàn)榇蠖鄶?shù)的商品名稱不會(huì)改變（8~140的部分），但是從一月到二月，因?yàn)楦鞣N原因我們減少了7個(gè)商品種類的庫存（1-7），又增加了4個(gè)種類的庫存（141-144），我們這里驗(yàn)證一致性的公式就是：

新增的 + 一月數(shù)據(jù)總量 = 減少的 + 二月數(shù)據(jù)總量

3. 開始實(shí)踐

現(xiàn)在讓我們來實(shí)現(xiàn)這個(gè)小模型，首先生成stock_jan，stock_fev兩個(gè)dataframe

stock= pd.DataFrame.from_dict(fake_data)
stock.head()

	Product_ID	Product_Name	Quantity	Unit_price	Total_price
0	A001	Course Pro Golf Bag	39	66	2574
1	A002	EverGlow Kerosene	18	78	1404
2	A003	Lux	24	44	1056
3	A004	Course Pro Putter	12	64	768
4	A005	Seeker 50	42	90	3780

stock.set_index(stock["Product_ID"],inplace=True)
stock.drop("Product_ID",axis=1,inplace=True)
stock.head()

	Product_Name	Quantity	Unit_price	Total_price
Product_ID
A001	Course Pro Golf Bag	39	66	2574
A002	EverGlow Kerosene	18	78	1404
A003	Lux	24	44	1056
A004	Course Pro Putter	12	64	768
A005	Seeker 50	42	90	3780

# 獲得1月份stock數(shù)據(jù),A001-A00140
stock_jan=stock[:"A00140"]
stock_jan.tail()

	Product_Name	Quantity	Unit_price	Total_price
Product_ID
A00136	Flicker Lantern	1	44	44
A00137	BugShield Spray	8	34	272
A00138	Glacier Basic	25	63	1575
A00139	Sun Blocker	23	88	2024
A00140	Granite Carabiner	11	63	693

# 獲得2月份stock數(shù)據(jù)
stock_fev=stock["A008":]
stock_fev.tail()

	Product_Name	Quantity	Unit_price	Total_price
Product_ID
A00140	Granite Carabiner	11	63	693
A00141	TrailChef Utensils	24	88	2112
A00142	TrailChef Deluxe Cook Set	9	62	558
A00143	Trail Star	21	47	987
A00144	Ranger Vision	19	73	1387

現(xiàn)在讓我們簡(jiǎn)單停頓一下，看看這兩個(gè)df：

stock_jan: A001 - A00140的所有數(shù)據(jù)

stock_fev: A008 - A00144的所有數(shù)據(jù)

接下來的操作很簡(jiǎn)單，用我們上篇文章提到的merge函數(shù)，這里merge的公有列為索引Product_ID，Product_Name,使用的是outer merge

merge_keys=["Product_ID","Product_Name"]

check_corehence = stock_jan.merge(stock_fev,on=merge_keys,how="outer",suffixes=("_jan","_fev"))
check_corehence.head(10)

	Product_Name	Quantity_jan	Unit_price_jan	Total_price_jan	Quantity_fev	Unit_price_fev	Total_price_fev
Product_ID
A001	Course Pro Golf Bag	39.0	66.0	2574.0	NaN	NaN	NaN
A002	EverGlow Kerosene	18.0	78.0	1404.0	NaN	NaN	NaN
A003	Lux	24.0	44.0	1056.0	NaN	NaN	NaN
A004	Course Pro Putter	12.0	64.0	768.0	NaN	NaN	NaN
A005	Seeker 50	42.0	90.0	3780.0	NaN	NaN	NaN
A006	Course Pro Golf Bag	27.0	66.0	1782.0	NaN	NaN	NaN
A007	Husky Rope 100	3.0	71.0	213.0	NaN	NaN	NaN
A008	EverGlow Double	18.0	34.0	612.0	18.0	34.0	612.0
A009	Opera Vision	30.0	98.0	2940.0	30.0	98.0	2940.0
A0010	TX	38.0	43.0	1634.0	38.0	43.0	1634.0

check_corehence.tail()

	Product_Name	Quantity_jan	Unit_price_jan	Total_price_jan	Quantity_fev	Unit_price_fev	Total_price_fev
Product_ID
A00140	Granite Carabiner	11.0	63.0	693.0	11.0	63.0	693.0
A00141	TrailChef Utensils	NaN	NaN	NaN	24.0	88.0	2112.0
A00142	TrailChef Deluxe Cook Set	NaN	NaN	NaN	9.0	62.0	558.0
A00143	Trail Star	NaN	NaN	NaN	21.0	47.0	987.0
A00144	Ranger Vision	NaN	NaN	NaN	19.0	73.0	1387.0

大家可以發(fā)現(xiàn)前7行正是減少的商品庫存，而后4行正是二月份新增的商品庫存，現(xiàn)在讓我們分別獲得減少的商品庫存數(shù)據(jù)和新增的商品庫存數(shù)據(jù)：

new_stock = check_corehence.loc[(check_corehence["Quantity_jan"].isnull()) & (check_corehence["Quantity_fev"].notnull())]
num_new = new_stock.shape[0]
num_new

remove_stock = check_corehence.loc[(check_corehence["Quantity_fev"].isnull()) & (check_corehence["Quantity_jan"].notnull())]
num_remove = remove_stock.shape[0]
num_remove

再讓我們分別看看1月和2月的數(shù)據(jù)量：

# 1月數(shù)據(jù)量
num_stock_jan = stock_jan.shape[0]
num_stock_jan

# 2月數(shù)據(jù)量
num_stock_fev = stock_fev.shape[0]
num_stock_fev

現(xiàn)在讓我們套入公式：

num_stock_jan + num_new

num_stock_fev + num_remove

結(jié)果相等，數(shù)據(jù)一致性過關(guān)！

4. 源碼及GitHub地址

這一期為大家分享了一個(gè)簡(jiǎn)單的pandas檢驗(yàn)數(shù)據(jù)一致性的模型，模型還是非常初級(jí)階段，功能非常簡(jiǎn)單，但是基礎(chǔ)的搭建流程想必大家已經(jīng)熟悉了，接下來小伙伴們可以根據(jù)業(yè)務(wù)需求搭建自己的模型啦，只要你每天和Excel打交道，總有一款模型適合你

我把這一期的ipynb文件和py文件,以及用到的商品目錄Category List放到了Github上，大家如果想要下載可以點(diǎn)擊下面的鏈接：

Github倉庫地址： https://github.com/yaozeliang/pandas_share

希望大家能夠繼續(xù)支持我，完結(jié)，撒花

云服務(wù)器 GPU云服務(wù)器阿里云香港服務(wù)器初入門學(xué)習(xí)之旅 js檢驗(yàn)數(shù)據(jù)類型 js 檢驗(yàn)數(shù)據(jù)類型

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://specialneedsforspecialkids.com/yun/43455.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

hqman

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

tensorflow

閱讀 627·2023-04-25 18:37
SCADA系統(tǒng)資料整理-概論

閱讀 2786·2021-10-12 10:12
主機(jī)號(hào)和主機(jī)數(shù)是什么-主機(jī)號(hào)怎么填寫？

閱讀 8358·2021-09-22 15:07
三種方法實(shí)現(xiàn)CSS三欄布局

閱讀 570·2019-08-30 15:55
前端面試每日3+1——第119天

閱讀 3178·2019-08-30 15:44
瀏覽器工作原理整理

閱讀 2198·2019-08-30 15:44
CSS 居中完全指南

閱讀 1631·2019-08-30 13:03
CSS：元素高度與寬度的討論系列文章(四)

閱讀 1564·2019-08-30 12:55

国产xxxx99真实实拍_久久不雅视频_高清韩国a级特黄毛片_嗯老师别我我受不了了小说

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來選購！

Pandas之旅（五): 構(gòu)建模型初入門：檢驗(yàn)數(shù)據(jù)一致性

相關(guān)文章

使用機(jī)器學(xué)習(xí)預(yù)測(cè)天氣(第二部分)

**「數(shù)據(jù)游戲」：使用 ARIMA 算法預(yù)測(cè)三日后招商銀行收盤價(jià)**

**「數(shù)據(jù)游戲」：使用 ARIMA 算法預(yù)測(cè)三日后招商銀行收盤價(jià)**

發(fā)表評(píng)論

0條評(píng)論

hqman

男|高級(jí)講師

TA的文章

tensorflow

SCADA系統(tǒng)資料整理-概論

主機(jī)號(hào)和主機(jī)數(shù)是什么-主機(jī)號(hào)怎么填寫？

三種方法實(shí)現(xiàn)CSS三欄布局

前端面試每日3+1——第119天

瀏覽器工作原理整理

CSS 居中完全指南

CSS：元素高度與寬度的討論系列文章(四)

最新活動(dòng)