Python數(shù)據(jù)分析學(xué)習(xí)筆記之Pandas入門

zqhxuyuan 發(fā)布于2019-07-25 11:21 / 3077人閱讀

摘要：是一個(gè)數(shù)據(jù)分析的開源庫。與表格或關(guān)系數(shù)據(jù)庫中的表非常神似。注意帶有一個(gè)索引，類似于關(guān)系數(shù)據(jù)庫中的主鍵。的統(tǒng)計(jì)函數(shù)分組與聚合通過方法，可以對數(shù)據(jù)組施加一系列的函數(shù)。函數(shù)的作用是串聯(lián)，追加數(shù)據(jù)行使用函數(shù)。

pandas(Python data analysis)是一個(gè)Python數(shù)據(jù)分析的開源庫。
pandas兩種數(shù)據(jù)結(jié)構(gòu)：DataFrame和Series

安裝：pandas依賴于NumPy,python-dateutil,pytz

pip install pandas

DataFrame

DataFrame是一種帶標(biāo)簽的二維對象。與excel表格或關(guān)系數(shù)據(jù)庫中的表非常神似。可以用以下方式來創(chuàng)建DataFrame：

從另一個(gè)DataFrame來創(chuàng)建DataFrame

從具有二維形狀的NumPy數(shù)組或者數(shù)組的復(fù)合結(jié)構(gòu)來生成DataFrame

可以用Series來創(chuàng)建DataFrame

DataFrame可以從類似CSV之類的文件來生成

準(zhǔn)備數(shù)據(jù)資料:http://www.exporedata.net/Dow... 下載一個(gè)csv數(shù)據(jù)文件。

from pandas.io.parsers import read_csv

df = read_csv("WHO_first9cols.csv")
print "Dataframe", df
print "Shape", df.shape
print "Length", len(df)
print "Column Headers", df.columns
print "Data types", df.dtypes
print "Index", df.index
print "Values", df.values

注意：DataFrame帶有一個(gè)索引，類似于關(guān)系數(shù)據(jù)庫中的主鍵。我們既可以手動(dòng)創(chuàng)建，也可以自動(dòng)創(chuàng)建。訪問df.index
如果需要遍歷數(shù)據(jù)，請使用df.values獲取所有值，非數(shù)字的數(shù)值在被輸出時(shí)標(biāo)記為nan。

Series

Series是一個(gè)由不同類型元素組成的一維數(shù)組，該數(shù)據(jù)結(jié)構(gòu)也具有標(biāo)簽。可以通過以下方式創(chuàng)建Series數(shù)據(jù)結(jié)構(gòu)：

由Python字典來創(chuàng)建

由NumPy數(shù)組來創(chuàng)建

由單個(gè)標(biāo)量值來創(chuàng)建

創(chuàng)建Series數(shù)據(jù)結(jié)構(gòu)時(shí)，可以向構(gòu)造函數(shù)遞交一組軸標(biāo)簽，這些標(biāo)簽通常稱為索引。
對DataFrame列執(zhí)行查詢操作時(shí)，會(huì)返回一個(gè)Series

from pandas.io.parsers import read_csv
import numpy as np

df = read_csv("WHO_first9cols.csv")
#這里對DataFrame列進(jìn)行查詢操作，返回一個(gè)Series
country_col = df["Country"]
print "Type df", type(df)
print "Type country col", type(country_col)

print "Series shape", country_col.shape
print "Series index", country_col.index
print "Series values", country_col.values
print "Series name", country_col.name

print "Last 2 countries", country_col[-2:]
print "Last 2 countries type", type(country_col[-2:])
#NumPy的函數(shù)同樣適用于pandas的DataFrame和Series
print "df signs", np.sign(df)
last_col = df.columns[-1]
print "Last df column signs", last_col, np.sign(df[last_col])

print np.sum(df[last_col] - df[last_col].values)

利用pandas查詢數(shù)據(jù)

數(shù)據(jù)準(zhǔn)備：pip install Quandl 或者手動(dòng)從http://www.quandl.com/SIDC/SU... 下載csv文件。

import Quandl

# Data from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual
# PyPi url https://pypi.python.org/pypi/Quandl
sunspots = Quandl.get("SIDC/SUNSPOTS_A")
print "Head 2", sunspots.head(2) 
print "Tail 2", sunspots.tail(2)

last_date = sunspots.index[-1]
print "Last value", sunspots.loc[last_date]

print "Values slice by date", sunspots["20020101": "20131231"]

print "Slice from a list of indices", sunspots.iloc[[2, 4, -4, -2]]

print "Scalar with Iloc", sunspots.iloc[0, 0]
print "Scalar with iat", sunspots.iat[1, 0]

print "Boolean selection", sunspots[sunspots > sunspots.mean()]
print "Boolean selection with column label", sunspots[sunspots.Number > sunspots.Number.mean()]

DataFrame的統(tǒng)計(jì)函數(shù)
describe、count、mad、median、min、max、,pde、std、var、skew、kurt

DataFrame分組與聚合

import pandas as pd
from numpy.random import seed
from numpy.random import rand
from numpy.random import random_integers
import numpy as np

seed(42)

df = pd.DataFrame({"Weather" : ["cold", "hot", "cold", "hot",
   "cold", "hot", "cold"],
   "Food" : ["soup", "soup", "icecream", "chocolate",
   "icecream", "icecream", "soup"],
   "Price" : 10 * rand(7), "Number" : random_integers(1, 9, size=(7,))})

print df
weather_group = df.groupby("Weather")

i = 0

for name, group in weather_group:
   i = i + 1
   print "Group", i, name
   print group

print "Weather group first
", weather_group.first()
print "Weather group last
", weather_group.last()
print "Weather group mean
", weather_group.mean()

wf_group = df.groupby(["Weather", "Food"])
print "WF Groups", wf_group.groups
#通過agg方法，可以對數(shù)據(jù)組施加一系列的NumPy函數(shù)。
print "WF Aggregated
", wf_group.agg([np.mean, np.median])

DataFrame的串聯(lián)與附加操作

數(shù)據(jù)庫的數(shù)據(jù)表有內(nèi)部連接和外部連接。DataFrame也有類似操作，即串聯(lián)和附加。
函數(shù)concat()的作用是串聯(lián)DataFrame，追加數(shù)據(jù)行使用append()函數(shù)。
例如

pd.concat([df[:3],df[3:]])
df[:3].append(df[5:])

pandas提供merge()或DataFrane的join()方法都能實(shí)現(xiàn)類似數(shù)據(jù)庫的連接操作功能。默認(rèn)情況下join()方法會(huì)按照索引進(jìn)行連接，不過，有時(shí)候這不符合我們的要求。
數(shù)據(jù)準(zhǔn)備：
tips.csv

EmpNr,Amount
5,10
9,5
7,2.5

dest.csv

EmpNr,Dest
5,The Hague
3,Amsterdam
9,Rotterdam

dests = pd.read_csv("dest.csv")
tips = pd.read_csv("tips.csv")
#使用merge()函數(shù)按照員工編號(hào)進(jìn)行連接處理
print "Merge() on key
", pd.merge(dests, tips, on="EmpNr")
#用join()方法執(zhí)行連接操作時(shí)，需要使用后綴來指示左、右操作對象。
print "Dests join() tips
", dests.join(tips, lsuffix="Dest", rsuffix="Tips")
#用merge()執(zhí)行內(nèi)部連接時(shí)，更顯示的方法如下
print "Inner join with merge()
", pd.merge(dests, tips, how="inner")
#稍作修改便變成完全外部連接，缺失的數(shù)據(jù)變?yōu)镹aN
print "Outer join
", pd.merge(dests, tips, how="outer")

處理缺失的數(shù)據(jù)

缺失的數(shù)據(jù)變?yōu)镹aN(非數(shù)字)，還有一個(gè)類似的符號(hào)NaT(非日期). 可以使用pandas的兩個(gè)函數(shù)來進(jìn)行判斷isnull(),notnull(), fillna()方法可以用一個(gè)標(biāo)量值來替換缺失的數(shù)據(jù)。

import pandas as pd
import numpy as np

df = pd.read_csv("WHO_first9cols.csv")
# Select first 3 rows of country and Net primary school enrolment ratio male (%)
df = df[["Country", df.columns[-2]]][:2]
print "New df
", df
print "Null Values
", pd.isnull(df)
print "Total Null Values
", pd.isnull(df).sum()
print "Not Null Values
", df.notnull()
print "Last Column Doubled
", 2 * df[df.columns[-1]]
print "Last Column plus NaN
", df[df.columns[-1]] + np.nan
print "Zero filled
", df.fillna(0)

處理日期數(shù)據(jù)

http://pandas.pydata.org/pand...
各種頻率(freq)短碼對照表:

B business day frequency

C custom business day frequency (experimental)

D calendar day frequency

W weekly frequency

M month end frequency

SM semi-month end frequency (15th and end of month)

BM business month end frequency

CBM custom business month end frequency

MS month start frequency

SMS semi-month start frequency (1st and 15th)

BMS business month start frequency

CBMS custom business month start frequency

Q quarter end frequency

BQ business quarter endfrequency

QS quarter start frequency

BQS business quarter start frequency

A year end frequency

BA business year end frequency

AS year start frequency

BAS business year start frequency

BH business hour frequency

H hourly frequency

T, min minutely frequency

S secondly frequency

L, ms milliseconds

U, us microseconds

N nanoseconds

import pandas as pd
from pandas.tseries.offsets import DateOffset
import sys

print "Date range", pd.date_range("1/1/1900", periods=42, freq="D")

try:
   print "Date range", pd.date_range("1/1/1677", periods=4, freq="D")
except:
   etype, value, _ = sys.exc_info()
   print "Error encountered", etype, value

offset = DateOffset(seconds=2 ** 63/10 ** 9)
mid = pd.to_datetime("1/1/1970")
print "Start valid range", mid - offset
print "End valid range", mid + offset
print pd.to_datetime(["1900/1/1", "1901.12.11"])

print "With format", pd.to_datetime(["19021112", "19031230"], format="%Y%m%d")

print "Illegal date", pd.to_datetime(["1902-11-12", "not a date"])
print "Illegal date coerced", pd.to_datetime(["1902-11-12", "not a date"], coerce=True)

據(jù)透視表(pivot_table)

數(shù)據(jù)透視表可以用來匯總數(shù)據(jù)。pivot_table()函數(shù)及相應(yīng)的DataFrame方法。

import pandas as pd
from numpy.random import seed
from numpy.random import rand
from numpy.random import random_integers
import numpy as np

seed(42)
N = 7
df = pd.DataFrame({
   "Weather" : ["cold", "hot", "cold", "hot",
   "cold", "hot", "cold"],
   "Food" : ["soup", "soup", "icecream", "chocolate",
   "icecream", "icecream", "soup"],
   "Price" : 10 * rand(N), "Number" : random_integers(1, 9, size=(N,))})

print "DataFrame
", df
#cols指定需要聚合的列，aggfunc指定聚合函數(shù)。
print pd.pivot_table(df, cols=["Food"], aggfunc=np.sum)

云服務(wù)器 GPU云服務(wù)器機(jī)器學(xué)習(xí)入門之深度學(xué)習(xí)入門之pytorch Python學(xué)習(xí)筆記 pandas python

文章版權(quán)歸作者所有，未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請注明本文地址：http://specialneedsforspecialkids.com/yun/38355.html

發(fā)表評論

登陸后可評論

0條評論

zqhxuyuan

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

30歲轉(zhuǎn)入做軟件測試，能轉(zhuǎn)行成功嗎？別再被忽悠了，否則連工作都找不到

閱讀 770·2021-09-30 09:46
修羅云：兩周年特惠,有深港iplc/香港HKT/廣州/中山/徐州/杭州/佛山等,最低6折起

閱讀 3777·2021-09-03 10:45
CSS布局--圣杯布局和雙飛翼布局以及使用Flex實(shí)現(xiàn)圣杯布局

閱讀 3609·2019-08-30 14:11
css常用布局

閱讀 2544·2019-08-30 13:54
IE 8 瀏覽器 placeholder 兼容性處理

閱讀 2255·2019-08-30 11:00
CSS3熱身實(shí)戰(zhàn)--過渡與動(dòng)畫（實(shí)現(xiàn)炫酷下拉，手風(fēng)琴，無縫滾動(dòng)）

閱讀 2347·2019-08-29 13:03
聊聊clip-path

閱讀 1554·2019-08-29 11:16
Python爬蟲入門教程 2-100 妹子圖網(wǎng)站爬取

閱讀 3581·2019-08-26 13:52

国产xxxx99真实实拍_久久不雅视频_高清韩国a级特黄毛片_嗯老师别我我受不了了小说

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長期優(yōu)惠，快來選購！

Python數(shù)據(jù)分析學(xué)習(xí)筆記之Pandas入門

相關(guān)文章

首次公開，整理12年積累的博客收藏夾，零距離展示《收藏夾吃灰》系列博客

Python機(jī)器學(xué)習(xí)入門資料整理

**SegmentFault 技術(shù)周刊 Vol.30 - 學(xué)習(xí) Python 來做一些神奇好玩的事情吧**

發(fā)表評論

0條評論

zqhxuyuan

男|高級(jí)講師

TA的文章

30歲轉(zhuǎn)入做軟件測試，能轉(zhuǎn)行成功嗎？別再被忽悠了，否則連工作都找不到

修羅云：兩周年特惠,有深港iplc/香港HKT/廣州/中山/徐州/杭州/佛山等,最低6折起

CSS布局--圣杯布局和雙飛翼布局以及使用Flex實(shí)現(xiàn)圣杯布局

css常用布局

IE 8 瀏覽器 placeholder 兼容性處理

CSS3熱身實(shí)戰(zhàn)--過渡與動(dòng)畫（實(shí)現(xiàn)炫酷下拉，手風(fēng)琴，無縫滾動(dòng)）

聊聊clip-path

Python爬蟲入門教程 2-100 妹子圖網(wǎng)站爬取

最新活動(dòng)