大數據與云計算學習：數據分析（一）

dunizb 發布于2019-07-30 14:48 / 2766人閱讀

python基礎

先看看基礎

注意點

切割操作

這里發現我們在取出list中的元素時候是左開右閉的，即[3,6) 索引6對應的元素7并沒有被輸出

改變list中的元素

添加刪除元素

兩種拷貝list的方式

list2拷貝給y，y改變，list2也變

list2拷貝給y，y改變，list2不變

刪除實例的屬性和刪除字典屬性的區別

a = {"a":1,"b":2}
del a["a"]
a = classname()
del classname.attrname

with as

https://www.cnblogs.com/DswCn...

if name == "__main__":

if __name__ == "__main__":

一個python的文件有兩種使用的方法，
第一是直接作為腳本執行，
第二是import到其他的python腳本中被調用（模塊重用）執行。
因此if name == "main":
的作用就是控制這兩種情況執行代碼的過程，
在if name == "main": 下的代碼只有在第一種情況下（即文件作為腳本直接執行）才會被執行，
而import到其他腳本中是不會被執行的。...

函數 /方法 正則表達式

基礎看這里

import re
line = "jwxddxsw33"
if line == "jxdxsw33":
    print("yep")
else:
    print("no")

# ^ 限定以什么開頭
regex_str = "^j.*"
if re.match(regex_str, line):
    print("yes")
#$限定以什么結尾
regex_str1 = "^j.*3$"
if re.match(regex_str, line):
    print("yes")

regex_str1 = "^j.3$"
if re.match(regex_str, line):
    print("yes")
# 貪婪匹配
regex_str2 = ".*(d.*w).*"
match_obj = re.match(regex_str2, line)
if match_obj:
    print(match_obj.group(1))
# 非貪婪匹配
# ？處表示遇到第一個d 就匹配
regex_str3 = ".*?(d.*w).*"
match_obj = re.match(regex_str3, line)
if match_obj:
    print(match_obj.group(1))
# * 表示>=0次　?。”硎尽?=0次
# ? 表示非貪婪模式
# + 的作用至少>出現一次  所以.+任意字符這個字符至少出現一次
line1 = "jxxxxxxdxsssssswwwwjjjww123"
regex_str3 = ".*(w.+w).*"
match_obj = re.match(regex_str3, line1)
if match_obj:
    print(match_obj.group(1))
# {2}限定前面的字符出現次數 {2,}2次以上 {2,5}最小兩次最多5次
line2 = "jxxxxxxdxsssssswwaawwjjjww123"
regex_str3 = ".*(w.{3}w).*"
match_obj = re.match(regex_str3, line2)
if match_obj:
    print(match_obj.group(1))

line2 = "jxxxxxxdxsssssswwaawwjjjww123"
regex_str3 = ".*(w.{2}w).*"
match_obj = re.match(regex_str3, line2)
if match_obj:
    print(match_obj.group(1))

line2 = "jxxxxxxdxsssssswbwaawwjjjww123"
regex_str3 = ".*(w.{5,}w).*"
match_obj = re.match(regex_str3, line2)
if match_obj:
    print(match_obj.group(1))

# | 或

line3 = "jx123"
regex_str4 = "((jx|jxjx)123)"
match_obj = re.match(regex_str4, line3)
if match_obj:
    print(match_obj.group(1))
    print(match_obj.group(2))
# [] 表示中括號內任意一個
line4 = "ixdxsw123"
regex_str4 = "([hijk]xdxsw123)"
match_obj = re.match(regex_str4, line4)
if match_obj:
    print(match_obj.group(1))
# [0,9]{9} 0到9任意一個 出現9次（9位數）
line5 = "15955224326"
regex_str5 = "(1[234567][0-9]{9})"
match_obj = re.match(regex_str5, line5)
if match_obj:
    print(match_obj.group(1))
# [^1]{9}
line6 = "15955224326"
regex_str6 = "(1[234567][^1]{9})"
match_obj = re.match(regex_str6, line6)
if match_obj:
    print(match_obj.group(1))

# [.*]{9} 中括號中的.和*就代表.*本身
line7 = "1.*59224326"
regex_str7 = "(1[.*][^1]{9})"
match_obj = re.match(regex_str7, line7)
if match_obj:
    print(match_obj.group(1))

#s 空格
line8 = "你 好"
regex_str8 = "(你s好)"
match_obj = re.match(regex_str8, line8)
if match_obj:
    print(match_obj.group(1))

# S 只要不是空格都可以（非空格）
line9 = "你真好"
regex_str9 = "(你S好)"
match_obj = re.match(regex_str9, line9)
if match_obj:
    print(match_obj.group(1))

# w  任意字符 和.不同的是 它表示[A-Za-z0-9_]
line9 = "你adsfs好"
regex_str9 = "(你wwwww好)"
match_obj = re.match(regex_str9, line9)
if match_obj:
    print(match_obj.group(1))

line10 = "你adsf_好"
regex_str10 = "(你wwwww好)"
match_obj = re.match(regex_str10, line10)
if match_obj:
    print(match_obj.group(1))
#W大寫的是非[A-Za-z0-9_]
line11 = "你 好"
regex_str11 = "(你W好)"
match_obj = re.match(regex_str11, line11)
if match_obj:
    print(match_obj.group(1))

# unicode編碼 [u4E00-u9FA5] 表示漢字
line12= "鏡心的小樹屋"
regex_str12= "([u4E00-u9FA5]+)"
match_obj = re.match(regex_str12,line12)
if match_obj:
    print(match_obj.group(1))

print("-----貪婪匹配情況----")
line13 = "reading in 鏡心的小樹屋"
regex_str13 = ".*([u4E00-u9FA5]+樹屋)"
match_obj = re.match(regex_str13, line13)
if match_obj:
    print(match_obj.group(1))

print("----取消貪婪匹配情況----")
line13 = "reading in 鏡心的小樹屋"
regex_str13 = ".*?([u4E00-u9FA5]+樹屋)"
match_obj = re.match(regex_str13, line13)
if match_obj:
    print(match_obj.group(1))

#d數字
line14 = "XXX出生于2011年"
regex_str14 = ".*(d{4})年"
match_obj = re.match(regex_str14, line14)
if match_obj:
    print(match_obj.group(1))

regex_str15 = ".*?(d+)年"
match_obj = re.match(regex_str15, line14)
if match_obj:
    print(match_obj.group(1))

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

###
# 試寫一個驗證Email地址的正則表達式。版本一應該可以驗證出類似的Email：
#someone@gmail.com
#bill.gates@microsoft.com
###

import re
addr = "someone@gmail.com"
addr2 = "bill.gates@microsoft.com"
def is_valid_email(addr):
    if re.match(r"[a-zA-Z_.]*@[a-aA-Z.]*",addr):
        return True
    else:
        return False

print(is_valid_email(addr))
print(is_valid_email(addr2))

# 版本二可以提取出帶名字的Email地址：
#  tom@voyager.org => Tom Paris
# bob@example.com => bob

addr3 = " tom@voyager.org"
addr4 = "bob@example.com"

def name_of_email(addr):
    r=re.compile(r"^(?)([ws]*)@([w.]*)$")
    if not r.match(addr):
        return None
    else:
        m = r.match(addr)
        return m.group(2)

print(name_of_email(addr3))
print(name_of_email(addr4))

案例

找出一個文本中詞頻最高的單詞

text = "the clown ran after the car and the car ran into the tent and the tent fell down on the clown and the car"
words = text.split()
print(words)

for word in words:# 初始化空列表
    print(word)


#步驟一：獲得單詞列表  相當于去重
unique_words = list()
for word in words:
   if(word not in unique_words):# 使用in判斷某個元素是否在列表里
       unique_words.append(word)
print(unique_words)


#步驟二：初始化詞頻列表

# [e]*n 快速初始化
counts = [0] * len(unique_words)
print(counts)

# 步驟三：統計詞頻
for word in words:
    index = unique_words.index(word)

    counts[index] = counts[index] + 1
    print(counts[index])
print(counts)
# 步驟四：找出最高詞頻和其對應的單詞
bigcount = None #None 為空，初始化bigcount
bigword = None

for i in range(len(counts)):
    if bigcount is None or counts[i] > bigcount:
        bigword = unique_words[i]
        bigcount = counts[i]
print(bigword,bigcount)

用字典的方式：

# 案例回顧：找出一個文本中最高詞頻的單詞

text = """the clown ran after the car and the car ran into the tent 
        and the tent fell down on the clown and the car"""
words = text.split() # 獲取單詞的列表

# 使用字典可以極大簡化步驟
# 獲取單詞-詞頻字典
counts = dict() # 初始化一個空字典
for word in words:
    counts[word] = counts.get(word, 0) + 1  # 構造字典。注意get方法需要設定默認返回值0（當單詞第一次出現時，詞頻為1）
print(counts)

# 在字典中查找最高詞頻的單詞
bigcount = None
bigword = None
for word,count in counts.items():
    if bigcount is None or count > bigcount:
        bigword = word
        bigcount = count

print(bigword, bigcount)

自定義一個每周工資計算器函數

# 使用input()函數，從鍵盤讀取輸入的文本
# a = input("請輸入文本:")
# print("您輸入的內容是：",a)

def salary_calculator(): #沒有參數的函數
    user = str #初始化user為字符串變量
    print("----工資計算器----")

    while True:
        user = input("
請輸入你的名字，或者輸入0來結束報告: ")

        if user == "0":
            print("結束報告")
            break
        else:
            hours = float(input("請輸入你的工作小時數："))
            payrate =float(input("請輸入你的單位時間工資： ￥"))

            if hours <= 40:
                print("員工姓名:",user)
                print("加班小時數：0")
                print("加班費：￥0.00")
                regularpay = round(hours * payrate,2) # round函數保留小數點后兩位
                print("稅前工資:￥" + str(regularpay))


            elif hours > 40:

                overtimehours = round(hours - 40, 2)

                print("員工姓名: " + user)

                print("加班小時數: " + str(overtimehours))

                regularpay = round(40 * payrate, 2)

                overtimerate = round(payrate * 1.5, 2)

                overtimepay = round(overtimehours * overtimerate)

                grosspay = round(regularpay + overtimepay, 2)

                print("常規工資: ￥" + str(regularpay))

                print("加班費: ￥" + str(overtimepay))

                print("稅前工資: ￥" + str(grosspay))

#調用 salary_calculator

salary_calculator()

這個實例中注意 python中關于round函數的小坑

數據結構、函數、條件和循環 包管理

戳這里看有哪些流行python包——>awesom-python

Numpy 處理數組/數據計算擴展

ndarray 一種多維數組對象

利用數組進行數據處理

用于數組的文件輸入輸出

多維操作

線性代數

隨機數生成

隨機漫步

Numpy高級應用

ndarray 對象的內部機制

高級數組操作

廣播

ufunc高級應用

結構化和記錄式數組

更多有關排序

NumPy的matrix類

高級數組輸入輸出

Matplotlib 數據可視化

Pandas 數據分析

pandas的數據結構

基本功能

匯總和計算描述統計

處理缺失數據

層次化索引

聚合與分組

邏輯回歸基本原理

jupyter

pip3 install jupyter
jupyter notebook

scipy

描述性統計

Scikit-learn 數據挖掘、機器學習

keras 人工神經網絡

tensorflow 神經網絡

安裝Python包管理工具pip，主要是用于安裝 PyPI 上的軟件包

安裝教程

sudo apt-get install python3-pip
pip3 install numpy
pip3 install scipy
pip3 install matplotlib

或者下這個安裝腳本 get-pip.py

包的引入方式

因為python是面向對象的編程，推薦引入方式還是

import numpy
numpy.array([1,2,3])

數據存儲 數據操作 生成數據

生成一組二維數組，有5000個元素，每個元素內表示 身高和體重

import numpy as np

生成1000個經緯度位置，靠近（117，32），并輸出位csv

import pandas as pd
import numpy as np

# 任意的多組列表
lng = np.random.normal(117,0.20,1000)

lat = np.random.normal(32.00,0.20,1000)

# 字典中的key值即為csv中列名
dataframe = pd.DataFrame({"lng":lng,"lat":lat})


#將DataFrame存儲為csv,index表示是否顯示行名，default=True
dataframe.to_csv("data/lng-lat.csv",index = False, sep="," )

numpy的常用操作

#encoding=utf-8 
import numpy as np 
def main():
    lst = [[1,3,5],[2,4,6]]
    print(type(lst))
    np_lst = np.array(lst)
    print(type(np_lst))
    # 同一種numpy.array中只能有一種數據類型
    # 定義np的數據類型
    # 數據類型有：bool int int8 int16 int32 int64 int128 uint8 uint16 uint32 uint64 uint128 float16/32/64 complex64/128
    np_lst = np.array(lst,dtype=np.float)

    print(np_lst.shape)
    print(np_lst.ndim)#數據的維度
    print(np_lst.dtype)#數據類型
    print(np_lst.itemsize) #每個元素的大小
    print(np_lst.size)#數據大小 幾個元素

    # numpy array
    print(np.zeros([2,4]))# 生成2行4列都是0的數組
    print(np.ones([3,5]))

    print("---------隨機數Rand-------") 
    print(np.random.rand(2,4))# rand用于產生0～1之間的隨機數 2*4的數組
    print(np.random.rand())
    print("---------隨機數RandInt-------")
    print(np.random.randint(1,10)) # 1~10之間的隨機整數
    print(np.random.randint(1,10,3))# 3個1～10之間的隨機整數
    print("---------隨機數Randn 標準正太分布-------")
    print(np.random.randn(2,4)) # 2行4列的標準正太分布的隨機整數
    print("---------隨機數Choice-------")
    print(np.random.choice([10,20,30]))# 指定在10 20 30 里面選一個隨機數生成
    print("---------分布Distribute-------")
    print(np.random.beta(1,10,100))# 生成beta分布
if __name__ == "__main__":
    main()

常用函數舉例

計算紅酒數據每一個屬性的平均值（即每一列數據的平均值）

數據分析工具 數據可視化

探索數據

數據展示

數據 ---> 故事

matplotlib 繪圖基礎

函數曲線的繪制

圖形細節的設置

案例分析：銷售記錄可視化

條形圖

繪制多圖

餅圖

散點圖

直方圖

seaborn 數據可視化包

分類數據的散點圖

分類數據的箱線圖

多變量圖

更多內容戳這里數據可視化

安裝 matplotlib

注意這里會報這樣的錯誤

ImportError: No module named "_tkinter", please install the python3-tk package

需要安裝 python3-tk

更多示例 線圖

散點圖 & 柱狀圖

數據分析

padans

上層數據操作

dataframe數據結構

 import pandas as pd
brics = pd.read_csv("/home/wyc/study/python_lession/python_lessions/數據分析/brics.csv",index_col = 0)

pandas基本操作


import numpy as np
import pandas as pd

def main():

    #Data Structure
    s = pd.Series([i*2 for i in range(1,11)])
    print(type(s))

    dates = pd.date_range("20170301",periods=8)
    df = pd.DataFrame(np.random.randn(8,5),index=dates,columns=list("ABCDE"))
    print(df)
    # basic

    print(df.head(3))
    print(df.tail(3))
    print(df.index)
    print(df.values)
    print(df.T)
    # print(df.sort(columns="C"))
    print(df.sort_index(axis=1,ascending=False))
    print(df.describe())

    #select
    print(type(df["A"]))
    print(df[:3])
    print(df["20170301":"20170304"])
    print(df.loc[dates[0]])
    print(df.loc["20170301":"20170304",["B","D"]])
    print(df.at[dates[0],"C"])


    print(df.iloc[1:3,2:4])
    print(df.iloc[1,4])
    print(df.iat[1,4])

    print(df[df.B>0][df.A<0])
    print(df[df>0])
    print(df[df["E"].isin([1,2])])

    # Set
    s1 = pd.Series(list(range(10,18)),index = pd.date_range("20170301",periods=8))
    df["F"]= s1
    print(df)
    df.at[dates[0],"A"] = 0
    print(df)
    df.iat[1,1] = 1
    df.loc[:,"D"] = np.array([4]*len(df))
    print(df)

    df2 = df.copy()
    df2[df2>0] = -df2
    print(df2)

    # Missing Value
    df1 = df.reindex(index=dates[:4],columns = list("ABCD") + ["G"])
    df1.loc[dates[0]:dates[1],"G"]=1
    print(df1)
    print(df1.dropna())
    print(df1.fillna(value=1))

    # Statistic
    print(df.mean())
    print(df.var())

    s = pd.Series([1,2,4,np.nan,5,7,9,10],index=dates)
    print(s)
    print(s.shift(2))
    print(s.diff())
    print(s.value_counts())
    print(df.apply(np.cumsum))
    print(df.apply(lambda x:x.max()-x.min()))

    #Concat
    pieces = [df[:3],df[-3:]]
    print(pd.concat(pieces))

    left = pd.DataFrame({"key":["x","y"],"value":[1,2]})
    right = pd.DataFrame({"key":["x","z"],"value":[3,4]})
    print("LEFT",left)
    print("RIGHT", right)
    print(pd.merge(left,right,on="key",how="outer"))
    df3 = pd.DataFrame({"A": ["a","b","c","b"],"B":list(range(4))})
    print(df3.groupby("A").sum())



if __name__ == "__main__":
    main()

# 首先產生一個叫gdp的字典
gdp = {"country":["United States", "China", "Japan", "Germany", "United Kingdom"],
       "capital":["Washington, D.C.", "Beijing", "Tokyo", "Berlin", "London"],
       "population":[323, 1389, 127, 83, 66],
       "gdp":[19.42, 11.8, 4.84, 3.42, 2.5],
       "continent":["North America", "Asia", "Asia", "Europe", "Europe"]}

import pandas as pd
gdp_df = pd.DataFrame(gdp)
print(gdp_df)

# 我們可以通過index選項添加自定義的行標簽(label)
# 使用column選項可以選擇列的順序
gdp_df = pd.DataFrame(gdp, columns = ["country", "capital", "population", "gdp", "continent"],index = ["us", "cn", "jp", "de", "uk"])
print(gdp_df)

#修改行和列的標簽
# 也可以使用index和columns直接修改
gdp_df.index=["US", "CN", "JP", "DE", "UK"]
gdp_df.columns = ["Country", "Capital", "Population", "GDP", "Continent"]
print(gdp_df)
# 增加rank列，表示他們的GDP處在前5位
gdp_df["rank"] = "Top5 GDP"
# 增加國土面積變量,以百萬公里計（數據來源：http://data.worldbank.org/）
gdp_df["Area"] = [9.15, 9.38, 0.37, 0.35, 0.24]
print(gdp_df)


# 一個最簡單的series
series = pd.Series([2,4,5,7,3],index = ["a","b","c","d","e"])
print(series)
# 當我們使用點操作符來查看一個變量時，返回的是一個pandas series
# 在后續的布爾篩選中使用點方法可以簡化代碼
# US,...,UK是索引
print(gdp_df.GDP)


# 可以直接查看索引index
print(gdp_df.GDP.index)
# 類型是pandas.core.series.Series
print(type(gdp_df.GDP))

#返回一個布爾型的series，在后面講到的DataFrame的布爾索引中會大量使用
print(gdp_df.GDP > 4)

# 我們也可以將series視為一個長度固定且有順序的字典，一些用于字典的函數也可以用于series
gdp_dict = {"US": 19.42, "CN": 11.80, "JP": 4.84, "DE": 3.42, "UK": 2.5}
gdp_series = pd.Series(gdp_dict)
print(gdp_series)

# 判斷 ’US" 標簽是否在gdp_series中

print("US" in gdp_series)
# 使用變量名加[[]]選取列
print(gdp_df[["Country"]])
# 可以同時選取多列
print(gdp_df[["Country", "GDP"]])


# 如果只是用[]則產生series
print(type(gdp_df["Country"]))
# 行選取和2d數組類似
# 如果使用[]選取行，切片方法唯一的選項
print(gdp_df[2:5]) #終索引是不被包括的！

#loc方法
# 在上面例子中，我們使用行索引選取行，能不能使用行標簽實現選取呢？
# loc方法正是基于標簽選取數據的方法
print(gdp_df.loc[["JP","DE"]])
# 以上例子選取了所有的列
# 我們可以加入需要的列標簽
print(gdp_df.loc[["JP","DE"],["Country","GDP","Continent"]])

# 選取所有的行，我們可以使用:來表示選取所有的行
print(gdp_df.loc[:,["Country","GDP","Continent"]])

# 等價于gdp_df.loc[["JP","DE"]]
print(gdp_df.iloc[[2,3]])

print(gdp_df.loc[["JP","DE"],["Country", "GDP", "Continent"]])
print(gdp_df.iloc[[2,3],[0,3,4]])

# 選出亞洲國家，下面兩行命令產生一樣的結果
print(gdp_df[gdp_df.Continent == "Asia"])

print(gdp_df.loc[gdp_df.Continent == "Asia"])
# 選出gdp大于3兆億美元的歐洲國家
print(gdp_df[(gdp_df.Continent == "Europe") & (gdp_df.GDP > 3)])

缺失值處理 數據挖掘

案例:Iris鳶尾花數據集
讓我們來看一下經典的iris數據:

鳶尾花卉數據集，來源 UCI 機器學習數據集

它最初是埃德加·安德森采集的

四個特征被用作樣本的定量分析，它們分別是花萼(sepal)和花瓣(petal)的長度(length)和寬度(width)

#####
#數據的導入和觀察
#####
import pandas as pd
# 用列表存儲列標簽
col_names = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]
# 讀取數據，并指定每一列的標簽
iris = pd.read_csv("data/iris.txt", names = col_names)

# 使用head/tail查看數據的頭和尾

print(iris.head(10))

# 使用info 方法查看數據的總體信息
iris.info()

# 使用shape可以查看DataFrame的行數與列數
# iris有150個觀察值，5個變量
print(iris.shape)
# 這里的品種(species)是分類變量(categorical variable)
# 可以使用unique方法來對查看series中品種的名字
print(iris.species.unique())


# 統計不同品種的數量
# 使用DataFrame的value_counts方法來實現
print(iris.species.value_counts())

#選取花瓣數據，即 petal_length 和 petal_width 這兩列
# 方法一：使用[[ ]]
petal = iris[["petal_length","petal_width"]]
print(petal.head())
# 方法二：使用 .loc[ ]
petal = iris.loc[:,["petal_length","petal_width"]]
print(petal.head())
# 方法三：使用 .iloc[ ]
petal = iris.iloc[:,2:4]
print(petal.head())

# 選取行索引為5-10的數據行
# 方法一：使用[]
print(iris[5:11])
# 方法二：使用 .iloc[]
print(iris.iloc[5:11,:])

# 選取品種為 Iris-versicolor 的數據
versicolor = iris[iris.species == "Iris-versicolor"]
print(versicolor.head())


####
#數據的可視化
####
#散點圖
import matplotlib.pyplot as plt
# 我們首先畫散點圖（sactter plot），x軸上畫出花瓣的長度，y軸上畫出花瓣的寬度
# 我們觀察到什么呢？
iris.plot(kind = "scatter", x="petal_length", y="petal_width")
# plt.show()

# 使用布爾索引的方法分別獲取三個品種的數據
setosa = iris[iris.species == "Iris-setosa"]
versicolor = iris[iris.species == "Iris-versicolor"]
virginica = iris[iris.species == "Iris-virginica"]

ax = setosa.plot(kind="scatter", x="petal_length", y="petal_width", color="Red", label="setosa", figsize=(10,6))
versicolor.plot(kind="scatter", x="petal_length", y="petal_width", color="Green", ax=ax, label="versicolor")
virginica.plot(kind="scatter", x="petal_length", y="petal_width", color="Orange", ax=ax, label="virginica")
plt.show()

#箱圖
#使用mean()方法獲取花瓣寬度均值
print(iris.petal_width.mean())
#使用median()方法獲取花瓣寬度的中位數
print(iris.petal_width.median())
# 可以使用describe方法來總結數值變量
print(iris.describe())


# 繪制花瓣寬度的箱圖
# 箱圖展示了數據中的中位數，四分位數，最大值，最小值
iris.petal_width.plot(kind="box")
# plt.show()

# 按品種分類，分別繪制不同品種花瓣寬度的箱圖
iris[["petal_width","species"]].boxplot(grid=False,by="species",figsize=(10,6))
# plt.show()

setosa.describe()

# 計算每個品種鳶尾花各個屬性（花萼、花瓣的長度和寬度）的最小值、平均值又是分別是多少？ （提示：使用min、mean 方法。）
print(iris.groupby(["species"]).agg(["min","mean"]))

#計算鳶尾花每個品種的花萼長度（sepal_length) 大于6cm的數據個數。
# 方法1
print(iris[iris["sepal_length"]> 6].groupby("species").size())
# 方法2
def more_len(group,length=6):
    return len(group[group["sepal_length"] > length])
print(iris.groupby(["species"]).apply(more_len,6))

缺失值處理、數據透視表

缺失值處理：pandas中的fillna()方法

pandas用nan(not a number)表示缺失數據，處理缺失數據有以下幾種方法：

dropna去除nan數據

fillna使用默認值填入

isnull 返回一個含有布爾值的對象，表示哪些是nan，哪些不是

notnull isnull的否定式

數據透視表：pandas中的pivot_table函數

我們用案例分析 - 泰坦尼克數據來說明這個兩個問題
缺失值處理：

真實數據往往某些變量會有缺失值。

這里，cabin有超過70%以上的缺失值，我們可以考慮直接丟掉這個變量。 -- 刪除某一列數據

像Age這樣的重要變量，有20%左右的缺失值，我們可以考慮用中位值來填補。-- 填補缺失值

我們一般不提倡去掉帶有缺失值的行，因為其他非缺失的變量可能提供有用的信息。-- 刪除帶缺失值的行

# 讀取常用的包
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#讀取數據
titanic_df = pd.read_csv("data/titanic.csv")

#查看前五行數據
print(titanic_df.head())

# 數據的統計描述
# describe函數查看部分變量的分布
# 因為Survived是0-1變量，所以均值就是幸存人數的百分比，這個用法非常有用
print(titanic_df[["Survived","Age","SibSp","Parch"]].describe())

# 使用include=[np.object]來查看分類變量
# count: 非缺失值的個數
# unique: 非重復值得個數
# top: 最高頻值
# freq: 最高頻值出現次數

print(titanic_df.describe(include=[np.object]))

#不同艙位的分布情況是怎樣的呢？
# 方法1: value_counts
# 查看不同艙位的分布
# 頭等艙：24%； 二等艙：21%； 三等艙：55%
# value_counts 頻數統計， len() 獲取數據長度
print(titanic_df.Pclass.value_counts() / len(titanic_df))
# 總共有891個乘客
# Age有714個非缺失值，Cabin只有204個非缺失值。我們將會講解如何處理缺失值
print(titanic_df.info())

#方法2：group_by
# sort_values 將結果排序
(titanic_df.groupby("Pclass").agg("size")/len(titanic_df)).sort_values(ascending=False)

# 填補年齡數據中的缺失值
# 直接使用所有人年齡的中位數來填補
# 在處理之前，查看Age列的統計值
print(titanic_df.Age.describe())

# 重新載入原始數據
titanic_df=pd.read_csv("data/titanic.csv")

# 計算所有人年齡的均值
age_median1 = titanic_df.Age.median()

# 使用fillna填充缺失值,inplace=True表示在原數據titanic_df上直接進行修改
titanic_df.Age.fillna(age_median1,inplace=True)
#查看Age列的統計值
print(titanic_df.Age.describe())
#print(titanic_df.info())

# 考慮性別因素，分別用男女乘客各自年齡的中位數來填補
# 重新載入原始數據
titanic_df=pd.read_csv("data/titanic.csv")
# 分組計算男女年齡的中位數， 得到一個Series數據，索引為Sex
age_median2 = titanic_df.groupby("Sex").Age.median()
# 設置Sex為索引
titanic_df.set_index("Sex",inplace=True)
# 使用fillna填充缺失值，根據索引值填充
titanic_df.Age.fillna(age_median2, inplace=True)
# 重置索引，即取消Sex索引
titanic_df.reset_index(inplace=True)
# 查看Age列的統計值
print(titanic_df.Age.describe())

#同時考慮性別和艙位因素

# 重新載入原始數據
titanic_df=pd.read_csv("data/titanic.csv")
# 分組計算不同艙位男女年齡的中位數， 得到一個Series數據，索引為Pclass,Sex
age_median3 = titanic_df.groupby(["Pclass", "Sex"]).Age.median()
# 設置Pclass, Sex為索引， inplace=True表示在原數據titanic_df上直接進行修改
titanic_df.set_index(["Pclass","Sex"], inplace=True)
print(titanic_df)

# 使用fillna填充缺失值，根據索引值填充
titanic_df.Age.fillna(age_median3, inplace=True)
# 重置索引，即取消Pclass,Sex索引
titanic_df.reset_index(inplace=True)

# 查看Age列的統計值
titanic_df.Age.describe()

將連續型變量離散化

連續型變量離散化是建模中一種常用的方法

離散化指的是將某個變量的所在區間分割為幾個小區間，落在同一個區間的觀測值用同一個符號表示

以年齡為例，最小值是0.42（嬰兒），最大值是80，如果我們想產生一個五個級（levels），我們可使用cut或者qcut函數

cut函數將年齡的區間均勻分割為5分，而qcut則選取區間以至于每個區間里的觀察值個數都是一樣的（五等分），這里演示中使用cut函數。

# 讀取常用的包
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#讀取數據
titanic_df = pd.read_csv("data/titanic.csv")

#查看前五行數據
print(titanic_df.head())

# 數據的統計描述
# describe函數查看部分變量的分布
# 因為Survived是0-1變量，所以均值就是幸存人數的百分比，這個用法非常有用
print(titanic_df[["Survived","Age","SibSp","Parch"]].describe())

# 使用include=[np.object]來查看分類變量
# count: 非缺失值的個數
# unique: 非重復值得個數
# top: 最高頻值
# freq: 最高頻值出現次數

print(titanic_df.describe(include=[np.object]))

#不同艙位的分布情況是怎樣的呢？
# 方法1: value_counts
# 查看不同艙位的分布
# 頭等艙：24%； 二等艙：21%； 三等艙：55%
# value_counts 頻數統計， len() 獲取數據長度
print(titanic_df.Pclass.value_counts() / len(titanic_df))
# 總共有891個乘客
# Age有714個非缺失值，Cabin只有204個非缺失值。我們將會講解如何處理缺失值
print(titanic_df.info())

#方法2：group_by
# sort_values 將結果排序
(titanic_df.groupby("Pclass").agg("size")/len(titanic_df)).sort_values(ascending=False)

# 填補年齡數據中的缺失值
# 直接使用所有人年齡的中位數來填補
# 在處理之前，查看Age列的統計值
print(titanic_df.Age.describe())

# 重新載入原始數據
titanic_df=pd.read_csv("data/titanic.csv")

# 計算所有人年齡的均值
age_median1 = titanic_df.Age.median()

# 使用fillna填充缺失值,inplace=True表示在原數據titanic_df上直接進行修改
titanic_df.Age.fillna(age_median1,inplace=True)
#查看Age列的統計值
print(titanic_df.Age.describe())
#print(titanic_df.info())

# 考慮性別因素，分別用男女乘客各自年齡的中位數來填補
# 重新載入原始數據
titanic_df=pd.read_csv("data/titanic.csv")
# 分組計算男女年齡的中位數， 得到一個Series數據，索引為Sex
age_median2 = titanic_df.groupby("Sex").Age.median()
# 設置Sex為索引
titanic_df.set_index("Sex",inplace=True)
# 使用fillna填充缺失值，根據索引值填充
titanic_df.Age.fillna(age_median2, inplace=True)
# 重置索引，即取消Sex索引
titanic_df.reset_index(inplace=True)
# 查看Age列的統計值
print(titanic_df.Age.describe())

#同時考慮性別和艙位因素

# 重新載入原始數據
titanic_df=pd.read_csv("data/titanic.csv")
# 分組計算不同艙位男女年齡的中位數， 得到一個Series數據，索引為Pclass,Sex
age_median3 = titanic_df.groupby(["Pclass", "Sex"]).Age.median()
# 設置Pclass, Sex為索引， inplace=True表示在原數據titanic_df上直接進行修改
titanic_df.set_index(["Pclass","Sex"], inplace=True)
print(titanic_df)

# 使用fillna填充缺失值，根據索引值填充
titanic_df.Age.fillna(age_median3, inplace=True)
# 重置索引，即取消Pclass,Sex索引
titanic_df.reset_index(inplace=True)

# 查看Age列的統計值
titanic_df.Age.describe()


###
#分析哪些因素會決定生還概率
###

# 艙位與生還概率
#計算每個艙位的生還概率
# 方法1：使用經典的分組-聚合-計算
# 注意：因為Survived是0-1函數，所以均值即表示生還百分比
print(titanic_df[["Pclass", "Survived"]].groupby("Pclass").mean() 
    .sort_values(by="Survived", ascending=False))

# 方法2：我們還可以使用pivot_table函數來實現同樣的功能（本次課新內容）
# pivot table中文為數據透視表
# values: 聚合后被施加計算的值，這里我們施加mean函數
# index: 分組用的變量
# aggfunc: 定義施加的函數
print(titanic_df.pivot_table(values="Survived", index="Pclass", aggfunc=np.mean))

# 繪制艙位和生還概率的條形圖
# 使用sns.barplot做條形圖，圖中y軸給出 Survived 均值的點估計
#sns.barplot(data=titanic_df,x="Pclass",y="Survived",ci=None)
# plt.show()

#####
#性別與生還概率
#####
# 方法1：groupby
print(titanic_df[["Sex", "Survived"]].groupby("Sex").mean() 
    .sort_values(by="Survived", ascending=False))
# 方法2：pivot_table
print(titanic_df.pivot_table(values="Survived",index="Sex",aggfunc=np.mean))

# 繪制條形圖
#sns.barplot(data=titanic_df,x="Sex",y="Survived",ci=None)
#plt.show()


#####
#綜合考慮艙位和性別的因素，與生還概率的關系
#####
# 方法1：groupby
print(titanic_df[["Pclass","Sex", "Survived"]].groupby(["Pclass", "Sex"]).mean())

# 方法2：pivot_table
titanic_df.pivot_table(values="Survived", index=["Pclass", "Sex"], aggfunc=np.mean)

# 方法3：pivot_talbe
# columns指定另一個分類變量，只不過我們將它列在列里而不是行里，這也是為什么這個變量稱為columns
print(titanic_df.pivot_table(values="Survived",index="Pclass",columns="Sex",aggfunc=np.mean))

#繪制條形圖：使用sns.barplot
#sns.barplot(data=titanic_df,x="Pclass",y="Survived",hue="Sex",ci=None)
# plt.show()

# 繪制折線圖：使用sns.pointplot
sns.pointplot(data=titanic_df,x="Pclass",y="Survived",hue="Sex",ci=None)
#plt.show()

####
#年齡與生還情況
####
#與上面的艙位、性別這些分類變量不同，年齡是一個連續的變量

#生還組和罹難組的年齡分布直方圖
#使用seaborn包中的 FacetGrid().map() 來快速生成高質量圖片
# col="Survived"指定將圖片在一行中做出生還和罹難與年齡的關系圖
sns.FacetGrid(titanic_df,col="Survived").
    map(plt.hist,"Age",bins=20,normed=True)
# plt.show()


###
#將連續型變量離散化
###
#我們使用cut函數
#我們可以看到每個區間的大小是固定的,大約是16歲

titanic_df["AgeBand"] = pd.cut(titanic_df["Age"],5)
print(titanic_df.head())

#查看落在不同年齡區間里的人數
#方法1：value_counts(), sort=False表示不需要將結果排序
print(titanic_df.AgeBand.value_counts(sort=False))

#方法2：pivot_table
print(titanic_df.pivot_table(values="Survived",index="AgeBand",aggfunc="count"))

#查看各個年齡區間的生還率
print(titanic_df.pivot_table(values="Survived",index="AgeBand",aggfunc=np.mean))
sns.barplot(data=titanic_df,x="AgeBand",y="Survived",ci=None)
plt.xticks(rotation=60)
plt.show()


####
# 年齡、性別 與生還概率
####
# 查看落在不同區間里男女的生還概率
print(titanic_df.pivot_table(values="Survived",index="AgeBand", columns="Sex", aggfunc=np.mean))

sns.pointplot(data=titanic_df, x="AgeBand", y="Survived", hue="Sex", ci=None)
plt.xticks(rotation=60)

plt.show()

####
#年齡、艙位、性別 與生還概率
####
titanic_df.pivot_table(values="Survived",index="AgeBand", columns=["Sex", "Pclass"], aggfunc=np.mean)



# 回顧sns.pointplot 繪制艙位、性別與生還概率的關系圖
sns.pointplot(data=titanic_df, x="Pclass", y="Survived", hue="Sex", ci=None)

人工神經網絡

https://keras.io

機器學習 特征工程

特征工程到底是什么？

案例分析：共享單車需求
特征工程（feature engineering）

數據和特征決定了機器學習的上限，而一個好的模型只是逼近那個上限而已

我們的目標是盡可能得從原始數據上獲取有用的信息，一些原始數據本身往往不能直接作為模型的變量。

特征工程是利用數據領域的相關知識來創建能夠使機器學習算法達到最佳性能的特征的過程。

日期型變量的處理

以datetime為例子，這個特征里包含了日期和時間點兩個重要信息。我們還可以進一步從日期中導出其所對應的月份和星期數。

#租車人數是由哪些因素決定的？
#導入數據分析包
import numpy as np
import pandas as pd

#導入繪圖工具包
import matplotlib.pyplot as plt
import seaborn as sns

#導入日期時間變量處理相關的工具包
import calendar
from datetime import datetime

# 讀取數據
BikeData = pd.read_csv("data/bike.csv")


#####
#了解數據大小
#查看前幾行/最后幾行數據
#查看數據類型與缺失值
####
# 第一步：查看數據大小

print(BikeData.shape)

# 第二步：查看前10行數據
print(BikeData.head(10))


# 第三步：查看數據類型與缺失值
# 大部分變量為整數型，溫度和風速為浮點型變量
# datetime類型為object，我們將在下面進一步進行處理
# 沒有缺失值！
print(BikeData.info())


####
#日期型變量的處理
####

# 取datetime中的第一個元素為例，其數據類型為字符串，所以我們可以使用split方法將字符串拆開
# 日期+時間戳是一個非常常見的數據形式
ex = BikeData.datetime[1]
print(ex)

print(type(ex))

# 使用split方法將字符串拆開
ex.split()

# 獲取日期數據
ex.split()[0]

# 首先獲得日期，定義一個函數使用split方法將日期+時間戳拆分為日期和
def get_date(x):
    return(x.split()[0])

# 使用pandas中的apply方法，對datatime使用函數get_date
BikeData["date"] = BikeData.datetime.apply(get_date)

print(BikeData.head())

# 生成租車時間(24小時）
# 為了取小時數，我們需要進一步拆分
print(ex.split()[1])
#":"是分隔符
print(ex.split()[1].split(":")[0])

# 將上面的內容定義為get_hour的函數，然后使用apply到datatime這個特征上
def get_hour(x):
    return (x.split()[1].split(":")[0])
# 使用apply方法，獲取整列數據的時間
BikeData["hour"] = BikeData.datetime.apply(get_hour)

print(BikeData.head())

####
# 生成日期對應的星期數
####
# 首先引入calendar中的day_name，列舉了周一到周日
print(calendar.day_name[:])

#獲取字符串形式的日期
dateString = ex.split()[0]

# 使用datatime中的strptime函數將字符串轉換為日期時間類型
# 注意這里的datatime是一個包不是我們dataframe里的變量名
# 這里我們使用"%Y-%m-%d"來指定輸入日期的格式是按照年月日排序，有時候可能會有月日年的排序形式
print(dateString)
dateDT = datetime.strptime(dateString,"%Y-%m-%d")
print(dateDT)
print(type(dateDT))

# 然后使用weekday方法取出日期對應的星期數
# 是0-6的整數，星期一對應0， 星期日對應6
week_day = dateDT.weekday()

print(week_day)
# 將星期數映射到其對應的名字上
print(calendar.day_name[week_day])


# 現在將上述的過程融合在一起變成一個獲取星期的函數
def get_weekday(dateString):
    week_day = datetime.strptime(dateString,"%Y-%m-%d").weekday()
    return (calendar.day_name[week_day])

# 使用apply方法，獲取date整列數據的星期
BikeData["weekday"] = BikeData.date.apply(get_weekday)

print(BikeData.head())


####
# 生成日期對應的月份
####

# 模仿上面的過程，我們可以提取日期對應的月份
# 注意：這里month是一個attribute不是一個函數，所以不用括號

def get_month(dateString):
    return (datetime.strptime(dateString,"%Y-%m-%d").month)
# 使用apply方法，獲取date整列數據的月份
BikeData["month"] = BikeData.date.apply(get_month)
print(BikeData.head())

####
#數據可視化舉例
####

#繪制租車人數的箱線圖， 以及人數隨時間（24小時）變化的箱線圖
# 設置畫布大小
fig = plt.figure(figsize=(18,5))

# 添加第一個子圖
# 租車人數的箱線圖
ax1 = fig.add_subplot(121)
sns.boxplot(data=BikeData,y="count")
ax1.set(ylabel="Count",title="Box Plot On Count")


# 添加第二個子圖
# 租車人數和時間的箱線圖
# 商業洞察：租車人數由時間是如何變化的?
ax2 = fig.add_subplot(122)
sns.boxplot(data=BikeData,y="count",x="hour")
ax2.set(xlabel="Hour",ylabel="Count",title="Box Plot On Count Across Hours")
plt.show()

機器學習

機器學習（Machine Learning）是人工智能的分支，其目標是通過算法從現有的數據中建立模型（學習）來解決問題。

機器學習是一門交叉學科，涉及概率統計（probability and statistics），優化（optimization），和計算機編程（computer programming）等等。

用途極為廣泛：從預測信用卡違約風險，癌癥病人五年生存概率到汽車無人駕駛，都有著機器學習的身影。

備受重視：人們在決策分析的時候越來越多得用定量方法（quantitative approach）來衡量一個決策的優劣。

監督學習：

監督學習（Supervised Learning）：從給定的訓練數據集中學習出一個函數，當新的數據到來時，可以根據這個函數預測結果。監督學習的訓練集（training data）要求是包括輸入和輸出，也可以說是特征和目標。

監督學習中又可進一步分為兩大類主要問題：預測與分類。房價預測是一個典型的預測問題，房價作為目標是一個連續型變量。信用卡違約預測是一個典型的分類問題，是否違約作為一個目標是一個分類變量。

無監督學習

無監督學習（Unsupervised Learning）：訓練集沒有人為標注的結果。我們從輸入數據本身探索規律。

無監督學習的例子包括圖片聚類分析，文章主題分類，基因序列分析，和高緯數據（high dimensional data) 降維等等。

案例分析：波士頓地區房價
注意波士頓房價數據是scikit-learn中的Toy datasets 可通過函數datasets.load_boston()直接加載

學習資源

機器學習教程及筆記
https://www.datacamp.com/
http://matplotlib.org/2.1.0/g...
https://www.kesci.com/
https://keras.io

競賽

https://www.kaggle.com/
天池大數據競賽和Kaggle、DataCastle的比較，哪個比較好？
天池新人實戰賽

參考

The Python Tutorial
python寫入csv文件的幾種方法總結
常見安裝第三方庫問題
慕課網 Python在數據科學中的應用
慕課網 Python數據分析-基礎技術篇
《利用python進行數據分析》
DataLearningTeam/PythonData
Visualization
使用 NumPy 進行科學計算
使用Python進行描述性統計
Documentation of scikit-learn 0.19.1
Seaborn tutorial
特征工程

托管Hadoop集群 USDP大數據平臺大數據與云計算大數據分析與云計算大數據與云計算培訓大數據與云計算技術

文章版權歸作者所有，未經允許請勿轉載,若此文章存在違規行為，您可以聯系管理員刪除。

轉載請注明本文地址：http://specialneedsforspecialkids.com/yun/40930.html

AI、大數據與云計算原來是這種關系

摘要：說到，總是不可避免的聯想到大數據與云計算，這三者可謂相輔相成，唯有全部結合起來，才有可能成為真正的人工智能。一句話概括大數據與云計算簡單來說，是基于計算機軟硬件，通過模擬人類思考和智能行為的一種理論方法和技術。說到AI，總是不可避免的聯想到大數據與云計算，這三者可謂相輔相成，唯有全部結合起來，才有可能成為真正的人工智能。當然，本文只是以一個普通人的視角來探尋這三者之間的聯系。一句話概括AI、...

Shihira 2019-04-30 11:33 評論0 收藏0
從融合洞見AI未來看云計算、大數據與AI之間的關系

摘要：或者你已經了解了與大數據之間的關系，也弄明白了什么是和，但是一個新的概念又要刷新你的知識庫與云計算的融合?，F在之所以火爆，就是因為其關鍵的技術，那就是深度學習，而這項技術恰恰是在云計算與大數據日趨成熟之后才得到實質性進展的。在黑科技層出不窮的AI行業，各種新技術常常令人眼花繚亂?；蛘吣阋呀浟私饬薃I與大數據之間的關系，也弄明白了什么是ML和DL，但是一個新的概念又要刷新你的知識庫——AI與云...

Cc_2011 2019-04-29 14:23 評論0 收藏0
如何正確看待大數據與云計算技術？

摘要：我們已經進入了新一輪技術驅動的時代那如何理解大數據與云計算的關系在中國計算機學會大數據專家委員會副主任車品覺看來人工智能深度學習，這些都是二十年前就有的技術，但是二十年前沒有大數據，沒有可以關聯的數據。戰國《風賦》：夫風生于地，起于青蘋之末，侵淫溪谷，盛怒于土囊之口，……DT時代，未來已來——數據大爆炸首先要明白大家為何從前幾年開始談大數據了？這是一個基本問題，包含著對當下數據現實的基本認識...

Jrain 2019-04-28 19:29 評論0 收藏0
揭開大數據與云計算非同一般的關系

摘要：從二者的定義范圍來看，大數據要比云計算更加廣泛。以此看來，大數據與云計算之間，并非獨立概念，而是關系非比尋常。這也難怪不少地區在做出相關產業規劃時，都會同時推進大數據與云計算建設，這也顯示出一方馬虎必會影響另一方的發展。　　通常情況下，我們容易將大數據與云計算混淆在一起，筆者就概念定義先做科普工作。著名的麥肯錫全球研究所給出大數據定義是一種規模大到在獲取、存儲、管理、分析方面大大超出了傳統數...

ashe 2019-04-28 19:26 評論0 收藏0
大唐電信提交兩項大數據與云計算標準在ITU-T獲立項

摘要：日前，在瑞士日內瓦舉行的國際電信聯盟第研究組的全會上，由大唐電信旗下大唐軟件和中國聯通聯合提交的兩項標準立項大數據的數據保留概覽與需求和云業務生命周期管理的元數據框架獲得正式立項，取得了在大數據和云計算技術領域國際標準制定中的又一突破。日前，在瑞士日內瓦舉行的國際電信聯盟（ITU－T）第13研究組的全會上，由大唐電信旗下大唐軟件和中國聯通聯合提交的兩項標準立項Y．BDDP－reqts：Big...

alaege 2019-04-28 19:29 評論0 收藏0