国产xxxx99真实实拍_久久不雅视频_高清韩国a级特黄毛片_嗯老师别我我受不了了小说

資訊專欄INFORMATION COLUMN

爬蟲性能:NodeJs VS Python

EastWoodYang / 3362人閱讀

摘要:結果可見這些線程是真的沒有并發執行,而是順序執行的,并沒有達到多線程的目的。綜上由我自己了解的知識和本實驗而言,我的結論是用上多線程下載速度能夠比過,但是解析網頁這種事沒有快,畢竟原生就是為了寫網頁,而且復雜的爬蟲總不能都用字符串去找吧。

前言

早就聽說Nodejs的異步策略是多么的好,I/O是多么的牛逼......反正就是各種好。今天我就準備給nodejs和python來做個比較。能體現異步策略和I/O優勢的項目,我覺得莫過于爬蟲了。那么就以一個爬蟲項目來一較高下吧。

爬蟲項目

眾籌網-眾籌中項目 http://www.zhongchou.com/brow...,我們就以這個網站為例,我們爬取它所有目前正在眾籌中的項目,獲得每一個項目詳情頁的URL,存入txt文件中。

實戰比較 python原始版
# -*- coding:utf-8 -*-
"""
Created on 20160827
@author: qiukang
"""
import requests,time
from BeautifulSoup import BeautifulSoup    # HTML

#請求頭
headers = {
   "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
   "Accept-Encoding":"gzip, deflate, sdch",
   "Accept-Language":"zh-CN,zh;q=0.8",
   "Connection":"keep-alive",
   "Host":"www.zhongchou.com",
   "Upgrade-Insecure-Requests":1,
   "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36"
}

# 獲得項目url列表
def getItems(allpage):
    no = 0
    items = open("pystandard.txt","a")
    for page in range(allpage):
        if page==0:
            url = "http://www.zhongchou.com/browse/di"
        else:
            url = "http://www.zhongchou.com/browse/di-p"+str(page+1)
        # print url #①
        r1 = requests.get(url,headers=headers)
        html = r1.text.encode("utf8")
        soup = BeautifulSoup(html);
        lists = soup.findAll(attrs={"class":"ssCardItem"})
        for i in range(len(lists)):
            href = lists[i].a["href"]
            items.write(href+"
")
            no +=1
    items.close()
    return no
    
if __name__ == "__main__":
    start = time.clock()
    allpage = 30
    no = getItems(allpage)
    end = time.clock()
    print("it takes %s Seconds to get %s items "%(end-start,no))

實驗5次的結果:

 it takes 48.1727159614 Seconds to get 720 items 
 it takes 45.3397999415 Seconds to get 720 items  
 it takes 44.4811429862 Seconds to get 720 items 
 it takes 44.4619293082 Seconds to get 720 items
 it takes 46.669706593 Seconds to get 720 items 
python多線程版
# -*- coding:utf-8 -*-
"""
Created on 20160827
@author: qiukang
"""
import requests,time,threading
from BeautifulSoup import BeautifulSoup    # HTML

#請求頭
headers = {
   "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
   "Accept-Encoding":"gzip, deflate, sdch",
   "Accept-Language":"zh-CN,zh;q=0.8",
   "Connection":"keep-alive",
   "Host":"www.zhongchou.com",
   "Upgrade-Insecure-Requests":1,
   "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36"
}

items = open("pymulti.txt","a")
no = 0
lock = threading.Lock()

# 獲得項目url列表
def getItems(urllist):
    # print urllist  #①
    global items,no,lock
    for url in urllist:
        r1 = requests.get(url,headers=headers)
        html = r1.text.encode("utf8")
        soup = BeautifulSoup(html);
        lists = soup.findAll(attrs={"class":"ssCardItem"})
        for i in range(len(lists)):
            href = lists[i].a["href"]
            lock.acquire()
            items.write(href+"
")
            no +=1
            # print no
            lock.release()
    
if __name__ == "__main__":
    start = time.clock()
    allpage = 30
    allthread = 30
    per = (int)(allpage/allthread)
    urllist = []
    ths = []
    for page in range(allpage):
        if page==0:
            url = "http://www.zhongchou.com/browse/di"
        else:
            url = "http://www.zhongchou.com/browse/di-p"+str(page+1)
        urllist.append(url)
    for i in range(allthread):
        # print urllist[i*(per):(i+1)*(per)]
        th = threading.Thread(target = getItems,args= (urllist[i*(per):(i+1)*(per)],))
        th.start()
        th.join()
    items.close()
    end = time.clock()
    print("it takes %s Seconds to get %s items "%(end-start,no))
    

實驗5次的結果:

it takes 45.5222291114 Seconds to get 720 items 
it takes 46.7097831417 Seconds to get 720 items
it takes 45.5334646156 Seconds to get 720 items 
it takes 48.0242797553 Seconds to get 720 items
it takes 44.804855018 Seconds to get 720 items  

這個多線程并沒有優勢,經過 #① 的注釋與否發現,這個所謂的多線程也是按照單線程運行的。

python改進 單線程

首先我們把解析html的步驟改進一下,分析發現

lists = soup.findAll("a",attrs={"class":"siteCardICH3"})

lists = soup.findAll(attrs={"class":"ssCardItem"})

更好,因為它是直接找 a ,而不是先找 div 再找 div 下的 a
改進后實驗5次結果如下,可見有進步:

it takes 41.0018861912 Seconds to get 720 items 
it takes 42.0260390497 Seconds to get 720 items
it takes 42.249635988 Seconds to get 720 items 
it takes 41.295524133 Seconds to get 720 items 
it takes 42.9022894154 Seconds to get 720 items 
多線程

修改 getItems(urllist)getItems(urllist,thno)
函數起止加入 print thno," begin at",time.clock()print thno," end at",time.clock()。結果:

0  begin at 0.00100631078628
0  end at 1.28625832936
1  begin at 1.28703230691
1  end at 2.61739476075
2  begin at 2.61801291642
2  end at 3.92514717937
3  begin at 3.9255829208
3  end at 5.38870235361
4  begin at 5.38921134066
4  end at 6.670658786
5  begin at 6.67125734731
5  end at 8.01520989534
6  begin at 8.01566383155
6  end at 9.42006780585
7  begin at 9.42053340537
7  end at 11.0386755513
8  begin at 11.0391565464
8  end at 12.421359168
9  begin at 12.4218294329
9  end at 13.9932716671
10  begin at 13.9939957256
10  end at 15.3535799145
11  begin at 15.3540870354
11  end at 16.6968289314
12  begin at 16.6972665389
12  end at 17.9798803157
13  begin at 17.9804714125
13  end at 19.326706238
14  begin at 19.3271438455
14  end at 20.8744308886
15  begin at 20.8751017624
15  end at 22.5306500245
16  begin at 22.5311450156
16  end at 23.7781693541
17  begin at 23.7787245279
17  end at 25.1775114499
18  begin at 25.178350742
18  end at 26.5497330734
19  begin at 26.5501776789
19  end at 27.970799259
20  begin at 27.9712727895
20  end at 29.4595075375
21  begin at 29.4599959972
21  end at 30.9507299602
22  begin at 30.9513989679
22  end at 32.2762763982
23  begin at 32.2767182045
23  end at 33.6476256057
24  begin at 33.648137392
24  end at 35.1100517711
25  begin at 35.1104907783
25  end at 36.462657099
26  begin at 36.4632234696
26  end at 37.7908515759
27  begin at 37.7912845182
27  end at 39.4359928956
28  begin at 39.436448698
28  end at 40.9955021593
29  begin at 40.9960871912
29  end at 42.6425665264
it takes 42.6435882327 Seconds to get 720 items 

可見這些線程是真的沒有并發執行,而是順序執行的,并沒有達到多線程的目的。問題在哪里呢?原來
我的循環中

th.start()
th.join()

兩行代碼是緊接著的,所以新的線程會等待上一個線程執行完畢才會start,修改為

for i in range(allthread):
    # print urllist[i*(per):(i+1)*(per)]
    th = threading.Thread(target = getItems,args= (urllist[i*(per):(i+1)*(per)],i))
    ths.append(th)
for th in ths:
    th.start()
for th in ths:
    th.join()

結果:

0  begin at 0.0010814225325
1  begin at 0.00135201143191
2  begin at 0.00191744892518
3  begin at 0.0021311208492
4  begin at 0.00247495536449
5  begin at 0.0027334144167
6  begin at 0.00320601192551
7  begin at 0.00379011072218
8  begin at 0.00425431064445
9  begin at 0.00511692939449
10  begin at 0.0132038052264
11  begin at 0.0165926979253
12  begin at 0.0170886220634
13  begin at 0.0174665134574
14  begin at 0.018348726576
15  begin at 0.0189780790334
16  begin at 0.0201896641572
17  begin at 0.0220576606283
18  begin at 0.0231484138125
19  begin at 0.0238804034387
20  begin at 0.0273901280772
21  begin at 0.0300363009005
22  begin at 0.0362878375422
23  begin at 0.0395512329756
24  begin at 0.0431556637289
25  begin at 0.0459581249682
26  begin at 0.0482254733323
27  begin at 0.0535430117384
28  begin at 0.0584971212607
29  begin at 0.0598136762161
16  end at 65.2657542222
24  end at 66.2951247811
21  end at 66.3849747583
4  end at 66.6230160119
5  end at 67.5501632164
29  end at 67.7516992283
23  end at 68.6985322418
7  end at 69.1060433231
22  end at 69.2743398214
2  end at 69.5523713152
14  end at 69.6454986837
15  end at 69.8333400981
12  end at 69.9508018062
10  end at 70.2860348602
26  end at 70.3670659719
13  end at 70.3847232972
27  end at 70.3941635841
11  end at 70.5132838156
1  end at 70.7272351926
0  end at 70.9115253609
6  end at 71.0876563409
8  end at 71.112480539825
  end at 71.1145248855
3  end at 71.4606034226
19  end at 71.6103622486
18  end at 71.6674453096
20  end at 71.725601862
17  end at 71.7778992318
9  end at 71.7847479301
28  end at 71.7921004837
it takes 71.7931912368 Seconds to get 720 items 
反思

上面的的多線是并發了,可是比單線程運行時間長了太多......我還沒找出來原因,猜想是不是beautifulsoup不支持多線程?請各位多多指教。為了驗證這個想法,我準備不用beautifulsoup,直接使用字符串查找。首先還是從單線程的修改:

# -*- coding:utf-8 -*-
"""
Created on 20160827
@author: qiukang
"""
import requests,time
from BeautifulSoup import BeautifulSoup    # HTML

#請求頭
headers = {
   "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
   "Accept-Encoding":"gzip, deflate, sdch",
   "Accept-Language":"zh-CN,zh;q=0.8",
   "Connection":"keep-alive",
   "Host":"www.zhongchou.com",
   "Upgrade-Insecure-Requests":"1",
   "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36"
}

# 獲得項目url列表
def getItems(allpage):
    no = 0
    data = set()
    for page in range(allpage):
        if page==0:
            url = "http://www.zhongchou.com/browse/di"
        else:
            url = "http://www.zhongchou.com/browse/di-p"+str(page+1)
        # print url #①
        r1 = requests.get(url,headers=headers)
        html = r1.text.encode("utf8")
        start = 5000    
        while  True:     
            index = html.find("deal-show", start)   
            if index == -1:     
                break    
            # print "http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"
"
            # time.sleep(100)
            data.add("http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"
")    
            start = index  + 1000 
    items = open("pystandard.txt","a")
    items.write("".join(data))
    items.close()
    return len(data)
    
if __name__ == "__main__":
    start = time.clock()
    allpage = 30
    no = getItems(allpage)
    end = time.clock()
    print("it takes %s Seconds to get %s items "%(end-start,no))

實驗3次,結果:

it takes 11.6800132309 Seconds to get 720 items
it takes 11.3621804427 Seconds to get 720 items
it takes 11.6811991567 Seconds to get 720 items  

然后對多線程進行修改:

# -*- coding:utf-8 -*-
"""
Created on 20160827
@author: qiukang
"""
import requests,time,threading
from BeautifulSoup import BeautifulSoup    # HTML

#請求頭
header = {
   "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
   "Accept-Encoding":"gzip, deflate, sdch",
   "Accept-Language":"zh-CN,zh;q=0.8",
   "Connection":"keep-alive",
   "Host":"www.zhongchou.com",
   "Upgrade-Insecure-Requests":"1",
   "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36"
}

data = set()
no = 0
lock = threading.Lock()

# 獲得項目url列表 
def getItems(urllist,thno):
    # print urllist
    # print thno," begin at",time.clock()
    global no,lock,data
    for url in urllist:
        r1 = requests.get(url,headers=header)
        html = r1.text.encode("utf8")
        start = 5000    
        while  True:     
            index = html.find("deal-show", start)   
            if index == -1:     
                break
            lock.acquire()
            data.add("http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"
")    
            start = index  + 1000 
            lock.release()
        
    # print thno," end at",time.clock()
    
if __name__ == "__main__":
    start = time.clock()
    allpage = 30  #頁數
    allthread = 10 #線程數
    per = (int)(allpage/allthread)
    urllist = []
    ths = []
    for page in range(allpage):
        if page==0:
            url = "http://www.zhongchou.com/browse/di"
        else:
            url = "http://www.zhongchou.com/browse/di-p"+str(page+1)
        urllist.append(url)
    for i in range(allthread):
        # print urllist[i*(per):(i+1)*(per)]
        low = i*allpage/allthread#注意寫法
        high = (i+1)*allpage/allthread
        # print low," ",high
        th = threading.Thread(target = getItems,args= (urllist[low:high],i))
        ths.append(th)
    for th in ths:
        th.start()
    for th in ths:
        th.join()
    items = open("pymulti.txt","a")
    items.write("".join(data))
    items.close()
    end = time.clock()
    print("it takes %s Seconds to get %s items "%(end-start,len(data)))

實驗3次,結果:

it takes 1.4781525123 Seconds to get 720 items 
it takes 1.44905954029 Seconds to get 720 items
it takes 1.49297891786 Seconds to get 720 items

可見多線程確實比單線程快好多倍。對于簡單的爬取任務而言,用字符串的內置方法比用beautifulsoup解析html快很多。

NodeJs
// npm install request -g #貌似不行,要進入代碼所在目錄:npm install --save request
// npm install cheerio -g  #npm install --save cheerio

var request = require("request");
var cheerio = require("cheerio");
var fs = require("fs");

var t1 = new Date().getTime();
var allpage = 30;
var urllist = new Array()  
var urldata = "";
var mark = 0;
var no = 0;
for (var i=0; i= 0; i--) {
        // console.log(href[i].attribs["href"]);
        urldata += (href[i].attribs["href"]+"
");
        no += 1;
    }    
    mark += 1;
    if (mark==allpage) {
        // console.log(urldata);
        fs.writeFile("./nodestandard.txt",urldata,function(err){
                    if(err) throw err;
        });
        var t2 = new Date().getTime();
        console.log("it takes " + ((t2-t1)/1000).toString() + " Seconds to get " + no.toString() + " items");
    }  
}

實驗5次的結果:

it takes 3.949 Seconds to get 720 items
it takes 3.642 Seconds to get 720 items
it takes 3.641 Seconds to get 720 items
it takes 3.938 Seconds to get 720 items
it takes 3.783 Seconds to get 720 items

可見同樣是用解析html的方法,nodejs速度完虐python。字符串查找呢?

var request = require("request");
var cheerio = require("cheerio");
var fs = require("fs");

var t1 = new Date().getTime();
var allpage = 30;
var urllist = new Array()  ;
var urldata = new Array();
var mark = 0;
var no = 0;
for (var i=0; i

實驗5次的結果:

it takes 3.695 Seconds to get 720 items
it takes 3.781 Seconds to get 720 items
it takes 3.94 Seconds to get 720 items
it takes 3.705 Seconds to get 720 items
it takes 3.601 Seconds to get 720 items

可見和解析起來的時間是差不多的。

綜上

由我自己了解的知識和本實驗而言,我的結論是:python用上多線程下載速度能夠比過nodejs,但是解析網頁這種事python沒有nodejs快,畢竟js原生就是為了寫網頁,而且復雜的爬蟲總不能都用字符串去找吧。

2016.9.13-補充

評論中提到的time.time(),感謝老司機指出我的錯誤,我在python多線程,字符串查找版本中使用了,實驗3次過后依然是快于nodejs版本的平均用時2.3S,不知道是不是您和我的網絡環境不一樣導致?我準備換個教室試試......至于有沒有誤導人,我想讀者會自己去嘗試,得出自己的結論。

Python的確有異步(twisted),nodejs也的確有多進程(child_process),我想追求極致的性能比較還需要對這兩種語言有更深入的研究,這個我目前也是半知不解,我會盡快花時間了解,爭取實現比較(這里不是追求編程方法的比較,就是單純的想比較在同一臺機器同一個網絡下,兩種語言能做到的極致。道阻且長啊。)

還有解析方法,我這里用的是python自帶的解析,官網說lxml的確比自帶的快,但是我這里換了過后多線程依然沒有體現出來優勢,所以我還是很疑惑,是不是beautifulsoup不支持多線程?,我在官網沒找到相關文檔,請各位指教。另外from BeautifulSoup import BeautifulSoup的確是比from bs4 import BeautifulSoup 慢多了,這是BeautifulSoup的版本原因,感謝評論者指出。

文章版權歸作者所有,未經允許請勿轉載,若此文章存在違規行為,您可以聯系管理員刪除。

轉載請注明本文地址:http://specialneedsforspecialkids.com/yun/38147.html

相關文章

  • nodeJS實現基于Promise爬蟲 定時發送信息到指定郵件

    摘要:也就是說,我的篇文章的請求對應個實例,這些實例都請求完畢后,執行以下邏輯他的目的在于對每一個返回值這個返回值為單篇文章的內容,進行方法處理。 英國人Robert Pitt曾在Github上公布了他的爬蟲腳本,導致任何人都可以容易地取得Google Plus的大量公開用戶的ID信息。至今大概有2億2千5百萬用戶ID遭曝光。 亮點在于,這是個nodejs腳本,非常短,包括注釋只有71行。 ...

    xuweijian 評論0 收藏0
  • 大話后端開發的奇淫技巧大集合

    摘要:,大家好,很榮幸有這個機會可以通過寫博文的方式,把這些年在后端開發過程中總結沉淀下來的經驗和設計思路分享出來模塊化設計根據業務場景,將業務抽離成獨立模塊,對外通過接口提供服務,減少系統復雜度和耦合度,實現可復用,易維護,易拓展項目中實踐例子 Hi,大家好,很榮幸有這個機會可以通過寫博文的方式,把這些年在后端開發過程中總結沉淀下來的經驗和設計思路分享出來 模塊化設計 根據業務場景,將業務...

    CloudwiseAPM 評論0 收藏0
  • Python協程(真才實學,想學的進來)

    摘要:所以與多線程相比,線程的數量越多,協程性能的優勢越明顯。值得一提的是,在此過程中,只有一個線程在執行,因此這與多線程的概念是不一樣的。 真正有知識的人的成長過程,就像麥穗的成長過程:麥穗空的時候,麥子長得很快,麥穗驕傲地高高昂起,但是,麥穗成熟飽滿時,它們開始謙虛,垂下麥芒。 ——蒙田《蒙田隨筆全集》 上篇論述了關于python多線程是否是雞肋的問題,得到了一些網友的認可,當然也有...

    lykops 評論0 收藏0
  • Evil Python

    摘要:用將倒放這次讓我們一個用做一個小工具將動態圖片倒序播放發現引力波的機構使用的包美國科學家日宣布,他們去年月首次探測到引力波。宣布這一發現的,是激光干涉引力波天文臺的負責人。這個機構誕生于上世紀年代,進行引力波觀測已經有近年。 那些年我們寫過的爬蟲 從寫 nodejs 的第一個爬蟲開始陸陸續續寫了好幾個爬蟲,從爬拉勾網上的職位信息到爬豆瓣上的租房帖子,再到去爬知乎上的妹子照片什么的,爬蟲...

    Turbo 評論0 收藏0

發表評論

0條評論

EastWoodYang

|高級講師

TA的文章

閱讀更多
最新活動
閱讀需要支付1元查看
<