requests-html庫(kù)初識(shí) + 無(wú)資料解BUG之 I/O error : encoder er

mozillazg 發(fā)布于2021-09-07 09:59 / 2435人閱讀

摘要：目標(biāo)站點(diǎn)分析本次要采集的目標(biāo)網(wǎng)站為，目標(biāo)站點(diǎn)描述為全球名站。由于上述代碼太少了，完全不夠今日代碼量，我們順手將其修改為多線程形式。

本篇博客是《爬蟲(chóng) 120 例》的第 30 例，新學(xué)習(xí)一個(gè)爬蟲(chóng)框架 requests-html，該框架作者就是 requests 的作者，所以盲猜就很好用啦。

知識(shí)鋪墊工作

requests-html 模塊安裝使用 pip install requests-html 即可，官方手冊(cè)查詢(xún)地址：https://requests-html.kennethreitz.org/，官方并沒(méi)有直接的中文翻譯，在檢索過(guò)程中，確實(shí)發(fā)現(xiàn)了一版中文手冊(cè)，在文末提供。

先看一下官方對(duì)該庫(kù)的基本描述：

Full JavaScript support!（完全支持 JS，這里手冊(cè)還重點(diǎn)標(biāo)記了一下，初學(xué)階段可以先忽略）
CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).（集成了 pyquery 庫(kù)，支持 css 選擇器）
XPath Selectors, for the faint at heart.（支持 XPath 選擇器）
Mocked user-agent (like a real web browser).（mock UA 數(shù)據(jù)，這點(diǎn)不錯(cuò)）
Automatic following of redirects.（自動(dòng)跟蹤重定向）
Connection–pooling and cookie persistence.（持久性 COOKIE）
The Requests experience you know and love, with magical parsing abilities.（額，這最后一點(diǎn)，各位自己領(lǐng)悟吧）

Only Python 3.6 is supported. 僅支持 Python 3.6 ，實(shí)測(cè)發(fā)現(xiàn) 3.6 以上版本依舊可以。

對(duì)于該庫(kù)的簡(jiǎn)單使用，代碼如下所示：

from requests_html import HTMLSessionsession = HTMLSession()r = session.get("https://python.org/")print(r)

首先從 requests_html 庫(kù)導(dǎo)入 HTMLSession 類(lèi)，然后將其實(shí)例化之后，調(diào)用其 get 方法，發(fā)送請(qǐng)求，得到的 r 輸出為，后續(xù)即可使用內(nèi)置的解析庫(kù)對(duì)數(shù)據(jù)進(jìn)行解析。

由于該庫(kù)是解析 html 對(duì)象，所以可以查看對(duì)應(yīng)的 html 對(duì)象包含哪些方法與與屬性。

通過(guò) dir 函數(shù)查閱。

print(dir(r.html))# 輸出如下內(nèi)容：["__aiter__", "__anext__", "__class__", "__delattr__", "__dict__", "__dir__", "__doc__", "__eq__", "__format__", "__ge__","__getattribute__", "__gt__", "__hash__", "__init__", "__init_subclass__", "__iter__", "__le__", "__lt__", "__module__", "__ne__","__new__", "__next__", "__reduce__", "__reduce_ex__", "__repr__", "__setattr__", "__sizeof__", "__str__", "__subclasshook__","__weakref__", "_async_render", "_encoding", "_html", "_lxml", "_make_absolute", "_pq", "absolute_links", "add_next_symbol","arender", "base_url", "default_encoding", "element", "encoding", "find", "full_text", "html", "links", "lxml", "next","next_symbol", "page", "pq", "raw_html", "render", "search", "search_all", "session", "skip_anchors", "text", "url", "xpath"]

該函數(shù)只能輸入大概內(nèi)容，細(xì)節(jié)還是需要通過(guò) help 函數(shù)查詢(xún)，例如：

html 對(duì)象的方法包括

find：提供一個(gè) css 選擇器，返回一個(gè)元素列表；
xpath：提供一個(gè) xpath 表達(dá)式，返回一個(gè)元素列表；
search：根據(jù)傳入的模板參數(shù)，查找 Element 對(duì)象；
search_all：同上，返回的全部數(shù)據(jù)；

html 對(duì)象的屬性包括

links：返回頁(yè)面所有鏈接；
absolute_links：返回頁(yè)面所有鏈接的絕對(duì)地址；
base_url：頁(yè)面的基準(zhǔn) URL；
html，raw_html，text：以 HTML 格式輸入頁(yè)面，輸出未解析過(guò)的網(wǎng)頁(yè)，提取頁(yè)面所有文本；

有了上述內(nèi)容鋪墊之后，在進(jìn)行 Python 爬蟲(chóng)的編寫(xiě)就會(huì)變的容易許多，requests-html 庫(kù)將通過(guò) 3~4 個(gè)案例進(jìn)行學(xué)習(xí)掌握，接下來(lái)進(jìn)入第一個(gè)案例。

目標(biāo)站點(diǎn)分析

本次要采集的目標(biāo)網(wǎng)站為：http://www.world68.com/top.asp?t=5star&page=1，目標(biāo)站點(diǎn)描述為【全球名站】。

在獲取數(shù)據(jù)源發(fā)送請(qǐng)求前，忽然想起可以動(dòng)態(tài)修改 user-agent，查閱該庫(kù)源碼發(fā)現(xiàn)，它只是使用了 fake_useragent 庫(kù)來(lái)進(jìn)行操作，并無(wú)太神奇的地方，所以可用可不用該內(nèi)容。

DEFAULT_USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8"def user_agent(style=None) -> _UserAgent:    """Returns an apparently legit user-agent, if not requested one of a specific    style. Defaults to a Chrome-style User-Agent.    """    global useragent    if (not useragent) and style:        useragent = UserAgent()    return useragent[style] if style else DEFAULT_USER_AGENT

其余內(nèi)容相對(duì)比較簡(jiǎn)單，頁(yè)碼規(guī)則如下：

http://www.world68.com/top.asp?t=5star&page=1http://www.world68.com/top.asp?t=5star&page=2

累計(jì)頁(yè)數(shù)直接在底部進(jìn)行了展示，可以設(shè)計(jì)為用戶手動(dòng)輸入，即 input 函數(shù)實(shí)現(xiàn)。

目標(biāo)數(shù)據(jù)存儲(chǔ)網(wǎng)站名與網(wǎng)站地址即可，基于此，開(kāi)始編碼。

編碼時(shí)間

首先通過(guò)單線程實(shí)現(xiàn) requests-html 的基本邏輯，注意到下述代碼非常輕量，

from requests_html import HTMLSessionsession = HTMLSession()page_size = int(input("請(qǐng)輸入總頁(yè)碼："))for page in range(1, page_size + 1):    world = session.get(f"http://www.world68.com/top.asp?t=5star&page={page}")    world.encoding = "gb2312"    # world.html.encoding = "gb2312"    # print(world.text)    print("正在采集數(shù)據(jù)", world.url)    title_a = world.html.find("dl>dt>a")    for item in title_a:        name = item.text        url = item.attrs["href"]        with open("webs.txt", "a+", encoding="utf-8") as f:            f.write(f"{name},{url}/n")

上述代碼重點(diǎn)部分說(shuō)明如下：

world.encoding，設(shè)置了網(wǎng)頁(yè)解析編碼；
world.html.find("dl>dt>a") 通過(guò) css 選擇器，查找所有的網(wǎng)頁(yè)標(biāo)題元素；
item.text 提取網(wǎng)頁(yè)標(biāo)題內(nèi)容；
item.attrs["href"] 獲取元素屬性，即網(wǎng)站域名。

運(yùn)行效果如下所示，獲取到的 3519 個(gè)站點(diǎn)，就不在提供了，簡(jiǎn)單運(yùn)行 1 分鐘代碼，即可得到。

由于上述代碼太少了，完全不夠今日代碼量，我們順手將其修改為多線程形式。

import requests_htmlimport threadingimport timeimport fcntlclass MyThread(threading.Thread):    def __init__(self):        threading.Thread.__init__(self)    def run(self):        global page, lock, page_size        while True:            lock.acquire(True)            if page >= page_size:                lock.release()                break            else:                page += 1                lock.release()                requests_html.DEFAULT_ENCODING = "gb18030"                session = requests_html.HTMLSession()                print("正在采集第{}頁(yè)".format(page), "*" * 50)                try:                    page_url = f"http://www.world68.com/top.asp?t=5star&page={page}"                    world = session.get(page_url, timeout=10)                    print("正在采集數(shù)據(jù)", world.url)                    # print(world.html)                    title_a = world.html.find("dl>dt>a")                    print(title_a)                    my_str = ""                    for item in title_a:                        name = item.text                        url = item.attrs["href"]                        my_str += f"{name.encode("utf-8").decode("utf-8")},{url}/n"                    with open("thread_webs.txt", "a+", encoding="utf-8") as f:                        fcntl.flock(f.fileno(), fcntl.LOCK_EX)  # 文件加鎖                        f.write(f"{my_str}")                except Exception as e:                    print(e, page_url)if "__main__" == __name__:    page_size = int(input("請(qǐng)輸入總頁(yè)碼："))    page = 0    thread_list = []    # 獲取開(kāi)始時(shí)間    start = time.perf_counter()    lock = threading.Lock()    for i in range(1, 5):        t = MyThread()        thread_list.append(t)    for t in thread_list:        t.start()    for t in thread_list:        t.join()    # 獲取時(shí)間間隔    elapsed = (time.perf_counter() - start)    print("程序運(yùn)行完畢，總耗時(shí)為：", elapsed)

在正式進(jìn)行編碼之后，發(fā)現(xiàn)存在比較大的問(wèn)題，編碼問(wèn)題，出現(xiàn)如下錯(cuò)誤：

encoding error : input conversion failed due to input error, bytes 0x81 0xE3 0xD3 0xAAencoding error : input conversion failed due to input error, bytes 0x81 0xE3 0xD3 0xAAencoding error : input conversion failed due to input error, bytes 0x81 0xE3 0xD3 0xAAI/O error : encoder error

該錯(cuò)誤在執(zhí)行單線程時(shí)并未發(fā)生，但是當(dāng)執(zhí)行多線程時(shí)，異常開(kāi)始出現(xiàn)，本問(wèn)題在互聯(lián)網(wǎng)上無(wú)解決方案，只能自行通過(guò) requests-html 庫(kù)的源碼進(jìn)行修改。

打開(kāi) requests_html.py 文件，將 417 行左右的代碼進(jìn)行如下修改：

def __init__(self, *, session: Union["HTMLSession", "AsyncHTMLSession"] = None, url: str = DEFAULT_URL, html: _HTML, default_encoding: str = DEFAULT_ENCODING, async_: bool = False) -> None:	# 修改本部分代碼    # Convert incoming unicode HTML into bytes.    # if isinstance(html, str):    html = html.decode(DEFAULT_ENCODING,"replace")    super(HTML, self).__init__(        # Convert unicode HTML to bytes.        element=PyQuery(html)("html") or PyQuery(f"{html}")("html"),        html=html,        url=url,        default_encoding=default_encoding    )

代碼 if isinstance(html, str): 用于判斷 html 是否為 str，但是在實(shí)測(cè)過(guò)程中發(fā)現(xiàn) html 是類(lèi)型，所以數(shù)據(jù)沒(méi)有進(jìn)行轉(zhuǎn)碼工作，故取消相關(guān)判斷。

除此以外，通過(guò)輸出 world.html.encoding 發(fā)現(xiàn)網(wǎng)頁(yè)的編碼不是 GB2312 ，而是 gb18030，所以通過(guò)下述代碼進(jìn)行了默認(rèn)編碼的設(shè)置。

requests_html.DEFAULT_ENCODING = "gb18030"

按照如上內(nèi)容進(jìn)行修改之后，代碼可以正常運(yùn)行，數(shù)據(jù)能正確的采集到。

本案例還新增了代碼運(yùn)行時(shí)長(zhǎng)的計(jì)算，具體如下：

# 獲取開(kāi)始時(shí)間start = time.perf_counter()# 執(zhí)行代碼的部分# 獲取時(shí)間間隔elapsed = (time.perf_counter() - start)print("程序運(yùn)行完畢，總耗時(shí)為：", elapsed)

完整的代碼運(yùn)行效果如下所示：

收藏時(shí)間

代碼倉(cāng)庫(kù)地址：https://codechina.csdn.net/hihell/python120，去給個(gè)關(guān)注或者 Star 吧。

數(shù)據(jù)沒(méi)有采集完畢，想要的可以在評(píng)論區(qū)留言交流

今天是持續(xù)寫(xiě)作的第 212 / 365 天。
可以關(guān)注我，點(diǎn)贊我、評(píng)論我、收藏我啦。

更多精彩

Python 爬蟲(chóng) 100 例教程導(dǎo)航帖（已完結(jié)，復(fù)盤(pán)更新中，目前 110+ 篇）

云服務(wù)器 GPU云服務(wù)器無(wú)jstl標(biāo)簽庫(kù) docker初識(shí) 初識(shí)關(guān)系數(shù)據(jù)庫(kù) ER

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://specialneedsforspecialkids.com/yun/119409.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

mozillazg

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

GitBash教程使用 git rebase修改歷史commit信息 | github的contr

閱讀 1397·2021-11-24 09:39
《C語(yǔ)言入門(mén)》簡(jiǎn)單有序數(shù)組二分查找代碼實(shí)現(xiàn)

閱讀 3687·2021-11-24 09:39
#11.11#騰訊云：企業(yè)高配大帶寬云服務(wù)器，4核8G內(nèi)存/10M，三年僅需768元

閱讀 1859·2021-11-16 11:54
2021-09-29 學(xué)習(xí)計(jì)劃 #嵌入式

閱讀 1464·2021-09-30 09:47
Kuai Che Dao：$15.5/月/1核/1GB內(nèi)存/10GB SSD空間/2TB流量/10G

閱讀 1713·2021-09-26 10:16
啊里云服務(wù)器主機(jī)怎么還要買(mǎi)服務(wù)器嗎-想買(mǎi)一個(gè)阿里云的服務(wù)器，要怎么配置？

閱讀 2342·2021-09-22 15:33
racknerd：cpanel新加坡虛擬主機(jī)上線，最低配$17.98/年，可托管4個(gè)域，贈(zèng)送ssl證

閱讀 1453·2021-09-14 18:01
requests-html庫(kù)初識(shí) + 無(wú)資料解BUG之 I/O error : encoder er

閱讀 2436·2021-09-07 09:59

国产xxxx99真实实拍_久久不雅视频_高清韩国a级特黄毛片_嗯老师别我我受不了了小说

資訊專(zhuān)欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

requests-html庫(kù)初識(shí) + 無(wú)資料解BUG之 I/O error : encoder er

知識(shí)鋪墊工作

目標(biāo)站點(diǎn)分析

編碼時(shí)間

收藏時(shí)間

相關(guān)文章

Python 爬蟲(chóng)實(shí)戰(zhàn)（二）：使用 requests-html

這個(gè)男人讓你的爬蟲(chóng)開(kāi)發(fā)效率提升8倍

發(fā)表評(píng)論

0條評(píng)論

mozillazg

男|高級(jí)講師

TA的文章

GitBash教程使用 git rebase修改歷史commit信息 | github的contr

《C語(yǔ)言入門(mén)》簡(jiǎn)單有序數(shù)組二分查找代碼實(shí)現(xiàn)

#11.11#騰訊云：企業(yè)高配大帶寬云服務(wù)器，4核8G內(nèi)存/10M，三年僅需768元

2021-09-29 學(xué)習(xí)計(jì)劃 #嵌入式

Kuai Che Dao：$15.5/月/1核/1GB內(nèi)存/10GB SSD空間/2TB流量/10G

啊里云服務(wù)器主機(jī)怎么還要買(mǎi)服務(wù)器嗎-想買(mǎi)一個(gè)阿里云的服務(wù)器，要怎么配置？

racknerd：cpanel新加坡虛擬主機(jī)上線，最低配$17.98/年，可托管4個(gè)域，贈(zèng)送ssl證

requests-html庫(kù)初識(shí) + 無(wú)資料解BUG之 I/O error : encoder er

最新活動(dòng)