[Python自動化]selenium之文件批量下載

wzyplus 發(fā)布于2021-09-28 09:36 / 2262人閱讀

摘要：自動化這一專欄，將以目的為導(dǎo)向，以簡化或自動化完成工作任務(wù)為目標(biāo)，將運(yùn)用于實(shí)踐中，解決實(shí)際問題，以激發(fā)讀者對這門腳本語言的學(xué)習(xí)興趣。

Python 自動化這一專欄，將以目的為導(dǎo)向，以簡化或自動化完成工作任務(wù)為目標(biāo)，將Python運(yùn)用于實(shí)踐中，解決實(shí)際問題，以激發(fā)讀者對這門腳本語言的學(xué)習(xí)興趣。在開始Python自動化相關(guān)實(shí)戰(zhàn)的學(xué)習(xí)前，建議對 Python基礎(chǔ) 以及 Python 爬蟲的相關(guān)知識展開一定的學(xué)習(xí)與了解。對此博客已開設(shè)相關(guān)專欄，可點(diǎn)擊直達(dá)。

往期內(nèi)容提要：

【Python基礎(chǔ)】動態(tài)HTML處理之Selenium與PhantomJS
【Python基礎(chǔ)】機(jī)器視覺與機(jī)器圖像識別之Tesseract
【Python自動化】 selenium之驗(yàn)證碼識別
【Python自動化】 selenium之網(wǎng)課學(xué)習(xí)自動化
【Python自動化】 selenium之文件批量下載（本文）
【Python實(shí)戰(zhàn)】疫情期間每日健康報送任務(wù)的自動化處理
【Python實(shí)戰(zhàn)】教務(wù)管理系統(tǒng)：成績、課表查詢接口設(shè)計及搶課、監(jiān)控功能實(shí)現(xiàn)

“文件下載”無論是在網(wǎng)絡(luò)爬蟲，還是自動化領(lǐng)域，都是最為常見的需求。此前作者曾在《教務(wù)管理系統(tǒng)：成績、課表查詢接口設(shè)計及搶課、監(jiān)控功能實(shí)現(xiàn)》一文中，在圖形驗(yàn)證碼的識別中首先就介紹了進(jìn)行了實(shí)戰(zhàn)展示。在這一篇文章中，將對文件下載作出一個相對系統(tǒng)的概括與總結(jié)。

一般而言，文件下載可以通過兩個方式實(shí)現(xiàn)。其一，發(fā)包收包解決；其二，selenium解決。

針對第一種方法，曾在《教務(wù)管理系統(tǒng)：成績、課表查詢接口設(shè)計及搶課、監(jiān)控功能實(shí)現(xiàn)》一文中予以過展示。基本邏輯在于構(gòu)造get請求，發(fā)包后儲存返回結(jié)果。

url = "手動打碼/Image.aspx"def get_pic():    # 驗(yàn)證碼請求頭    headers = {        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0",        "cookie": "varPartNewsManage.aspx=10"    }    re_pic = requests.get(url, headers=headers)    response = re_pic.content    file = "C://Users//john//Desktop//1//" + ".png"    playFile = open(file, "wb")    playFile.write(response)    playFile.close()

此外，通過selenium方式解放雙手，實(shí)現(xiàn)文件批量下載在實(shí)戰(zhàn)中也是較為常見的方法。接下來將以網(wǎng)課夢魘——“超星學(xué)習(xí)通”課程音頻下載為例，介紹如何利用selenium實(shí)現(xiàn)網(wǎng)課文件批量下載。

環(huán)境所需必要模塊：

from selenium.webdriver.common.keys import Keysfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support import expected_conditionsfrom selenium.common.exceptions import TimeoutException, WebDriverExceptionfrom datetime import datetimefrom time import sleepfrom selenium.webdriver.support.wait import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.webdriver.common.by import By

文件下載基本步驟梳理：

訪問目標(biāo)站點(diǎn)
獲取下載源
指定存儲路徑
實(shí)現(xiàn)下載

一、訪問目標(biāo)站點(diǎn)

目標(biāo)站點(diǎn)的訪問方法可參見往期Python自動化文章《【Python自動化】登陸與識別》，而文章舉例站點(diǎn)“超星學(xué)習(xí)通”的URL作為教師分享鏈接，無需登陸驗(yàn)證。URL格式如下：

http://apps.wh.chaoxing.com/screen/vclass/view/xxxxx-xxxxx-xxxxx-xxxxx-xxxxxxxxxx

因此，僅需要簡單調(diào)用webdriver，實(shí)現(xiàn)目標(biāo)站點(diǎn)的訪問：

chrome_options = Options()# chrome_options.add_argument("--headless")# chrome_options.add_argument("--disable-gpu")driver = webdriver.Chrome()browser = webdriver.Chrome(chrome_options=chrome_options)url_list = [    "http://apps.wh.chaoxing.com/screen/vclass/view/xxxxx-xxxxx-xxxxx-xxxxx-xxxxxxxxxx1",    "http://apps.wh.chaoxing.com/screen/vclass/view/xxxxx-xxxxx-xxxxx-xxxxx-xxxxxxxxxx2",    "http://apps.wh.chaoxing.com/screen/vclass/view/xxxxx-xxxxx-xxxxx-xxxxx-xxxxxxxxxx3"]browser.get(url_list[0])# browser.maximize_window()wait = WebDriverWait(browser,10,0.5)

二、獲取下載源

首先通過開發(fā)者模式定位音頻元素，在能夠獲取單頁音頻文件直鏈的基礎(chǔ)上，采用遍歷的方式獲取全站文件直鏈。

link = WebDriverWait(browser, 10).until(lambda x: x.find_elements_by_xpath("http://audio"))list=[]list_count = 0for i in link:    list.append(i.get_attribute("src"))#print(list)#print(type(list))browser.quit()

三、指定存儲路徑并實(shí)現(xiàn)下載

z = []for i in list:    time = datetime.now().strftime("%H-%M-%S----")    data = requests.get(i, stream=True)    z.append(time)    with open("C://Users//john//Desktop//1//" + time + i[-8:-5] + ".mp3", "wb") as f:        for j in data.iter_content(chunk_size=512):            f.write(j)        print(i + "寫出完畢!")print("一共 {} 個,下載完成 {} 個 ".format(len(list),len(z)))