摘要:當(dāng)前版本是一個服務(wù)器端的的。也可以說是無界面瀏覽器。安裝不是程序,去官網(wǎng)下載對應(yīng)系統(tǒng)版本的安裝即可。方法會一直等到頁面被完全加載,然后才會繼續(xù)程序,但是對于是無可奈何的。安裝設(shè)置的查看所有可用的屬性。
selenium:https://github.com/SeleniumHQ...
當(dāng)前版本3.0.1
A browser automation framework and ecosystem
phantomjs:http://phantomjs.org/
是一個服務(wù)器端的 JavaScript API 的 WebKit。也可以說是無界面瀏覽器。其支持各種Web標(biāo)準(zhǔn): DOM 處理, CSS 選擇器, JSON, Canvas, 和 SVG.
大部分的網(wǎng)頁抓取用urllib都可以搞定,但是涉及到JavaScript及Ajax渲染的時候,urlopen就完全傻逼了,所以不得不用模擬瀏覽器,方法也有很多,此處采用的是selenium2+phantomjs
selenium2支持所有主流的瀏覽器和phantomjs這些無界面的瀏覽器。
安裝:
pip install selenium
phantomjs不是python程序,去官網(wǎng)下載對應(yīng)系統(tǒng)版本的安裝即可。
from selenium import webdriver import time driver = webdriver.PhantomJS(executable_path=r"C:Users aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe") driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html") time.sleep(3) print(driver.find_element_by_id("content").text) driver.close() from selenium import webdriver driver = webdriver.PhantomJS(executable_path="C:UsersGentlyguitarDesktopphantomjs-1.9.7-windowsphantomjs.exe") driver.set_window_size(1120, 550) driver.get("http://duckduckgo.com/") driver.find_element_by_id("search_form_input_homepage").send_keys("Nirvana") driver.find_element_by_id("search_button_homepage").click() print(driver.current_url) driver.close()
get方法會一直等到頁面被完全加載,然后才會繼續(xù)程序,但是對于ajax是無可奈何的。
send_keys就是填充input表單
#等待頁面渲染完成 from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.desired_capabilities import DesiredCapabilities dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" driver = webdriver.PhantomJS(executable_path=r"C:Users aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe", desired_capabilities=dcap) driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html") try: element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "loadedButton"))) finally: print(driver.find_element_by_id("content").text) driver.close()處理Javascript重定向
#處理Javascript重定向 from selenium import webdriver import time from selenium.webdriver.remote.webelement import WebElement from selenium.common.exceptions import StaleElementReferenceException def waitForLoad(driver): elem = driver.find_element_by_tag_name("html") count = 0 while True: count += 1 if count > 20: print("Timing out after 10 seconds and returning") return time.sleep(.5) try: elem == driver.find_element_by_tag_name("html") #拋出StaleElementReferenceException異常說明elem元素已經(jīng)消失了,也就說明頁面已經(jīng)跳轉(zhuǎn)了。 except StaleElementReferenceException: return driver = webdriver.PhantomJS(executable_path=r"C:Users aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe") driver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html") waitForLoad(driver) print(driver.page_source)設(shè)置PHANTOMJS的USER-AGENT
有些網(wǎng)站的WebServer對User-Agent有限制,可能會拒絕不熟悉的User-Agent的訪問。
設(shè)置PhantomJS的user-agent,是要設(shè)置“phantomjs.page.settings.userAgent”這個desired_capability.
from selenium import webdriver from selenium.webdriver.common.desired_capabilities import DesiredCapabilities dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" driver = webdriver.PhantomJS(executable_path="./phantomjs.exe", desired_capabilities=dcap) driver.get("http://dianping.com/") cap_dict = driver.desired_capabilities #查看所有可用的desired_capabilities屬性。 for key in cap_dict: print "%s: %s" % (key, cap_dict[key]) print driver.current_url driver.quit()Demo
github
#pip install selenium #安裝phantomjs from selenium import webdriver import time from selenium.webdriver.common.desired_capabilities import DesiredCapabilities dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" driver = webdriver.PhantomJS(executable_path=r"C:Users aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe", desired_capabilities=dcap) driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html") time.sleep(3) print(driver.find_element_by_id("content").text) driver.close() #設(shè)置PHANTOMJS的USER-AGENT from selenium import webdriver from selenium.webdriver.common.desired_capabilities import DesiredCapabilities dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" driver = webdriver.PhantomJS(executable_path="./phantomjs.exe", desired_capabilities=dcap) driver.get("http://dianping.com/") cap_dict = driver.desired_capabilities #查看所有可用的desired_capabilities屬性。 for key in cap_dict: print("%s: %s" % (key, cap_dict[key])) print(driver.current_url) driver.quit() #等待頁面渲染完成 from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.PhantomJS(executable_path=r"C:Users aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe") driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html") try: element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "loadedButton"))) finally: print(driver.find_element_by_id("content").text) driver.close() #處理Javascript重定向 from selenium import webdriver import time from selenium.webdriver.remote.webelement import WebElement from selenium.common.exceptions import StaleElementReferenceException def waitForLoad(driver): elem = driver.find_element_by_tag_name("html") count = 0 while True: count += 1 if count > 20: print("Timing out after 10 seconds and returning") return time.sleep(.5) try: elem == driver.find_element_by_tag_name("html") except StaleElementReferenceException: return driver = webdriver.PhantomJS(executable_path=r"C:Users aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe") driver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html") waitForLoad(driver) print(driver.page_source) ################################################################################## #模擬拖拽 from selenium import webdriver from selenium.webdriver.remote.webelement import WebElement from selenium.webdriver import ActionChains driver = webdriver.PhantomJS(executable_path="phantomjs/bin/phantomjs") driver.get("http://pythonscraping.com/pages/javascript/draggableDemo.html") print(driver.find_element_by_id("message").text) element = driver.find_element_by_id("draggable") target = driver.find_element_by_id("div2") actions = ActionChains(driver) actions.drag_and_drop(element, target).perform() print(driver.find_element_by_id("message").text) ################################################################################## #截屏 driver.get_screenshot_as_file("tmp/pythonscraping.png") #### ################################################################################## #登陸知乎,然后能自動點擊頁面下方的“更多”,以載入更多的內(nèi)容 from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver import ActionChains import time import sys driver = webdriver.PhantomJS(executable_path="C:UsersGentlyguitarDesktopphantomjs-1.9.7-windowsphantomjs.exe") driver.get("http://www.zhihu.com/#signin") #driver.find_element_by_name("email").send_keys("your email") driver.find_element_by_xpath("http://input[@name="password"]").send_keys("your password") #driver.find_element_by_xpath("http://input[@name="password"]").send_keys(Keys.RETURN) time.sleep(2) driver.get_screenshot_as_file("show.png") #driver.find_element_by_xpath("http://button[@class="sign-button"]").click() driver.find_element_by_xpath("http://form[@class="zu-side-login-box"]").submit() try: #等待頁面加載完畢 dr=WebDriverWait(driver,5) dr.until(lambda the_driver:the_driver.find_element_by_xpath("http://a[@class="zu-top-nav-userinfo "]").is_displayed()) except: print("登錄失敗") sys.exit(0) driver.get_screenshot_as_file("show.png") #user=driver.find_element_by_class_name("zu-top-nav-userinfo ") #webdriver.ActionChains(driver).move_to_element(user).perform() #移動鼠標(biāo)到我的用戶名 loadmore=driver.find_element_by_xpath("http://a[@id="zh-load-more"]") actions = ActionChains(driver) actions.move_to_element(loadmore) actions.click(loadmore) actions.perform() time.sleep(2) driver.get_screenshot_as_file("show.png") print(driver.current_url) print(driver.page_source) driver.quit() ##################################################################################
參考:
http://www.cnblogs.com/chenqi...
http://www.realpython.com/blo...
http://selenium-python.readth...
http://www.cnblogs.com/paisen...
http://smilejay.com/2013/12/s...
更多參考:
selenium webdriver的各種driver
文章版權(quán)歸作者所有,未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請注明本文地址:http://specialneedsforspecialkids.com/yun/44221.html
摘要:包括爬蟲編寫爬蟲避禁動態(tài)網(wǎng)頁數(shù)據(jù)抓取部署分布式爬蟲系統(tǒng)監(jiān)測共六個內(nèi)容,結(jié)合實際定向抓取騰訊新聞數(shù)據(jù),通過測試檢驗系統(tǒng)性能。 1 項目介紹 本項目的主要內(nèi)容是分布式網(wǎng)絡(luò)新聞抓取系統(tǒng)設(shè)計與實現(xiàn)。主要有以下幾個部分來介紹: (1)深入分析網(wǎng)絡(luò)新聞爬蟲的特點,設(shè)計了分布式網(wǎng)絡(luò)新聞抓取系統(tǒng)爬取策略、抓取字段、動態(tài)網(wǎng)頁抓取方法、分布式結(jié)構(gòu)、系統(tǒng)監(jiān)測和數(shù)據(jù)存儲六個關(guān)鍵功能。 (2)結(jié)合程序代碼分解說...
摘要:,集搜客開源代碼下載源開源網(wǎng)絡(luò)爬蟲源,文檔修改歷史,增補文字說明,增加第五章源代碼下載源,并更換源的網(wǎng)址 showImg(https://segmentfault.com/img/bVvMn3); 1,引言 在Python網(wǎng)絡(luò)爬蟲內(nèi)容提取器一文我們詳細(xì)講解了核心部件:可插拔的內(nèi)容提取器類gsExtractor。本文記錄了確定gsExtractor的技術(shù)路線過程中所做的編程實驗。這是第二...
摘要:,源代碼爬取京東商品列表,以手機商品列表為例示例網(wǎng)址版本京東手機列表源代碼下載位置請看文章末尾的源。,抓取結(jié)果運行上面的代碼,就會爬取京東手機品類頁面的所有手機型號價格等信息,并保存到本地文件京東手機列表中。 showImg(https://segmentfault.com/img/bVxXHW); 1,引言 在上一篇《python爬蟲實戰(zhàn):爬取Drupal論壇帖子列表》,爬取了一個用...
摘要:,用庫實現(xiàn)網(wǎng)頁內(nèi)容提取是的一個庫,可以迅速靈活地處理。,集搜客開源代碼下載源開源網(wǎng)絡(luò)爬蟲源,文檔修改歷史,增補文字說明把跟帖的代碼補充了進來,增加最后一章源代碼下載源 showImg(https://segmentfault.com/img/bVvBTt); 1,引言 在Python網(wǎng)絡(luò)爬蟲內(nèi)容提取器一文我們詳細(xì)講解了核心部件:可插拔的內(nèi)容提取器類gsExtractor。本文記錄了確定...
摘要:遇到的問題近來在寫個人博客的時候遇到了大家可能都會遇到的問題單頁面在時顯得很無力,尤其是百度不會抓取動態(tài)腳本配合前后端分離無法讓標(biāo)簽在蜘蛛抓取時動態(tài)填充單頁面又是大勢所趨,寫起來也不止是一個爽,當(dāng)然也可以選擇多頁面但即使是多頁面在面對文章 遇到的問題: 近來在寫個人博客的時候遇到了大家可能都會遇到的問題 Vue單頁面在SEO時顯得很無力,尤其是百度不會抓取動態(tài)腳本 Vue-Router...
閱讀 2689·2021-10-22 09:55
閱讀 2017·2021-09-27 13:35
閱讀 1272·2021-08-24 10:02
閱讀 1494·2019-08-30 15:55
閱讀 1205·2019-08-30 14:13
閱讀 3478·2019-08-30 13:57
閱讀 1980·2019-08-30 11:07
閱讀 2456·2019-08-29 17:12