国产xxxx99真实实拍_久久不雅视频_高清韩国a级特黄毛片_嗯老师别我我受不了了小说

資訊專欄INFORMATION COLUMN

Python_selenium_phantomjs動態(tài)抓取

zacklee / 3576人閱讀

摘要:當(dāng)前版本是一個服務(wù)器端的的。也可以說是無界面瀏覽器。安裝不是程序,去官網(wǎng)下載對應(yīng)系統(tǒng)版本的安裝即可。方法會一直等到頁面被完全加載,然后才會繼續(xù)程序,但是對于是無可奈何的。安裝設(shè)置的查看所有可用的屬性。

selenium:https://github.com/SeleniumHQ...
當(dāng)前版本3.0.1
A browser automation framework and ecosystem

phantomjs:http://phantomjs.org/
是一個服務(wù)器端的 JavaScript API 的 WebKit。也可以說是無界面瀏覽器。其支持各種Web標(biāo)準(zhǔn): DOM 處理, CSS 選擇器, JSON, Canvas, 和 SVG.

大部分的網(wǎng)頁抓取用urllib都可以搞定,但是涉及到JavaScript及Ajax渲染的時候,urlopen就完全傻逼了,所以不得不用模擬瀏覽器,方法也有很多,此處采用的是selenium2+phantomjs
selenium2支持所有主流的瀏覽器和phantomjs這些無界面的瀏覽器。
安裝:

pip install selenium

phantomjs不是python程序,去官網(wǎng)下載對應(yīng)系統(tǒng)版本的安裝即可。

from selenium import webdriver
import time
 
driver = webdriver.PhantomJS(executable_path=r"C:Users	aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe")
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
time.sleep(3)
print(driver.find_element_by_id("content").text)
driver.close()
from selenium import webdriver
 
driver = webdriver.PhantomJS(executable_path="C:UsersGentlyguitarDesktopphantomjs-1.9.7-windowsphantomjs.exe")
driver.set_window_size(1120, 550)
driver.get("http://duckduckgo.com/")
driver.find_element_by_id("search_form_input_homepage").send_keys("Nirvana")
driver.find_element_by_id("search_button_homepage").click()
print(driver.current_url)
driver.close()

get方法會一直等到頁面被完全加載,然后才會繼續(xù)程序,但是對于ajax是無可奈何的。
send_keys就是填充input表單

等待頁面渲染完成
#等待頁面渲染完成
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
 
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"
driver = webdriver.PhantomJS(executable_path=r"C:Users	aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe", desired_capabilities=dcap)
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
try:
    element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "loadedButton")))
finally:
    print(driver.find_element_by_id("content").text)
    driver.close()
處理Javascript重定向
#處理Javascript重定向
from selenium import webdriver
import time
from selenium.webdriver.remote.webelement import WebElement
from selenium.common.exceptions import StaleElementReferenceException
 
def waitForLoad(driver):
    elem = driver.find_element_by_tag_name("html")
    count = 0
    while True:
        count += 1
        if count > 20:
            print("Timing out after 10 seconds and returning")
            return
        time.sleep(.5)
        try:
            elem == driver.find_element_by_tag_name("html")
        #拋出StaleElementReferenceException異常說明elem元素已經(jīng)消失了,也就說明頁面已經(jīng)跳轉(zhuǎn)了。
        except StaleElementReferenceException:  
            return
 
driver = webdriver.PhantomJS(executable_path=r"C:Users	aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe")
driver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html")
waitForLoad(driver)
print(driver.page_source)
設(shè)置PHANTOMJS的USER-AGENT

有些網(wǎng)站的WebServer對User-Agent有限制,可能會拒絕不熟悉的User-Agent的訪問。
設(shè)置PhantomJS的user-agent,是要設(shè)置“phantomjs.page.settings.userAgent”這個desired_capability.

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
 
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"
 
 
driver = webdriver.PhantomJS(executable_path="./phantomjs.exe", desired_capabilities=dcap)
driver.get("http://dianping.com/")
cap_dict = driver.desired_capabilities  #查看所有可用的desired_capabilities屬性。
for key in cap_dict:
    print "%s: %s" % (key, cap_dict[key])
print driver.current_url
driver.quit()
Demo

github

#pip install selenium
#安裝phantomjs

from selenium import webdriver
import time
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
 
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"

driver = webdriver.PhantomJS(executable_path=r"C:Users	aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe", desired_capabilities=dcap)
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
time.sleep(3)
print(driver.find_element_by_id("content").text)
driver.close()

#設(shè)置PHANTOMJS的USER-AGENT
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
 
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"

 
driver = webdriver.PhantomJS(executable_path="./phantomjs.exe", desired_capabilities=dcap)
driver.get("http://dianping.com/")

cap_dict = driver.desired_capabilities  #查看所有可用的desired_capabilities屬性。
for key in cap_dict:
    print("%s: %s" % (key, cap_dict[key]))
print(driver.current_url)
driver.quit()

#等待頁面渲染完成
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.PhantomJS(executable_path=r"C:Users	aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe")
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
try:
    element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "loadedButton")))
finally:
    print(driver.find_element_by_id("content").text)
    driver.close()

#處理Javascript重定向
from selenium import webdriver
import time
from selenium.webdriver.remote.webelement import WebElement
from selenium.common.exceptions import StaleElementReferenceException

def waitForLoad(driver):
    elem = driver.find_element_by_tag_name("html")
    count = 0
    while True:
        count += 1
        if count > 20:
            print("Timing out after 10 seconds and returning")
            return
        time.sleep(.5)
        try:
            elem == driver.find_element_by_tag_name("html")
        except StaleElementReferenceException:
            return

driver = webdriver.PhantomJS(executable_path=r"C:Users	aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe")
driver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html")
waitForLoad(driver)
print(driver.page_source)
##################################################################################
#模擬拖拽
from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver import ActionChains

driver = webdriver.PhantomJS(executable_path="phantomjs/bin/phantomjs")
driver.get("http://pythonscraping.com/pages/javascript/draggableDemo.html")

print(driver.find_element_by_id("message").text)

element = driver.find_element_by_id("draggable")
target = driver.find_element_by_id("div2")
actions = ActionChains(driver)
actions.drag_and_drop(element, target).perform()

print(driver.find_element_by_id("message").text)
##################################################################################
#截屏
driver.get_screenshot_as_file("tmp/pythonscraping.png")

####
##################################################################################
#登陸知乎,然后能自動點擊頁面下方的“更多”,以載入更多的內(nèi)容
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver import ActionChains
import time
import sys

driver = webdriver.PhantomJS(executable_path="C:UsersGentlyguitarDesktopphantomjs-1.9.7-windowsphantomjs.exe")
driver.get("http://www.zhihu.com/#signin")
#driver.find_element_by_name("email").send_keys("your email")
driver.find_element_by_xpath("http://input[@name="password"]").send_keys("your password")
#driver.find_element_by_xpath("http://input[@name="password"]").send_keys(Keys.RETURN)
time.sleep(2)
driver.get_screenshot_as_file("show.png")
#driver.find_element_by_xpath("http://button[@class="sign-button"]").click()
driver.find_element_by_xpath("http://form[@class="zu-side-login-box"]").submit()

try:
    #等待頁面加載完畢
    dr=WebDriverWait(driver,5)
    dr.until(lambda the_driver:the_driver.find_element_by_xpath("http://a[@class="zu-top-nav-userinfo "]").is_displayed())
except:
    print("登錄失敗")
    sys.exit(0)
driver.get_screenshot_as_file("show.png")
#user=driver.find_element_by_class_name("zu-top-nav-userinfo ")
#webdriver.ActionChains(driver).move_to_element(user).perform() #移動鼠標(biāo)到我的用戶名
loadmore=driver.find_element_by_xpath("http://a[@id="zh-load-more"]")
actions = ActionChains(driver)
actions.move_to_element(loadmore)
actions.click(loadmore)
actions.perform()
time.sleep(2)
driver.get_screenshot_as_file("show.png")
print(driver.current_url)
print(driver.page_source)
driver.quit()
##################################################################################

參考:
http://www.cnblogs.com/chenqi...
http://www.realpython.com/blo...
http://selenium-python.readth...
http://www.cnblogs.com/paisen...
http://smilejay.com/2013/12/s...
更多參考:
selenium webdriver的各種driver

文章版權(quán)歸作者所有,未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請注明本文地址:http://specialneedsforspecialkids.com/yun/44221.html

相關(guān)文章

  • 從0-1打造最強性能Scrapy爬蟲集群

    摘要:包括爬蟲編寫爬蟲避禁動態(tài)網(wǎng)頁數(shù)據(jù)抓取部署分布式爬蟲系統(tǒng)監(jiān)測共六個內(nèi)容,結(jié)合實際定向抓取騰訊新聞數(shù)據(jù),通過測試檢驗系統(tǒng)性能。 1 項目介紹 本項目的主要內(nèi)容是分布式網(wǎng)絡(luò)新聞抓取系統(tǒng)設(shè)計與實現(xiàn)。主要有以下幾個部分來介紹: (1)深入分析網(wǎng)絡(luò)新聞爬蟲的特點,設(shè)計了分布式網(wǎng)絡(luò)新聞抓取系統(tǒng)爬取策略、抓取字段、動態(tài)網(wǎng)頁抓取方法、分布式結(jié)構(gòu)、系統(tǒng)監(jiān)測和數(shù)據(jù)存儲六個關(guān)鍵功能。 (2)結(jié)合程序代碼分解說...

    vincent_xyb 評論0 收藏0
  • Python爬蟲使用Selenium+PhantomJS抓取Ajax和動態(tài)HTML內(nèi)容

    摘要:,集搜客開源代碼下載源開源網(wǎng)絡(luò)爬蟲源,文檔修改歷史,增補文字說明,增加第五章源代碼下載源,并更換源的網(wǎng)址 showImg(https://segmentfault.com/img/bVvMn3); 1,引言 在Python網(wǎng)絡(luò)爬蟲內(nèi)容提取器一文我們詳細(xì)講解了核心部件:可插拔的內(nèi)容提取器類gsExtractor。本文記錄了確定gsExtractor的技術(shù)路線過程中所做的編程實驗。這是第二...

    ymyang 評論0 收藏0
  • Python爬蟲實戰(zhàn)(2):爬取京東商品列表

    摘要:,源代碼爬取京東商品列表,以手機商品列表為例示例網(wǎng)址版本京東手機列表源代碼下載位置請看文章末尾的源。,抓取結(jié)果運行上面的代碼,就會爬取京東手機品類頁面的所有手機型號價格等信息,并保存到本地文件京東手機列表中。 showImg(https://segmentfault.com/img/bVxXHW); 1,引言 在上一篇《python爬蟲實戰(zhàn):爬取Drupal論壇帖子列表》,爬取了一個用...

    shevy 評論0 收藏0
  • Python使用xslt提取網(wǎng)頁數(shù)據(jù)

    摘要:,用庫實現(xiàn)網(wǎng)頁內(nèi)容提取是的一個庫,可以迅速靈活地處理。,集搜客開源代碼下載源開源網(wǎng)絡(luò)爬蟲源,文檔修改歷史,增補文字說明把跟帖的代碼補充了進來,增加最后一章源代碼下載源 showImg(https://segmentfault.com/img/bVvBTt); 1,引言 在Python網(wǎng)絡(luò)爬蟲內(nèi)容提取器一文我們詳細(xì)講解了核心部件:可插拔的內(nèi)容提取器類gsExtractor。本文記錄了確定...

    mdluo 評論0 收藏0
  • 在不使用ssr的情況下解決Vue單頁面SEO問題

    摘要:遇到的問題近來在寫個人博客的時候遇到了大家可能都會遇到的問題單頁面在時顯得很無力,尤其是百度不會抓取動態(tài)腳本配合前后端分離無法讓標(biāo)簽在蜘蛛抓取時動態(tài)填充單頁面又是大勢所趨,寫起來也不止是一個爽,當(dāng)然也可以選擇多頁面但即使是多頁面在面對文章 遇到的問題: 近來在寫個人博客的時候遇到了大家可能都會遇到的問題 Vue單頁面在SEO時顯得很無力,尤其是百度不會抓取動態(tài)腳本 Vue-Router...

    Aceyclee 評論0 收藏0

發(fā)表評論

0條評論

最新活動
閱讀需要支付1元查看
<