4、web爬蟲，scrapy模塊標簽選擇器下載圖片，以及正則匹配標簽

KitorinZero 發布于2019-07-31 10:33 / 3444人閱讀

摘要：百度云搜索，搜各種資料搜網盤，搜各種資料標簽選擇器對象創建標簽選擇器對象，參數接收回調的對象需要導入模塊標簽選擇器方法，是里的一個方法，參數接收選擇器規則，返回列表元素是一個標簽對象獲取到選擇器過濾后的內容，返回列表元素是內容選擇器規則表示

【百度云搜索，搜各種資料:http://bdy.lqkweb.com】

【搜網盤，搜各種資料:http://www.swpan.cn】

標簽選擇器對象

HtmlXPathSelector()創建標簽選擇器對象，參數接收response回調的html對象
需要導入模塊：from scrapy.selector import HtmlXPathSelector

select()標簽選擇器方法，是HtmlXPathSelector里的一個方法，參數接收選擇器規則，返回列表元素是一個標簽對象

extract()獲取到選擇器過濾后的內容，返回列表元素是內容

選擇器規則

　　//x?表示向下查找n層指定標簽，如：//div 表示查找所有div標簽
　　/x?表示向下查找一層指定的標簽
　　/@x?表示查找指定屬性,可以連綴如：@id @src
　　[@class="class名稱"]?表示查找指定屬性等于指定值的標簽,可以連綴，查找class名稱等于指定名稱的標簽
　　/text()?獲取標簽文本類容
　　[x]?通過索引獲取集合里的指定一個元素

獲取指定的標簽對象

#?-*-?coding:?utf-8?-*-
import?scrapy???????#導入爬蟲模塊
from?scrapy.selector?import?HtmlXPathSelector??#導入HtmlXPathSelector模塊
from?urllib?import?request?????????????????????#導入request模塊
import?os

class?AdcSpider(scrapy.Spider):
????name?=?"adc"????????????????????????????????????????#設置爬蟲名稱
????allowed_domains?=?["www.shaimn.com"]
????start_urls?=?["http://www.shaimn.com/xinggan/"]

????def?parse(self,?response):
????????hxs?=?HtmlXPathSelector(response)???????????????#創建HtmlXPathSelector對象，將頁面返回對象傳進去

????????items?=?hxs.select("http://div[@class="showlist"]/li")??#標簽選擇器，表示獲取所有class等于showlist的div，下面的li標簽
????????print(items)???????????????????????????????????????#返回標簽對象

循環獲取到每個li標簽里的子標簽，以及各種屬性或者文本

#?-*-?coding:?utf-8?-*-
import?scrapy???????#導入爬蟲模塊
from?scrapy.selector?import?HtmlXPathSelector??#導入HtmlXPathSelector模塊
from?urllib?import?request?????????????????????#導入request模塊
import?os

class?AdcSpider(scrapy.Spider):
????name?=?"adc"????????????????????????????????????????#設置爬蟲名稱
????allowed_domains?=?["www.shaimn.com"]
????start_urls?=?["http://www.shaimn.com/xinggan/"]

????def?parse(self,?response):
????????hxs?=?HtmlXPathSelector(response)???????????????#創建HtmlXPathSelector對象，將頁面返回對象傳進去

????????items?=?hxs.select("http://div[@class="showlist"]/li")??#標簽選擇器，表示獲取所有class等于showlist的div，下面的li標簽
????????#?print(items)?????????????????????????????????????#返回標簽對象
????????for?i?in?range(len(items)):????????????????????????#根據li標簽的長度循環次數
????????????title?=?hxs.select("http://div[@class="showlist"]/li[%d]//img/@alt"?%?i).extract()???#根據循環的次數作為下標獲取到當前li標簽，下的img標簽的alt屬性內容
????????????src?=?hxs.select("http://div[@class="showlist"]/li[%d]//img/@src"?%?i).extract()?????#根據循環的次數作為下標獲取到當前li標簽，下的img標簽的src屬性內容
????????????if?title?and?src:
????????????????print(title,src)??#返回類容列表

將獲取到的圖片下載到本地

urlretrieve()將文件保存到本地，參數1要保存文件的src，參數2保存路徑
urlretrieve是urllib下request模塊的一個方法，需要導入from urllib import request

#?-*-?coding:?utf-8?-*-
import?scrapy???????#導入爬蟲模塊
from?scrapy.selector?import?HtmlXPathSelector??#導入HtmlXPathSelector模塊
from?urllib?import?request?????????????????????#導入request模塊
import?os

class?AdcSpider(scrapy.Spider):
????name?=?"adc"????????????????????????????????????????#設置爬蟲名稱
????allowed_domains?=?["www.shaimn.com"]
????start_urls?=?["http://www.shaimn.com/xinggan/"]

????def?parse(self,?response):
????????hxs?=?HtmlXPathSelector(response)???????????????#創建HtmlXPathSelector對象，將頁面返回對象傳進去

????????items?=?hxs.select("http://div[@class="showlist"]/li")??#標簽選擇器，表示獲取所有class等于showlist的div，下面的li標簽
????????#?print(items)?????????????????????????????????????#返回標簽對象
????????for?i?in?range(len(items)):????????????????????????#根據li標簽的長度循環次數
????????????title?=?hxs.select("http://div[@class="showlist"]/li[%d]//img/@alt"?%?i).extract()???#根據循環的次數作為下標獲取到當前li標簽，下的img標簽的alt屬性內容
????????????src?=?hxs.select("http://div[@class="showlist"]/li[%d]//img/@src"?%?i).extract()?????#根據循環的次數作為下標獲取到當前li標簽，下的img標簽的src屬性內容
????????????if?title?and?src:
????????????????#?print(title[0],src[0])????????????????????????????????????????????????????#通過下標獲取到字符串內容
????????????????file_path?=?os.path.join(os.getcwd()?+?"/img/",?title[0]?+?".jpg")??????????#拼接圖片保存路徑
????????????????request.urlretrieve(src[0],?file_path)??????????????????????????#將圖片保存到本地，參數1獲取到的src，參數2保存路徑

xpath()標簽選擇器，是Selector類里的一個方法，參數是選擇規則【推薦】

選擇器規則同上

selector()創建選擇器類，需要接受html對象
需要導入：from scrapy.selector import Selector

#?-*-?coding:?utf-8?-*-
import?scrapy???????#導入爬蟲模塊
from?scrapy.selector?import?HtmlXPathSelector??#導入HtmlXPathSelector模塊
from?scrapy.selector?import?Selector

class?AdcSpider(scrapy.Spider):
????name?=?"adc"????????????????????????????????????????#設置爬蟲名稱
????allowed_domains?=?["www.shaimn.com"]
????start_urls?=?["http://www.shaimn.com/xinggan/"]

????def?parse(self,?response):
????????items?=?Selector(response=response).xpath("http://div[@class="showlist"]/li").extract()
????????#?print(items)?????????????????????????????????????#返回標簽對象
????????for?i?in?range(len(items)):
????????????title?=?Selector(response=response).xpath("http://div[@class="showlist"]/li[%d]//img/@alt"?%?i).extract()
????????????src?=?Selector(response=response).xpath("http://div[@class="showlist"]/li[%d]//img/@src"?%?i).extract()
????????????print(title,src)

正則表達式的應用

正則表達式是彌補，選擇器規則無法滿足過濾情況時使用的，

分為兩種正則使用方式

　　1、將選擇器規則過濾出來的結果進行正則匹配

　　2、在選擇器規則里應用正則進行過濾

1、將選擇器規則過濾出來的結果進行正則匹配，用正則取最終內容

最后.re("正則")

#?-*-?coding:?utf-8?-*-
import?scrapy???????#導入爬蟲模塊
from?scrapy.selector?import?HtmlXPathSelector??#導入HtmlXPathSelector模塊
from?scrapy.selector?import?Selector

class?AdcSpider(scrapy.Spider):
????name?=?"adc"????????????????????????????????????????#設置爬蟲名稱
????allowed_domains?=?["www.shaimn.com"]
????start_urls?=?["http://www.shaimn.com/xinggan/"]

????def?parse(self,?response):
????????items?=?Selector(response=response).xpath("http://div[@class="showlist"]/li//img")[0].extract()
????????print(items)?????????????????????????????????????#返回標簽對象
????????items2?=?Selector(response=response).xpath("http://div[@class="showlist"]/li//img")[0].re("alt="(w+)")
????????print(items2)

#?
#?["人體藝術mmSunny前凸后翹性感誘惑寫真"]

2、在選擇器規則里應用正則進行過濾

[re:正則規則]

#?-*-?coding:?utf-8?-*-
import?scrapy???????#導入爬蟲模塊
from?scrapy.selector?import?HtmlXPathSelector??#導入HtmlXPathSelector模塊
from?scrapy.selector?import?Selector

class?AdcSpider(scrapy.Spider):
????name?=?"adc"????????????????????????????????????????#設置爬蟲名稱
????allowed_domains?=?["www.shaimn.com"]
????start_urls?=?["http://www.shaimn.com/xinggan/"]

????def?parse(self,?response):
????????items?=?Selector(response=response).xpath("http://div").extract()
????????#?print(items)?????????????????????????????????????#返回標簽對象
????????items2?=?Selector(response=response).xpath("http://div[re:test(@class,?"showlist")]").extract()??#正則找到div的class等于showlist的元素
????????print(items2)

【轉載自：http://www.leiqiankun.com/?id=47】

GPU云服務器云服務器 jsp標簽選擇器匹配a標簽正則html標簽正則校驗html標簽

文章版權歸作者所有，未經允許請勿轉載,若此文章存在違規行為，您可以聯系管理員刪除。

轉載請注明本文地址：http://specialneedsforspecialkids.com/yun/44026.html

網絡爬蟲介紹

摘要：什么是爬蟲網絡爬蟲也叫網絡蜘蛛，是一種自動化瀏覽網絡的程序，或者說是一種網絡機器人。什么是爬蟲網絡爬蟲也叫網絡蜘蛛，是一種自動化瀏覽網絡的程序，或者說是一種網絡機器人。它們被廣泛用于互聯網搜索引擎或其他類似網站，以獲取或更新這些網站的內容和檢索方式。它們可以自動采集所有其能夠訪問到的頁面內容，以供搜索引擎做進一步處理（分檢整理下載的頁面），而使得用戶能更快的檢索到他們需要的信息。簡...

sf190404 2019-07-31 10:23 評論0 收藏0
11、web爬蟲講解2—Scrapy框架爬蟲—Scrapy使用

摘要：百度云搜索，搜各種資料搜網盤，搜各種資料表達式表示向下查找層指定標簽，如表示查找所有標簽表示向下查找一層指定的標簽表示查找指定屬性的值可以連綴如屬性名稱屬性值表示查找指定屬性等于指定值的標簽可以連綴，如查找名稱等于指定名稱的標簽獲取標簽文本【百度云搜索，搜各種資料:http://www.lqkweb.com】【搜網盤，搜各種資料:http://www.swpan.cn】 xpath...

trilever 2019-07-31 11:23 評論0 收藏0
scrapy學習筆記

摘要：是最有名的爬蟲框架之一，可以很方便的進行抓取，并且提供了很強的定制型，這里記錄簡單學習的過程和在實際應用中會遇到的一些常見問題一安裝在安裝之前有一些依賴需要安裝，否則可能會安裝失敗，的選擇器依賴于，還有網絡引擎，下面是下安裝的過程下安裝安裝 scrapy是python最有名的爬蟲框架之一，可以很方便的進行web抓取，并且提供了很強的定制型，這里記錄簡單學習的過程和在實際應用中會遇到的一...

luzhuqun 2019-07-25 10:51 評論0 收藏0
爬蟲入門

摘要：通用網絡爬蟲通用網絡爬蟲又稱全網爬蟲，爬取對象從一些種子擴充到整個。為提高工作效率，通用網絡爬蟲會采取一定的爬取策略。介紹是一個國人編寫的強大的網絡爬蟲系統并帶有強大的。爬蟲簡單的說網絡爬蟲（Web crawler）也叫做網絡鏟（Web scraper）、網絡蜘蛛（Web spider），其行為一般是先爬到對應的網頁上，再把需要的信息鏟下來。分類網絡爬蟲按照系統結構和實現技術，...

defcon 2019-07-30 17:07 評論0 收藏0
爬蟲入門

摘要：通用網絡爬蟲通用網絡爬蟲又稱全網爬蟲，爬取對象從一些種子擴充到整個。為提高工作效率，通用網絡爬蟲會采取一定的爬取策略。介紹是一個國人編寫的強大的網絡爬蟲系統并帶有強大的。爬蟲簡單的說網絡爬蟲（Web crawler）也叫做網絡鏟（Web scraper）、網絡蜘蛛（Web spider），其行為一般是先爬到對應的網頁上，再把需要的信息鏟下來。分類網絡爬蟲按照系統結構和實現技術，...

Invoker 2019-08-30 15:54 評論0 收藏0