Python圖片爬取方法總結(jié)

edagarli 發(fā)布于2019-07-25 12:06 / 2051人閱讀

摘要：當(dāng)項(xiàng)目進(jìn)入，組內(nèi)的將被的調(diào)度器和下載器這意味著調(diào)度器和下載器的中間件可以復(fù)用安排下載，當(dāng)優(yōu)先級(jí)更高，會(huì)在其他頁(yè)面被抓取前處理。這個(gè)組將包含一個(gè)字典列表，其中包括下載文件的信息，比如下載路徑源抓取地址從組獲得和圖片的校驗(yàn)碼。

1. 最常見(jiàn)爬取圖片方法

對(duì)于圖片爬取，最容易想到的是通過(guò)urllib庫(kù)或者requests庫(kù)實(shí)現(xiàn)。具體兩種方法的實(shí)現(xiàn)如下：

1.1 urllib

使用urllib.request.urlretrieve方法，通過(guò)圖片url和存儲(chǔ)的名稱完成下載。

"""
Signature: request.urlretrieve(url, filename=None, reporthook=None, data=None)
Docstring:
Retrieve a URL into a temporary location on disk.

Requires a URL argument. If a filename is passed, it is used as
the temporary file location. The reporthook argument should be
a callable that accepts a block number, a read size, and the
total file size of the URL target. The data argument should be
valid URL encoded data.

If a filename is passed and the URL points to a local resource,
the result is a copy from local file to new file.

Returns a tuple containing the path to the newly created
data file as well as the resulting HTTPMessage object.
File:      ~/anaconda/lib/python3.6/urllib/request.py
Type:      function
"""

參數(shù) finename 指定了保存本地路徑（如果參數(shù)未指定，urllib會(huì)生成一個(gè)臨時(shí)文件保存數(shù)據(jù)。）

參數(shù) reporthook 是一個(gè)回調(diào)函數(shù)，當(dāng)連接上服務(wù)器、以及相應(yīng)的數(shù)據(jù)塊傳輸完畢時(shí)會(huì)觸發(fā)該回調(diào)，我們可以利用這個(gè)回調(diào)函數(shù)來(lái)顯示當(dāng)前的下載進(jìn)度。

參數(shù) data 指 post 到服務(wù)器的數(shù)據(jù)，該方法返回一個(gè)包含兩個(gè)元素的(filename, headers)元組，filename 表示保存到本地的路徑，header 表示服務(wù)器的響應(yīng)頭。

使用示例：

request.urlretrieve("https://img3.doubanio.com/view/photo/photo/public/p454345512.jpg", "kids.jpg")

但很有可能返回403錯(cuò)誤（Forbidden），如：http://www.qnong.com.cn/uploa...。Stack Overflow指出原因：This website is blocking the user-agent used by urllib, so you need to change it in your request.

給urlretrieve加上User-Agent還挺麻煩，方法如下：

import urllib

opener = request.build_opener()
headers = ("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0")
opener.addheaders = [headers]
request.install_opener(opener)
request.urlretrieve("http://www.qnong.com.cn/uploadfile/2016/0416/20160416101815887.jpg", "./dog.jpg")

1.2 requests

使用requests.get()獲取圖片，但要將參數(shù)stream設(shè)為True。

import requests

req = requests.get("http://www.qnong.com.cn/uploadfile/2016/0416/20160416101815887.jpg", stream=True)

with open("dog.jpg", "wb") as wr:
    for chunk in req.iter_content(chunk_size=1024):
        if chunk:
            wr.write(chunk)
            wr.flush()

requests添加User-Agent也很方便，使用headers參數(shù)即可。

2. Scrapy 支持的方法 2.1 ImagesPipeline

Scrapy 自帶 ImagesPipeline 和 FilePipeline 用于圖片和文件下載，最簡(jiǎn)單使用 ImagesPipeline 只需要在 settings 中配置。

# settings.py
ITEM_PIPELINES = {
    "scrapy.pipelines.images.ImagesPipeline": 500
}

IMAGES_STORE = "pictures"  # 圖片存儲(chǔ)目錄
IMAGES_MIN_HEIGHT = 400  # 小于600*400的圖片過(guò)濾
IMAGES_MIN_WIDTH = 600

# items.py
import scrapy

class PictureItem(scrapy.Item):
    image_urls = scrapy.Field()

# myspider.py
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


from ..items import BeePicture

class PicSpider(CrawlSpider):
    name = "pic"
    allowed_domains = ["qnong.com.cn"]
    start_urls = ["http://www.qnong.com.cn/"]

    rules = (
        Rule(LinkExtractor(allow=r".*?", restrict_xpaths=("http://a[@href]")), callback="parse_item", follow=True),
    )

    def parse_item(self, response):
        for img_url in response.xpath("http://img/@src").extract():
            item = PictureItem()
            item["image_urls"] = [response.urljoin(img_url)]
            yield item

2.2 自定義 Pipeline

默認(rèn)情況下，使用ImagePipeline組件下載圖片的時(shí)候，圖片名稱是以圖片URL的SHA1值進(jìn)行保存的。

如：
圖片URL: http://www.example.com/image.jpg
SHA1結(jié)果：3afec3b4765f8f0a07b78f98c07b83f013567a0a
則圖片名稱：3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg

想要以自定義圖片文件名需要重寫(xiě) ImagesPipeline 的file_path方法。參考：https://doc.scrapy.org/en/lat...。

# settings.py
ITEM_PIPELINES = {
    "qnong.pipelines.MyImagesPipeline": 500,
}

# items.py
import scrapy

class PictureItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()
    image_paths = scrapy.Field()

# myspider.py
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


from ..items import BeePicture

class PicSpider(CrawlSpider):
    name = "pic"
    allowed_domains = ["qnong.com.cn"]
    start_urls = ["http://www.qnong.com.cn/"]

    rules = (
        Rule(LinkExtractor(allow=r".*?", restrict_xpaths=("http://a[@href]")), callback="parse_item", follow=True),
    )

    def parse_item(self, response):
        for img_url in response.xpath("http://img/@src").extract():
            item = PictureItem()
            item["image_urls"] = [response.urljoin(img_url)]
            yield item

# pipelines.py
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline
import scrapy

class MyImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for img_url in item["image_urls"]:
            yield scrapy.Request(img_url)

    def item_completed(self, results, item, info):
        image_paths = [x["path"] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item["image_paths"] = image_paths
        return item

    def file_path(self, request, response=None, info=None):
        image_guid = request.url.split("/")[-1]
        return "full/%s" % (image_guid)

2.3 FilesPipeline 和 ImagesPipeline 工作流程

FilesPipeline

在一個(gè)爬蟲(chóng)里，你抓取一個(gè)項(xiàng)目，把其中圖片的URL放入 file_urls 組內(nèi)。

項(xiàng)目從爬蟲(chóng)內(nèi)返回，進(jìn)入項(xiàng)目管道。

當(dāng)項(xiàng)目進(jìn)入 FilesPipeline，file_urls 組內(nèi)的 URLs 將被 Scrapy 的調(diào)度器和下載器（這意味著調(diào)度器和下載器的中間件可以復(fù)用）安排下載，當(dāng)優(yōu)先級(jí)更高，會(huì)在其他頁(yè)面被抓取前處理。項(xiàng)目會(huì)在這個(gè)特定的管道階段保持“l(fā)ocker”的狀態(tài)，直到完成文件的下載（或者由于某些原因未完成下載）。

當(dāng)文件下載完后，另一個(gè)字段(files)將被更新到結(jié)構(gòu)中。這個(gè)組將包含一個(gè)字典列表，其中包括下載文件的信息，比如下載路徑、源抓取地址（從 file_urls 組獲得）和圖片的校驗(yàn)碼(checksum)。 files 列表中的文件順序?qū)⒑驮?file_urls 組保持一致。如果某個(gè)圖片下載失敗，將會(huì)記錄下錯(cuò)誤信息，圖片也不會(huì)出現(xiàn)在 files 組中。

ImagesPipeline

在一個(gè)爬蟲(chóng)里，你抓取一個(gè)項(xiàng)目，把其中圖片的 URL 放入 images_urls 組內(nèi)。

項(xiàng)目從爬蟲(chóng)內(nèi)返回，進(jìn)入項(xiàng)目管道。

當(dāng)項(xiàng)目進(jìn)入 Imagespipeline，images_urls 組內(nèi)的URLs將被Scrapy的調(diào)度器和下載器（這意味著調(diào)度器和下載器的中間件可以復(fù)用）安排下載，當(dāng)優(yōu)先級(jí)更高，會(huì)在其他頁(yè)面被抓取前處理。項(xiàng)目會(huì)在這個(gè)特定的管道階段保持“l(fā)ocker”的狀態(tài)，直到完成文件的下載（或者由于某些原因未完成下載）。

當(dāng)文件下載完后，另一個(gè)字段(images)將被更新到結(jié)構(gòu)中。這個(gè)組將包含一個(gè)字典列表，其中包括下載文件的信息，比如下載路徑、源抓取地址（從 images_urls 組獲得）和圖片的校驗(yàn)碼(checksum)。 images 列表中的文件順序?qū)⒑驮?images_urls 組保持一致。如果某個(gè)圖片下載失敗，將會(huì)記錄下錯(cuò)誤信息，圖片也不會(huì)出現(xiàn)在 images 組中。

Scrapy 不僅可以下載圖片，還可以生成指定大小的縮略圖。
Pillow 是用來(lái)生成縮略圖，并將圖片歸一化為 JPEG/RGB 格式，因此為了使用圖片管道，你需要安裝這個(gè)庫(kù)。

云服務(wù)器 GPU云服務(wù)器 python爬取網(wǎng)頁(yè)圖片方法總結(jié) 使用方法總結(jié) python爬取

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://specialneedsforspecialkids.com/yun/38687.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

edagarli

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

#yyds干貨盤(pán)點(diǎn)# 前端基礎(chǔ)知識(shí)面試集錦2

閱讀 3054·2021-11-22 15:29
電子設(shè)備及半導(dǎo)體測(cè)量之“納米結(jié)構(gòu)的低級(jí)測(cè)量”技術(shù)說(shuō)明

閱讀 1732·2021-10-12 10:11
指南者stm32單片機(jī)keil5新建工程和組織目錄的那些事

閱讀 1761·2021-09-04 16:45
Namesilo：域名購(gòu)買(mǎi)及使用教程（附 Namesilo 優(yōu)惠碼）

閱讀 2242·2021-08-25 09:39
ION：2周年促銷，VPS年付8折優(yōu)惠，可選洛杉磯/圣何塞cn2 gia/新加坡cn2

閱讀 2793·2021-08-18 10:20
edgenat：全新“韓國(guó)原生IP”VPS，全場(chǎng)8折促銷，韓國(guó)CN2/中國(guó)香港CN2/洛杉磯CN2

閱讀 2516·2021-08-11 11:17
SASS入門(mén)

閱讀 451·2019-08-30 12:49
前端碎語(yǔ)（6）

閱讀 3314·2019-08-30 12:49

国产xxxx99真实实拍_久久不雅视频_高清韩国a级特黄毛片_嗯老师别我我受不了了小说

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

Python圖片爬取方法總結(jié)

相關(guān)文章

首次公開(kāi)，整理12年積累的博客收藏夾，零距離展示《收藏夾吃灰》系列博客

Python 從零開(kāi)始爬蟲(chóng)(三)——實(shí)戰(zhàn)：requests+BeautifulSoup實(shí)現(xiàn)靜態(tài)爬取

爬蟲(chóng) - 收藏集 - 掘金

Python爬蟲(chóng)之使用Fiddler+Postman+Python的requests模塊爬取各國(guó)國(guó)旗

**用Python爬取"王者農(nóng)藥"英雄皮膚**

發(fā)表評(píng)論

0條評(píng)論

edagarli

男|高級(jí)講師

TA的文章

#yyds干貨盤(pán)點(diǎn)# 前端基礎(chǔ)知識(shí)面試集錦2

電子設(shè)備及半導(dǎo)體測(cè)量之“納米結(jié)構(gòu)的低級(jí)測(cè)量”技術(shù)說(shuō)明

指南者stm32單片機(jī)keil5新建工程和組織目錄的那些事

Namesilo：域名購(gòu)買(mǎi)及使用教程（附 Namesilo 優(yōu)惠碼）

ION：2周年促銷，VPS年付8折優(yōu)惠，可選洛杉磯/圣何塞cn2 gia/新加坡cn2

edgenat：全新“韓國(guó)原生IP”VPS，全場(chǎng)8折促銷，韓國(guó)CN2/中國(guó)香港CN2/洛杉磯CN2

SASS入門(mén)

前端碎語(yǔ)（6）

最新活動(dòng)