使用python scrapy爬取網頁中帶有地圖展示的數據

Bryan 發布于2019-07-31 10:20 / 1983人閱讀

摘要：例如這個界面，我要獲取全中國各大城市的物流園區分布信息，并且要獲取詳情信息，這個頁面里面是有個地圖鑲嵌，每個城市物流信息你要多帶帶點擊地圖上的信息才能顯示。

最近有個需求，是要爬取某個物流公司的官網信息，我看了下官網，基本上都是靜態頁面比較好抓取，不像那種資訊類，電子商務類型的網站結果復雜，反爬嚴格，AJAX眾多，還內心暗自慶幸，當我進一步分析時候發現并非普通的靜態頁面。
例如這個URL界面，我要獲取全中國各大城市的物流園區分布信息，并且要獲取詳情信息，
這個頁面里面是有個地圖鑲嵌，每個城市物流信息你要多帶帶點擊地圖上的信息才能顯示。
https://www.glprop.com.cn/our...

我剛開始想，這種會不會是ajax請求呢，通過chrmoe抓包并沒有發現，然后我查看網頁源代碼
發現所有城市信息在一個scripts里面
如圖：

然后各個園區的信息在一個叫park={xx}里面存著

原來都在這里面，直接獲取源代碼，正則匹配，開干。
item：

#普洛斯
class PuluosiNewsItem(scrapy.Item):
    newstitle=scrapy.Field()
    newtiems=scrapy.Field()
    newslink=scrapy.Field()
class PuluosiItem(scrapy.Item):
    assetstitle = scrapy.Field()
    assetaddress=scrapy.Field()
    assetgaikuang=scrapy.Field()
    assetpeople=scrapy.Field()
    asseturl = scrapy.Field()

pipelines：

class PuluosiNewsPipeline(object):
    def __init__(self):
        self.wb=Workbook()
        self.ws=self.wb.active
        #設置表頭
        self.ws.append(["普洛斯新聞標題","新聞發布時間","新聞URL"])
        self.wb2 = Workbook()
        self.ws2 = self.wb2.active
        self.ws2.append(["資產標題", "資產地址", "資產概況","其他信息","URL"])
    def process_item(self,item,spider):
        if isinstance(item, PuluosiNewsItem):
            line = [item["newstitle"], item["newtiems"], item["newslink"]]  # 把數據中每一項整理出來
            self.ws.append(line)
            self.wb.save("PuluosiNews.xlsx")  # 保存xlsx文件
        elif isinstance(item,PuluosiItem):
            line = [item["assetstitle"], item["assetaddress"], item["assetgaikuang"],item["assetpeople"],item["asseturl"]]
            self.ws2.append(line)
            self.wb2.save("PuluosiAsset.xlsx")  # 保存xlsx文件
        return item

spider：

# -*- coding: utf-8 -*-
import scrapy,re,json
from news.items import PuluosiNewsItem,PuluosiItem
from scrapy.linkextractors import LinkExtractor

class PuluosiSpider(scrapy.Spider):
    name = "puluosi"
    allowed_domains = ["glprop.com.cn"]
    # start_urls = ["https://www.glprop.com.cn/press-releases.html"]

    def start_requests(self):
        yield scrapy.Request("https://www.glprop.com.cn/press-releases.html", self.parse1)
        yield scrapy.Request("https://www.glprop.com.cn/in-the-news.html", self.parse2)
        yield scrapy.Request("https://www.glprop.com.cn/proposed-privatization.html", self.parse3)
        yield scrapy.Request("https://www.glprop.com.cn/our-network/network-detail.html", self.parse4)

    def parse1(self, response):
        print("此時啟動的爬蟲為：puluosi" )
        item=PuluosiNewsItem()
        web=response.xpath("http://tbody/tr")
        web.pop(0)
        for node in  web:
            item["newstitle"] = node.xpath(".//a/text()").extract()[0].strip()
            print(item["newstitle"])
            item["newtiems"] = node.xpath(".//td/text()").extract()[0].strip()
            print(item["newtiems"])
            # urljoin創建絕對的links路徑，始用于網頁中的href值為相對路徑的連接
            item["newslink"] = response.urljoin(web.xpath(".//a/@href").extract()[0])
            # print(item["newslink"])
            yield item
        #加入try 來判斷當前年份的新聞是否有下一頁出現
        try:
            next_url_tmp = response.xpath("http://div[@class="page"]/a[contains(text(),"下一頁")]/@href").extract()[0]
            if next_url_tmp:
                next_url = "https://www.glprop.com.cn" + next_url_tmp
                yield scrapy.Request(next_url,callback=self.parse1)
        except Exception as e:
            print("當前頁面沒有下一頁")
        href=response.xpath("http://ul[@class="timeList"]/li/a/@href")
        for nexturl in href:
            url1 =nexturl.extract()
            if url1:
                url="https://www.glprop.com.cn"+url1
                yield scrapy.Request(url,callback=self.parse1)

    def parse2(self,response):
        item = PuluosiNewsItem()
        web = response.xpath("http://tbody/tr")
        web.pop(0)
        for node in  web:
            item["newstitle"] = node.xpath(".//a/text()").extract()[0].strip()
            print(item["newstitle"])
            item["newtiems"] = node.xpath(".//td/text()").extract()[0].strip()
            print(item["newtiems"])
            # urljoin創建絕對的links路徑，始用于網頁中的href值為相對路徑的連接
            item["newslink"] = response.urljoin(web.xpath(".//a/@href").extract()[0])
            print(item["newslink"])
            yield item
        #加入try 來判斷當前年份的新聞是否有下一頁出現
        try:
            next_url_tmp = response.xpath("http://div[@class="page"]/a[contains(text(),"下一頁")]/@href").extract()[0]
            if next_url_tmp:
                next_url = "https://www.glprop.com.cn" + next_url_tmp
                yield scrapy.Request(next_url,callback=self.parse2)
        except Exception as e:
            print("當前頁面沒有下一頁")
        href=response.xpath("http://ul[@class="timeList"]/li/a/@href")
        for nexturl in href:
            url1 =nexturl.extract()
            if url1:
                url="https://www.glprop.com.cn"+url1
                yield scrapy.Request(url,callback=self.parse2)

    def parse3(self,response):
        item=PuluosiNewsItem()
        web=response.xpath("http://tbody/tr")
        web.pop()
        for node in  web:
            item["newstitle"] = node.xpath(".//a/text()").extract()[0].strip()
            print(item["newstitle"])
            item["newtiems"] = node.xpath(".//td/text()").extract()[0].strip()
            print(item["newtiems"])
            # urljoin創建絕對的links路徑，始用于網頁中的href值為相對路徑的連接
            item["newslink"] = response.urljoin(web.xpath(".//a/@href").extract()[0])
            print(item["newslink"])
            yield item

    def parse4(self,response):
        link=LinkExtractor(restrict_xpaths="http://div[@class="net_pop1"]//div[@class="city"]")
        links=link.extract_links(response)
        #獲取所有城市的links
        for i in links:
            detailurl=i.url
            yield scrapy.Request(url=detailurl,callback=self.parse5)

    def parse4(self, response):
        item = PuluosiItem()
        citycode=re.findall("var cities =(.*);",response.text )
        citycodejson=json.loads(("".join(citycode)))
        #把每個城市的id和name取出來放到一個字典
        dictcity={}
        for i in citycodejson:
            citycodename=i["name"]
            citycodenm=i["id"]
            dictcity[citycodenm]=citycodename
        detail=re.findall("var parks =(.*);",response.text )
        jsonBody = json.loads(("".join(detail)))
        list = []
        for key1 in jsonBody:
            for key2  in jsonBody[key1]:
                tmp=jsonBody[key1][key2]
                list.append(jsonBody[key1][key2])
        for node in list:
            assetaddress = node["city_id"]
            item["assetaddress"] = dictcity[assetaddress]
            # print(item["assetaddress"])
            item["assetstitle"] = node["name"]
            # print(item["assetstitle"])
            item["assetgaikuang"] = node["detail_single"].strip().replace(" ", "").replace(" ", "")
            # print(item["assetgaikuang"])
            assetpeople = node["description"]
            item["assetpeople"] = re.sub(r"<.*?>", "", (assetpeople.strip())).replace(" ", "")
            item["asseturl"]="https://www.glprop.com.cn/network-city-detail.html?city="+item["assetaddress"]
            # print(item["assetpeople"])
            yield item

然后我順便把頁面的新聞信息也爬取了。

GPU云服務器云服務器 python爬取網頁圖片 scrapy爬取實例 java網頁爬取數據多地圖展示

文章版權歸作者所有，未經允許請勿轉載,若此文章存在違規行為，您可以聯系管理員刪除。

轉載請注明本文地址：http://specialneedsforspecialkids.com/yun/43670.html

首次公開，整理12年積累的博客收藏夾，零距離展示《收藏夾吃灰》系列博客

摘要：時間永遠都過得那么快，一晃從年注冊，到現在已經過去了年那些被我藏在收藏夾吃灰的文章，已經太多了，是時候把他們整理一下了。那是因為收藏夾太亂，橡皮擦給設置私密了，不收拾不好看呀。 ...

Harriet666 2021-09-10 10:51 評論0 收藏0
Python Scrapy爬蟲框架學習

摘要：組件引擎負責控制數據流在系統中所有組件中流動，并在相應動作發生時觸發事件。下載器下載器負責獲取頁面數據并提供給引擎，而后提供給。下載器中間件下載器中間件是在引擎及下載器之間的特定鉤子，處理傳遞給引擎的。 Scrapy 是用Python實現一個為爬取網站數據、提取結構性數據而編寫的應用框架。一、Scrapy框架簡介 Scrapy是一個為了爬取網站數據，提取結構性數據而編寫的應用框架。 ...

harriszh 2019-07-31 11:00 評論0 收藏0
爬蟲入門

摘要：通用網絡爬蟲通用網絡爬蟲又稱全網爬蟲，爬取對象從一些種子擴充到整個。為提高工作效率，通用網絡爬蟲會采取一定的爬取策略。介紹是一個國人編寫的強大的網絡爬蟲系統并帶有強大的。爬蟲簡單的說網絡爬蟲（Web crawler）也叫做網絡鏟（Web scraper）、網絡蜘蛛（Web spider），其行為一般是先爬到對應的網頁上，再把需要的信息鏟下來。分類網絡爬蟲按照系統結構和實現技術，...

defcon 2019-07-30 17:07 評論0 收藏0
爬蟲入門

摘要：通用網絡爬蟲通用網絡爬蟲又稱全網爬蟲，爬取對象從一些種子擴充到整個。為提高工作效率，通用網絡爬蟲會采取一定的爬取策略。介紹是一個國人編寫的強大的網絡爬蟲系統并帶有強大的。爬蟲簡單的說網絡爬蟲（Web crawler）也叫做網絡鏟（Web scraper）、網絡蜘蛛（Web spider），其行為一般是先爬到對應的網頁上，再把需要的信息鏟下來。分類網絡爬蟲按照系統結構和實現技術，...

Invoker 2019-08-30 15:54 評論0 收藏0
為你的爬蟲提提速？

摘要：項目介紹本文將展示如何利用中的異步模塊來提高爬蟲的效率。使用用的爬蟲爬取了條數據，耗時小時，該爬蟲爬取條數據，耗時半小時。如果是同樣的數據量，那么爬取條數據耗時約小時，該爬蟲僅用了爬蟲的四分之一的時間就出色地完成了任務。項目介紹 ??本文將展示如何利用Pyhton中的異步模塊來提高爬蟲的效率。??我們需要爬取的目標為：融360網站上的理財產品信息（https://www.rong36...

yanest 2019-07-31 11:13 評論0 收藏0