Scrapy 抓取知乎的問題

xiao7cn 發(fā)布于2019-07-24 18:26 / 2520人閱讀

摘要：從如何評價的話題下開始抓取問題，然后開始爬相關問題再循環(huán)對于每個問題抓取標題，關注人數(shù)，回答數(shù)等數(shù)據(jù)設置用戶名和密碼設置獲取值獲得驗證碼的地址準備下載驗證碼獲取請求下載驗證碼打開驗證碼輸入驗證碼請輸入驗證碼輸入賬號和

從如何評價X的話題下開始抓取問題，然后開始爬相關問題再循環(huán)

對于每個問題抓取標題，關注人數(shù)，回答數(shù)等數(shù)據(jù)

zhihuTopicSpider.py

# -*- coding: utf-8 -*-

import scrapy
import os
import time
import re
import json

from ..items import zhihuQuestionItem

# mode 1:tencent   2:free
mode = 2
proxy = "https://web-proxy.oa.com:8080" if mode == 1 else ""

# 設置 用戶名和密碼
email = "youremail"
password = "yourpassword"


class zhihu_topicSpider(scrapy.Spider):
    name = "zhihu_topicSpider"
    zhihu_url = "https://www.zhihu.com"
    login_url = "https://www.zhihu.com/login/email"
    topic = "https://www.zhihu.com/topic"
    domain = "https://www.zhihu.com"

    # 設置 Headers
    headers_dict = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "zh-CN,zh;q=0.8",
        "Connection": "keep-alive",
        "Host": "www.zhihu.com",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36"
    }

    def start_requests(self):
        yield scrapy.Request(
            url=self.zhihu_url,
            headers=self.headers_dict,
            meta={
                "proxy": proxy,
                "cookiejar": 1
            },
            callback=self.request_captcha
        )

    def request_captcha(self, response):
        # 獲取_xsrf值
        _xsrf = response.css("input[name="_xsrf"]::attr(value)").extract()[0]
        # 獲得驗證碼的地址
        captcha_url = "http://www.zhihu.com/captcha.gif?r=" + str(time.time() * 1000)
        # 準備下載驗證碼
        # 獲取請求
        yield scrapy.Request(
            url=captcha_url,
            headers=self.headers_dict,
            meta={
                "proxy": proxy,
                "cookiejar": response.meta["cookiejar"],
                "_xsrf": _xsrf
            },
            callback=self.download_captcha
        )

    def download_captcha(self, response):
        # 下載驗證碼
        with open("captcha.gif", "wb") as fp:
            fp.write(response.body)
        # 打開驗證碼
        os.system("open captcha.gif")
        # 輸入驗證碼
        print "請輸入驗證碼:
"
        captcha = raw_input()
        # 輸入賬號和密碼
        yield scrapy.FormRequest(
            url=self.login_url,
            headers=self.headers_dict,
            formdata={
                "email": email,
                "password": password,
                "_xsrf": response.meta["_xsrf"],
                "remember_me": "true",
                "captcha": captcha
            },
            meta={
                "proxy": proxy,
                "cookiejar": response.meta["cookiejar"],
            },
            callback=self.request_zhihu
        )

    def request_zhihu(self, response):
        """
            現(xiàn)在已經(jīng)登錄,請求www.zhihu.com的頁面
        """
        yield scrapy.Request(url=self.topic + "/19760570",
                             headers=self.headers_dict,
                             meta={
                                 "proxy": proxy,
                                 "cookiejar": response.meta["cookiejar"],
                             },
                             callback=self.get_topic_question,
                             dont_filter=True)

    def get_topic_question(self, response):
        # with open("topic.html", "wb") as fp:
        #     fp.write(response.body)
        # 獲得話題下的question的url
        question_urls = response.css(".question_link[target=_blank]::attr(href)").extract()
        length = len(question_urls)
        k = -1
        j = 0
        temp = []
        for j in range(length/3):
            temp.append(question_urls[k+3])
            j+=1
            k+=3
        for url in temp:
            yield scrapy.Request(url = self.zhihu_url+url,
                    headers = self.headers_dict,
                    meta = {
                        "proxy": proxy,
                        "cookiejar": response.meta["cookiejar"],
                    },
                    callback = self.parse_question_data)

    def parse_question_data(self, response):
        item = zhihuQuestionItem()
        item["qid"] = re.search("d+",response.url).group()
        item["title"] = response.css(".zm-item-title::text").extract()[0].strip()
        item["answers_num"] = response.css("h3::attr(data-num)").extract()[0]
        question_nums = response.css(".zm-side-section-inner .zg-gray-normal strong::text").extract()
        item["followers_num"] = question_nums[0]
        item["visitsCount"] = question_nums[1]
        item["topic_views"] = question_nums[2]
        topic_tags = response.css(".zm-item-tag::text").extract()
        if len(topic_tags) >= 3:
            item["topic_tag0"] = topic_tags[0].strip()
            item["topic_tag1"] = topic_tags[1].strip()
            item["topic_tag2"] = topic_tags[2].strip()
        elif len(topic_tags) == 2:
            item["topic_tag0"] = topic_tags[0].strip()
            item["topic_tag1"] = topic_tags[1].strip()
            item["topic_tag2"] = "-"
        elif len(topic_tags) == 1:
            item["topic_tag0"] = topic_tags[0].strip()
            item["topic_tag1"] = "-"
            item["topic_tag2"] = "-"
        # print type(item["title"])
        question_links = response.css(".question_link::attr(href)").extract()
        yield item
        for url in question_links:
            yield scrapy.Request(url = self.zhihu_url+url,
                    headers = self.headers_dict,
                    meta = {
                        "proxy": proxy,
                        "cookiejar": response.meta["cookiejar"],
                    },
                    callback = self.parse_question_data)

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don"t forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

# import json
import MySQLdb

# class JsonDumpPipeline(object):
#     def process_item(self, item, spider):
#         with open("d.json", "a") as fp:
#             fp.write(json.dumps(dict(item), ensure_ascii = False).encode("utf-8") + "
")



class MySQLPipeline(object):
    print "







"
    sql_questions = (
            "INSERT INTO questions("
            "qid, title, answers_num, followers_num, visitsCount, topic_views, topic_tag0, topic_tag1, topic_tag2) "
            "VALUES ("%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s")")
    count = 0

    def open_spider(self, spider):
        host = "localhost"
        user = "root"
        password = "wangqi"
        dbname = "zh"
        self.conn = MySQLdb.connect(host, user, password, dbname)
        self.cursor = self.conn.cursor()
        self.conn.set_character_set("utf8")
        self.cursor.execute("SET NAMES utf8;")
        self.cursor.execute("SET CHARACTER SET utf8;")
        self.cursor.execute("SET character_set_connection=utf8;")
        print "

MYSQL DB CURSOR INIT SUCCESS!!

"
        sql = (
            "CREATE TABLE IF NOT EXISTS questions ("
                "qid VARCHAR (100) NOT NULL,"
                "title varchar(100),"
                "answers_num INT(11),"
                "followers_num INT(11) NOT NULL,"
                "visitsCount INT(11),"
                "topic_views INT(11),"
                "topic_tag0 VARCHAR (600),"
                "topic_tag1 VARCHAR (600),"
                "topic_tag2 VARCHAR (600),"
                "PRIMARY KEY (qid)"
            ")")
        self.cursor.execute(sql)
        print "

TABLES ARE READY!

"

    def process_item(self, item, spider):
        sql = self.sql_questions % (item["qid"], item["title"], item["answers_num"],item["followers_num"],
                                item["visitsCount"], item["topic_views"], item["topic_tag0"], item["topic_tag1"], item["topic_tag2"])
        self.cursor.execute(sql)
        if self.count % 10 == 0:
            self.conn.commit()
        self.count += 1
        print item["qid"] + " DATA COLLECTED!"

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy import Field


class zhihuQuestionItem(scrapy.Item):
    qid = Field()
    title = Field()
    followers_num = Field()
    answers_num = Field()
    visitsCount = Field()
    topic_views = Field()
    topic_tag0 = Field()
    topic_tag1 = Field()
    topic_tag2 = Field()

文章版權(quán)歸作者所有，未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請注明本文地址：http://specialneedsforspecialkids.com/yun/37746.html

scrapy模擬登陸知乎--抓取熱點話題

摘要：在抓取數(shù)據(jù)之前，請在瀏覽器中登錄過知乎，這樣才使得是有效的。所謂的模擬登陸，只是在中盡量的模擬在瀏覽器中的交互過程，使服務端無感抓包過程。若是幫你解決了問題，或者給了你啟發(fā)，不要吝嗇給加一星。折騰了將近兩天，中間數(shù)次想要放棄，還好硬著頭皮搞下去了，在此分享出來，希望有同等需求的各位能少走一些彎路。源碼放在了github上，歡迎前往查看。若是幫你解決了問題，或者給了你啟發(fā)，不要吝...

leanxi 2019-07-30 14:34 評論0 收藏0
Scrapy 實戰(zhàn)之爬取妹子圖

摘要：很多人學習爬蟲的第一驅(qū)動力就是爬取各大網(wǎng)站的妹子圖片，比如比較有名的。最后我們只需要運行程序，即可執(zhí)行爬取，程序運行命名如下完整代碼我已上傳到微信公眾號后臺，在癡海公眾號后臺回復即可獲取。本文首發(fā)于公眾號癡海，后臺回復即可獲取最新編程資源。 showImg(https://segmentfault.com/img/remote/1460000016780800); 閱讀文本大概需要 1...

Achilles 2019-07-31 11:14 評論0 收藏0
基于 Electron 的爬蟲框架 Nightmare

摘要：話題精華即為知乎的高票回答。下面的項目中還包含了另外一個爬取的知乎的動態(tài)。作者：William本文為原創(chuàng)文章，轉(zhuǎn)載請注明作者及出處 Electron 可以讓你使用純 JavaScript 調(diào)用 Chrome 豐富的原生的接口來創(chuàng)造桌面應用。你可以把它看作一個專注于桌面應用的 Node.js 的變體，而不是 Web 服務器。其基于瀏覽器的應用方式可以極方便的做各種響應式的交互，接下來介...

Harriet666 2019-08-22 11:05 評論0 收藏0
零基礎如何學爬蟲技術

摘要：楚江數(shù)據(jù)是專業(yè)的互聯(lián)網(wǎng)數(shù)據(jù)技術服務，現(xiàn)整理出零基礎如何學爬蟲技術以供學習，。本文來源知乎作者路人甲鏈接楚江數(shù)據(jù)提供網(wǎng)站數(shù)據(jù)采集和爬蟲軟件定制開發(fā)服務，服務范圍涵蓋社交網(wǎng)絡電子商務分類信息學術研究等。楚江數(shù)據(jù)是專業(yè)的互聯(lián)網(wǎng)數(shù)據(jù)技術服務，現(xiàn)整理出零基礎如何學爬蟲技術以供學習，http://www.chujiangdata.com。第一：Python爬蟲學習系列教程（來源于某博主：htt...

KunMinX 2019-07-25 11:29 評論0 收藏0
23個Python爬蟲開源項目代碼，包含微信、淘寶、豆瓣、知乎、微博等

摘要：今天為大家整理了個爬蟲項目。地址新浪微博爬蟲主要爬取新浪微博用戶的個人信息微博信息粉絲和關注。代碼獲取新浪微博進行登錄，可通過多賬號登錄來防止新浪的反扒。涵蓋鏈家爬蟲一文的全部代碼，包括鏈家模擬登錄代碼。支持微博知乎豆瓣。 showImg(https://segmentfault.com/img/remote/1460000018452185?w=1000&h=667); 今天為大家整...

jlanglang 2019-07-31 10:09 評論0 收藏0