摘要:爬蟲獲取知乎用戶數據安裝爬蟲框架關于如何安裝以及框架,這里不做介紹,請自行網上搜索。
2016-04-10
Scrapy爬蟲 - 獲取知乎用戶數據 安裝Scrapy爬蟲框架關于如何安裝Python以及Scrapy框架,這里不做介紹,請自行網上搜索。
初始化安裝好Scrapy后,執行 scrapy startproject myspider
接下來你會看到 myspider 文件夾,目錄結構如下:
scrapy.cfg
myspider
items.py
pipelines.py
settings.py
__init__.py
spiders
__init__.py
編寫爬蟲文件在spiders目錄下新建 users.py
# -*- coding: utf-8 -*- import scrapy import os import time from zhihu.items import UserItem from zhihu.myconfig import UsersConfig # 爬蟲配置 class UsersSpider(scrapy.Spider): name = "users" domain = "https://www.zhihu.com" login_url = "https://www.zhihu.com/login/email" headers = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Language": "zh-CN,zh;q=0.8", "Connection": "keep-alive", "Host": "www.zhihu.com", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36" } def __init__(self, url = None): self.user_url = url def start_requests(self): yield scrapy.Request( url = self.domain, headers = self.headers, meta = { "proxy": UsersConfig["proxy"], "cookiejar": 1 }, callback = self.request_captcha ) def request_captcha(self, response): # 獲取_xsrf值 _xsrf = response.css("input[name="_xsrf"]::attr(value)").extract()[0] # 獲取驗證碼地址 captcha_url = "http://www.zhihu.com/captcha.gif?r=" + str(time.time() * 1000) # 準備下載驗證碼 yield scrapy.Request( url = captcha_url, headers = self.headers, meta = { "proxy": UsersConfig["proxy"], "cookiejar": response.meta["cookiejar"], "_xsrf": _xsrf }, callback = self.download_captcha ) def download_captcha(self, response): # 下載驗證碼 with open("captcha.gif", "wb") as fp: fp.write(response.body) # 用軟件打開驗證碼圖片 os.system("start captcha.gif") # 輸入驗證碼 print "Please enter captcha: " captcha = raw_input() yield scrapy.FormRequest( url = self.login_url, headers = self.headers, formdata = { "email": UsersConfig["email"], "password": UsersConfig["password"], "_xsrf": response.meta["_xsrf"], "remember_me": "true", "captcha": captcha }, meta = { "proxy": UsersConfig["proxy"], "cookiejar": response.meta["cookiejar"] }, callback = self.request_zhihu ) def request_zhihu(self, response): yield scrapy.Request( url = self.user_url + "/about", headers = self.headers, meta = { "proxy": UsersConfig["proxy"], "cookiejar": response.meta["cookiejar"], "from": { "sign": "else", "data": {} } }, callback = self.user_item, dont_filter = True ) yield scrapy.Request( url = self.user_url + "/followees", headers = self.headers, meta = { "proxy": UsersConfig["proxy"], "cookiejar": response.meta["cookiejar"], "from": { "sign": "else", "data": {} } }, callback = self.user_start, dont_filter = True ) yield scrapy.Request( url = self.user_url + "/followers", headers = self.headers, meta = { "proxy": UsersConfig["proxy"], "cookiejar": response.meta["cookiejar"], "from": { "sign": "else", "data": {} } }, callback = self.user_start, dont_filter = True ) def user_start(self, response): sel_root = response.xpath("http://h2[@class="zm-list-content-title"]") # 判斷關注列表是否為空 if len(sel_root): for sel in sel_root: people_url = sel.xpath("a/@href").extract()[0] yield scrapy.Request( url = people_url + "/about", headers = self.headers, meta = { "proxy": UsersConfig["proxy"], "cookiejar": response.meta["cookiejar"], "from": { "sign": "else", "data": {} } }, callback = self.user_item, dont_filter = True ) yield scrapy.Request( url = people_url + "/followees", headers = self.headers, meta = { "proxy": UsersConfig["proxy"], "cookiejar": response.meta["cookiejar"], "from": { "sign": "else", "data": {} } }, callback = self.user_start, dont_filter = True ) yield scrapy.Request( url = people_url + "/followers", headers = self.headers, meta = { "proxy": UsersConfig["proxy"], "cookiejar": response.meta["cookiejar"], "from": { "sign": "else", "data": {} } }, callback = self.user_start, dont_filter = True ) def user_item(self, response): def value(list): return list[0] if len(list) else "" sel = response.xpath("http://div[@class="zm-profile-header ProfileCard"]") item = UserItem() item["url"] = response.url[:-6] item["name"] = sel.xpath("http://a[@class="name"]/text()").extract()[0].encode("utf-8") item["bio"] = value(sel.xpath("http://span[@class="bio"]/@title").extract()).encode("utf-8") item["location"] = value(sel.xpath("http://span[contains(@class, "location")]/@title").extract()).encode("utf-8") item["business"] = value(sel.xpath("http://span[contains(@class, "business")]/@title").extract()).encode("utf-8") item["gender"] = 0 if sel.xpath("http://i[contains(@class, "icon-profile-female")]") else 1 item["avatar"] = value(sel.xpath("http://img[@class="Avatar Avatar--l"]/@src").extract()) item["education"] = value(sel.xpath("http://span[contains(@class, "education")]/@title").extract()).encode("utf-8") item["major"] = value(sel.xpath("http://span[contains(@class, "education-extra")]/@title").extract()).encode("utf-8") item["employment"] = value(sel.xpath("http://span[contains(@class, "employment")]/@title").extract()).encode("utf-8") item["position"] = value(sel.xpath("http://span[contains(@class, "position")]/@title").extract()).encode("utf-8") item["content"] = value(sel.xpath("http://span[@class="content"]/text()").extract()).strip().encode("utf-8") item["ask"] = int(sel.xpath("http://div[contains(@class, "profile-navbar")]/a[2]/span[@class="num"]/text()").extract()[0]) item["answer"] = int(sel.xpath("http://div[contains(@class, "profile-navbar")]/a[3]/span[@class="num"]/text()").extract()[0]) item["agree"] = int(sel.xpath("http://span[@class="zm-profile-header-user-agree"]/strong/text()").extract()[0]) item["thanks"] = int(sel.xpath("http://span[@class="zm-profile-header-user-thanks"]/strong/text()").extract()[0]) yield item添加爬蟲配置文件
在myspider目錄下新建myconfig.py,并添加以下內容,將你的配置信息填入相應位置
# -*- coding: utf-8 -*- UsersConfig = { # 代理 "proxy": "", # 知乎用戶名和密碼 "email": "your email", "password": "your password", } DbConfig = { # db config "user": "db user", "passwd": "db password", "db": "db name", "host": "db host", }修改items.py
# -*- coding: utf-8 -*- import scrapy class UserItem(scrapy.Item): # define the fields for your item here like: url = scrapy.Field() name = scrapy.Field() bio = scrapy.Field() location = scrapy.Field() business = scrapy.Field() gender = scrapy.Field() avatar = scrapy.Field() education = scrapy.Field() major = scrapy.Field() employment = scrapy.Field() position = scrapy.Field() content = scrapy.Field() ask = scrapy.Field() answer = scrapy.Field() agree = scrapy.Field() thanks = scrapy.Field()將用戶數據存入mysql數據庫
修改pipelines.py
# -*- coding: utf-8 -*- import MySQLdb import datetime from zhihu.myconfig import DbConfig class UserPipeline(object): def __init__(self): self.conn = MySQLdb.connect(user = DbConfig["user"], passwd = DbConfig["passwd"], db = DbConfig["db"], host = DbConfig["host"], charset = "utf8", use_unicode = True) self.cursor = self.conn.cursor() # 清空表 # self.cursor.execute("truncate table weather;") # self.conn.commit() def process_item(self, item, spider): curTime = datetime.datetime.now() try: self.cursor.execute( """INSERT IGNORE INTO users (url, name, bio, location, business, gender, avatar, education, major, employment, position, content, ask, answer, agree, thanks, create_at) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)""", ( item["url"], item["name"], item["bio"], item["location"], item["business"], item["gender"], item["avatar"], item["education"], item["major"], item["employment"], item["position"], item["content"], item["ask"], item["answer"], item["agree"], item["thanks"], curTime ) ) self.conn.commit() except MySQLdb.Error, e: print "Error %d %s" % (e.args[0], e.args[1]) return item修改settings.py
找到 ITEM_PIPELINES,改為:
ITEM_PIPELINES = { "myspider.pipelines.UserPipeline": 300, }
在末尾添加,設置爬蟲的深度
DEPTH_LIMIT=10爬取知乎用戶數據
確保MySQL已經打開,在項目根目錄下打開終端,
執行 scrapy crawl users -a url=https://www.zhihu.com/people/
其中user為爬蟲的第一個用戶,之后會根據該用戶關注的人和被關注的人進行爬取數據
接下來會下載驗證碼圖片,若未自動打開,請到根目錄下打開 captcha.gif,在終端輸入驗證碼
數據爬取Loading...
源碼可以在這里找到 github
文章版權歸作者所有,未經允許請勿轉載,若此文章存在違規行為,您可以聯系管理員刪除。
轉載請注明本文地址:http://specialneedsforspecialkids.com/yun/45426.html
摘要:今天為大家整理了個爬蟲項目。地址新浪微博爬蟲主要爬取新浪微博用戶的個人信息微博信息粉絲和關注。代碼獲取新浪微博進行登錄,可通過多賬號登錄來防止新浪的反扒。涵蓋鏈家爬蟲一文的全部代碼,包括鏈家模擬登錄代碼。支持微博知乎豆瓣。 showImg(https://segmentfault.com/img/remote/1460000018452185?w=1000&h=667); 今天為大家整...
摘要:楚江數據是專業的互聯網數據技術服務,現整理出零基礎如何學爬蟲技術以供學習,。本文來源知乎作者路人甲鏈接楚江數據提供網站數據采集和爬蟲軟件定制開發服務,服務范圍涵蓋社交網絡電子商務分類信息學術研究等。 楚江數據是專業的互聯網數據技術服務,現整理出零基礎如何學爬蟲技術以供學習,http://www.chujiangdata.com。 第一:Python爬蟲學習系列教程(來源于某博主:htt...
【百度云搜索,搜各種資料:http://www.bdyss.cn】 【搜網盤,搜各種資料:http://www.swpan.cn】 第一步。首先下載,大神者也的倒立文字驗證碼識別程序 下載地址:https://github.com/muchrooms/... 注意:此程序依賴以下模塊包 Keras==2.0.1 Pillow==3.4.2 jupyter==1.0.0 matplotli...
摘要:下載器下載器負責獲取頁面數據并提供給引擎,而后提供給。下載器中間件下載器中間件是在引擎及下載器之間的特定鉤子,處理傳遞給引擎的。一旦頁面下載完畢,下載器生成一個該頁面的,并將其通過下載中間件返回方向發送給引擎。 作者:xiaoyu微信公眾號:Python數據科學知乎:Python數據分析師 在爬蟲的路上,學習scrapy是一個必不可少的環節。也許有好多朋友此時此刻也正在接觸并學習sc...
摘要:在抓取數據之前,請在瀏覽器中登錄過知乎,這樣才使得是有效的。所謂的模擬登陸,只是在中盡量的模擬在瀏覽器中的交互過程,使服務端無感抓包過程。若是幫你解決了問題,或者給了你啟發,不要吝嗇給加一星。 折騰了將近兩天,中間數次想要放棄,還好硬著頭皮搞下去了,在此分享出來,希望有同等需求的各位能少走一些彎路。 源碼放在了github上, 歡迎前往查看。 若是幫你解決了問題,或者給了你啟發,不要吝...
閱讀 1321·2021-09-22 15:09
閱讀 2656·2021-08-20 09:38
閱讀 2402·2021-08-03 14:03
閱讀 863·2019-08-30 15:55
閱讀 3368·2019-08-30 12:59
閱讀 3551·2019-08-26 13:48
閱讀 1886·2019-08-26 11:40
閱讀 647·2019-08-26 10:30