??大佬都在學什么？Python爬蟲分析C站大佬收藏夾，跟著大佬一起學，你就是下一個大佬??!

Yang_River 發布于2021-09-06 15:02 / 1190人閱讀

??大佬都在學什么？Python爬蟲分析C站大佬收藏夾，跟著大佬一起學，你就是下一個大佬??!

前言

計算機行業的發展太快了，有時候幾天不學習，就被時代所拋棄了，因此對于我們程序員而言，最重要的就是要時刻緊跟業界動態變化，學習新的技術，但是很多時候我們又不知道學什么好，萬一學的新技術并不會被廣泛使用，太小眾了對學習工作也幫助不大，這時候我們就想要知道大佬們都在學什么了，跟著大佬學習走彎路的概率就小很多了?，F在就讓我們看看C站大佬們平時都收藏了什么，大佬學什么跟著大佬的腳步就好了！

程序說明

通過爬取 “CSDN” 獲取全站排名靠前的博主的公開收藏夾，寫入 csv 文件中，根據所獲取數據分析領域大佬們的學習趨勢，并通過可視化的方式進行展示。

數據爬取

使用 requests 庫請求網頁信息，使用 BeautifulSoup4 和 json 庫解析網頁。

獲取 CSDN 作者總榜數據

首先，我們需要獲取 CSDN 中在榜的大佬，獲取他/她們的相關信息。由于數據是動態加載的(關于動態加載的更多說明，可以參考博文《渣男，你為什么有這么多小姐姐的照片？因為我Python爬蟲學的好啊??！》)，因此使用開發者工具，在網絡選項卡中可以找到請求的 JSON 數據：

觀察請求鏈接：

https://blog.csdn.net/phoenix/web/blog/all-rank?page=0&pageSize=20https://blog.csdn.net/phoenix/web/blog/all-rank?page=1&pageSize=20...

可以發現每次請求 JSON 數據時，會獲取20個數據，為了獲取排名前100的大佬數據，使用如下方式構造請求：

url_rank_pattern = "https://blog.csdn.net/phoenix/web/blog/all-rank?page={}&pageSize=20"for i in range(5):    url = url_rank_pattern.format(i)    #聲明網頁編碼方式    response = requests.get(url=url, headers=headers)    response.encoding = "utf-8"    response.raise_for_status()    soup = BeautifulSoup(response.text, "html.parser")

請求得到 Json 數據后，使用 json 模塊解析數據(當然也可以使用 re 模塊，根據自己的喜好選擇就好了)，獲取用戶信息，從需求上講，這里僅需要用戶 userName，因此僅解析 userName 信息，也可以根據需求獲取其他信息：

userNames = []information = json.loads(str(soup))for j in information["data"]["allRankListItem"]:    # 獲取id信息    userNames.append(j["userName"])

獲取收藏夾列表

獲取到大佬的 userName 信息后，通過主頁來觀察收藏夾列表的請求方式，本文以自己的主頁為例(給自己推廣一波)，分析方法與上一步類似，在主頁中切換到“收藏”選項卡，同樣利用開發者工具的網絡選項卡：

觀察請求收藏夾列表的地址：

https://blog.csdn.net/community/home-api/v1/get-favorites-created-list?page=1&size=20&noMore=false&blogUsername=LOVEmy134611

可以看到這里我們上一步獲取的 userName 就用上了，可以通過替換 blogUsername 的值來獲取列表中大佬的收藏夾列表，同樣當收藏夾數量大于20時，可以通過修改 page 值來獲取所有收藏夾列表：

collections = "https://blog.csdn.net/community/home-api/v1/get-favorites-created-list?page=1&size=20&noMore=false&blogUsername={}"for userName in userNames:    url = collections.format(userName)    #聲明網頁編碼方式    response = requests.get(url=url, headers=headers)    response.encoding = "utf-8"    response.raise_for_status()    soup = BeautifulSoup(response.text, "html.parser")

請求得到 Json 數據后，使用 json 模塊解析數據，獲取收藏夾信息，從需求上講，這里僅需要收藏夾 id，因此僅解析 id 信息，也可以根據需求獲取其他信息(例如可以獲取關注人數等信息，找到最受歡迎的收藏夾)：

file_id_list = []information = json.loads(str(soup))# 獲取收藏夾總數collection_number = information["data"]["total"]# 獲取收藏夾idfor j in information["data"]["list"]:    file_id_list.append(j["id"])

這里大家可能會問，現在 CSDN 不是有新舊兩種主頁么，請求方式能一樣么？答案是：不一樣，在瀏覽器端進行訪問時，舊版本使用了不同的請求接口，但是我們同樣可以使用新版本的請求方式來進行獲取，因此就不必區分新、舊版本的請求接口了，獲取收藏數據時情況也是一樣的。

獲取收藏數據

最后，單擊收藏夾展開按鈕，就可以看到收藏夾中的內容了，然后同樣利用開發者工具的網絡選項卡進行分析：

觀察請求收藏夾的地址：

https://blog.csdn.net/community/home-api/v1/get-favorites-item-list?blogUsername=LOVEmy134611&folderId=9406232&page=1&pageSize=200

可以看到剛剛獲取的用戶 userName 和收藏夾 id 就可以構造請求獲取收藏夾中的收藏信息了：

file_url = "https://blog.csdn.net/community/home-api/v1/get-favorites-item-list?blogUsername={}&folderId={}&page=1&pageSize=200"for file_id in file_id_list:    url = file_url.format(userName,file_id)    #聲明網頁編碼方式    response = requests.get(url=url, headers=headers)    response.encoding = "utf-8"    response.raise_for_status()    soup = BeautifulSoup(response.text, "html.parser")

最后用 re 模塊解析：

    user = user_dict[userName]    user = preprocess(user)    # 標題    title_list  = analysis(r""title":"(.*?)",", str(soup))    # 鏈接    url_list = analysis(r""url":"(.*?)"", str(soup))    # 作者    nickname_list = analysis(r""nickname":"(.*?)",", str(soup))    # 收藏日期    date_list = analysis(r""dateTime":"(.*?)",", str(soup))    for i in range(len(title_list)):        title = preprocess(title_list[i])        url = preprocess(url_list[i])        nickname = preprocess(nickname_list[i])        date = preprocess(date_list[i])

爬蟲程序完整代碼

import timeimport requestsfrom bs4 import BeautifulSoupimport osimport jsonimport reimport csvif not os.path.exists("col_infor.csv"):    #創建存儲csv文件存儲數據    file = open("col_infor.csv", "w", encoding="utf-8-sig",newline="")    csv_head = csv.writer(file)    #表頭    header = ["userName","title","url","anthor","date"]    csv_head.writerow(header)    file.close()headers = {    "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}def preprocess(string):    return string.replace(","," ")url_rank_pattern = "https://blog.csdn.net/phoenix/web/blog/all-rank?page={}&pageSize=20"userNames = []user_dict = {}for i in range(5):    url = url_rank_pattern.format(i)    #聲明網頁編碼方式    response = requests.get(url=url, headers=headers)    response.encoding = "utf-8"    response.raise_for_status()    soup = BeautifulSoup(response.text, "html.parser")    information = json.loads(str(soup))    for j in information["data"]["allRankListItem"]:        # 獲取id信息        userNames.append(j["userName"])        user_dict[j["userName"]] = j["nickName"]def get_col_list(page,userName):    collections = "https://blog.csdn.net/community/home-api/v1/get-favorites-created-list?page={}&size=20&noMore=false&blogUsername={}"    url = collections.format(page,userName)    #聲明網頁編碼方式    response = requests.get(url=url, headers=headers)    response.encoding = "utf-8"    response.raise_for_status()    soup = BeautifulSoup(response.text, "html.parser")    information = json.loads(str(soup))    return informationdef analysis(item,results):    pattern = re.compile(item, re.I|re.M)    result_list = pattern.findall(results)    return result_listdef get_col(userName, file_id, col_page):    file_url = "https://blog.csdn.net/community/home-api/v1/get-favorites-item-list?blogUsername={}&folderId={}&page={}&pageSize=200"    url = file_url.format(userName,file_id, col_page)    #聲明網頁編碼方式    response = requests.get(url=url, headers=headers)    response.encoding = "utf-8"    response.raise_for_status()    soup = BeautifulSoup(response.text, "html.parser")    user = user_dict[userName]    user = preprocess(user)    # 標題    title_list  = analysis(r""title":"(.*?)",", str(soup))    # 鏈接    url_list = analysis(r""url":"(.*?)"", str(soup))    # 作者    nickname_list = analysis(r""nickname":"(.*?)",", str(soup))    # 收藏日期    date_list = analysis(r""dateTime":"(.*?)",", str(soup))    for i in range(len(title_list)):        title = preprocess(title_list[i])        url = preprocess(url_list[i])        nickname = preprocess(nickname_list[i])        date = preprocess(date_list[i])        if title and url and nickname and date:            with open("col_infor.csv", "a+", encoding="utf-8-sig") as f:                f.write(user + "," + title + "," + url + "," + nickname + "," + date  + "/n")    return informationfor userName in userNames:    page = 1    file_id_list = []    information = get_col_list(page, userName)    # 獲取收藏夾總數    collection_number = information["data"]["total"]    # 獲取收藏夾id    for j in information["data"]["list"]:        file_id_list.append(j["id"])    while collection_number > 20:        page = page + 1        collection_number = collection_number - 20        information = get_col_list(page, userName)        # 獲取收藏夾id        for j in information["data"]["list"]:            file_id_list.append(j["id"])    collection_number = 0    # 獲取收藏信息    for file_id in file_id_list:        col_page = 1        information = get_col(userName, file_id, col_page)        number_col = information["data"]["total"]        while number_col > 200:            col_page = col_page + 1            number_col = number_col - 200            get_col(userName, file_id, col_page)    number_col = 0

爬取數據結果

展示部分爬取結果：

數據分析及可視化

最后使用 wordcloud 庫，繪制詞云展示大佬收藏。

from os import pathfrom PIL import Imageimport matplotlib.pyplot as pltimport jiebafrom wordcloud import WordCloud, STOPWORDSimport pandas as pdimport matplotlib.ticker as tickerimport numpy as npimport mathimport redf = pd.read_csv("col_infor.csv", encoding="utf-8-sig",usecols=["userName","title","url","anthor","date"])place_array = df["title"].valuesplace_list = "，".join(place_array)with open("text.txt","a+") as f:    f.writelines(place_list)###當前文件路徑d = path.dirname(__file__)# Read the whole text.file = open(path.join(d, "text.txt")).read()##進行分詞#停用詞stopwords = ["的","與","和","建議","收藏","使用","了","實現","我","中","你","在","之"]text_split = jieba.cut(file)  # 未去掉停用詞的分詞結果   list類型#去掉停用詞的分詞結果  list類型text_split_no = []for word in text_split:    if word not in stopwords:        text_split_no.append(word)#print(text_split_no)text =" ".join(text_split_no)#背景圖片picture_mask = np.array(Image.open(path.join(d, "path.jpg")))stopwords = set(STOPWORDS)stopwords.add("said")wc = WordCloud(      #設置字體，指定字體路徑    font_path=r"C:/Windows/Fonts/simsun.ttc",     # font_path=r"/usr/share/fonts/wps-office/simsun.ttc",     background_color="white",       max_words=2000,       mask=picture_mask,      stopwords=stopwords)  # 生成詞云wc.generate(text)# 存儲圖片wc.to_file(path.join(d, "result.jpg"))

GPU云服務器云服務器大佬們大佬 vps大佬大佬人工智能

文章版權歸作者所有，未經允許請勿轉載,若此文章存在違規行為，您可以聯系管理員刪除。

轉載請注明本文地址：http://specialneedsforspecialkids.com/yun/119310.html

?? 爬蟲分析CSDN大佬之間關系，堪比娛樂圈 ??

? 作者主頁：不吃西紅柿 ? 簡介：CSDN博客專家?、信息技術智庫公號作者?簡歷模板、PPT模板、技術資料盡管【關注】私聊我。歷史文章目錄：https://t.1yb.co/zHJo ? 歡迎點贊 ? 收藏 ?留言 ? 如有錯誤敬請指正！本文重點： 1、爬蟲獲取csdn大佬之間的關系 2、可視化分析曖昧關系，復雜堪比娛樂圈大佬簡介 ? Java李楊勇：一個性感的計算機專業畢業的...

Michael_Ding 2021-09-02 15:11 評論0 收藏0
趁著課余時間學點Python（十四）文件操作

摘要：我是布小禪，一枚自學萌新，跟著我每天進步一點點吧說了這么多暫時也就夠了，那么就告辭吧文章目錄 ?? 前言 ??? 作者簡介 ??文件操作?1??、open函數...

abson 2021-09-07 09:59 評論0 收藏0
??蘇州程序大白一文從基礎手把手教你Python數據可視化大佬??《??記得收藏??》

??蘇州程序大白一文從基礎手把手教你Python數據可視化大佬??《??記得收藏??》目錄 ????開講啦?。。?！????蘇州程序大白?????博主介紹前言數據關系可視化散點圖 Scatter plots折線圖強調連續性 Emphasizing continuity with line plots同時顯示多了圖表數據種類的可視化 Plotting with categorical da...

Drinkey 2021-10-09 09:44 評論0 收藏0
??爆肝十二萬字《python從零到精通教程》，從零教你變大佬??（建議收藏）

文章目錄強烈推薦系列教程，建議學起來??！一.pycharm下載安裝二.python下載安裝三.pycharm上配置python四.配置鏡像源讓你下載嗖嗖的快4.1pycharm內部配置 4.2手動添加鏡像源4.3永久配置鏡像源五.插件安裝（比如漢化？）5.1自動補碼神器第一款5.2漢化pycharm5.3其它插件六.美女背景七.自定義腳本開頭八、這個前言一定要看九、pyt...

booster 2021-09-04 16:40 評論0 收藏0

發表評論

登陸后可評論

0條評論

Yang_River

男|高級講師

我要關注我要私信

TA的文章

虛擬主機技術是什么-什么是虛擬主機？

閱讀 3977·2021-09-22 16:03
如何登陸云主機-怎么登錄云主機？

閱讀 5311·2021-09-22 15:40
??大佬都在學什么？Python爬蟲分析C站大佬收藏夾，跟著大佬一起學，你就是下一個大佬??!

閱讀 1191·2021-09-06 15:02
web前端編碼規范整合

閱讀 866·2019-08-30 15:53
微信小程序中圖片上傳阿里云Oss

閱讀 2215·2019-08-29 15:35
大話-node真的是單線程嗎？

閱讀 1105·2019-08-23 18:22
使用Proxy實現雙向綁定

閱讀 3333·2019-08-23 16:06
JavaScript之this

閱讀 643·2019-08-23 12:27

国产xxxx99真实实拍_久久不雅视频_高清韩国a级特黄毛片_嗯老师别我我受不了了小说

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務器低至59元/年，更有多臺、長期優惠，快來選購！

??大佬都在學什么？Python爬蟲分析C站大佬收藏夾，跟著大佬一起學，你就是下一個大佬??!