Python爬蟲——Python 崗位分析報(bào)告

william 發(fā)布于2019-07-30 17:41 / 1150人閱讀

摘要：歡迎關(guān)注公號智能制造社區(qū)學(xué)習(xí)更多原創(chuàng)智能制造及編程知識(shí)。無無無無無無無獲取所有數(shù)據(jù)了解了如何解析數(shù)據(jù)，剩下的就是連續(xù)請求所有頁面了，我們構(gòu)造一個(gè)函數(shù)來請求所有頁的數(shù)據(jù)。

前兩篇我們分別爬取了糗事百科和妹子圖網(wǎng)站，學(xué)習(xí)了 Requests, Beautiful Soup 的基本使用。不過前兩篇都是從靜態(tài) HTML 頁面中來篩選出我們需要的信息。這一篇我們來學(xué)習(xí)下如何來獲取 Ajax 請求返回的結(jié)果。

歡迎關(guān)注公號【智能制造社區(qū)】學(xué)習(xí)更多原創(chuàng)智能制造及編程知識(shí)。

Python 爬蟲入門(二)——爬取妹子圖
Python 爬蟲入門(一)——爬取糗百

本篇以拉勾網(wǎng)為例來說明一下如何獲取 Ajax 請求內(nèi)容

本文目標(biāo)

獲取 Ajax 請求,解析 JSON 中所需字段

數(shù)據(jù)保存到 Excel 中

數(shù)據(jù)保存到 MySQL, 方便分析

簡單分析

五個(gè)城市 Python 崗位平均薪資水平

Python 崗位要求學(xué)歷分布

Python 行業(yè)領(lǐng)域分布

Python 公司規(guī)模分布

查看頁面結(jié)構(gòu)

我們輸入查詢條件以 Python 為例，其他條件默認(rèn)不選，點(diǎn)擊查詢，就能看到所有 Python 的崗位了，然后我們打開控制臺(tái)，點(diǎn)擊網(wǎng)絡(luò)標(biāo)簽可以看到如下請求：

從響應(yīng)結(jié)果來看，這個(gè)請求正是我們需要的內(nèi)容。后面我們直接請求這個(gè)地址就好了。從圖中可以看出 result 下面就是各個(gè)崗位信息。

到這里我們知道了從哪里請求數(shù)據(jù)，從哪里獲取結(jié)果。但是 result 列表中只有第一頁 15 條數(shù)據(jù)，其他頁面數(shù)據(jù)怎么獲取呢？

分析請求參數(shù)

我們點(diǎn)擊參數(shù)選項(xiàng)卡，如下：

發(fā)現(xiàn)提交了三個(gè)表單數(shù)據(jù)，很明顯看出來 kd 就是我們搜索的關(guān)鍵詞，pn 就是當(dāng)前頁碼。first 默認(rèn)就行了，不用管它。剩下的事情就是構(gòu)造請求，來下載 30 個(gè)頁面的數(shù)據(jù)了。

構(gòu)造請求，并解析數(shù)據(jù)

構(gòu)造請求很簡單，我們還是用 requests 庫來搞定。首先我們構(gòu)造出表單數(shù)據(jù) data = {"first": "true", "pn": page, "kd": lang_name} 之后用 requests 來請求url地址，解析得到的 Json 數(shù)據(jù)就算大功告成了。由于拉勾對爬蟲限制比較嚴(yán)格，我們需要把瀏覽器中 headers 字段全部加上，而且把爬蟲間隔調(diào)大一點(diǎn)，我后面設(shè)置的為 10-20s，然后就能正常獲取數(shù)據(jù)了。

import requests

def get_json(url, page, lang_name):
    headers = {
        "Host": "www.lagou.com",
        "Connection": "keep-alive",
        "Content-Length": "23",
        "Origin": "https://www.lagou.com",
        "X-Anit-Forge-Code": "0",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0",
        "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
        "Accept": "application/json, text/javascript, */*; q=0.01",
        "X-Requested-With": "XMLHttpRequest",
        "X-Anit-Forge-Token": "None",
        "Referer": "https://www.lagou.com/jobs/list_python?city=%E5%85%A8%E5%9B%BD&cl=false&fromSearch=true&labelWords=&suginput=",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7"
    }
    data = {"first": "false", "pn": page, "kd": lang_name}
    json = requests.post(url, data, headers=headers).json()
    list_con = json["content"]["positionResult"]["result"]
    info_list = []
    for i in list_con:
        info = []
        info.append(i.get("companyShortName", "無"))
        info.append(i.get("companyFullName", "無"))
        info.append(i.get("industryField", "無"))
        info.append(i.get("companySize", "無"))
        info.append(i.get("salary", "無"))
        info.append(i.get("city", "無"))
        info.append(i.get("education", "無"))
        info_list.append(info)
    return info_list

獲取所有數(shù)據(jù)

了解了如何解析數(shù)據(jù)，剩下的就是連續(xù)請求所有頁面了，我們構(gòu)造一個(gè)函數(shù)來請求所有 30 頁的數(shù)據(jù)。

def main():
    lang_name = "python"
    wb = Workbook()
    conn = get_conn()
    for i in ["北京", "上海", "廣州", "深圳", "杭州"]:
        page = 1
        ws1 = wb.active
        ws1.title = lang_name
        url = "https://www.lagou.com/jobs/positionAjax.json?city={}&needAddtionalResult=false".format(i)
        while page < 31:
            info = get_json(url, page, lang_name)
            page += 1
            import time
            a = random.randint(10, 20)
            time.sleep(a)
            for row in info:
                insert(conn, tuple(row))
                ws1.append(row)
    conn.close()
    wb.save("{}職位信息.xlsx".format(lang_name))

if __name__ == "__main__":
    main()

完整代碼

import random
import time

import requests
from openpyxl import Workbook
import pymysql.cursors


def get_conn():
    """建立數(shù)據(jù)庫連接"""
    conn = pymysql.connect(host="localhost",
                                user="root",
                                password="root",
                                db="python",
                                charset="utf8mb4",
                                cursorclass=pymysql.cursors.DictCursor)
    return conn


def insert(conn, info):
    """數(shù)據(jù)寫入數(shù)據(jù)庫"""
    with conn.cursor() as cursor:
        sql = "INSERT INTO `python` (`shortname`, `fullname`, `industryfield`, `companySize`, `salary`, `city`, `education`) VALUES (%s, %s, %s, %s, %s, %s, %s)"
        cursor.execute(sql, info)
    conn.commit()


def get_json(url, page, lang_name):
    """返回當(dāng)前頁面的信息列表"""
    headers = {
        "Host": "www.lagou.com",
        "Connection": "keep-alive",
        "Content-Length": "23",
        "Origin": "https://www.lagou.com",
        "X-Anit-Forge-Code": "0",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0",
        "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
        "Accept": "application/json, text/javascript, */*; q=0.01",
        "X-Requested-With": "XMLHttpRequest",
        "X-Anit-Forge-Token": "None",
        "Referer": "https://www.lagou.com/jobs/list_python?city=%E5%85%A8%E5%9B%BD&cl=false&fromSearch=true&labelWords=&suginput=",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7"
    }
    data = {"first": "false", "pn": page, "kd": lang_name}
    json = requests.post(url, data, headers=headers).json()
    list_con = json["content"]["positionResult"]["result"]
    info_list = []
    for i in list_con:
        info = []
        info.append(i.get("companyShortName", "無"))  # 公司名
        info.append(i.get("companyFullName", "無"))
        info.append(i.get("industryField", "無"))   # 行業(yè)領(lǐng)域
        info.append(i.get("companySize", "無"))  # 公司規(guī)模
        info.append(i.get("salary", "無"))   # 薪資
        info.append(i.get("city", "無"))
        info.append(i.get("education", "無"))   # 學(xué)歷
        info_list.append(info)
    return info_list   # 返回列表


def main():
    lang_name = "python"
    wb = Workbook()  # 打開 excel 工作簿
    conn = get_conn()  # 建立數(shù)據(jù)庫連接  不存數(shù)據(jù)庫 注釋此行
    for i in ["北京", "上海", "廣州", "深圳", "杭州"]:   # 五個(gè)城市
        page = 1
        ws1 = wb.active
        ws1.title = lang_name
        url = "https://www.lagou.com/jobs/positionAjax.json?city={}&needAddtionalResult=false".format(i)
        while page < 31:   # 每個(gè)城市30頁信息
            info = get_json(url, page, lang_name)
            page += 1
            time.sleep(random.randint(10, 20))
            for row in info:
                insert(conn, tuple(row))  # 插入數(shù)據(jù)庫，若不想存入 注釋此行
                ws1.append(row)
    conn.close()  # 關(guān)閉數(shù)據(jù)庫連接，不存數(shù)據(jù)庫 注釋此行
    wb.save("{}職位信息.xlsx".format(lang_name))

if __name__ == "__main__":
    main()

GitHub 地址：https://github.com/injetlee/Python/tree/master/%E7%88%AC%E8%99%AB%E9%9B%86%E5%90%88

如果你想要爬蟲獲取的崗位信息，請關(guān)注公號【智能制造社區(qū)】后臺(tái)留言發(fā)送 "python崗位"。