Python 爬蟲實戰（一）：使用 requests 和 BeautifulSoup

jokester 發布于2019-07-30 15:10 / 1807人閱讀

摘要：建立連接插入數據使用方法創建一個游標對象執行語句提交事務已經存在如果發生錯誤則回滾關閉游標連接關閉數據庫連接定時設置做了一個定時，過段時間就去爬一次。

Python 基礎

我之前寫的《Python 3 極簡教程.pdf》，適合有點編程基礎的快速入門，通過該系列文章學習，能夠獨立完成接口的編寫，寫寫小東西沒問題。

requests

requests，Python HTTP 請求庫，相當于 Android 的 Retrofit，它的功能包括 Keep-Alive 和連接池、Cookie 持久化、內容自動解壓、HTTP 代理、SSL 認證、連接超時、Session 等很多特性，同時兼容 Python2 和 Python3，GitHub：https://github.com/requests/r... 。

安裝

Mac：

pip3 install requests

Windows：

pip install requests

發送請求

HTTP 請求方法有 get、post、put、delete。

import requests

# get 請求
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all")

# post 請求
response = requests.post("http://127.0.0.1:1024/developer/api/v1.0/insert")

# put 請求
response = requests.put("http://127.0.0.1:1024/developer/api/v1.0/update")

# delete 請求
response = requests.delete("http://127.0.0.1:1024/developer/api/v1.0/delete")

請求返回 Response 對象，Response 對象是對 HTTP 協議中服務端返回給瀏覽器的響應數據的封裝，響應的中的主要元素包括：狀態碼、原因短語、響應首部、響應 URL、響應 encoding、響應體等等。

# 狀態碼
print(response.status_code)

# 響應 URL
print(response.url)

# 響應短語
print(response.reason)

# 響應內容
print(response.json())

定制請求頭

請求添加 HTTP 頭部 Headers，只要傳遞一個 dict 給 headers 關鍵字參數就可以了。

header = {"Application-Id": "19869a66c6",
          "Content-Type": "application/json"
          }
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all/", headers=header)

構建查詢參數

想為 URL 的查詢字符串(query string)傳遞某種數據，比如：http://127.0.0.1:1024/developer/api/v1.0/all?key1=value1&key2=value2 ，Requests 允許你使用 params 關鍵字參數，以一個字符串字典來提供這些參數。

payload = {"key1": "value1", "key2": "value2"}
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", params=payload)

還可以將 list 作為值傳入：

payload = {"key1": "value1", "key2": ["value2", "value3"]}
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", params=payload)

# 響應 URL
print(response.url)# 打印：http://127.0.0.1:1024/developer/api/v1.0/all?key1=value1&key2=value2&key2=value3

post 請求數據

如果服務器要求發送的數據是表單數據，則可以指定關鍵字參數 data。

payload = {"key1": "value1", "key2": "value2"}
response = requests.post("http://127.0.0.1:1024/developer/api/v1.0/insert", data=payload)

如果要求傳遞 json 格式字符串參數，則可以使用 json 關鍵字參數，參數的值都可以字典的形式傳過去。

obj = {
    "article_title": "小公務員之死2"
}
# response = requests.post("http://127.0.0.1:1024/developer/api/v1.0/insert", json=obj)

響應內容

Requests 會自動解碼來自服務器的內容。大多數 unicode 字符集都能被無縫地解碼。請求發出后，Requests 會基于 HTTP 頭部對響應的編碼作出有根據的推測。

# 響應內容
# 返回是 是 str 類型內容
# print(response.text())
# 返回是 JSON 響應內容
print(response.json())
# 返回是二進制響應內容
# print(response.content())
# 原始響應內容，初始請求中設置了 stream=True
# response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", stream=True)
# print(response.raw())

超時

如果沒有顯式指定了 timeout 值，requests 是不會自動進行超時處理的。如果遇到服務器沒有響應的情況時，整個應用程序一直處于阻塞狀態而沒法處理其他請求。

response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", timeout=5)  # 單位秒數

代理設置

如果頻繁訪問一個網站，很容易被服務器屏蔽掉，requests 完美支持代理。

# 代理
proxies = {
    "http": "http://127.0.0.1:1024",
    "https": "http://127.0.0.1:4000",
}
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", proxies=proxies)

BeautifulSoup

BeautifulSoup，Python Html 解析庫，相當于 Java 的 jsoup。

安裝

BeautifulSoup 3 目前已經停止開發，直接使用BeautifulSoup 4。

Mac：

pip3 install beautifulsoup4

Windows：

pip install beautifulsoup4

安裝解析器

我用的是 html5lib，純 Python 實現的。

Mac：

pip3 install html5lib

Windows：

pip install html5lib

簡單使用

BeautifulSoup 將復雜 HTML 文檔轉換成一個復雜的樹形結構，每個節點都是 Python 對象。

解析

from bs4 import BeautifulSoup

def get_html_data():
    html_doc = """
    
    
    WuXiaolong
    
    
    分享 Android 技術，也關注 Python 等熱門技術。
    寫博客的初衷：總結經驗，記錄自己的成長。
    你必須足夠的努力，才能看起來毫不費力！專注！精致！
    
    WuXiaolong"s blog
    公眾號：吳小龍同學 
    GitHub
    
       
    """
    soup = BeautifulSoup(html_doc, "html5lib")

tag

tag = soup.head
print(tag)  # WuXiaolong
print(tag.name)  # head
print(tag.title)  # WuXiaolong
print(soup.p)  # 分享 Android 技術，也關注 Python 等熱門技術。
print(soup.a["href"])  # 輸出 a 標簽的 href 屬性：http://wuxiaolong.me/

注意：tag 如果多個匹配，返回第一個，比如這里的 p 標簽。

查找

print(soup.find("p"))  # 分享 Android 技術，也關注 Python 等熱門技術。

find 默認也是返回第一個匹配的標簽，沒找到匹配的節點則返回 None。如果我想指定查找，比如這里的公眾號，可以指定標簽的如 class 屬性值：

# 因為 class 是 Python 關鍵字，所以這里指定為 class_。
print(soup.find("p", class_="WeChat"))
# 公眾號

查找所有的 P 標簽：

for p in soup.find_all("p"):
    print(p.string)

實戰

前段時間，有用戶反饋，我的個人 APP 掛了，雖然這個 APP 我已經不再維護，但是我也得起碼保證它能正常運行。大部分人都知道這個 APP 數據是爬來的（詳見：《手把手教你做個人app》），數據爬來的好處之一就是不用自己管數據，弊端是別人網站掛了或網站的 HTML 節點變了，我這邊就解析不到，就沒數據。這次用戶反饋，我在想要不要把他們網站數據直接爬蟲了，正好自學 Python，練練手，嗯說干就干，本來是想著先用 Python 爬蟲，MySQL 插入本地數據庫，然后 Flask 自己寫接口，用 Android 的 Retrofit 調，再用 bmob sdk 插入 bmob……哎，費勁，感覺行不通，后來我得知 bmob 提供了 RESTful，解決大問題，我可以直接 Python 爬蟲插入就好了，這里我演示的是插入本地數據庫，如果用 bmob，是調 bmob 提供的 RESTful 插數據。

網站選定

我選的演示網站：https://meiriyiwen.com/random ，大家可以發現，每次請求的文章都不一樣，正好利用這點，我只要定時去請求，解析自己需要的數據，插入數據庫就 OK 了。

創建數據庫

我直接用 NaviCat Premium 創建的，當然也可以用命令行。

創建表

創建表 article，用的 pymysql，表需要 id，article_title，article_author，article_content 字段，代碼如下，只需要調一次就好了。

import pymysql


def create_table():
    # 建立連接
    db = pymysql.connect(host="localhost",
                         user="root",
                         password="root",
                         db="python3learn")
    # 創建名為 article 數據庫語句
    sql = """create table if not exists article (
    id int NOT NULL AUTO_INCREMENT, 
    article_title text,
    article_author text,
    article_content text,
    PRIMARY KEY (`id`)
    )"""
    # 使用 cursor() 方法創建一個游標對象 cursor
    cursor = db.cursor()
    try:
        # 執行 sql 語句
        cursor.execute(sql)
        # 提交事務
        db.commit()
        print("create table success")
    except BaseException as e:  # 如果發生錯誤則回滾
        db.rollback()
        print(e)

    finally:
        # 關閉游標連接
        cursor.close()
        # 關閉數據庫連接
        db.close()


if __name__ == "__main__":
    create_table()

解析網站

首先需要 requests 請求網站，然后 BeautifulSoup 解析自己需要的節點。

import requests
from bs4 import BeautifulSoup


def get_html_data():
    # get 請求
    response = requests.get("https://meiriyiwen.com/random")

    soup = BeautifulSoup(response.content, "html5lib")
    article = soup.find("div", id="article_show")
    article_title = article.h1.string
    print("article_title=%s" % article_title)
    article_author = article.find("p", class_="article_author").string
    print("article_author=%s" % article.find("p", class_="article_author").string)
    article_contents = article.find("div", class_="article_text").find_all("p")
    article_content = ""
    for content in article_contents:
        article_content = article_content + str(content)
        print("article_content=%s" % article_content)

插入數據庫

這里做了一個篩選，默認這個網站的文章標題是唯一的，插入數據時，如果有了同樣的標題就不插入。

import pymysql


def insert_table(article_title, article_author, article_content):
    # 建立連接
    db = pymysql.connect(host="localhost",
                         user="root",
                         password="root",
                         db="python3learn",
                         charset="utf8")
    # 插入數據
    query_sql = "select * from article where article_title=%s"
    sql = "insert into article (article_title,article_author,article_content) values (%s, %s, %s)"
    # 使用 cursor() 方法創建一個游標對象 cursor
    cursor = db.cursor()
    try:
        query_value = (article_title,)
        # 執行 sql 語句
        cursor.execute(query_sql, query_value)
        results = cursor.fetchall()
        if len(results) == 0:
            value = (article_title, article_author, article_content)
            cursor.execute(sql, value)
            # 提交事務
            db.commit()
            print("--------------《%s》 insert table success-------------" % article_title)
            return True
        else:
            print("--------------《%s》 已經存在-------------" % article_title)
            return False

    except BaseException as e:  # 如果發生錯誤則回滾
        db.rollback()
        print(e)

    finally:  # 關閉游標連接
        cursor.close()
        # 關閉數據庫連接
        db.close()

定時設置

做了一個定時，過段時間就去爬一次。

import sched
import time


# 初始化 sched 模塊的 scheduler 類
# 第一個參數是一個可以返回時間戳的函數，第二個參數可以在定時未到達之前阻塞。
schedule = sched.scheduler(time.time, time.sleep)


# 被周期性調度觸發的函數
def print_time(inc):
    # to do something
    print("to do something")
    schedule.enter(inc, 0, print_time, (inc,))


# 默認參數 60 s
def start(inc=60):
    # enter四個參數分別為：間隔事件、優先級（用于同時間到達的兩個事件同時執行時定序）、被調用觸發的函數，
    # 給該觸發函數的參數（tuple形式）
    schedule.enter(0, 0, print_time, (inc,))
    schedule.run()


if __name__ == "__main__":
    # 5 s 輸出一次
    start(5)

完整代碼

import pymysql
import requests
from bs4 import BeautifulSoup
import sched
import time


def create_table():
    # 建立連接
    db = pymysql.connect(host="localhost",
                         user="root",
                         password="root",
                         db="python3learn")
    # 創建名為 article 數據庫語句
    sql = """create table if not exists article (
    id int NOT NULL AUTO_INCREMENT, 
    article_title text,
    article_author text,
    article_content text,
    PRIMARY KEY (`id`)
    )"""
    # 使用 cursor() 方法創建一個游標對象 cursor
    cursor = db.cursor()
    try:
        # 執行 sql 語句
        cursor.execute(sql)
        # 提交事務
        db.commit()
        print("create table success")
    except BaseException as e:  # 如果發生錯誤則回滾
        db.rollback()
        print(e)

    finally:
        # 關閉游標連接
        cursor.close()
        # 關閉數據庫連接
        db.close()


def insert_table(article_title, article_author, article_content):
    # 建立連接
    db = pymysql.connect(host="localhost",
                         user="root",
                         password="root",
                         db="python3learn",
                         charset="utf8")
    # 插入數據
    query_sql = "select * from article where article_title=%s"
    sql = "insert into article (article_title,article_author,article_content) values (%s, %s, %s)"
    # 使用 cursor() 方法創建一個游標對象 cursor
    cursor = db.cursor()
    try:
        query_value = (article_title,)
        # 執行 sql 語句
        cursor.execute(query_sql, query_value)
        results = cursor.fetchall()
        if len(results) == 0:
            value = (article_title, article_author, article_content)
            cursor.execute(sql, value)
            # 提交事務
            db.commit()
            print("--------------《%s》 insert table success-------------" % article_title)
            return True
        else:
            print("--------------《%s》 已經存在-------------" % article_title)
            return False

    except BaseException as e:  # 如果發生錯誤則回滾
        db.rollback()
        print(e)

    finally:  # 關閉游標連接
        cursor.close()
        # 關閉數據庫連接
        db.close()


def get_html_data():
    # get 請求
    response = requests.get("https://meiriyiwen.com/random")

    soup = BeautifulSoup(response.content, "html5lib")
    article = soup.find("div", id="article_show")
    article_title = article.h1.string
    print("article_title=%s" % article_title)
    article_author = article.find("p", class_="article_author").string
    print("article_author=%s" % article.find("p", class_="article_author").string)
    article_contents = article.find("div", class_="article_text").find_all("p")
    article_content = ""
    for content in article_contents:
        article_content = article_content + str(content)
        print("article_content=%s" % article_content)

    # 插入數據庫
    insert_table(article_title, article_author, article_content)


# 初始化 sched 模塊的 scheduler 類
# 第一個參數是一個可以返回時間戳的函數，第二個參數可以在定時未到達之前阻塞。
schedule = sched.scheduler(time.time, time.sleep)


# 被周期性調度觸發的函數
def print_time(inc):
    get_html_data()
    schedule.enter(inc, 0, print_time, (inc,))


# 默認參數 60 s
def start(inc=60):
    # enter四個參數分別為：間隔事件、優先級（用于同時間到達的兩個事件同時執行時定序）、被調用觸發的函數，
    # 給該觸發函數的參數（tuple形式）
    schedule.enter(0, 0, print_time, (inc,))
    schedule.run()


if __name__ == "__main__":
    start(60*5)

問題：這只是對一篇文章爬蟲，如果是那種文章列表，點擊是文章詳情，這種如何爬蟲解析？首先肯定要拿到列表，再循環一個個解析文章詳情插入數據庫？還沒有想好該如何做更好，留給后面的課題吧。

最后

雖然我學 Python 純屬業余愛好，但是也要學以致用，不然這些知識很快就忘記了，期待下篇 Python 方面的文章。

參考

快速上手 — Requests 2.18.1 文檔

爬蟲入門系列（二）：優雅的HTTP庫requests

Beautiful Soup 4.2.0 文檔

爬蟲入門系列（四）：HTML文本解析庫BeautifulSoup

云服務器 GPU云服務器 python爬蟲實戰 python3爬蟲實戰爬蟲和python python和爬蟲

文章版權歸作者所有，未經允許請勿轉載,若此文章存在違規行為，您可以聯系管理員刪除。

轉載請注明本文地址：http://specialneedsforspecialkids.com/yun/41082.html

Python 從零開始爬蟲(三)——實戰：requests+BeautifulSoup實現靜態爬取

摘要：前篇全片都是生硬的理論使用，今天就放個靜態爬取的實例讓大家體驗一下的使用，了解一些背后的原理。給出網站打開右鍵檢查第一個電影，分析源碼先，發現每個標簽就對應著一個電影的信息。前篇全片都是生硬的理論使用，今天就放個靜態爬取的實例讓大家體驗一下BeautifulSoup的使用，了解一些背后的原理。順便在這引入靜態網頁的概念——靜態網頁是指一次性加載所有內容的網頁，爬蟲一次請求便能得到所...

Codeing_ls 2019-07-30 16:15 評論0 收藏0
Python 爬蟲實戰（二）：使用 requests-html

摘要：爬蟲實戰一使用和，我們使用了做網絡請求，拿到網頁數據再用解析，就在前不久，作者出了一個新庫，，它可以用于解析文檔的。是基于現有的框架等庫進行了二次封裝，更加方便開發者調用。參考今天用了一下庫爬蟲公眾號我的公眾號吳小龍同學，歡迎交流 Python 爬蟲實戰（一）：使用 requests 和 BeautifulSoup，我們使用了 requests 做網絡請求，拿到網頁數據再用 Beaut...

honmaple 2019-07-31 11:05 評論0 收藏0
Python爬蟲基礎

摘要：爬蟲架構架構組成管理器管理待爬取的集合和已爬取的集合，傳送待爬取的給網頁下載器。網頁下載器爬取對應的網頁，存儲成字符串，傳送給網頁解析器。從文檔中獲取所有文字內容正則匹配后記爬蟲基礎知識，至此足夠，接下來，在實戰中學習更高級的知識。前言 Python非常適合用來開發網頁爬蟲，理由如下：1、抓取網頁本身的接口相比與其他靜態編程語言，如java，c#，c++，python抓取網頁文檔的接...

bang590 2019-07-25 11:23 評論0 收藏0
python爬蟲實戰：爬取西刺代理的代理ip（二）

摘要：爬蟲實戰二爬取西刺代理的代理對于剛入門的同學來說，本次實戰稍微有點難度，但是簡單的爬取圖片文本之類的又沒營養，所以這次我選擇了爬取西刺代理的地址，爬取的代理也能在以后的學習中用到本次實戰用的主要知識很多，其中包括自動保存利用抓包工具獲取和匹爬蟲實戰（二）：爬取西刺代理的代理ip 對于剛入門的同學來說，本次實戰稍微有點難度，但是簡單的爬取圖片、文本之類的又沒營養，所以這次我選擇了爬...

fsmStudy 2019-07-30 17:07 評論0 收藏0