使用Redis+Flask維護(hù)動(dòng)態(tài)代理池

vibiu 發(fā)布于2019-07-30 18:37 / 2865人閱讀

摘要：目標(biāo)爬蟲(chóng)中經(jīng)常遇到被封殺的情況最有效的方式就是使用代理。為什么要用代理池許多網(wǎng)站有專門(mén)的反爬蟲(chóng)措施，可能遇到封等問(wèn)題。通過(guò)定時(shí)的檢測(cè)維護(hù)同樣可以得到多個(gè)可用代理。

目標(biāo)

爬蟲(chóng)中經(jīng)常遇到被封殺IP的情況,最有效的方式就是使用代理IP。我們可以在一些平臺(tái)上購(gòu)買(mǎi)代理IP,但是價(jià)格比較昂貴。另外很多IP代理網(wǎng)站也提供了一些免費(fèi)的代理IP,可以爬取下這些代理IP,并使用webAPI方式提供代理IP服務(wù)。

為什么要用代理池？

許多網(wǎng)站有專門(mén)的反爬蟲(chóng)措施，可能遇到封IP等問(wèn)題。

互聯(lián)網(wǎng)上公開(kāi)了大量免費(fèi)代理，利用好資源。

通過(guò)定時(shí)的檢測(cè)維護(hù)同樣可以得到多個(gè)可用代理。

代理池的要求？

多站抓取，異步檢測(cè)

定時(shí)篩選，持續(xù)更新

提供接口，易于提取

代理池架構(gòu)？

代理池的實(shí)現(xiàn)

項(xiàng)目完整代碼已托管到github:https://github.com/panjings/p...

項(xiàng)目結(jié)構(gòu)如下：

從程序的入口run.py開(kāi)始分析：

from proxypool.api import app
from proxypool.schedule import Schedule

def main():
    
    s = Schedule()
    // 運(yùn)行調(diào)度器
    s.run()
    // 運(yùn)行接口
    app.run()

if __name__ == "__main__":
    main()

從run.py中不難看出，首先運(yùn)行了一個(gè)調(diào)度器，接著運(yùn)行了一個(gè)接口。

調(diào)度器schedule.py代碼：

class Schedule(object):
    @staticmethod
    def valid_proxy(cycle=VALID_CHECK_CYCLE):
        """
        Get half of proxies which in redis
        """
        conn = RedisClient()
        tester = ValidityTester()
        while True:
            print("Refreshing ip")
            count = int(0.5 * conn.queue_len)
            if count == 0:
                print("Waiting for adding")
                time.sleep(cycle)
                continue
            raw_proxies = conn.get(count)
            tester.set_raw_proxies(raw_proxies)
            tester.test()
            time.sleep(cycle)

    @staticmethod
    def check_pool(lower_threshold=POOL_LOWER_THRESHOLD,
                   upper_threshold=POOL_UPPER_THRESHOLD,
                   cycle=POOL_LEN_CHECK_CYCLE):
        """
        If the number of proxies less than lower_threshold, add proxy
        """
        conn = RedisClient()
        adder = PoolAdder(upper_threshold)
        while True:
            if conn.queue_len < lower_threshold:
                adder.add_to_queue()
            time.sleep(cycle)

    def run(self):
        print("Ip processing running")
        valid_process = Process(target=Schedule.valid_proxy)
        check_process = Process(target=Schedule.check_pool)
        valid_process.start()
        check_process.start()

在Schedule中首先聲明了valid_proxy()，用來(lái)檢測(cè)代理是否可用，其中ValidityTester()方法中的test_single_proxy()方法是實(shí)現(xiàn)異步檢測(cè)的關(guān)鍵。
接著check_pool()方法里面?zhèn)魅肓巳齻€(gè)參數(shù)：兩個(gè)代理池的上下界限，一個(gè)時(shí)間。其中PoolAdder()的add_to_queue()方法中使用了一個(gè)從網(wǎng)站抓取ip的類FreeProxyGetter()，FreeProxyGetter()定義在getter.py里面。

接口api.py的代碼：

from flask import Flask, g

from .db import RedisClient

__all__ = ["app"]

app = Flask(__name__)


def get_conn():
    """
    Opens a new redis connection if there is none yet for the
    current application context.
    """
    if not hasattr(g, "redis_client"):
        g.redis_client = RedisClient()
    return g.redis_client


@app.route("/")
def index():
    return "Welcome to Proxy Pool System"


@app.route("/get")
def get_proxy():
    """
    Get a proxy
    """
    conn = get_conn()
    return conn.pop()


@app.route("/count")
def get_counts():
    """
    Get the count of proxies
    """
    conn = get_conn()
    return str(conn.queue_len)


if __name__ == "__main__":
    app.run()

不難看出，在api.py中利用了flask框架的特性定義了各種接口。

具體代碼實(shí)現(xiàn)請(qǐng)參考github。