爬蟲框架WebMagic源碼分析之Downloader

104828720 發布于2019-08-14 17:55 / 2647人閱讀

摘要：方法，首先判斷是否有這是在中配置的，如果有，直接調用的將相應內容轉化成對應編碼字符串，否則智能檢測響應內容的字符編碼。

Downloader是負責請求url獲取返回值（html、json、jsonp等）的一個組件。當然會同時處理POST重定向、Https驗證、ip代理、判斷失敗重試等。

接口：Downloader 定義了download方法返回Page，定義了setThread方法來請求的設置線程數。
抽象類：AbstractDownloader。定義了重載的download方法返回Html，同時定義了onSuccess、onError狀態方法，并定義了addToCycleRetry來判斷是否需要進行重試。
實現類：HttpClientDownloader。負責通過HttpClient下載頁面
輔助類：HttpClientGenerator。負責生成HttpClient實例。

1、AbstractDownloader

public Html download(String url, String charset) {
        Page page = download(new Request(url), Site.me().setCharset(charset).toTask());
        return (Html) page.getHtml();
    }

這里download邏輯很簡單，就是調用子類實現的download下載。

protected Page addToCycleRetry(Request request, Site site) {
        Page page = new Page();
        Object cycleTriedTimesObject = request.getExtra(Request.CYCLE_TRIED_TIMES);
        if (cycleTriedTimesObject == null) {
            page.addTargetRequest(request.setPriority(0).putExtra(Request.CYCLE_TRIED_TIMES, 1));
        } else {
            int cycleTriedTimes = (Integer) cycleTriedTimesObject;
            cycleTriedTimes++;
            if (cycleTriedTimes >= site.getCycleRetryTimes()) {
                return null;
            }
            page.addTargetRequest(request.setPriority(0).putExtra(Request.CYCLE_TRIED_TIMES, cycleTriedTimes));
        }
        page.setNeedCycleRetry(true);
        return page;
    }

判斷重試邏輯：先判斷CYCLE_TRIED_TIMES是否為null，如果不為null，循環重試次數+1,判斷是否超過最大允許值(默認為3次)，然后設置needCycleRetry標志說明需要被重試。這在我們Spider分析篇提到過這個，我們再來看看Spider中的代碼片段加深理解

// for cycle retry
        if (page.isNeedCycleRetry()) {
            extractAndAddRequests(page, true);
            sleep(site.getRetrySleepTime());
            return;
        }

2、HttpClientDownloader
繼承了AbstractDownloader.負責通過HttpClient下載頁面.
實例變量
httpClients：是一個Map型的變量，用來保存根據站點域名生成的HttpClient實例，以便重用。

httpClientGenerator：HttpClientGenerator實例，用來生成HttpClient

主要方法：
a、獲取HttpClient實例。

private CloseableHttpClient getHttpClient(Site site, Proxy proxy) {
        if (site == null) {
            return httpClientGenerator.getClient(null, proxy);
        }
        String domain = site.getDomain();
        CloseableHttpClient httpClient = httpClients.get(domain);
        if (httpClient == null) {
            synchronized (this) {
                httpClient = httpClients.get(domain);
                if (httpClient == null) {
                    httpClient = httpClientGenerator.getClient(site, proxy);
                    httpClients.put(domain, httpClient);
                }
            }
        }
        return httpClient;
    }

主要思路是，通過Site獲取域名，然后通過域名判斷是否在httpClients這個map中已存在HttpClient實例，如果存在則重用，否則通過httpClientGenerator創建一個新的實例，然后加入到httpClients這個map中，并返回。
注意為了確保線程安全性，這里用到了線程安全的雙重判斷機制。

b、download方法：

public Page download(Request request, Task task) {
    Site site = null;
    if (task != null) {
        site = task.getSite();
    }
    Set acceptStatCode;
    String charset = null;
    Map headers = null;
    if (site != null) {
        acceptStatCode = site.getAcceptStatCode();
        charset = site.getCharset();
        headers = site.getHeaders();
    } else {
        acceptStatCode = WMCollections.newHashSet(200);
    }
    logger.info("downloading page {}", request.getUrl());
    CloseableHttpResponse httpResponse = null;
    int statusCode=0;
    try {
        HttpHost proxyHost = null;
        Proxy proxy = null; //TODO
        if (site.getHttpProxyPool() != null && site.getHttpProxyPool().isEnable()) {
            proxy = site.getHttpProxyFromPool();
            proxyHost = proxy.getHttpHost();
        } else if(site.getHttpProxy()!= null){
            proxyHost = site.getHttpProxy();
        }
        
        HttpUriRequest httpUriRequest = getHttpUriRequest(request, site, headers, proxyHost);
        httpResponse = getHttpClient(site, proxy).execute(httpUriRequest);
        statusCode = httpResponse.getStatusLine().getStatusCode();
        request.putExtra(Request.STATUS_CODE, statusCode);
        if (statusAccept(acceptStatCode, statusCode)) {
            Page page = handleResponse(request, charset, httpResponse, task);
            onSuccess(request);
            return page;
        } else {
            logger.warn("get page {} error, status code {} ",request.getUrl(),statusCode);
            return null;
        }
    } catch (IOException e) {
        logger.warn("download page {} error", request.getUrl(), e);
        if (site.getCycleRetryTimes() > 0) {
            return addToCycleRetry(request, site);
        }
        onError(request);
        return null;
    } finally {
        request.putExtra(Request.STATUS_CODE, statusCode);
        if (site.getHttpProxyPool()!=null && site.getHttpProxyPool().isEnable()) {
            site.returnHttpProxyToPool((HttpHost) request.getExtra(Request.PROXY), (Integer) request
                    .getExtra(Request.STATUS_CODE));
        }
        try {
            if (httpResponse != null) {
                //ensure the connection is released back to pool
                EntityUtils.consume(httpResponse.getEntity());
            }
        } catch (IOException e) {
            logger.warn("close response fail", e);
        }
    }
}

注意，這里的Task入參，其實就是Spider實例。
首先通過site來設置字符集、請求頭、以及允許接收的響應狀態碼。
之后便是設置代理：首先判斷site是否有設置代理池，以及代理池是否可用。可用，則隨機從池中獲取一個代理主機，否則判斷site是否設置過直接代理主機。
然后獲取HttpUriRequest(它是HttpGet、HttpPost的接口)，執行請求、判斷響應碼，并將響應轉換成Page對象返回。期間還調用了狀態方法onSuccess,onError，但是這兩個方法都是空實現。(主要原因可能是在Spider中已經通過調用Listener來處理狀態了)。
如果發生異常，調用addToCycleRetry判斷是否需要進行重試。
如果這里返回的Page為null，在Spider中就不會調用PageProcessor，所以我們在PageProcessor中不用擔心Page是否為null
最后的finally塊中進行資源回收處理，回收代理入池，回收HttpClient的connection等(EntityUtils.consume(httpResponse.getEntity());)。

c、具體說說怎么獲取HttpUriRequest

protected HttpUriRequest getHttpUriRequest(Request request, Site site, Map headers,HttpHost proxy) {
        RequestBuilder requestBuilder = selectRequestMethod(request).setUri(request.getUrl());
        if (headers != null) {
            for (Map.Entry headerEntry : headers.entrySet()) {
                requestBuilder.addHeader(headerEntry.getKey(), headerEntry.getValue());
            }
        }
        RequestConfig.Builder requestConfigBuilder = RequestConfig.custom()
                .setConnectionRequestTimeout(site.getTimeOut())
                .setSocketTimeout(site.getTimeOut())
                .setConnectTimeout(site.getTimeOut())
                .setCookieSpec(CookieSpecs.BEST_MATCH);
        if (proxy !=null) {
            requestConfigBuilder.setProxy(proxy);
            request.putExtra(Request.PROXY, proxy);
        }
        requestBuilder.setConfig(requestConfigBuilder.build());
        return requestBuilder.build();
    }

首先調用selectRequestMethod來獲取合適的RequestBuilder，比如是GET還是POST，同時設置請求參數。之后便是調用HttpClient的相關API設置請求頭、超時時間、代理等。

關于selectRequestMethod的改動：預計在WebMagic0.6.2(目前還未發布)之后由于作者合并并修改了PR，設置POST請求參數會大大簡化。
之前POST請求設置參數需要
request.putExtra("nameValuePair",NameValuePair[]);然后這個NameValuePair[]需要不斷add BasicNameValuePair,而且還需要UrlEncodedFormEntity,設置參數過程比較繁瑣，整個過程如下：

List formparams = new ArrayList();
formparams.add(new BasicNameValuePair("channelCode", "0008")); 
formparams.add(new BasicNameValuePair("pageIndex", i+""));
formparams.add(new BasicNameValuePair("pageSize", "15"));
formparams.add(new BasicNameValuePair("sitewebName", "廣東省"));
request.putExtra("nameValuePair",formparams.toArray());

之后我們只需要如下就可以了：

request.putParam("sitewebName", "廣東省");
request.putParam("xxx", "xxx");

d、說說下載的內容如何轉換為Page對象：

protected Page handleResponse(Request request, String charset, HttpResponse httpResponse, Task task) throws IOException {
        String content = getContent(charset, httpResponse);
        Page page = new Page();
        page.setRawText(content);
        page.setUrl(new PlainText(request.getUrl()));
        page.setRequest(request);
        page.setStatusCode(httpResponse.getStatusLine().getStatusCode());
        return page;
    }

這個方法沒什么好說的，唯一要說的就是它調用getContent方法。

protected String getContent(String charset, HttpResponse httpResponse) throws IOException {
    if (charset == null) {
        byte[] contentBytes = IOUtils.toByteArray(httpResponse.getEntity().getContent());
        String htmlCharset = getHtmlCharset(httpResponse, contentBytes);
        if (htmlCharset != null) {
            return new String(contentBytes, htmlCharset);
        } else {
            logger.warn("Charset autodetect failed, use {} as charset. Please specify charset in Site.setCharset()", Charset.defaultCharset());
            return new String(contentBytes);
        }
    } else {
        return IOUtils.toString(httpResponse.getEntity().getContent(), charset);
    }
}

getContent方法，首先判斷是否有charset(這是在Site中配置的)，如果有，直接調用ApacheCommons的IOUtils將相應內容轉化成對應編碼字符串，否則智能檢測響應內容的字符編碼。

protected String getHtmlCharset(HttpResponse httpResponse, byte[] contentBytes) throws IOException {
    return CharsetUtils.detectCharset(httpResponse.getEntity().getContentType().getValue(), contentBytes);
}

getHtmlCharset是調用CharsetUtils來檢測字符編碼，其思路就是，首先判斷httpResponse.getEntity().getContentType().getValue()是否含有比如charset=utf-8
否則用Jsoup解析內容，判斷是提取meta標簽，然后判斷針對html4中html4.01 和html5中分情況判斷出字符編碼。
當然，你懂的，如果服務端返回的不是完整的html內容(不包含head的)，甚至不是html內容(比如json)，那么就會導致判斷失敗，返回默認jvm編碼值.
所以說，如果可以，最好手動給Site設置字符編碼。

3、HttpClientGenerator
用于生成HttpClient實例，算是一種工廠模式了。

public HttpClientGenerator() {
        Registry reg = RegistryBuilder.create()
                .register("http", PlainConnectionSocketFactory.INSTANCE)
                .register("https", buildSSLConnectionSocketFactory())
                .build();
        connectionManager = new PoolingHttpClientConnectionManager(reg);
        connectionManager.setDefaultMaxPerRoute(100);
    }

構造函數主要是注冊http以及https的socket工廠實例。https下我們需要提供自定義的工廠以忽略不可信證書校驗(也就是信任所有證書)，在webmagic0.6之前是存在不可信證書校驗失敗這一問題的，之后webmagic合并了一個關于這一問題的PR，目前的策略是忽略證書校驗、信任一切證書(這才是爬蟲該采用的嘛，我們爬的不是安全，是寂寞。)

private CloseableHttpClient generateClient(Site site, Proxy proxy) {
    CredentialsProvider credsProvider = null;
    HttpClientBuilder httpClientBuilder = HttpClients.custom();
    
    if(proxy!=null && StringUtils.isNotBlank(proxy.getUser()) && StringUtils.isNotBlank(proxy.getPassword()))
    {
        credsProvider= new BasicCredentialsProvider();
        credsProvider.setCredentials(
                new AuthScope(proxy.getHttpHost().getAddress().getHostAddress(), proxy.getHttpHost().getPort()),
                new UsernamePasswordCredentials(proxy.getUser(), proxy.getPassword()));
        httpClientBuilder.setDefaultCredentialsProvider(credsProvider);
    }

    if(site!=null&&site.getHttpProxy()!=null&&site.getUsernamePasswordCredentials()!=null){
        credsProvider = new BasicCredentialsProvider();
        credsProvider.setCredentials(
                new AuthScope(site.getHttpProxy()),//可以訪問的范圍
                site.getUsernamePasswordCredentials());//用戶名和密碼
        httpClientBuilder.setDefaultCredentialsProvider(credsProvider);
    }
    
    httpClientBuilder.setConnectionManager(connectionManager);
    if (site != null && site.getUserAgent() != null) {
        httpClientBuilder.setUserAgent(site.getUserAgent());
    } else {
        httpClientBuilder.setUserAgent("");
    }
    if (site == null || site.isUseGzip()) {
        httpClientBuilder.addInterceptorFirst(new HttpRequestInterceptor() {

            public void process(
                    final HttpRequest request,
                    final HttpContext context) throws HttpException, IOException {
                if (!request.containsHeader("Accept-Encoding")) {
                    request.addHeader("Accept-Encoding", "gzip");
                }
            }
        });
    }
    //解決post/redirect/post 302跳轉問題
    httpClientBuilder.setRedirectStrategy(new CustomRedirectStrategy());
    
    SocketConfig socketConfig = SocketConfig.custom().setSoTimeout(site.getTimeOut()).setSoKeepAlive(true).setTcpNoDelay(true).build();
    httpClientBuilder.setDefaultSocketConfig(socketConfig);
    connectionManager.setDefaultSocketConfig(socketConfig);
    if (site != null) {
        httpClientBuilder.setRetryHandler(new DefaultHttpRequestRetryHandler(site.getRetryTimes(), true));
    }
    generateCookie(httpClientBuilder, site);
    return httpClientBuilder.build();
}

前面是設置代理代理及代理的用戶名密碼
這里主要需要關注的兩點是
1、post/redirect/post 302跳轉問題：這是是通過設置一個自定義的跳轉策略類來實現的。(這在0.6版本之前是存在問題的，0.6版本之后合并了PR)

httpClientBuilder.setRedirectStrategy(new CustomRedirectStrategy());

CustomRedirectStrategy在繼承HttpClient自帶額LaxRedirectStrategy(支持GET,POST，HEAD，DELETE請求重定向跳轉)的基礎上，對POST請求做了特殊化處理，如果是POST請求，代碼處理如下：

HttpRequestWrapper httpRequestWrapper = (HttpRequestWrapper) request;
httpRequestWrapper.setURI(uri);
httpRequestWrapper.removeHeaders("Content-Length");

可以看到，POST請求時首先會重用原先的request對象，并重新設置uri為新的重定向url，然后移除新請求不需要的頭部。重用request對象的好處是，post/redirect/post 302跳轉時會攜帶原有的POST參數，就防止了參數丟失的問題。
否則默認實現是這樣的

if (status == HttpStatus.SC_TEMPORARY_REDIRECT) {
                return RequestBuilder.copy(request).setUri(uri).build();
            } else {
                return new HttpGet(uri);
            }

SC_TEMPORARY_REDIRECT是307狀態碼，也就是說只有在307狀態碼的時候才會攜帶參數跳轉。

2、HttpClient的重試：這是是通過設置一個默認處理器來實現的，同時設置了重試次數(也就是Site中配置的retryTimes)。

httpClientBuilder.setRetryHandler(newDefaultHttpRequestRetryHandler(site.getRetryTimes(), true));

之后便是配置Cookie策略。

private void generateCookie(HttpClientBuilder httpClientBuilder, Site site) {
    CookieStore cookieStore = new BasicCookieStore();
    for (Map.Entry cookieEntry : site.getCookies().entrySet()) {
        BasicClientCookie cookie = new BasicClientCookie(cookieEntry.getKey(), cookieEntry.getValue());
        cookie.setDomain(site.getDomain());
        cookieStore.addCookie(cookie);
    }
    for (Map.Entry> domainEntry : site.getAllCookies().entrySet()) {
        for (Map.Entry cookieEntry : domainEntry.getValue().entrySet()) {
            BasicClientCookie cookie = new BasicClientCookie(cookieEntry.getKey(), cookieEntry.getValue());
            cookie.setDomain(domainEntry.getKey());
            cookieStore.addCookie(cookie);
        }
    }
    httpClientBuilder.setDefaultCookieStore(cookieStore);
}

首先創建一個CookieStore實例，然后將Site中的cookie加入到cookieStore中。并配置到httpClientBuilder中。那么在這個HttpClient實例執行的所有請求中都會用到這個cookieStore。比如登錄保持就可以通過配置Site中的Cookie來實現。

4、關于Page對象說明：
Page對象代表了一個請求結果，或者說相當于頁面(當返回json時這種說法有點勉強)。

public Html getHtml() {
        if (html == null) {
            html = new Html(UrlUtils.fixAllRelativeHrefs(rawText, request.getUrl()));
        }
        return html;
    }

通過它得到的頁面，原始頁面中的鏈接是不包含域名的情況下會被自動轉換為http[s]開頭的完整鏈接。

關于Downloader就分析到這，后續會進行補充，下篇主題待定。

文章版權歸作者所有，未經允許請勿轉載,若此文章存在違規行為，您可以聯系管理員刪除。

轉載請注明本文地址：http://specialneedsforspecialkids.com/yun/66880.html

爬蟲框架Webmagic源碼分析之Spider

摘要：獲取正在運行的線程數，用于狀態監控。之后初始化組件主要是初始化線程池將到中，初始化開始時間等。如果線程池中運行線程數量為，并且默認，那么就停止退出，結束爬蟲。本系列文章，針對Webmagic 0.6.1版本一個普通爬蟲啟動代碼 public static void main(String[] args) { Spider.create(new GithubRepoPageP...

鄒立鵬 2019-08-14 17:53 評論0 收藏0
爬蟲框架WebMagic源碼分析之Selenium

摘要：有一個模塊其中實現了一個。但是感覺靈活性不大。接口如下它會獲得一個實例，你可以在里面進行任意的操作。本部分到此結束。 webmagic有一個selenium模塊,其中實現了一個SeleniumDownloader。但是感覺靈活性不大。所以我就自己參考實現了一個。首先是WebDriverPool用來管理WebDriver池： import java.util.ArrayList; im...

MarvinZhang 2019-08-14 17:57 評論0 收藏0
爬蟲框架WebMagic源碼分析系列目錄

摘要：爬蟲框架源碼分析之爬蟲框架源碼分析之爬蟲框架源碼分析之爬蟲框架源碼分析之爬蟲框架源碼分析之之進階爬蟲框架Webmagic源碼分析之Spider爬蟲框架WebMagic源碼分析之Scheduler爬蟲框架WebMagic源碼分析之Downloader爬蟲框架WebMagic源碼分析之Selector爬蟲框架WebMagic源碼分析之SeleniumWebMagic之Spider進階

wayneli 2019-08-14 17:57 評論0 收藏0
【Sasila】一個簡單易用的爬蟲框架

摘要：所以我模仿這些爬蟲框架的優勢，以盡量簡單的原則，搭配實際上是開發了這套輕量級爬蟲框架。將下載器，解析器，調度器，數據處理器注入核心成為對象。提供對爬蟲進行管理監控。每個腳本被認為是一個，確定一個任務。 ??現在有很多爬蟲框架，比如scrapy、webmagic、pyspider都可以在爬蟲工作中使用，也可以直接通過requests+beautifulsoup來寫一些個性化的小型爬蟲腳本...

yacheng 2019-07-30 14:18 評論0 收藏0
爬蟲框架WebMagic源碼分析之Selector

摘要：主要用于選擇器抽象類，實現類前面說的兩個接口，主要用于選擇器繼承。多個選擇的情形，每個選擇器各自獨立選擇，將所有結果合并。抽象類，定義了一些模板方法。這部分源碼就不做分析了。這里需要提到的一點是返回的不支持選擇，返回的對象支持選擇。 1、Selector部分：接口：Selector：定義了根據字符串選擇單個元素和選擇多個元素的方法。ElementSelector：定義了根據jsoup ...

dongxiawu 2019-08-14 17:54 評論0 收藏0