摘要:還是直接貼代碼說明比較實在。重新調整窗口大小,以適應頁面,需要耗費一定時間。建議等待合理的時間。負責摳圖指定坐標不保持比例,調用進程,返回識別結果。
還是直接貼代碼說明比較實在。
感覺webmagic-selenium這個模塊有點雞肋,但還是有可借鑒之處。借鑒它寫了一個SeleniumDownloader,如下:
import org.openqa.selenium.By; import org.openqa.selenium.Cookie; import org.openqa.selenium.WebDriver; import org.openqa.selenium.WebElement; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Request; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.Task; import us.codecraft.webmagic.downloader.Downloader; import us.codecraft.webmagic.selector.Html; import us.codecraft.webmagic.selector.PlainText; import us.codecraft.webmagic.utils.UrlUtils; import java.util.Map; /** * @author taojw * */ public class SeleniumDownloader implements Downloader{ private static final Logger log=LoggerFactory.getLogger(SeleniumDownloader.class); private int sleepTime=3000;//3s private SeleniumAction action=null; private WebDriverPool webDriverPool=new WebDriverPool(); public SeleniumDownloader(){ } public SeleniumDownloader(int sleepTime,WebDriverPool pool){ this(sleepTime,pool,null); } public SeleniumDownloader(int sleepTime,WebDriverPool pool,SeleniumAction action){ this.sleepTime=sleepTime; this.action=action; if(pool!=null){ webDriverPool=pool; } } public SeleniumDownloader setSleepTime(int sleepTime) { this.sleepTime = sleepTime; return this; } public void setOperator(SeleniumAction action){ this.action=action; } @Override public Page download(Request request, Task task) { WebDriver webDriver; try { webDriver = webDriverPool.get(); } catch (InterruptedException e) { log.warn("interrupted", e); return null; } log.info("downloading page " + request.getUrl()); Page page = new Page(); try { webDriver.get(request.getUrl()); Thread.sleep(sleepTime); } catch (InterruptedException e) { e.printStackTrace(); } catch (Exception e) { webDriverPool.close(webDriver); page.setSkip(true); return page; } // WindowUtil.changeWindow(webDriver); WebDriver.Options manage = webDriver.manage(); Site site = task.getSite(); if (site.getCookies() != null) { for (Map.EntrycookieEntry : site.getCookies() .entrySet()) { Cookie cookie = new Cookie(cookieEntry.getKey(), cookieEntry.getValue()); manage.addCookie(cookie); } } manage.window().maximize(); if(action!=null){ action.execute(webDriver); } SeleniumAction reqAction=(SeleniumAction) request.getExtra("action"); if(reqAction!=null){ reqAction.execute(webDriver); } WebElement webElement = webDriver.findElement(By.xpath("/html")); String content = webElement.getAttribute("outerHTML"); page.setRawText(content); page.setHtml(new Html(UrlUtils.fixAllRelativeHrefs(content, webDriver.getCurrentUrl()))); page.setUrl(new PlainText(webDriver.getCurrentUrl())); page.setRequest(request); webDriverPool.returnToPool(webDriver); return page; } @Override public void setThread(int thread) { } }
功能:
支持在Spider.setDownloader的時候添加鉤子SeleniumAction來實現自定義selenium的通用操作。加強了靈活性
支持對每個請求添加action參數,參數值為SeleniumAction對象,進而可以對每個請求實現自定義selenium操作.加強了靈活性
import org.openqa.selenium.WebDriver; /** * @author taojw * */ public interface SeleniumAction { void execute(WebDriver driver); }
WebDriverPool實現:注意對WebDriver的池化來保證性能
也是參考webmagic-selenium作了些修改。
import com.fh.util.FileUtil; import org.openqa.selenium.WebDriver; import org.openqa.selenium.phantomjs.PhantomJSDriver; import org.openqa.selenium.phantomjs.PhantomJSDriverService; import org.openqa.selenium.remote.DesiredCapabilities; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.util.concurrent.BlockingDeque; import java.util.concurrent.LinkedBlockingDeque; import java.util.concurrent.atomic.AtomicInteger; /** * @author taojw */ public class WebDriverPool { private Logger logger = LoggerFactory.getLogger(getClass()); private int CAPACITY = 5; private AtomicInteger refCount = new AtomicInteger(0); private static final String DRIVER_PHANTOMJS = "phantomjs"; /** * store webDrivers available */ private BlockingDequeinnerQueue = new LinkedBlockingDeque ( CAPACITY); private static String PHANTOMJS_PATH; private static DesiredCapabilities caps = DesiredCapabilities.phantomjs(); static { PHANTOMJS_PATH = FileUtil.getCommonProp("phantomjs.path"); caps.setJavascriptEnabled(true); caps.setCapability( PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY, PHANTOMJS_PATH); caps.setCapability("takesScreenshot", true); caps.setCapability( PhantomJSDriverService.PHANTOMJS_PAGE_CUSTOMHEADERS_PREFIX + "User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"); caps.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, "--load-images=no"); } public WebDriverPool() { } public WebDriverPool(int poolsize) { this.CAPACITY = poolsize; innerQueue = new LinkedBlockingDeque (poolsize); } public WebDriver get() throws InterruptedException { WebDriver poll = innerQueue.poll(); if (poll != null) { return poll; } if (refCount.get() < CAPACITY) { synchronized (innerQueue) { if (refCount.get() < CAPACITY) { WebDriver mDriver = new PhantomJSDriver(caps); // 嘗試性解決:https://github.com/ariya/phantomjs/issues/11526問題 mDriver.manage().timeouts() .pageLoadTimeout(60, TimeUnit.SECONDS); // mDriver.manage().window().setSize(new Dimension(1366, // 768)); innerQueue.add(mDriver); refCount.incrementAndGet(); } } } return innerQueue.take(); } public void returnToPool(WebDriver webDriver) { // webDriver.quit(); // webDriver=null; innerQueue.add(webDriver); } public void close(WebDriver webDriver) { refCount.decrementAndGet(); webDriver.close(); webDriver.quit(); webDriver = null; } public void shutdown() { try { for (WebDriver driver : innerQueue) { close(driver); } innerQueue.clear(); } catch (Exception e) { // e.printStackTrace(); logger.warn("webdriverpool關閉失敗",e); } } }
修改后:
僅支持PhantomJS作為瀏覽器驅動。
增加phantomjs相關配置
修改隊列大小控制邏輯
WindowUtil
注意這個loadAll方法的實現很巧妙哦,由于涉及滾動加載頁面的時候,如果一下子滾到底部可能會造成中間部分沒有加載出來,這樣就不得不針對每個頁面進行滿滿滾動。而loadAll采取的思路是直接獲取頁面可滾動大小,然后將瀏覽器窗口調成對應大小,刷新之后所有內容便加載出來了。
import org.apache.commons.io.FileUtils; import org.openqa.selenium.*; import java.io.File; import java.io.IOException; /** * @author taojw * */ public class WindowUtil { /** * 滾動窗口。 * @param driver * @param height */ public static void scroll(WebDriver driver,int height){ ((JavascriptExecutor)driver).executeScript("window.scrollTo(0,"+height+" );"); } /** * 重新調整窗口大小,以適應頁面,需要耗費一定時間。建議等待合理的時間。 * @param driver */ public static void loadAll(WebDriver driver){ Dimension od=driver.manage().window().getSize(); int width=driver.manage().window().getSize().width; //嘗試性解決:https://github.com/ariya/phantomjs/issues/11526問題 driver.manage().timeouts().pageLoadTimeout(60, TimeUnit.SECONDS); long height=(Long)((JavascriptExecutor)driver).executeScript("return document.body.scrollHeight;"); driver.manage().window().setSize(new Dimension(width, (int)height)); driver.navigate().refresh(); } public static void taskScreenShot(WebDriver driver,File saveFile){ File src=((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE); try { FileUtils.copyFile(src, saveFile); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } public static void changeWindow(WebDriver driver){ // 獲取當前頁面句柄 String handle = driver.getWindowHandle(); // 獲取所有頁面的句柄,并循環判斷不是當前的句柄,就做選取switchTo() for (String handles : driver.getWindowHandles()) { if (handles.equals(handle)) continue; driver.switchTo().window(handles); } } }
至此對爬蟲框架的擴展高一段落。
實戰部分 抓取淘寶店鋪信息/** * 店鋪銷售信息 * * @author taojw */ @Scope("prototype") @Component public class TaoBaoShopInfoProcessor implements PageProcessor { private static final Logger log = LoggerFactory .getLogger(TaoBaoShopInfoProcessor.class); @Autowired private TaoBaoShopInfoService service; private Site site = Site .me() .setCharset("UTF-8") .setCycleRetryTimes(3) .setSleepTime(3 * 1000) .addHeader("Connection", "keep-alive") .addHeader("Cache-Control", "max-age=0") .addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0"); private AtomicBoolean isPageAdd = new AtomicBoolean(false); private static AtomicBoolean running = new AtomicBoolean(false); private WebDriverPool pool=new WebDriverPool(); @Override public Site getSite() { return this.site; } @Override public void process(Page page) { if (islistPage(page)) { Listurls = page.getHtml() .$("dl.item a.J_TGoldData", "href").all(); List targetUrls = new ArrayList (); for (String url : urls) { targetUrls.add(url.trim()); } page.addTargetRequests(targetUrls); if (isPageAdd.compareAndSet(false, true)) { // 分頁處理 String pageinfo = page.getHtml() .$(".pagination .page-info", "text").get(); int pageCount = Integer.valueOf(pageinfo.split("/")[1]); String cururl = page.getUrl().get(); //只抓前5頁 if(pageCount>5){ pageCount=5; } for (int i = 1; i < pageCount; i++) { String tmp = cururl + "&pageNo=" + (i + 1); page.addTargetRequest(tmp); } } return; } // 商品頁面 String curUrl = page.getUrl().get(); boolean isTaoBao=curUrl.startsWith("https://item.taobao.com"); boolean isTmall=curUrl.startsWith("https://detail.tmall.com"); String tmpspm = curUrl.split("?")[1].split("&")[0]; // spm碼 String spm = tmpspm.split("=")[1]; // 網店地址 String shopUrl=""; // 商品名稱 String name=""; // 價格 double price =0; // 30天交易總數 int sellCount=0; // 交易總價 double allPrice=0; if(isTaoBao){ shopUrl= page.getHtml() .xpath("http://div[@class="tb-shop-name"]/dl/dd/strong/a/@href") .get(); shopUrl = shopUrl.split("?")[0]; name = page.getHtml().xpath("http://*[@id="J_Title"]/h3/text()") .get(); try{ price=Double.valueOf(page.getHtml() .$("#J_PromoPriceNum", "text").get().split("-")[0].trim()); }catch(Exception e){ price=Double.valueOf(page.getHtml() .$("#J_StrPrice .tb-rmb-num", "text").get().split("-")[0].trim()); } sellCount = Integer.valueOf(page.getHtml() .$("#J_SellCounter", "text").get()); allPrice = Double.valueOf(price) * Double.valueOf(sellCount); }else if(isTmall){ shopUrl= page.getHtml() .xpath("http://*[@id="side-shop-info"]/div/h3/div/a/@href") .get(); shopUrl = shopUrl.split("?")[0]; name = page.getHtml().$(".tb-detail-hd h1","text") .get().trim(); price=Double.valueOf(page.getHtml() .$(".tm-price", "text").get().split("-")[0].trim()); sellCount = Integer.valueOf(page.getHtml() .$(".tm-count", "text").get().trim()); allPrice = Double.valueOf(price) * Double.valueOf(sellCount); } // 采集日期 // Timestamp recordDate=new Timestamp(new Date().getTime()); String recordDate = DateUtil.formatDate(new Date(), "yyyy-MM-dd"); log.debug(shopUrl + ":" + spm + ":" + name + ":" + price + ":" + sellCount + ":" + allPrice + ":" + recordDate); PageData pd = new PageData(); pd.put("id", UUID.randomUUID().toString()); pd.put("shopUrl", shopUrl); pd.put("spm", spm); pd.put("name", name); pd.put("price", price); pd.put("sellCount", sellCount); pd.put("allPrice", allPrice); pd.put("recordDate", recordDate); service.saveData(pd); } private boolean islistPage(Page page) { String tmp = page.getHtml().$("#J_PromoPrice").get(); if (StringUtils.isBlank(tmp)) { return true; } return false; } public void start() { if (running.compareAndSet(false, true)) { try { service.emptyTable(); List urls = service.getShopUrl(); if (urls == null) { log.error("店鋪url獲取異常,終止抓取"); } String[] urlStrs=null; int size=50; // int size=urls.size(); if(urls.size() 抓取貓眼票房數據 由于貓眼票房數據采用加密字體圖標,而且每個數字對應的加密碼每次都變化。所以此次采用selenium加載頁面,截圖,摳圖(給每個數字),考慮到貓眼票房數據的規則性,結合google的 Tesseract-OCR 訓練模型來識別我們摳出來的數字圖片。
ImageUtil 負責摳圖
import net.coobird.thumbnailator.Thumbnails; import net.coobird.thumbnailator.geometry.Position; import net.coobird.thumbnailator.geometry.Size; /** * @author taojw * */ public class ImageUtil { public static void crop(String srcfile,String destfile,ImageRegion region){ //指定坐標 try { Thumbnails.of(srcfile) .sourceRegion(region.x, region.y, region.width, region.height) .size(region.width, region.height).outputQuality(1.0) //.keepAspectRatio(false) //不保持比例 .toFile(destfile); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } public static void main(String[] args) { crop("D:data111.png","D:data1112.png",new ImageRegion(66, 264, 422, 426)); } }/** * @author taojw * */ public class ImageRegion { public int x; public int y; public int width; public int height; public ImageRegion(int x,int y,int width,int height){ this.x=x; this.y=y; this.width=width; this.height=height; } }TesseractOcrUtil,調用tesseract進程,返回識別結果。
import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStreamReader; import java.util.UUID; import org.apache.commons.io.FileUtils; import org.apache.commons.io.IOUtils; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import com.fh.util.FileUtil; /** * @author taojw * */ public class TesseractOcrUtil { private static final Logger log = LoggerFactory .getLogger(TesseractOcrUtil.class); private static final String tessPath; private static final String basePath; static { tessPath = FileUtil.getCommonProp("tesseract.path"); basePath = new File(tessPath).getParentFile().getAbsolutePath(); } public static String getByLangNum(String imagePath) { return get(imagePath, "num"); } public static String getByLangChi(String imagePath) { return get(imagePath, "chi_sim"); } public static String getByLangEng(String imagePath) { return get(imagePath, "eng"); } public static String get(String imagePath, String lang) { String outName = UUID.randomUUID().toString(); String outPath = basePath + File.separator + outName + ".txt"; // String cmd = tessPath + " " + imagePath + " " + outName + " -l " + lang; ProcessBuilder pb = new ProcessBuilder(); pb.directory(new File(basePath)); pb.command(tessPath,imagePath,outName,"-l",lang); pb.redirectErrorStream(true); Process process=null; String errormsg = ""; String res = null; try { process = pb.start(); // tesseract.exe 1.jpg 1 -l chi_sim int excode = process.waitFor(); if (excode == 0) { BufferedReader in = new BufferedReader(new InputStreamReader( new FileInputStream(outPath), "UTF-8")); res = in.readLine(); IOUtils.closeQuietly(in); } else { switch (excode) { case 1: errormsg = "Errors accessing files.There may be spaces in your image"s filename."; break; case 29: errormsg = "Cannot recongnize the image or its selected region."; break; case 31: errormsg = "Unsupported image format."; break; default: errormsg = "Errors occurred."; } log.error("when ocr picture " + imagePath + " an error occured. " + errormsg); } } catch (IOException e) { e.printStackTrace(); log.warn("orc process occurs an io error",e); } catch (InterruptedException e) { e.printStackTrace(); log.warn("orc process was interrupt unexpected!",e); }finally{ FileUtils.deleteQuietly(new File(imagePath)); FileUtils.deleteQuietly(new File(outPath)); } if(res!=null){ res=res.trim(); } return res; } }/** * @author taojw * */ public class MaoyanTest implements PageProcessor{ private static Site site=Site.me().setCharset("UTF-8").setUserAgent( "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31"); @Override public Site getSite() { return site; } @Override public void process(Page page) { } public void start() { Spider cnSpider = Spider.create(this).setDownloader(new SeleniumDownloader(5000,null,new TestAction())) // .addUrl("https://shop34068488.taobao.com/?spm=a230r.7195193.1997079397.2.JLFlPa") // .addUrl("http://piaofang.maoyan.com/company/cinema?date=2017-01-18&webCityId=288&cityTier=0&page=1&cityName=%E6%8F%AD%E9%98%B3"); .addUrl("http://piaofang.maoyan.com/company/cinema?date=2017-01-18&webCityId=84&cityTier=0&page=1&cityName=%E4%BF%9D%E5%AE%9A"); // .addPipeline(new JsonFilePipeline("D:datawebmagicfile.json")) //SpiderMonitor.instance().register(cnSpider); cnSpider.run(); } public static void main(String[] args) { new MaoyanTest().start(); } private class TestAction implements SeleniumAction{ @Override public void execute(WebDriver driver) { WindowUtil.loadAll(driver); try { Thread.sleep(5000); //WebDriverWait wait = new WebDriverWait(driver, 10); //wait.until(ExpectedConditions.presenceOfElementLocated(By.id("J_PromoPriceNum"))); File src=((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE); String srcfile="D:data"+UUID.randomUUID().toString()+".png"; FileUtils.copyFile(src, new File(srcfile)); Listmovielist=driver.findElements(By.xpath("http://*[@id="cinema-tbody"]/tr")); // movielist.remove(0); for(int i=1;i 可供參考鏈接:
selenium系列文章:http://www.cnblogs.com/TankXi...
selenium api:http://seleniumhq.github.io/s...
tesseract-ocr樣本訓練: http://blog.csdn.net/firehood...
selenium多窗口切換:http://blog.csdn.net/meyoung0...
文章版權歸作者所有,未經允許請勿轉載,若此文章存在違規行為,您可以聯系管理員刪除。
轉載請注明本文地址:http://specialneedsforspecialkids.com/yun/66543.html
摘要:有一個模塊其中實現了一個。但是感覺靈活性不大。接口如下它會獲得一個實例,你可以在里面進行任意的操作。本部分到此結束。 webmagic有一個selenium模塊,其中實現了一個SeleniumDownloader。但是感覺靈活性不大。所以我就自己參考實現了一個。 首先是WebDriverPool用來管理WebDriver池: import java.util.ArrayList; im...
摘要:優雅的使用框架,爬取唐詩別苑網的詩人詩歌數據同時在幾種動態加載技術中對比作選擇雖然差不多兩年沒有維護,但其本身是一個優秀的爬蟲框架的實現,源碼中有很多值得參考的地方,特別是對爬蟲多線程的控制。 優雅的使用WebMagic框架,爬取唐詩別苑網的詩人詩歌數據 同時在幾種動態加載技術(HtmlUnit、PhantomJS、Selenium、JavaScriptEngine)中對比作選擇 We...
摘要:上一篇文章網絡爬蟲實戰請求庫安裝下一篇文章網絡爬蟲實戰解析庫的安裝的安裝在上一節我們了解了的配置方法,配置完成之后我們便可以用來驅動瀏覽器來做相應網頁的抓取。上一篇文章網絡爬蟲實戰請求庫安裝下一篇文章網絡爬蟲實戰解析庫的安裝 上一篇文章:Python3網絡爬蟲實戰---1、請求庫安裝:Requests、Selenium、ChromeDriver下一篇文章:Python3網絡爬蟲實戰--...
摘要:對于這次的爬蟲來說,由于網易云音樂以及音樂網頁中大部分元素都是使用渲染生成的,因此選擇使用來完成這次的腳本。可以發現網易云音樂的手機版歌單地址是。現在已經支持網易云音樂與音樂歌單的互相同步。 本文主要介紹selenium在爬蟲腳本的實際應用。適合剛接觸python,沒使用過selenium的童鞋。(如果你是老司機路過的話,幫忙點個star吧) 項目地址 https://github.c...
摘要:,引言注釋上一篇爬蟲實戰安居客房產經紀人信息采集,訪問的網頁是靜態網頁,有朋友模仿那個實戰來采集動態加載豆瓣小組的網頁,結果不成功。 showImg(https://segmentfault.com/img/bVzdNZ); 1, 引言 注釋:上一篇《Python爬蟲實戰(3):安居客房產經紀人信息采集》,訪問的網頁是靜態網頁,有朋友模仿那個實戰來采集動態加載豆瓣小組的網頁,結果不成功...
閱讀 640·2021-10-13 09:39
閱讀 1449·2021-09-09 11:53
閱讀 2639·2019-08-29 13:55
閱讀 722·2019-08-28 18:08
閱讀 2586·2019-08-26 13:54
閱讀 2406·2019-08-26 11:44
閱讀 1835·2019-08-26 11:41
閱讀 3761·2019-08-26 10:15