POI讀取文件的最佳實踐

bingchen 發布于2019-08-16 10:53 / 1464人閱讀

摘要：我最近做的一個工具就是讀取計算機中的以及文件。經常在讀取某些特別大的文件的時候都會帶來一個內存溢出的問題。以上，就是我在使用讀取文件的一些探索和發現，希望對你能有所幫助。

POI是 Apache 旗下一款讀寫微軟家文檔聲名顯赫的類庫。應該很多人在做報表的導出，或者創建 word 文檔以及讀取之類的都是用過 POI。POI 也的確對于這些操作帶來很大的便利性。我最近做的一個工具就是讀取計算機中的 word 以及 excel 文件。下面我就兩方面講解以下遇到的一些坑：

word 篇

對于 word 文件，我需要的就是提取文件中正文的文字。所以可以創建一個方法來讀取 doc 或者 docx 文件：

    private static String readDoc(String filePath, InputStream is) {
        String text= "";
        try {
            if (filePath.endsWith("doc")) {
                WordExtractor ex = new WordExtractor(is);
                text = ex.getText();
                ex.close();
                is.close();
            } else if(filePath.endsWith("docx")) {
                XWPFDocument doc = new XWPFDocument(is);
                XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
                text = extractor.getText();
                extractor.close();
                is.close();
            }
        } catch (Exception e) {
            logger.error(filePath, e);
        } finally {
            if (is != null) {
                is.close();
            }
        }
        return text;
    }

理論上來說，這段代碼應該對于讀取大多數 doc 或者 docx 文件都是有效的。但是!!!!我發現了一個奇怪的問題，就是我的代碼在讀取某些 doc 文件的時候，經常會給出這樣的一個異常：

org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents.

這個異常的意思是什么呢，通俗的來講，就是你打開的文件并不是一個 doc 文件，你應該使用讀取 docx 的方法去讀取。但是我們明明打開的就是一個后綴是 doc 的文件?。?/p>

其實 doc 和 docx 的本質不同的，doc 是 OLE2 類型，而 docx 而是 OOXML 類型。如果你用壓縮文件打開一個 docx 文件，你會發現一些文件夾：

本質上 docx 文件就是一個 zip 文件，里面包含了一些 xml 文件。所以，一些 docx 文件雖然大小不大，但是其內部的 xml 文件確實比較大的，這也是為什么在讀取某些看起來不是很大的 docx 文件的時候卻耗費了大量的內存。

然后我使用壓縮文件打開這個 doc 文件，果不其然，其內部正是如上圖，所以本質上我們可以認為它是一個 docx 文件。可能是因為它是以某種兼容模式保存從而導致如此坑爹的問題。所以，現在我們根據后綴名來判斷一個文件是 doc 或者 docx 就是不可靠的了。

老實說，我覺得這應該不是一個很少見的問題。但是我在谷歌上并沒有找到任何關于此的信息。how to know whether a file is .docx or .doc format from Apache POI 這個例子是通過 ZipInputStream 來判斷文件是否是 docx 文件：

boolean isZip = new ZipInputStream( fileStream ).getNextEntry() != null;

但我并不覺得這是一個很好的方法，因為我得去構建一個ZipInpuStream，這很顯然不好。另外，這個操作貌似會影響到 InputStream，所以你在讀取正常的 doc 文件會有問題。或者你使用 File 對象去判斷是否是一個 zip 文件。但這也不是一個好方法，因為我還需要在壓縮文件中讀取 doc 或者 docx 文件，所以我的輸入必須是 Inputstream，所以這個選項也是不可以的。我在 stackoverflow 上和一幫老外扯了大半天，有時候我真的很懷疑這幫老外的理解能力，不過最終還是有一個大佬給出了一個讓我欣喜若狂的解決方案，FileMagic。這個是一個 POI 3.17新增加的一個特性：

public enum FileMagic {
    /** OLE2 / BIFF8+ stream used for Office 97 and higher documents */
    OLE2(HeaderBlockConstants._signature),
    /** OOXML / ZIP stream */
    OOXML(OOXML_FILE_HEADER),
    /** XML file */
    XML(RAW_XML_FILE_HEADER),
    /** BIFF2 raw stream - for Excel 2 */
    BIFF2(new byte[]{
            0x09, 0x00, // sid=0x0009
            0x04, 0x00, // size=0x0004
            0x00, 0x00, // unused
            0x70, 0x00  // 0x70 = multiple values
    }),
    /** BIFF3 raw stream - for Excel 3 */
    BIFF3(new byte[]{
            0x09, 0x02, // sid=0x0209
            0x06, 0x00, // size=0x0006
            0x00, 0x00, // unused
            0x70, 0x00  // 0x70 = multiple values
    }),
    /** BIFF4 raw stream - for Excel 4 */
    BIFF4(new byte[]{
            0x09, 0x04, // sid=0x0409
            0x06, 0x00, // size=0x0006
            0x00, 0x00, // unused
            0x70, 0x00  // 0x70 = multiple values
    },new byte[]{
            0x09, 0x04, // sid=0x0409
            0x06, 0x00, // size=0x0006
            0x00, 0x00, // unused
            0x00, 0x01
    }),
    /** Old MS Write raw stream */
    MSWRITE(
            new byte[]{0x31, (byte)0xbe, 0x00, 0x00 },
            new byte[]{0x32, (byte)0xbe, 0x00, 0x00 }),
    /** RTF document */
    RTF("{
tf"),
    /** PDF document */
    PDF("%PDF"),
    // keep UNKNOWN always as last enum!
    /** UNKNOWN magic */
    UNKNOWN(new byte[0]);

    final byte[][] magic;

    FileMagic(long magic) {
        this.magic = new byte[1][8];
        LittleEndian.putLong(this.magic[0], 0, magic);
    }

    FileMagic(byte[]... magic) {
        this.magic = magic;
    }

    FileMagic(String magic) {
        this(magic.getBytes(LocaleUtil.CHARSET_1252));
    }

    public static FileMagic valueOf(byte[] magic) {
        for (FileMagic fm : values()) {
            int i=0;
            boolean found = true;
            for (byte[] ma : fm.magic) {
                for (byte m : ma) {
                    byte d = magic[i++];
                    if (!(d == m || (m == 0x70 && (d == 0x10 || d == 0x20 || d == 0x40)))) {
                        found = false;
                        break;
                    }
                }
                if (found) {
                    return fm;
                }
            }
        }
        return UNKNOWN;
    }

    /**
     * Get the file magic of the supplied InputStream (which MUST
     *  support mark and reset).
     *
     * If unsure if your InputStream does support mark / reset,
     *  use {@link #prepareToCheckMagic(InputStream)} to wrap it and make
     *  sure to always use that, and not the original!

     *
     * Even if this method returns {@link FileMagic#UNKNOWN} it could potentially mean,
     *  that the ZIP stream has leading junk bytes
     *
     * @param inp An InputStream which supports either mark/reset
     */
    public static FileMagic valueOf(InputStream inp) throws IOException {
        if (!inp.markSupported()) {
            throw new IOException("getFileMagic() only operates on streams which support mark(int)");
        }

        // Grab the first 8 bytes
        byte[] data = IOUtils.peekFirst8Bytes(inp);

        return FileMagic.valueOf(data);
    }


    /**
     * Checks if an {@link InputStream} can be reseted (i.e. used for checking the header magic) and wraps it if not
     *
     * @param stream stream to be checked for wrapping
     * @return a mark enabled stream
     */
    public static InputStream prepareToCheckMagic(InputStream stream) {
        if (stream.markSupported()) {
            return stream;
        }
        // we used to process the data via a PushbackInputStream, but user code could provide a too small one
        // so we use a BufferedInputStream instead now
        return new BufferedInputStream(stream);
    }
}

在這給出主要的代碼，其主要就是根據 InputStream 前 8 個字節來判斷文件的類型，毫無以為這就是最優雅的解決方式。一開始，其實我也是在想對于壓縮文件的前幾個字節似乎是由不同的定義的，magicmumber。因為 FileMagic 的依賴和3.16 版本是兼容的，所以我只需要加入這個類就可以了，因此我們現在讀取 word 文件的正確做法是：

    private static String readDoc (String filePath, InputStream is) {
        String text= "";
        is = FileMagic.prepareToCheckMagic(is);
        try {
            if (FileMagic.valueOf(is) == FileMagic.OLE2) {
                WordExtractor ex = new WordExtractor(is);
                text = ex.getText();
                ex.close();
            } else if(FileMagic.valueOf(is) == FileMagic.OOXML) {
                XWPFDocument doc = new XWPFDocument(is);
                XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
                text = extractor.getText();
                extractor.close();
            }
        } catch (Exception e) {
            logger.error("for file " + filePath, e);
        } finally {
            if (is != null) {
                is.close();
            }
        }
        return text;
    }

excel 篇

對于 excel 篇，我也就不去找之前的方案和現在的方案的對比了。就給出我現在的最佳做法了：

    @SuppressWarnings("deprecation" )
    private static String readExcel(String filePath, InputStream inp) throws Exception {
        Workbook wb;
        StringBuilder sb = new StringBuilder();
        try {
            if (filePath.endsWith(".xls")) {
                wb = new HSSFWorkbook(inp);
            } else {
                wb = StreamingReader.builder()
                        .rowCacheSize(1000)    // number of rows to keep in memory (defaults to 10)
                        .bufferSize(4096)     // buffer size to use when reading InputStream to file (defaults to 1024)
                        .open(inp);            // InputStream or File for XLSX file (required)
            }
            sb = readSheet(wb, sb, filePath.endsWith(".xls"));
            wb.close();
        } catch (OLE2NotOfficeXmlFileException e) {
            logger.error(filePath, e);
        } finally {
            if (inp != null) {
                inp.close();
            }
        }
        return sb.toString();
    }

    private static String readExcelByFile(String filepath, File file) {
        Workbook wb;
        StringBuilder sb = new StringBuilder();
        try {
            if (filepath.endsWith(".xls")) {
                wb = WorkbookFactory.create(file);
            } else {
                wb = StreamingReader.builder()
                        .rowCacheSize(1000)    // number of rows to keep in memory (defaults to 10)
                        .bufferSize(4096)     // buffer size to use when reading InputStream to file (defaults to 1024)
                        .open(file);            // InputStream or File for XLSX file (required)
            }
            sb = readSheet(wb, sb, filepath.endsWith(".xls"));
            wb.close();
        } catch (Exception e) {
            logger.error(filepath, e);
        }
        return sb.toString();
    }

    private static StringBuilder readSheet(Workbook wb, StringBuilder sb, boolean isXls) throws Exception {
        for (Sheet sheet: wb) {
            for (Row r: sheet) {
                for (Cell cell: r) {
                    if (cell.getCellType() == Cell.CELL_TYPE_STRING) {
                        sb.append(cell.getStringCellValue());
                        sb.append(" ");
                    } else if (cell.getCellType() == Cell.CELL_TYPE_NUMERIC) {
                        if (isXls) {
                            DataFormatter formatter = new DataFormatter();
                            sb.append(formatter.formatCellValue(cell));
                        } else {
                            sb.append(cell.getStringCellValue());
                        }
                        sb.append(" ");
                    }
                }
            }
        }
        return sb;
    }

其實，對于 excel 讀取，我的工具面臨的最大問題就是內存溢出。經常在讀取某些特別大的 excel 文件的時候都會帶來一個內存溢出的問題。后來我終于找到一個優秀的工具 excel-streaming-reader，它可以流式的讀取 xlsx 文件，將一些特別大的文件拆分成小的文件去讀。

另外一個做的優化就是，對于可以使用 File 對象的場景下，我是去使用 File 對象去讀取文件而不是使用 InputStream 去讀取，因為使用 InputStream 需要把它全部加載到內存中，所以這樣是非常占用內存的。

最后，我的一點小技巧就是使用 cell.getCellType 去減少一些數據量，因為我只需要獲取一些文字以及數字的字符串內容就可以了。

以上，就是我在使用 POI 讀取文件的一些探索和發現，希望對你能有所幫助。上面的這些例子也是在我的一款工具 everywhere 中的應用（這款工具主要是可以幫助你在電腦中進行內容的全文搜索），感興趣的可以看看，歡迎 star 或者 pr。

GPU云服務器云服務器 api設計的最佳實踐最佳實踐 webrtc最佳實踐 cdn最佳實踐

文章版權歸作者所有，未經允許請勿轉載,若此文章存在違規行為，您可以聯系管理員刪除。

轉載請注明本文地址：http://specialneedsforspecialkids.com/yun/70646.html

java 導出 excel 最佳實踐，java 大文件 excel 避免OOM(內存溢出) exce

摘要：消費之后，多線程處理文件導出，生成文件后上傳到等文件服務器。前端直接查詢并且展現對應的任務執行列表，去等文件服務器下載文件即可。這客戶體驗不友好，而且網絡傳輸，系統占用多種問題。拓展閱讀導出最佳實踐框架產品需求產品經理需要導出一個頁面的所有的信息到 EXCEL 文件。需求分析對于 excel 導出，是一個很常見的需求。最常見的解決方案就是使用 poi 直接同步導出一個 exc...

K_B_Z 2019-08-16 13:51 評論0 收藏0
慕課網_《解密JAVA實現Excel導入導出》學習總結

時間：2017年07月06日星期四說明：本文部分內容均來自慕課網。@慕課網：http://www.imooc.com教學源碼：無學習源碼：https://github.com/zccodere/s... 第一章：課程介紹 1-1 預備知識基礎知識 struts2框架（上傳下載功能） xml解析技術（導入模板） JQuery EasyUI（前臺美觀）課程目錄實現方式定制導入模版導入文件導...

enrecul101 2019-08-15 10:46 評論0 收藏0
Java實現excel導入導出學習筆記1 - 實現方式

摘要：需要的技術框架利用其上傳下載功能解析技術定制導入模板制作前臺與格式對應，版本低，兼容性好與格式對應組成的幾個概念工作薄工作表行記錄單元格創建中的的詳見如創建創建工作簿創建工作表創建第一行創建一個文件存盤名字性別男解析文件創建，讀取文件需要的技術 1、strut2框架利用其上傳下載功能2、xml解析技術定制導入模板3、jquery UI 制作前臺 4、showImg(/i...

wean 2019-08-14 12:41 評論0 收藏0
POI的使用及導出excel報表

摘要：的使用及導出報表首先，了解是什么一基本概念是軟件基金會的開放源碼函式庫，提供給程序對格式檔案讀和寫的功能。 POI的使用及導出excel報表首先，了解poi是什么？一、基本概念 ? Apache POI是Apache軟件基金會的開放源碼函式庫，POI提供API給Java程序對Microsoft Office格式檔案讀和寫的功能。二、基本結構 ? HSSF - 提供讀寫...

Ilikewhite 2019-08-16 14:11 評論0 收藏0
Excel大批量數據的導入和導出，如何做優化？

摘要：并且在對的抽象中，每一行，每一個單元格都是一個對象。對支持使用官方例子需要繼承，覆蓋方法，每讀取到一個單元格的數據則會回調次方法。概要Java對Excel的操作一般都是用POI，但是數據量大的話可能會導致頻繁的FGC或OOM，這篇文章跟大家說下如果避免踩POI的坑，以及分別對于xls和xlsx文件怎么優化大批量數據的導入和導出。一次線上問題這是一次線上的問題，因為一個大數據量的Excel導出...

Tecode 2022-06-28 18:59 評論0 收藏0