簡體   English   中英

如何有效地打開一個巨大的excel文件

[英]How to open a huge excel file efficiently

我有一個150MB的單頁excel文件,使用以下內容在一台非常強大的機器上打開大約需要7分鍾:

# using python
import xlrd
wb = xlrd.open_workbook(file)
sh = wb.sheet_by_index(0)

有沒有辦法更快地打開excel文件? 我甚至對非常古怪的建議(例如hadoop,spark,c,java等)持開放態度。 理想情況下,我正在尋找一種在30秒內打開文件的方法,如果這不是夢想。 另外,上面的例子是使用python,但它不一定是python。


注意:這是來自客戶端的Excel文件。 在收到之前,它無法轉換為任何其他格式。 這不是我們的檔案


更新:回答一個代碼的工作示例將在30秒內打開以下200MB excel文件將獲得獎勵: https//drive.google.com/file/d/0B_CXvCTOo7_2VW9id2VXRWZrbzQ/view? usp =sharing 該文件應該包含字符串(col 1),date(col 9)和number(col 11)。

大多數使用Office產品的編程語言都有一些中間層,這通常是瓶頸所在,一個很好的例子就是使用PIA的/ Interop或Open XML SDK。

將數據置於較低級別(繞過中間層)的一種方法是使用驅動程序。

150MB單頁excel文件,大約需要7分鍾。

我能做的最好的是135秒的130MB文件,大約快3倍:

Stopwatch sw = new Stopwatch();
sw.Start();

DataSet excelDataSet = new DataSet();

string filePath = @"c:\temp\BigBook.xlsx";

// For .XLSXs we use =Microsoft.ACE.OLEDB.12.0;, for .XLS we'd use Microsoft.Jet.OLEDB.4.0; with  "';Extended Properties=\"Excel 8.0;HDR=YES;\"";
string connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source='" + filePath + "';Extended Properties=\"Excel 12.0;HDR=YES;\"";

using (OleDbConnection conn = new OleDbConnection(connectionString))
{
    conn.Open();
    OleDbDataAdapter objDA = new System.Data.OleDb.OleDbDataAdapter
    ("select * from [Sheet1$]", conn);
    objDA.Fill(excelDataSet);
    //dataGridView1.DataSource = excelDataSet.Tables[0];
}
sw.Stop();
Debug.Print("Load XLSX tool: " + sw.ElapsedMilliseconds + " millisecs. Records = "  + excelDataSet.Tables[0].Rows.Count);

在此輸入圖像描述

贏得7x64,Intel i5,2.3ghz,8GB內存,SSD250GB。

如果我也可以推薦硬件解決方案,如果您使用的是標准硬盤驅動器,請嘗試使用SSD解決它。

注意:我無法下載您的Excel電子表格示例,因為我在公司防火牆后面。

PS。 請參閱MSDN - 使用200 MB數據導入xlsx文件的最快方式, OleDB的共識是最快的。

PS 2.以下是使用python進行操作的方法: http//code.activestate.com/recipes/440661-read-tabular-data-from-excel-spreadsheets-the-fast/

我設法使用.NET核心和Open XML SDK在大約30秒內讀取文件。

以下示例返回包含具有匹配類型的所有行和單元格的對象列表,它支持日期,數字和文本單元格。 該項目可在此處獲得: https//github.com/xferaa/BigSpreadSheetExample/ (適用於Windows,Linux和Mac OS,不需要安裝Excel或任何Excel組件)。

public List<List<object>> ParseSpreadSheet()
{
    List<List<object>> rows = new List<List<object>>();

    using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filePath, false))
    {
        WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
        WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();

        OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);

        Dictionary<int, string> sharedStringCache = new Dictionary<int, string>();

        int i = 0;
        foreach (var el in workbookPart.SharedStringTablePart.SharedStringTable.ChildElements)
        {
            sharedStringCache.Add(i++, el.InnerText);
        }

        while (reader.Read())
        {
            if(reader.ElementType == typeof(Row))
            {
                reader.ReadFirstChild();

                List<object> cells = new List<object>();

                do
                {
                    if (reader.ElementType == typeof(Cell))
                    {
                        Cell c = (Cell)reader.LoadCurrentElement();

                        if (c == null || c.DataType == null || !c.DataType.HasValue)
                            continue;

                        object value;

                        switch(c.DataType.Value)
                        {
                            case CellValues.Boolean:
                                value = bool.Parse(c.CellValue.InnerText);
                                break;
                            case CellValues.Date:
                                value = DateTime.Parse(c.CellValue.InnerText);
                                break;
                            case CellValues.Number:
                                value = double.Parse(c.CellValue.InnerText);
                                break;
                            case CellValues.InlineString:
                            case CellValues.String:
                                value = c.CellValue.InnerText;
                                break;
                            case CellValues.SharedString:
                                value = sharedStringCache[int.Parse(c.CellValue.InnerText)];
                                break;
                            default:
                                continue;
                        }

                        if (value != null)
                            cells.Add(value);
                    }

                } while (reader.ReadNextSibling());

                if (cells.Any())
                    rows.Add(cells);
            }
        }
    }

    return rows;
}

我在一台三年前的筆記本電腦上運行該程序,該筆記本電腦配備SSD驅動器,8GB內存和Windows 10 64位上的Intel Core i7-4710 CPU @ 2.50GHz(兩個內核)。

請注意,雖然打開並將整個文件解析為字符串需要的時間少於30秒,但在我上次編輯的示例中使用對象時,使用我的筆記本電腦時,時間會增加到近50秒。 使用Linux,您的服務器可能會接近30秒。

訣竅是使用SAX方法,如下所述:

https://msdn.microsoft.com/en-us/library/office/gg575571.aspx

好吧,如果您的Excel將像您的示例( https://drive.google.com/file/d/0B_CXvCTOo7_2UVZxbnpRaEVnaFk/view?usp=sharing )一樣簡單,您可以嘗試將文件打開為一個zip文件並直接讀取每個xml:

Intel i5 4460,12 GB RAM,SSD Samsung EVO PRO。

如果你有很多內存ram:這段代碼需要很多內存,但需要20~25秒。 (你需要參數-Xmx7g)

package com.devsaki.opensimpleexcel;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.nio.charset.Charset;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.zip.ZipFile;

public class Multithread {

    public static final char CHAR_END = (char) -1;

    public static void main(String[] args) throws IOException, ExecutionException, InterruptedException {
        String excelFile = "C:/Downloads/BigSpreadsheetAllTypes.xlsx";
        ZipFile zipFile = new ZipFile(excelFile);
        long init = System.currentTimeMillis();
        ExecutorService executor = Executors.newFixedThreadPool(4);
        char[] sheet1 = readEntry(zipFile, "xl/worksheets/sheet1.xml").toCharArray();
        Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(new CharReader(sheet1), executor));
        char[] sharedString = readEntry(zipFile, "xl/sharedStrings.xml").toCharArray();
        Future<String[]> futureWords = executor.submit(() -> processSharedStrings(new CharReader(sharedString)));

        Object[][] sheet = futureSheet1.get();
        String[] words = futureWords.get();

        executor.shutdown();

        long end = System.currentTimeMillis();
        System.out.println("only read: " + (end - init) / 1000);

        ///Doing somethin with the file::Saving as csv
        init = System.currentTimeMillis();
        try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) {
            for (Object[] rows : sheet) {
                for (Object cell : rows) {
                    if (cell != null) {
                        if (cell instanceof Integer) {
                            writer.append(words[(Integer) cell]);
                        } else if (cell instanceof String) {
                            writer.append(toDate(Double.parseDouble(cell.toString())));
                        } else {
                            writer.append(cell.toString()); //Probably a number
                        }
                    }
                    writer.append(";");
                }
                writer.append("\n");
            }
        }
        end = System.currentTimeMillis();
        System.out.println("Main saving to csv: " + (end - init) / 1000);
    }

    private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME;
    private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2);

    //The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date
    public static String toDate(double s) {
        return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600))));
    }

    public static String readEntry(ZipFile zipFile, String entry) throws IOException {
        System.out.println("Initialing readEntry " + entry);
        long init = System.currentTimeMillis();
        String result = null;

        try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
            br.readLine();
            result = br.readLine();
        }

        long end = System.currentTimeMillis();
        System.out.println("readEntry '" + entry + "': " + (end - init) / 1000);
        return result;
    }


    public static String[] processSharedStrings(CharReader br) throws IOException {
        System.out.println("Initialing processSharedStrings");
        long init = System.currentTimeMillis();
        String[] words = null;
        char[] wordCount = "Count=\"".toCharArray();
        char[] token = "<t>".toCharArray();
        String uniqueCount = extractNextValue(br, wordCount, '"');
        words = new String[Integer.parseInt(uniqueCount)];
        String nextWord;
        int currentIndex = 0;
        while ((nextWord = extractNextValue(br, token, '<')) != null) {
            words[currentIndex++] = nextWord;
            br.skip(11); //you can skip at least 11 chars "/t></si><si>"
        }
        long end = System.currentTimeMillis();
        System.out.println("SharedStrings: " + (end - init) / 1000);
        return words;
    }


    public static Object[][] processSheet1(CharReader br, ExecutorService executorService) throws IOException, ExecutionException, InterruptedException {
        System.out.println("Initialing processSheet1");
        long init = System.currentTimeMillis();
        char[] dimensionToken = "dimension ref=\"".toCharArray();
        String dimension = extractNextValue(br, dimensionToken, '"');
        int[] sizes = extractSizeFromDimention(dimension.split(":")[1]);
        br.skip(30); //Between dimension and next tag c exists more or less 30 chars
        Object[][] result = new Object[sizes[0]][sizes[1]];

        int parallelProcess = 8;
        int currentIndex = br.currentIndex;
        CharReader[] charReaders = new CharReader[parallelProcess];
        int totalChars = Math.round(br.chars.length / parallelProcess);
        for (int i = 0; i < parallelProcess; i++) {
            int endIndex = currentIndex + totalChars;
            charReaders[i] = new CharReader(br.chars, currentIndex, endIndex, i);
            currentIndex = endIndex;
        }
        Future[] futures = new Future[parallelProcess];
        for (int i = charReaders.length - 1; i >= 0; i--) {
            final int j = i;
            futures[i] = executorService.submit(() -> inParallelProcess(charReaders[j], j == 0 ? null : charReaders[j - 1], result));
        }
        for (Future future : futures) {
            future.get();
        }

        long end = System.currentTimeMillis();
        System.out.println("Sheet1: " + (end - init) / 1000);
        return result;
    }

    public static void inParallelProcess(CharReader br, CharReader back, Object[][] result) {
        System.out.println("Initialing inParallelProcess : " + br.identifier);

        char[] tokenOpenC = "<c r=\"".toCharArray();
        char[] tokenOpenV = "<v>".toCharArray();

        char[] tokenAttributS = " s=\"".toCharArray();
        char[] tokenAttributT = " t=\"".toCharArray();

        String v;
        int firstCurrentIndex = br.currentIndex;
        boolean first = true;

        while ((v = extractNextValue(br, tokenOpenC, '"')) != null) {
            if (first && back != null) {
                int sum = br.currentIndex - firstCurrentIndex - tokenOpenC.length - v.length() - 1;
                first = false;
                System.out.println("Adding to : " + back.identifier + " From : " + br.identifier);
                back.plusLength(sum);
            }
            int[] indexes = extractSizeFromDimention(v);

            int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT);
            char type = 's'; //3 types: number (n), string (s) and date (d)
            if (s == 0) { // Token S = number or date
                char read = br.read();
                if (read == '1') {
                    type = 'n';
                } else {
                    type = 'd';
                }
            } else if (s == -1) {
                type = 'n';
            }
            String c = extractNextValue(br, tokenOpenV, '<');
            Object value = null;
            switch (type) {
                case 'n':
                    value = Double.parseDouble(c);
                    break;
                case 's':
                    try {
                        value = Integer.parseInt(c);
                    } catch (Exception ex) {
                        System.out.println("Identifier Error : " + br.identifier);
                    }
                    break;
                case 'd':
                    value = c.toString();
                    break;
            }
            result[indexes[0] - 1][indexes[1] - 1] = value;
            br.skip(7); ///v></c>
        }
    }

    static class CharReader {
        char[] chars;
        int currentIndex;
        int length;

        int identifier;

        public CharReader(char[] chars) {
            this.chars = chars;
            this.length = chars.length;
        }

        public CharReader(char[] chars, int currentIndex, int length, int identifier) {
            this.chars = chars;
            this.currentIndex = currentIndex;
            if (length > chars.length) {
                this.length = chars.length;
            } else {
                this.length = length;
            }
            this.identifier = identifier;
        }

        public void plusLength(int n) {
            if (this.length + n <= chars.length) {
                this.length += n;
            }
        }

        public char read() {
            if (currentIndex >= length) {
                return CHAR_END;
            }
            return chars[currentIndex++];
        }

        public void skip(int n) {
            currentIndex += n;
        }
    }


    public static int[] extractSizeFromDimention(String dimention) {
        StringBuilder sb = new StringBuilder();
        int columns = 0;
        int rows = 0;
        for (char c : dimention.toCharArray()) {
            if (columns == 0) {
                if (Character.isDigit(c)) {
                    columns = convertExcelIndex(sb.toString());
                    sb = new StringBuilder();
                }
            }
            sb.append(c);
        }
        rows = Integer.parseInt(sb.toString());
        return new int[]{rows, columns};
    }

    public static int foundNextTokens(CharReader br, char until, char[]... tokens) {
        char character;
        int[] indexes = new int[tokens.length];
        while ((character = br.read()) != CHAR_END) {
            if (character == until) {
                break;
            }
            for (int i = 0; i < indexes.length; i++) {
                if (tokens[i][indexes[i]] == character) {
                    indexes[i]++;
                    if (indexes[i] == tokens[i].length) {
                        return i;
                    }
                } else {
                    indexes[i] = 0;
                }
            }
        }

        return -1;
    }

    public static String extractNextValue(CharReader br, char[] token, char until) {
        char character;
        StringBuilder sb = new StringBuilder();
        int index = 0;

        while ((character = br.read()) != CHAR_END) {
            if (index == token.length) {
                if (character == until) {
                    return sb.toString();
                } else {
                    sb.append(character);
                }
            } else {
                if (token[index] == character) {
                    index++;
                } else {
                    index = 0;
                }
            }
        }
        return null;
    }

    public static int convertExcelIndex(String index) {
        int result = 0;
        for (char c : index.toCharArray()) {
            result = result * 26 + ((int) c - (int) 'A' + 1);
        }
        return result;
    }
}

舊答案(不需要參數Xms7g,因此占用更少內存):使用HDD打開和讀取示例文件約35秒(200MB),SDD需要少一點(30秒)。

這里的代碼是: https//github.com/csaki/OpenSimpleExcelFast.git

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.nio.charset.Charset;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.zip.ZipFile;

public class Launcher {

    public static final char CHAR_END = (char) -1;

    public static void main(String[] args) throws IOException, ExecutionException, InterruptedException {
        long init = System.currentTimeMillis();
        String excelFile = "D:/Downloads/BigSpreadsheet.xlsx";
        ZipFile zipFile = new ZipFile(excelFile);

        ExecutorService executor = Executors.newFixedThreadPool(4);
        Future<String[]> futureWords = executor.submit(() -> processSharedStrings(zipFile));
        Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(zipFile));
        String[] words = futureWords.get();
        Object[][] sheet1 = futureSheet1.get();
        executor.shutdown();

        long end = System.currentTimeMillis();
        System.out.println("Main only open and read: " + (end - init) / 1000);


        ///Doing somethin with the file::Saving as csv
        init = System.currentTimeMillis();
        try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) {
            for (Object[] rows : sheet1) {
                for (Object cell : rows) {
                    if (cell != null) {
                        if (cell instanceof Integer) {
                            writer.append(words[(Integer) cell]);
                        } else if (cell instanceof String) {
                            writer.append(toDate(Double.parseDouble(cell.toString())));
                        } else {
                            writer.append(cell.toString()); //Probably a number
                        }
                    }
                    writer.append(";");
                }
                writer.append("\n");
            }
        }
        end = System.currentTimeMillis();
        System.out.println("Main saving to csv: " + (end - init) / 1000);
    }

    private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME;
    private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2);

    //The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date
    public static String toDate(double s) {
        return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600))));
    }

    public static Object[][] processSheet1(ZipFile zipFile) throws IOException {
        String entry = "xl/worksheets/sheet1.xml";
        Object[][] result = null;
        char[] dimensionToken = "dimension ref=\"".toCharArray();
        char[] tokenOpenC = "<c r=\"".toCharArray();
        char[] tokenOpenV = "<v>".toCharArray();

        char[] tokenAttributS = " s=\"".toCharArray();
        char[] tokenAttributT = " t=\"".toCharArray();
        try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
            String dimension = extractNextValue(br, dimensionToken, '"');
            int[] sizes = extractSizeFromDimention(dimension.split(":")[1]);
            br.skip(30); //Between dimension and next tag c exists more or less 30 chars
            result = new Object[sizes[0]][sizes[1]];
            String v;
            while ((v = extractNextValue(br, tokenOpenC, '"')) != null) {
                int[] indexes = extractSizeFromDimention(v);

                int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT);
                char type = 's'; //3 types: number (n), string (s) and date (d)
                if (s == 0) { // Token S = number or date
                    char read = (char) br.read();
                    if (read == '1') {
                        type = 'n';
                    } else {
                        type = 'd';
                    }
                } else if (s == -1) {
                    type = 'n';
                }
                String c = extractNextValue(br, tokenOpenV, '<');
                Object value = null;
                switch (type) {
                    case 'n':
                        value = Double.parseDouble(c);
                        break;
                    case 's':
                        value = Integer.parseInt(c);
                        break;
                    case 'd':
                        value = c.toString();
                        break;
                }
                result[indexes[0] - 1][indexes[1] - 1] = value;
                br.skip(7); ///v></c>
            }
        }
        return result;
    }

    public static int[] extractSizeFromDimention(String dimention) {
        StringBuilder sb = new StringBuilder();
        int columns = 0;
        int rows = 0;
        for (char c : dimention.toCharArray()) {
            if (columns == 0) {
                if (Character.isDigit(c)) {
                    columns = convertExcelIndex(sb.toString());
                    sb = new StringBuilder();
                }
            }
            sb.append(c);
        }
        rows = Integer.parseInt(sb.toString());
        return new int[]{rows, columns};
    }

    public static String[] processSharedStrings(ZipFile zipFile) throws IOException {
        String entry = "xl/sharedStrings.xml";
        String[] words = null;
        char[] wordCount = "Count=\"".toCharArray();
        char[] token = "<t>".toCharArray();
        try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
            String uniqueCount = extractNextValue(br, wordCount, '"');
            words = new String[Integer.parseInt(uniqueCount)];
            String nextWord;
            int currentIndex = 0;
            while ((nextWord = extractNextValue(br, token, '<')) != null) {
                words[currentIndex++] = nextWord;
                br.skip(11); //you can skip at least 11 chars "/t></si><si>"
            }
        }
        return words;
    }

    public static int foundNextTokens(BufferedReader br, char until, char[]... tokens) throws IOException {
        char character;
        int[] indexes = new int[tokens.length];
        while ((character = (char) br.read()) != CHAR_END) {
            if (character == until) {
                break;
            }
            for (int i = 0; i < indexes.length; i++) {
                if (tokens[i][indexes[i]] == character) {
                    indexes[i]++;
                    if (indexes[i] == tokens[i].length) {
                        return i;
                    }
                } else {
                    indexes[i] = 0;
                }
            }
        }

        return -1;
    }

    public static String extractNextValue(BufferedReader br, char[] token, char until) throws IOException {
        char character;
        StringBuilder sb = new StringBuilder();
        int index = 0;

        while ((character = (char) br.read()) != CHAR_END) {
            if (index == token.length) {
                if (character == until) {
                    return sb.toString();
                } else {
                    sb.append(character);
                }
            } else {
                if (token[index] == character) {
                    index++;
                } else {
                    index = 0;
                }
            }
        }
        return null;
    }

    public static int convertExcelIndex(String index) {
        int result = 0;
        for (char c : index.toCharArray()) {
            result = result * 26 + ((int) c - (int) 'A' + 1);
        }
        return result;
    }

}

Python的Pandas庫可用於保存和處理您的數據,但使用它來直接加載.xlsx文件將非常慢,例如使用read_excel()

一種方法是使用Python自動使用Excel本身將文件轉換為CSV,然后使用Pandas使用read_csv()加載生成的CSV文件。 這將為您提供良好的加速,但不會低於30秒:

import win32com.client as win32        
import pandas as pd    
from datetime import datetime    

print ("Starting")
start = datetime.now()

# Use Excel to load the xlsx file and save it in csv format
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(r'c:\full path\BigSpreadsheet.xlsx')
excel.DisplayAlerts = False
wb.DoNotPromptForConvert = True
wb.CheckCompatibility = False

print('Saving')
wb.SaveAs(r'c:\full path\temp.csv', FileFormat=6, ConflictResolution=2) 
excel.Application.Quit()

# Use Pandas to load the resulting CSV file
print('Loading CSV')
df = pd.read_csv(r'c:\full path\temp.csv', dtype=str)

print(df.shape)
print("Done", datetime.now() - start)

列類型
該類型的列可以通過傳遞指定dtypeconvertersparse_dates

df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[8], infer_datetime_format=True)

您還應該指定infer_datetime_format=True ,因為這將大大加快日期轉換速度。

nfer_datetime_format :boolean,默認為False

如果啟用了True和parse_dates,pandas將嘗試推斷列中日期時間字符串的格式,如果可以推斷,請切換到更快的解析方法。 在某些情況下,這可以將解析速度提高5-10倍。

如果日期的格式為DD/MM/YYYY添加dayfirst=True

選擇性的列
如果您實際上只需要處理第1 9 11列,那么您可以通過指定usecols=[0, 8, 10]來進一步減少資源,如下所示:

df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[1], dayfirst=True, infer_datetime_format=True, usecols=[0, 8, 10])

結果數據幀將只包含這3列數據。

RAM驅動器
使用RAM驅動器存儲臨時CSV文件可以進一步加快加載時間。

注意:這假設您使用的是帶有Excel的Windows PC。

我正在使用Dell Precision T1700工作站並使用c#我能夠在大約24秒內打開文件並閱讀它的內容,只需使用標准代碼即可使用互操作服務打開工作簿。 這里使用對Microsoft Excel 15.0對象庫的引用是我的代碼。

我的使用陳述:

using System.Runtime.InteropServices;
using Excel = Microsoft.Office.Interop.Excel;

用於打開和閱讀工作簿的代碼:

public partial class MainWindow : Window {
    public MainWindow() {
        InitializeComponent();

        Excel.Application xlApp;
        Excel.Workbook wb;
        Excel.Worksheet ws;

        xlApp = new Excel.Application();
        xlApp.Visible = false;
        xlApp.ScreenUpdating = false;

        wb = xlApp.Workbooks.Open(@"Desired Path of workbook\Copy of BigSpreadsheet.xlsx");

        ws = wb.Sheets["Sheet1"];

        //string rng = ws.get_Range("A1").Value;
        MessageBox.Show(ws.get_Range("A1").Value);

        Marshal.FinalReleaseComObject(ws);

        wb.Close();
        Marshal.FinalReleaseComObject(wb);

        xlApp.Quit();
        Marshal.FinalReleaseComObject(xlApp);

        GC.Collect();
        GC.WaitForPendingFinalizers();
    }
}

我已經創建了一個示例Java程序,它可以在我的筆記本電腦(Intel i7 4核,16 GB RAM)中加載文件~40秒。

https://github.com/skadyan/largefile

該程序使用Apache POI庫使用XSSF SAX API加載.xlsx文件。

回調接口com.stackoverlfow.largefile.RecordHandler實現可用於處理從excel加載的數據。 此接口僅定義一個采用三個參數的方法

  • sheetname:字符串,Excel工作表名稱
  • 行號:int,行數
  • data map :映射:excel單元格引用和excel格式化單元格值

com.stackoverlfow.largefile.Main類演示了此接口的一個基本實現,它只是在控制台上打印行號。

更新

woodstox解析器似乎比標准SAXReader具有更好的性能。 (代碼在repo中更新)。

另外,為了滿足所需的性能要求,您可以考慮重新實現org.apache.poi...XSSFSheetXMLHandler 在實現中,可以實現更優化的字符串/文本值處理,並且可以跳過不必要的文本格式化操作。

看起來在Python中幾乎無法實現。 如果我們解壓縮工作表數據文件,那么通過基於C的迭代SAX解析器(使用lxml ,一個非常快速的libxml2包裝器)將需要30秒才能通過它:

from __future__ import print_function

from lxml import etree
import time


start_ts = time.time()

for data in etree.iterparse(open('xl/worksheets/sheet1.xml'), events=('start',), 
                            collect_ids=False, resolve_entities=False,
                            huge_tree=True):
    pass

print(time.time() - start_ts)

樣本輸出:27.2134890556

順便說一句,Excel本身需要大約40秒來加載工作簿。

c#和ole解決方案仍有一些瓶頸。所以我用c ++和ado測試它。

_bstr_t connStr(makeConnStr(excelFile, header).c_str());

TESTHR(pRec.CreateInstance(__uuidof(Recordset)));       
TESTHR(pRec->Open(sqlSelectSheet(connStr, sheetIndex).c_str(), connStr, adOpenStatic, adLockOptimistic, adCmdText));

while(!pRec->adoEOF)
{
    for(long i = 0; i < pRec->Fields->GetCount(); ++i)
    {   
        _variant_t v = pRec->Fields->GetItem(i)->Value;
        if(v.vt == VT_R8)
            num[i] = v.dblVal;
        if(v.vt == VT_BSTR)
            str[i] = v.bstrVal;          
        ++cellCount;
    }                                    
    pRec->MoveNext();
}

在i5-4460和HDD機器中,我發現xls中的500萬個單元將需要1.5s。但是xlsx中的相同數據將需要2.829。因此可以在30秒內操作您的數據。

如果您確實需要30秒以下,請使用RAM驅動器來減少文件IO。它將顯着改善您的過程。 我無法下載您的數據進行測試,所以請告訴我結果。

另一種應該在很大程度上改善負載/運行時間的方法是RAMDrive

為您的文件創建一個有足夠空間的RAMDrive,以及10%...... 20%的額外空間......
復制RAMDrive的文件...
從那里加載文件...根據您的驅動器和文件系統,速度的提升應該是巨大的......

我最喜歡的是IMDisk工具包
https://sourceforge.net/projects/imdisk-toolkit/ )這里有一個強大的命令行來編寫腳本...

我還推薦SoftPerfect ramdisk
http://www.majorgeeks.com/files/details/softperfect_ram_disk.html

但這也取決於你的操作系統......

我想了解有關您打開文件的系統的更多信息...無論如何:

在系統中查找名為的Windows更新
“辦公室文件驗證加載項...”

如果你有它...卸載它...
該文件應該加載更快
特別是如果是分享的話

您是否嘗試過按需加載工作表 ,從xlrd版本0.7.1開始可用?

為此,您需要將on_demand=True傳遞給open_workbook()

xlrd.open_workbook(filename = None,logfile = <_ io.TextIOWrapper name =''mode ='w'coding ='UTF-8'>,verbosity = 0,use_mmap = 1,file_contents = None,encoding_override = None,formatting_info = False,on_demand = False,ragged_rows = False)


我發現用於讀取xlsx文件的其他潛在python解決方案:

  • 閱讀'xl / sharedStrings.xml'和'xl / worksheets / sheet1.xml'中的原始xml
  • 嘗試使用openpyxl庫的只讀模式 ,該模式聲稱對大文件的內存使用情況也進行了優化。

     from openpyxl import load_workbook wb = load_workbook(filename='large_file.xlsx', read_only=True) ws = wb['big_data'] for row in ws.rows: for cell in row: print(cell.value) 
  • 如果您在Windows上運行,可以使用PyWin32和'Excel.Application'

     import time import win32com.client as win32 def excel(): xl = win32.gencache.EnsureDispatch('Excel.Application') ss = xl.Workbooks.Add() ... 

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM