简体   繁体   English

Xlrd非常慢的打开Excel文件

[英]Xlrd very slow opening excel file

I have an excel file, I guess it is quite huge for an excel file (200Mb), it has around 20 sheets full of information. 我有一个excel文件,我想它对于excel文件(200Mb)来说是非常大的,大约有20张纸的信息。

My question is that if it is normal that the following simple action takes almost 5 mins to execute. 我的问题是,以下简单动作执行将近5分钟是否正常,这是正常的。 I am wondering if I am doing it in the correct way. 我想知道我是否以正确的方式进行操作。

import xlrd

def processExcel(excelFile):
    excelData = xlrd.open_workbook(excelFile)
    sheets = excelData.sheet_names()
    print sheets

As you can see, on the first step I am just trying to get the sheet names and only that simple thing takes 5 mins...is that possible? 如您所见,在第一步中,我只是尝试获取工作表名称,仅此简单的操作就需要5分钟...可能吗?

Yes, it's absolutely possible. 是的,这绝对有可能。 That is indeed a lot of data to be in an Excel file. 确实,Excel文件中包含大量数据。 By default, xlrd loads the entire workbook into memory. 默认情况下, xlrd将整个工作簿加载到内存中。 If your workbook is a .xls file, you can use the on_demand parameter to only open worksheets as they are needed: 如果您的工作簿是.xls文件,则可以使用on_demand参数仅在需要时打开工作表:

import xlrd

def processExcel(excelFile):
    excelData = xlrd.open_workbook(excelFile, on_demand=True)
    sheets = excelData.sheet_names()
    print sheets

If you are trying to open a .xlsx file, the on_demand parameter has no effect. 如果尝试打开.xlsx文件,则on_demand参数无效。

Update 更新资料

If you are using Python 3 and reading a .xlsx file, you can try sxl . 如果您正在使用Python 3并读取.xlsx文件,则可以尝试sxl This is a library which only reads things into memory as needed. 这是一个仅根据需要将内容读入内存的库。 So just opening the workbook to retrieve the worksheet names is very quick. 因此,只需打开工作簿以检索工作表名称即可。 Also, if you just need the first few rows of a worksheet, it can get those rather quickly as well. 另外,如果您只需要工作表的前几行,它也可以很快地获得它们。

If you need to read all the data with sxl , you have to iterate over all the rows, which could be even slower than xlrd , but at least will only use up as much memory as you need. 如果需要使用sxl读取所有数据,则必须遍历所有行,这可能甚至比xlrd慢,但至少只会消耗所需的内存。 For example, the following code will only keep one row in memory at any given time: 例如,以下代码在任何给定时间将仅在存储器中保留一行:

from sxl import Workbook

wb = Workbook('MyBigFile.xlsx')
ws = wb.sheets[1]
for row in ws.rows:
    print(row)

However, if you need random access to all the rows to do your processing, you'll have to keep them all in memory: 但是,如果您需要对所有行进行随机访问以进行处理,则必须将它们全部保留在内存中:

from sxl import Workbook

wb = Workbook('MyBigFile.xlsx')
ws = wb.sheets[1]
all_rows = list(ws.rows)

In this case, all_rows keeps the entire sheet in memory. 在这种情况下, all_rows将整个工作表保留在内存中。 If your workbook has multiple sheets, this may still be more efficient than xlrd . 如果您的工作簿有多个工作表,这可能仍然比xlrd更有效。 But if you need your whole workbook in memory, then you might as well stick to xlrd . 但是,如果您需要将整个工作簿存储在内存中,则最好还是坚持使用xlrd

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM