I am now using PyExcelerator for reading excel files, but it is extremely slow. As I always need to open excel files more than 100MB, it takes me more than twenty minutes to only load one file.
The functionality I need are:
And the code I am using now is:
book = pyExcelerator.parse_xls(filepath)
parsed_dictionary = defaultdict(lambda: '', book[0][1])
number_of_columns = 44
result_list = []
number_of_rows = 500000
for i in range(0, number_of_rows):
ok = False
result_list.append([])
for h in range(0, number_of_columns):
item = parsed_dictionary[i,h]
if type(item) is StringType or type(item) is UnicodeType:
item = item.replace("\t","").strip()
result_list[i].append(item)
if item != '':
ok = True
if not ok:
break
Any suggestions?
pyExcelerator appears not to be maintained. To write xls files, use xlwt, which is a fork of pyExcelerator with bug fixes and many enhancements. The (very basic) xls reading capability of pyExcelerator was eradicated from xlwt. To read xls files, use xlrd.
If it's taking 20 minutes to load a 100MB xls file, you must be using one or more of: a slow computer, a computer with very little available memory, or an older version of Python.
Neither pyExcelerator nor xlrd read password-protected files.
Here's a link that covers xlrd and xlwt .
Disclaimer: I'm the author of xlrd and maintainer of xlwt.
You could try to pre-allocate the list to its size in a single statement instead of appending one item at a time like this: (one large allocation of memory should be faster than many small ones)
book = pyExcelerator.parse_xls(filepath)
parsed_dictionary = defaultdict(lambda: '', book[0][1])
number_of_columns = 44
number_of_rows = 500000
result_list = [] * number_of_rows
for i in range(0, number_of_rows):
ok = False
#result_list.append([])
for h in range(0, number_of_columns):
item = parsed_dictionary[i,h]
if type(item) is StringType or type(item) is UnicodeType:
item = item.replace("\t","").strip()
result_list[i].append(item)
if item != '':
ok = True
if not ok:
break
If doing this gives appreciable performance increase you could also try to preallocate each list item with the number of columns and then assign them by index rather than appending one value at a time. Here's a snippet that creates a 10x10, two-dimensional list in a single statement with an initial value of 0:
L = [[0] * 10 for i in range(10)]
So folded into your code, it might work something like this:
book = pyExcelerator.parse_xls(filepath)
parsed_dictionary = defaultdict(lambda: '', book[0][1])
number_of_columns = 44
number_of_rows = 500000
result_list = [[''] * number_of_rows for x in range(number_of_columns)]
for i in range(0, number_of_rows):
ok = False
#result_list.append([])
for h in range(0, number_of_columns):
item = parsed_dictionary[i,h]
if type(item) is StringType or type(item) is UnicodeType:
item = item.replace("\t","").strip()
result_list[i,h] = item
if item != '':
ok = True
if not ok:
break
Unrelated to your question : If you're trying to check if none of the columns are empty string, then you set ok = True
initially, and do this instead in the inner loop ( ok = ok and item != ''
). Also, you can just use isinstance(item, basestring)
to test whether a variable is string or not.
Revised version
for i in range(0, number_of_rows):
ok = True
result_list.append([])
for h in range(0, number_of_columns):
item = parsed_dictionary[i,h]
if isinstance(item, basestring):
item = item.replace("\t","").strip()
result_list[i].append(item)
ok = ok and item != ''
if not ok:
break
I built a library recently that may be of interest: https://github.com/ktr/sxl . It essentially tries to "stream" Excel files like Python does with normal files and is therefore very fast when you only need a subset of data (esp. if it is near the beginning of the file).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.