简体   繁体   English

使用xlwt优化xls文件中的添加行

[英]Optimize add row in xls file with xlwt

I have a big problem with a large xls file. 我在使用大型xls文件时遇到了很大的问题。 When my app add a new stats record (a new row at the end of the file) there is a very long time (one minute). 当我的应用添加新的统计记录(文件末尾的新行)时,时间很长(一分钟)。 If I replace it with an empty xls file this work the best (1-2 seconds). 如果我将其替换为空的xls文件,则效果最佳(1-2秒)。 So I'm trying to optimize this if possible. 因此,如果可能,我正在尝试对此进行优化。

I use something like: 我使用类似:

def add_stats_record():
    # Add record
    lock = LockFile(STATS_FILE)
    with lock:
        # Open for read
        rb = open_workbook(STATS_FILE, formatting_info=True)
        sheet_records = rb.sheet_by_index(0)

        # record_id
        START_ROW = sheet_records.nrows
        try:
            record_id = int(sheet_records.cell(START_ROW - 1, 0).value) + 1
        except:
            record_id = 1

        # Open for write
        wb = copy(rb)
        sheet_records = wb.get_sheet(0)

        # Set normal style
        style_normal = xlwt.XFStyle()
        normal_font = xlwt.Font()
        style_normal.font = normal_font

        # Prepare some data here
        ........................
        # then:

        for i, col in enumerate(SHEET_RECORDS_COLS):
            sheet_records.write(START_ROW, i, possible_values.get(col[0], ''),
                                style_normal)

        wb.save(STATS_FILE)

Do you see here something to improve? 您看到这里有什么改善的地方吗? Or can you give me a better idea / example how to do this? 还是可以给我一个更好的主意/示例,该如何做?

Probably not the answer you want to hear but there is hardly anything to optimize. 可能不是您想听到的答案,但是几乎没有什么要优化的。

import xlwt, xlrd
from xlutils.copy import copy as copy
from time import time

def add_stats_record():
    #Open for read
    start_time = time()
    rb = xlrd.open_workbook(STATS_FILE, formatting_info=True)
    sheet_records_original = rb.sheet_by_index(0)
    print('Elapsed time for opening:            %.2f' % (time()-start_time))
    #Record_id
    start_time = time()
    START_ROW = sheet_records_original.nrows
    SHEET_RECORDS_COLS = sheet_records_original.ncols
    try:
        record_id = int(sheet_records.cell(START_ROW - 1, 0).value) + 1
    except:
        record_id = 1
    print('Elapsed time for record ID:          %.2f' % (time()-start_time))
    #Open for write
    start_time = time()
    wb = copy(rb)
    sheet_records = wb.get_sheet(0)
    print('Elapsed time for write:              %.2f' % (time()-start_time))
    #Set normal style
    style_normal = xlwt.XFStyle()
    normal_font = xlwt.Font()
    style_normal.font = normal_font

    #Read all the data and get some stats
    start_time = time()
    max_col = {}
    start_time = time()
    for col_idx in range(0,16):
        max_value = 0
        for row_idx in range(START_ROW):
            if sheet_records_original.cell(row_idx, col_idx).value:
                val = float(sheet_records_original.cell(row_idx, col_idx).value)
                if val > max_value:
                    max_col[col_idx] = str(row_idx) + ';' + str(col_idx)

    text_cells = [[0 for x in range(15)] for y in range(START_ROW)] 
    for col_idx in range(16,31):
        max_value = 0
        for row_idx in range(START_ROW):
            if sheet_records_original.cell(row_idx, col_idx).value:
                val = str(sheet_records_original.cell(row_idx, col_idx).value).replace('text', '').count(str(col_idx))
                if val > max_value:
                    max_col[col_idx] = str(row_idx) + ';' + str(col_idx)
    print('Elapsed time for reading data/stats: %.2f' % (time()-start_time))
    #Write the stats row
    start_time = time()
    for i in range(SHEET_RECORDS_COLS):
        sheet_records.write(START_ROW, i, max_col[i], style_normal)

    start_time = time()
    wb.save(STATS_FILE)
    print('Elapsed time for writing:            %.2f' % (time()-start_time))    

if __name__ == '__main__':
    STATS_FILE = 'output.xls'
    start_time2 = time()
    add_stats_record()
    print ('Total time:                         %.2f' % (time() - start_time2))

Elapsed time for opening: 2.43 开启时间:2.43
Elapsed time for record ID: 0.00 记录ID的经过时间:0.00
Elapsed time for write: 7.62 耗用的写入时间:7.62
Elapsed time for reading data/stats: 2.35 读取数据/统计信息所花费的时间:2.35
Elapsed time for writing: 3.33 耗用书写时间:3.33
Total time: 15.75 总时间:15.75

From those results it becomes pretty clear that there is hardly any room for improvement in your code. 从这些结果可以很清楚地看出,您的代码几乎没有改进的余地。 Open/copy/write make up the bulk time but are just simple calls to xlrd/xlwt . 打开/复制/写入占了大量时间,但这只是对xlrd/xlwt简单调用。

Using on_demand=True in open_workbook doesn't help either. open_workbook中使用on_demand=True也无济于事。

Using openpyxl doesn't improve performance as well. 使用openpyxl并不能改善性能。

from openpyxl import load_workbook
from time import time

#Load workbook
start_time = time()
wb = load_workbook('output.xlsx')
print('Elapsed time for loading workbook: %.2f' % (time.time()-start_time))    

#Read all data
start_time = time()
ws = wb.active
cell_range1 = ws['A1':'P20001']
cell_range2 = ws['Q1':'AF20001']
print('Elapsed time for reading workbook: %.2f' % (time.time()-start_time))    

#Save to a new workbook
start_time = time()
wb.save("output_tmp.xlsx")
print('Elapsed time for saving workbook:  %.2f' % (time.time()-start_time))    

Elapsed time for loading workbook: 22.35 加载工作簿所需的时间:22.35
Elapsed time for reading workbook: 0.00 阅读工作簿所花费的时间:0.00
Elapsed time for saving workbook: 21.11 保存工作簿所花费的时间:21.11

Ubuntu 14.04 (Virtual machine)/Python2.7-64bit/Regular hard disk (with native Windows 10 similar results, Python 3 performs worse in loading but better in writing). Ubuntu 14.04(虚拟机)/Python2.7-64bit/Regular硬盘(具有与Windows 10相似的本机结果,Python 3的加载性能较差,但编写性能较好)。


Random data was generated using Pandas and Numpy 使用Pandas和Numpy生成随机数据

import pandas as pd
import numpy as np
#just random numbers
df = pd.DataFrame(np.random.rand(20000,30), columns=range(0,30))
#convert half the columns to text
for i in range(15,30):
    df[i].apply(str)
    df[i] = 'text' + df[i].astype(str)
writer = pd.ExcelWriter(STATS_FILE)
df.to_excel(writer,'Sheet1')
writer.save()

After some fiddling with multiprocessing I found a slightly improved solution. 经过multiprocessing摆弄之后,我发现了一个稍微改进的解决方案。 Since the copy operation was the most time consuming operation and having a shared workbook made performance worse, a different approach was taken. 由于copy操作是最耗时的操作,并且共享workbook使性能变差,因此采用了另一种方法。 Both threads read the original workbook, one reads the data, calculates the statistics and writes them to a file ( tmp.txt ), the other one copies the workbook, waits for the statistics file to appear and then writes it to the newly copied workbook. 两个线程都读取原始工作簿,一个读取数据,计算统计数据并将其写入文件( tmp.txt ),另一个tmp.txt复制该工作簿,等待统计文件出现,然后将其写入新复制的工作簿中。 。

Difference: 12% less time needed in total (n=3 for both scripts). 区别:总共减少12%的时间(两个脚本的n = 3)。 Not great but I cannot think of another way of doing, except for not using Excel files. 不太好,但是除了不使用Excel文件外,我想不出另一种方法。

xls_copy.py xls_copy.py

def xls_copy(STATS_FILE, START_ROW, style_normal):
    from xlutils.copy import copy as copy
    from time import sleep, time
    from os import stat
    from xlrd import open_workbook
    print('started 2nd thread')
    start_time = time()
    rb = open_workbook(STATS_FILE, formatting_info=True)
    wb = copy(rb)
    sheet_records = wb.get_sheet(0)
    print('2: Elapsed time for xls_copy:         %.2f' % (time()-start_time))

    counter = 0
    filesize = stat('tmp.txt').st_size

    while filesize == 0 and counter < 10**5:
        sleep(0.01)
        filesize = stat('tmp.txt').st_size
        counter +=1
    with open('tmp.txt', 'r') as f:
        for line in f.readlines():
            cells = line.split(';')
            sheet_records.write(START_ROW, int(cells[0]), cells[1], style_normal)

    start_time = time()
    wb.save('tmp_' + STATS_FILE)
    print('2: Elapsed time for writing:          %.2f' % (time()-start_time))    

xlsx_multi.py xlsx_multi.py

from xls_copy import xls_copy
import xlwt, xlrd
from time import time
from multiprocessing import Process

def add_stats_record():

    #Open for read
    start_time = time()
    rb = xlrd.open_workbook(STATS_FILE, formatting_info=True)
    sheet_records_original = rb.sheet_by_index(0)
    print('Elapsed time for opening:            %.2f' % (time()-start_time))
    #Record_id
    start_time = time()
    START_ROW = sheet_records_original.nrows
    f = open('tmp.txt', 'w')
    f.close()
    #Set normal style
    style_normal = xlwt.XFStyle()
    normal_font = xlwt.Font()
    style_normal.font = normal_font

    #start 2nd thread
    p = Process(target=xls_copy, args=(STATS_FILE, START_ROW, style_normal,))
    p.start()
    print('continuing with 1st thread')
    SHEET_RECORDS_COLS = sheet_records_original.ncols
    try:
        record_id = int(sheet_records.cell(START_ROW - 1, 0).value) + 1
    except:
        record_id = 1
    print('Elapsed time for record ID:          %.2f' % (time()-start_time))

    #Read all the data and get some stats
    start_time = time()
    max_col = {}
    start_time = time()
    for col_idx in range(0,16):
        max_value = 0
        for row_idx in range(START_ROW):
            if sheet_records_original.cell(row_idx, col_idx).value:
                val = float(sheet_records_original.cell(row_idx, col_idx).value)
                if val > max_value:
                    max_col[col_idx] = str(row_idx) + ';' + str(col_idx)

    text_cells = [[0 for x in range(15)] for y in range(START_ROW)] 
    for col_idx in range(16,31):
        max_value = 0
        for row_idx in range(START_ROW):
            if sheet_records_original.cell(row_idx, col_idx).value:
                val = str(sheet_records_original.cell(row_idx, col_idx).value).replace('text', '').count(str(col_idx))
                if val > max_value:
                    max_col[col_idx] = str(row_idx) + ';' + str(col_idx)
    #write statistics to a temp file
    with open('tmp.txt', 'w') as f:
        for k in max_col:
            f.write(str(k) + ';' + max_col[k] + str('\n'))
    print('Elapsed time for reading data/stats: %.2f' % (time()-start_time))
    p.join()
if __name__ == '__main__':

    done = False
    wb = None
    STATS_FILE = 'output.xls'
    start_time2 = time()
    add_stats_record()
    print ('Total time:                          %.2f' % (time() - start_time2))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM