简体   繁体   中英

writing excel from .h5 file: performance

I am saving some data from a .h5 file to an excel file.

I am using openpyxl for that. And, I may not be doing it in a good way but, seems like it is taking too much time for a (quite) small .h5 file.

Do you have any recommendations?

I am currently taking a look at XlsxWriter, but is it really good enought?.

Here is the simple code I am using:

from openpyxl import Workbook
from tables import *
import os
import time

def saveExcel(pyTableName):

    t1 = time.time()

    wb_write = Workbook()
    wsh_write = wb_write.active

    r = 2
    with openFile(pyTableName, 'r') as f:
        tab = f.getNode('/absoluteData')
        for row in tab.iterrows():
            wsh_write.cell(row=r, column=1).value = row['sheet']
            wsh_write.cell(row=r, column=2).value = str(row['IDnum'])+','+str(row['name'])
            wsh_write.cell(row=r, column=3).value = row['line'])
            wsh_write.cell(row=r, column=4).value = row['is_1']
            wsh_write.cell(row=r, column=5).value = row['is_0']
            wsh_write.cell(row=r, column=6).value = row['is_unknown']
            wsh_write.cell(row=r, column=7).value = row['is_ok']
            r+=1

        wb_write.save(os.path.join(os.getcwd(),'Results.xlsx'))
        print "SAVED in: ", time.time() - t1

And some performance data after running this code:

For a pyTable with 235200 rows x 17 cols it needed 152.976000071 secs

Both openpyxl and xlsxwriter are suitable for the task; xlsxwriter is probably the fastest for just writing files but openpyxl also has a write_only mode for this kind of task which is very fast if you also have lxml installed. If you don't have lxml installed then you should see a considerable speedup.

There are several limiting factors:

  • converting from the source objects to Python to XML (in this case probably h5, numpy, Python and XML)
  • the fact that xlsx doesn't support streaming

In openpyxl we've tried to simplify the API so that you can simply append rows to a cell without worrying too much about coordinates.

Your modified code might look something like this:

wb = Workbook(write_only=True)
ws = wb.create_sheet("Sheet1")
for row in tab.iterrows():
   ws.append({'A':'row['sheet'], 'B': '%s%s' %(row['IDnum'], r(row['name'])}

If you do wish to follow the CSV route then it's probably best to use h5dump and define a data source in Excel which might also allow you to choose the columns the way you want.

You can simply write to CSV and load that into Excel. Here's the rough code:

with openFile(pyTableName, 'r') as f:
    tab = f.getNode('/absoluteData')
    outpath = os.path.join(os.getcwd(),'Results.csv')
    np.savetxt(outpath, tab, delimiter=',')

That is, you should be able to write the entire CSV using NumPy (or Pandas if you want fancier options, perhaps), without any slow Python loops.

You can also consider Pandas' to_excel method: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_excel.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM