如何使用多处理将多个Excel工作表导入到pandas中？

Question

I am trying to use multiprocessing on a 12-core machine to read an Excel file – a 60MB file with 15 sheets and 10,000 rows each. 我试图在12核机器上使用多处理来读取Excel文件 - 一个60MB的文件，每页15页，每行10,000行。 Importing all the sheets with pandas.read_csv and no parallelisation still takes about 33 seconds. 使用pandas.read_csv导入所有工作表并且没有并行化仍需要大约33秒。

If I use pool.map() it works, but it takes longer than the non-parallel version: 150 seconds vs 33! 如果我使用pool.map（）它可以工作，但它需要比非并行版本更长的时间：150秒vs 33！

If I use pool.map_async() it takes 36 seconds, but I can`t seem to access (and cannot therefore check) the output! 如果我使用pool.map_async（）需要36秒，但我似乎无法访问（因此无法检查）输出！

My questions are: 我的问题是：

what am I doing wrong? 我究竟做错了什么？ both pool.map and pool.map_async take roughly the same time even if I set nrows=10 in the read_single_sheet function; 即使我在read_single_sheet函数中设置了nrows = 10，pool.map和pool.map_async也大致相同。 same time whether it reads 10 rows or 10,000 – how is that possible? 同时它是读10行还是10,000行 - 这怎么可能？
How do I get the results of pool.map_async()? 我如何获得pool.map_async（）的结果？ I have tried output = [p.get() for p in dataframes] but it doesn`t work: 我output = [p.get() for p in dataframes]尝试了output = [p.get() for p in dataframes]但它不起作用：

MapResult object is not iterable MapResult对象不可迭代

Is this more of a IO-bound than a CPU-bound problem? 这是一个IO限制而不是CPU限制的问题吗？ Still, why does pool.map take so long? 仍然，为什么pool.map需要这么长时间？

Reading the same data from CSV (each Excel sheet saved to a separate CSV) takes 2 seconds on my machine. 从CSV读取相同的数据（每个Excel工作表保存到单独的CSV）在我的机器上花费2秒钟。 However, CSV is not really a good option for what I need to do. 但是，对于我需要做的事情，CSV并不是一个很好的选择。 I often have 10 to 20 mid-sized tabs; 我经常有10到20个中型标签; converting them manually may often take longer than waiting for pandas to read them, plus if I receive updated versions I have to do the manual conversion again. 手动转换它们通常需要比等待熊猫读取它们更长的时间，而且如果我收到更新的版本，我必须再次进行手动转换。

I know I could use a VBA script in Excel to automatically save each sheet to CSV, but data types are most often inferred correctly when reading from Excel – not so with CSV, especially for dates (my dates are never ISO yyyy-mm-dd): I'd have to identify the date fields, specify the format, etc – just reading from Excel would often be faster. 我知道我可以在Excel中使用VBA脚本自动将每个工作表保存为CSV，但是从Excel读取时通常会正确推断数据类型 - 对于CSV不是这样，特别是对于日期（我的日期永远不会是ISO yyyy-mm-dd））：我必须确定日期字段，指定格式等 - 只需从Excel中读取通常会更快。 Especially because these tasks tend to be one-offs: I import the data once, maybe twice or 3 times if I receive an update, store it in SQL and then all my Python scripts read from SQL. 特别是因为这些任务往往是一次性的：我导入数据一次，如果我收到更新，可能是两次或三次，将它存储在SQL中，然后我的所有Python脚本都从SQL中读取。

The code I am using to read the file is: 我用来读取文件的代码是：

import numpy as np
import pandas as pd
import time
import multiprocessing
from multiprocessing import Pool
def parallel_read():
    pool = Pool(num_cores)
    # reads 1 row only, to retrieve column names and sheet names
    mydic = pd.read_excel('excel_write_example.xlsx', nrows=1, sheet_name=None)
    sheets =[]
    for d in mydic:
        sheets.extend([d])
    dataframes  = pool.map( read_single_sheet , sheets  )
    return dataframes

def parallel_read_async():
    pool = Pool(num_cores)
    # reads 1 row only, to retrieve column names and sheet names
    mydic = pd.read_excel('excel_write_example.xlsx', nrows=1, sheet_name=None)
    sheets =[]
    for d in mydic:
        sheets.extend([d])
    dataframes  = pool.map_async( read_single_sheet , sheets  ) 
    output = None
    # this below doesn`t work - can`t understand why
    output = [p.get() for p in dataframes]
    return output

def read_single_sheet(sheet):
    out = pd.read_excel('excel_write_example.xlsx', sheet_name=sheet )
    return out

num_cores = multiprocessing.cpu_count() 

if __name__=='__main__':
    start=time.time()
    out_p = parallel_read()
    time_par = time.time() -start

    out_as = parallel_read_async()
    time_as = time.time() - start - time_par

The code I used to create the Excel is: 我用来创建Excel的代码是：

import numpy as np
import pandas as pd

sheets = 15
rows= int(10e3)

writer = pd.ExcelWriter('excel_write_example.xlsx')

def create_data(sheets, rows):
    df = {} # dictionary of dataframes
    for i in range(sheets):
        df[i] = pd.DataFrame(data= np.random.rand(rows,30) )
        df[i]['a'] = 'some long random text'
        df[i]['b'] = 'some more random text'
        df[i]['c'] = 'yet more text'
    return df

def data_to_excel(df, writer):
    for d in df:
        df[d].to_excel(writer, sheet_name = str(d), index=False)
    writer.close()

df = create_data(sheets, rows)
data_to_excel(df, writer)

Answer 1

I am posting this as an answer because, while it doesn't answer the question of how to do it in Python, it still provides a feasible alternative to speed up the reading time materially, so it can be of interest to any Python user ; 我将此作为答案发布，因为虽然它没有回答如何在Python中执行此操作的问题，但它仍然提供了一种可行的替代方法，可以大大加快读取时间，因此任何Python用户都可以感兴趣 ; additionally, it relies only on open-source software, and requires the user to learn only a couple of commands in R. 另外，它仅依赖于开源软件，并且要求用户仅在R中学习几个命令。

My solution is: do it in R! 我的解决方案是：在R中做到这一点！

I posted about it here , which also shows my (very minimal) code ; 我在这里发布了它，它也显示了我的（非常小的）代码; basically, on the same file, R's readxl took 5.6 seconds. 基本上，在同一个文件中，R的readxl耗时5.6秒。 To recap: 回顾一下：

Python from xlsx: 33 seconds 来自xlsx的Python：33秒
Python from CSV: ca. CSV中的Python：ca。 2 seconds 2秒
R from xlsx: 5.6 seconds 来自xlsx的R：5.6秒

The link also has an answer which shows parallelising can speed up the process even more. 该链接还有一个答案，表明并行化可以进一步加快这一过程。

I believe the key difference is that the pandas.read_cs v relies on C code, while pandas.read_excel relies on more Python code. 我认为关键的区别在于pandas.read_cs v依赖于C代码，而pandas.read_excel依赖于更多的Python代码。 R's readxl is probably based on C. It might be possible to use a C parser to import xlsx files into Python, but AFAIK no such parser is available as of now. R的readxl可能基于C.可能使用C解析器将xlsx文件导入Python，但AFAIK现在没有这样的解析器。

It is a feasible solution because, after importing into R, you can easily export to a format which retains all the information on data types, and which Python can read from (SQL, parquet, etc). 这是一个可行的解决方案，因为在导入到R之后，您可以轻松地导出为保留数据类型的所有信息以及Python可以读取的格式（SQL，镶木地板等）。 Not everyone will have a SQL server available, but formats like parquet or sqlite don't require any additional software. 不是每个人都有可用的SQL服务器，但像parquet或sqlite这样的格式不需要任何其他软件。

So the changes to the workflow are minimal: the initial data loading, which, at least in my case, tends to be a one-off, is in R, and everything else continues to be in Python. 因此，对工作流的更改是最小的：初始数据加载，至少在我的情况下，往往是一次性的，在R中，其他一切继续在Python中。

I also noticed that exporting the same sheets to SQL is much faster with R and DBI::dbWriteTable than with pandas (4.25 sec vs 18.4 sec). 我还注意到，使用R和DBI::dbWriteTable将相同的工作表导出到SQL比使用pandas （4.25秒vs 18.4秒）要快得多。

Answer 2

A couple of things are going on here: 这里有几件事情在发生：

The 36 seconds that parallel_read_async seems to be taking is in fact entirely taken up by the call to pd.read_excel('excel_write_example.xlsx', nrows=1, sheet_name=None) . parallel_read_async似乎正在采取的36秒实际上完全由对pd.read_excel('excel_write_example.xlsx', nrows=1, sheet_name=None)的调用pd.read_excel('excel_write_example.xlsx', nrows=1, sheet_name=None) 。 map_async returns immediately, giving you the MapResult object, and you're immediately causing an exception by trying to iterate over it, so in this version essentially no work is being done by the read_single_sheet function. map_async立即返回，为您提供MapResult对象，并且您通过尝试迭代它立即导致异常，因此在此版本中， read_single_sheet函数基本上没有完成任何工作。
Furthermore, pd.read_excel with sheet_name=None is taking exactly as long as with sheet_name='1' etc. - so in your parallel_read function, each process is doing the work of parsing every row of every sheet. 此外， pd.read_excel与sheet_name=None正好只要服用与sheet_name='1'等-所以你parallel_read功能，每道工序都做解析每一个表的每一行的工作。 This is why it takes so much longer. 这就是它需要更长时间的原因。

And now that I've written out, I remember that my company ran into this same problem, and we ended up implementing our own xlsx parser because of it. 现在我已经写出来了，我记得我的公司遇到了同样的问题，我们最终因为它而实现了我们自己的xlsx解析器。 There's simply no way with xlrd - which pandas uses - to open an xlsx file without parsing it completely. xlrd - 大熊猫使用 - 根本无法打开xlsx文件而无需完全解析它。

If you have the option to produce (or receive?) xls files instead, those should be much quicker to work with. 如果您可以选择生成（或接收？）xls文件，那么这些文件应该更快。 Besides that, the export-to-csv option may be your best bet, if the speed of the non-parallel processing is unacceptable. 除此之外，如果非并行处理的速度是不可接受的，那么export-to-csv选项可能是您最好的选择。

Answer 3

Here is an outline of how you can bypass file lock and achieve concurrency with little changes to your code: 下面概述了如何绕过文件锁定并通过对代码进行少量更改来实现并发：

import io
import xlrd
from functools import partial

def read_sheet(buff, sheetname):
    # reads 1 row only, to retrieve column names and sheet names
    df = pd.read_excel(buff, sheetname=sheetname)
    return df

if __name__=='__main__':
    start=time.time()
    time_par = time.time() -start
    xl = xlrd.open_workbook("myfile.xls")  # you fill in this
    sheets = xl.book.sheet_names()
    buff = io.BytesIO()
    xl.dump(buff)
    buff.seek(0)
    target = partial(read_sheet, buff)
    with Pool(num_processes) as p:
        dfs = p.map(target, sheetnames)
    time_as = time.time() - start - time_par

如何使用多处理将多个Excel工作表导入到pandas中？

问题描述

3 个解决方案

解决方案1
2 2019-04-04 11:29:36

解决方案2
1 2019-04-04 03:38:08

解决方案3
-1 2019-04-04 00:00:38

如何使用多处理将多个Excel工作表导入到pandas中？

问题描述

3 个解决方案

解决方案1 2 2019-04-04 11:29:36

解决方案2 1 2019-04-04 03:38:08

解决方案3 -1 2019-04-04 00:00:38

解决方案1
2 2019-04-04 11:29:36

解决方案2
1 2019-04-04 03:38:08

解决方案3
-1 2019-04-04 00:00:38