简体   繁体   English

使用 Python 读取 .xlsx 文件的最快方法

[英]Fastest way to read .xlsx file with Python

I'm trying to read data from a.xlsx file into a MySQL database using Python.我正在尝试使用 Python 将数据从 a.xlsx 文件读取到 MySQL 数据库中。

Here's my code:这是我的代码:

wb = openpyxl.load_workbook(filename="file", read_only=True)
ws = wb['My Worksheet']

conn = MySQLdb.connect()
cursor = conn.cursor()

cursor.execute("SET autocommit = 0")

for row in ws.iter_rows(row_offset=1):
     sql_row = # data i need
     cursor.execute("INSERT sql_row")

conn.commit() 

Unfortunately, openpyxl 's ws.iter_rows() is painfully slow.不幸的是, openpyxlws.iter_rows()非常慢。 I've tried similar methods using the xlrd and pandas modules.我尝试过使用xlrdpandas模块的类似方法。 Still slow.还是很慢。 Any thoughts?有什么想法吗?

You really need to benchmark your code and provide information about the size of the worksheet and the time taken to process it. 您确实需要对代码进行基准测试,并提供有关工作表大小和处理时间的信息。

openpyxl's read-only mode is essentially a memory optimisation which avoids loading the whole worksheet into memory. openpyxl的只读模式本质上是一种内存优化,可以避免将整个工作表加载到内存中。 When it comes to parsing Excel worksheets most of the work involved is converting XML to Python and there are limits to this. 解析Excel工作表时,涉及的大多数工作是将XML转换为Python,对此有一定的限制。

However, two optimisations do spring to mind: 但是,有两个优化方案确实存在:

  • keep your SQL statement outside the loop 将您的SQL语句置于循环之外
  • use executemany to pass lots of rows at once to the driver 使用executemany一次将很多行传递给驱动程序

These can be combined in something like like 这些可以组合成类似

INSERT_SQL = "INSERT INTO mytable (name, age…) VALUES (%s, %s, …)"
c.executemany(INSERT_SQL, ws.values)

If you only want a subset of the rows then look at using itertools.islice 如果您只想要行的子集,那么请使用itertools.islice

This should be faster than your current code but you shouldn't expect miracles. 这应该比您当前的代码快,但是您不应指望奇迹。

When it comes to pure performance, xlrd is a little faster than openpyxl when reading worksheets because it has a smaller memory footprint, largely related to being a read-only library. 就纯性能而言,xlrd在读取工作表时比openpyxl快一点,因为它的内存占用量较小,这在很大程度上与成为只读库有关。 But it always loads a whole workbook into memory, which might not be want you want. 但是它总是将整个工作簿加载到内存中,这可能不是您想要的。

For reading, try http://github.com/AndyStricker/FastXLSX It claimed to use expat for event based parsing and stream based zip reader.要阅读,请尝试http://github.com/AndyStricker/FastXLSX它声称使用 expat 进行基于事件的解析和基于 stream 的 zip 阅读器。 Only shared string table has to kept in memory.只有共享字符串表必须保存在 memory 中。

If still to slow, you can try compile it with Nuitka.如果仍然很慢,你可以尝试用 Nuitka 编译它。 I used to get 25% speed boost with Nuitka compiled lib.我曾经使用 Nuitka 编译的库获得 25% 的速度提升。

For writing, try http://github.com/kz26/PyExecelerate对于写作,试试http://github.com/kz26/PyExecelerate

For MySQL, try the cython enabled CyMySQL http://github.com/nakagami/CyMySQL for bulk insertion, in my experience, this noticeably boosted the insertion speed compared to pymysql.对于 MySQL,尝试启用 cython 的 CyMySQL http://github.com/nakagami/CyMySQL进行批量插入,根据我的经验,与 pymysql 相比,这显着提高了插入速度。 The difference I saw only applies to massive bulk insertion in tight loop.我看到的差异仅适用于紧密循环中的大量批量插入。 Try different larger batch size to get best speed possible.尝试不同的更大批量大小以获得最佳速度。

There is a rust lib calamine and python`s bindings for it.有一个 rust 库calamine和 python 的绑定 It gives 10x-20x speedup on read.它提供 10x-20x 的读取速度。

from python_calamine import get_sheet_data

recs: list[list] = get_sheet_data("myfile.xlsx", sheet=0)

If you wanna turn it into a pd.DataFrame :如果你想把它变成pd.DataFrame

df = pd.DataFrame.from_records(recs)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM