[英]How to improve reading and transform 300 xml files with python
I have 300 xml files on my folder.我的文件夹中有 300 个 xml 文件。
Each one gives me ~ 1.657 rows when I transform to a dataframe.当我转换为 dataframe 时,每个都给我 ~ 1.657 行。
The code bellow is taking too much time.下面的代码花费了太多时间。
Using R I did in ~200seconds.使用 R 我在 ~200 秒内完成了。
The function: function_from_xml_pddataframe(xmlfile)
generates the df_xml_ dataframe
function: function_from_xml_pddataframe(xmlfile)
生成df_xml_ dataframe
What am I doing wrong?我究竟做错了什么?
How can I improve this process?我怎样才能改进这个过程?
import os
all_dfs = pd.DataFrame()
for file in tqdm("/data"):
if file.endswith(".xml"):
function_from_xml_pddataframe(xmlfile)
df_created = df_xml
list_of_dataframes.append(df_created)
all_dfs = pd.concat(list_of_dataframes)
Without a reproducible example it's hard to be sure.没有可重现的例子,很难确定。 Even replacing your file loop and function_from_xml_pddataframe
your code won't run as it is:即使替换您的文件循环和function_from_xml_pddataframe
,您的代码也不会按原样运行:
list_of_dataframes
needs to be defined before you can append to it. list_of_dataframes
需要先定义,然后才能 append 到它。list_of_dataframes = pd.DataFrame()
your code would run slowly and then error at the last line all_dfs = pd.concat(list_of_dataframes)
.如果您要阅读第 2 行: list_of_dataframes = pd.DataFrame()
您的代码将运行缓慢,然后在最后一行all_dfs = pd.concat(list_of_dataframes)
。Assuming you wanted something like this (which is slow, but runs):假设您想要这样的东西(速度很慢,但可以运行):
import pandas as pd
from scipy.stats import uniform
list_of_dfs = pd.DataFrame()
for i in range(10):
list_of_dfs.append(pd.DataFrame({'a': uniform.rvs(0, 1, 1000), 'b': uniform.rvs(0, 1, 1000)}))
# this line will cause an error
pd.concat(list_of_dfs)
It looks like you're setting up all_dfs
(or list_of_dataframes
as it should be) as a pandas dataframe. I think what you're trying to do here is set it up as a list.看起来您正在将all_dfs
(或应该是list_of_dataframes
)设置为 pandas dataframe。我认为您在这里要做的是将其设置为列表。 Try changing line 2 to list_of_dataframes = []
:尝试将第 2 行更改为list_of_dataframes = []
:
import pandas as pd
from scipy.stats import uniform
list_of_dataframes = []
for i in range(10):
list_of_dataframes.append(pd.DataFrame({'a': uniform.rvs(0, 1, 1000), 'b': uniform.rvs(0, 1, 1000)}))
pd.concat(list_of_dataframes)
The second version is ~10x faster.第二个版本快了约 10 倍。
TL;DR长话短说
Make sure you store your separate pandas dataframes in a list - appending to a dataframe is slow.确保将单独的 pandas 数据帧存储在列表中 - 附加到 dataframe 很慢。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.