如何提高阅读能力，用python转换300个xml文件

Question

I have 300 xml files on my folder.我的文件夹中有 300 个 xml 文件。

Each one gives me ~ 1.657 rows when I transform to a dataframe.当我转换为 dataframe 时，每个都给我 ~ 1.657 行。

The code bellow is taking too much time.下面的代码花费了太多时间。

Using R I did in ~200seconds.使用 R 我在 ~200 秒内完成了。

The function: function_from_xml_pddataframe(xmlfile) generates the df_xml_ dataframe function: function_from_xml_pddataframe(xmlfile)生成df_xml_ dataframe

What am I doing wrong?我究竟做错了什么？

How can I improve this process?我怎样才能改进这个过程？

   import os
    all_dfs = pd.DataFrame()

    
    
    for file in tqdm("/data"):
    
        if file.endswith(".xml"):
          function_from_xml_pddataframe(xmlfile)
          
          df_created = df_xml

          list_of_dataframes.append(df_created)



all_dfs = pd.concat(list_of_dataframes)

Answer 1

Without a reproducible example it's hard to be sure.没有可重现的例子，很难确定。 Even replacing your file loop and function_from_xml_pddataframe your code won't run as it is:即使替换您的文件循环和function_from_xml_pddataframe ，您的代码也不会按原样运行：

list_of_dataframes needs to be defined before you can append to it. list_of_dataframes需要先定义，然后才能 append 到它。
If you meant line 2 to read: list_of_dataframes = pd.DataFrame() your code would run slowly and then error at the last line all_dfs = pd.concat(list_of_dataframes) .如果您要阅读第 2 行： list_of_dataframes = pd.DataFrame()您的代码将运行缓慢，然后在最后一行all_dfs = pd.concat(list_of_dataframes) 。

Assuming you wanted something like this (which is slow, but runs):假设您想要这样的东西（速度很慢，但可以运行）：

import pandas as pd
from scipy.stats import uniform

list_of_dfs = pd.DataFrame()

for i in range(10):
    list_of_dfs.append(pd.DataFrame({'a': uniform.rvs(0, 1, 1000), 'b': uniform.rvs(0, 1, 1000)}))

# this line will cause an error
pd.concat(list_of_dfs)

It looks like you're setting up all_dfs (or list_of_dataframes as it should be) as a pandas dataframe. I think what you're trying to do here is set it up as a list.看起来您正在将all_dfs （或应该是list_of_dataframes ）设置为 pandas dataframe。我认为您在这里要做的是将其设置为列表。 Try changing line 2 to list_of_dataframes = [] :尝试将第 2 行更改为list_of_dataframes = [] ：

import pandas as pd
from scipy.stats import uniform

list_of_dataframes = []
for i in range(10):
    list_of_dataframes.append(pd.DataFrame({'a': uniform.rvs(0, 1, 1000), 'b': uniform.rvs(0, 1, 1000)}))

pd.concat(list_of_dataframes)

The second version is ~10x faster.第二个版本快了约 10 倍。

TL;DR长话短说
Make sure you store your separate pandas dataframes in a list - appending to a dataframe is slow.确保将单独的 pandas 数据帧存储在列表中 - 附加到 dataframe 很慢。

如何提高阅读能力，用python转换300个xml文件

问题描述

1 个解决方案

解决方案1
0 2022-03-30 14:08:47

如何提高阅读能力，用python转换300个xml文件

问题描述

1 个解决方案

解决方案1 0 2022-03-30 14:08:47

解决方案1
0 2022-03-30 14:08:47