简体   繁体   English

如何提高阅读能力,用python转换300个xml文件

[英]How to improve reading and transform 300 xml files with python

I have 300 xml files on my folder.我的文件夹中有 300 个 xml 文件。

Each one gives me ~ 1.657 rows when I transform to a dataframe.当我转换为 dataframe 时,每个都给我 ~ 1.657 行。

The code bellow is taking too much time.下面的代码花费了太多时间。

Using R I did in ~200seconds.使用 R 我在 ~200 秒内完成了。

The function: function_from_xml_pddataframe(xmlfile) generates the df_xml_ dataframe function: function_from_xml_pddataframe(xmlfile)生成df_xml_ dataframe

What am I doing wrong?我究竟做错了什么?

How can I improve this process?我怎样才能改进这个过程?

   import os
    all_dfs = pd.DataFrame()

    
    
    for file in tqdm("/data"):
    
        if file.endswith(".xml"):
          function_from_xml_pddataframe(xmlfile)
          
          df_created = df_xml

          list_of_dataframes.append(df_created)



all_dfs = pd.concat(list_of_dataframes)

Without a reproducible example it's hard to be sure.没有可重现的例子,很难确定。 Even replacing your file loop and function_from_xml_pddataframe your code won't run as it is:即使替换您的文件循环和function_from_xml_pddataframe ,您的代码也不会按原样运行:

  • list_of_dataframes needs to be defined before you can append to it. list_of_dataframes需要先定义,然后才能 append 到它。
  • If you meant line 2 to read: list_of_dataframes = pd.DataFrame() your code would run slowly and then error at the last line all_dfs = pd.concat(list_of_dataframes) .如果您要阅读第 2 行: list_of_dataframes = pd.DataFrame()您的代码将运行缓慢,然后在最后一行all_dfs = pd.concat(list_of_dataframes)

Assuming you wanted something like this (which is slow, but runs):假设您想要这样的东西(速度很慢,但可以运行):

import pandas as pd
from scipy.stats import uniform

list_of_dfs = pd.DataFrame()

for i in range(10):
    list_of_dfs.append(pd.DataFrame({'a': uniform.rvs(0, 1, 1000), 'b': uniform.rvs(0, 1, 1000)}))

# this line will cause an error
pd.concat(list_of_dfs)

It looks like you're setting up all_dfs (or list_of_dataframes as it should be) as a pandas dataframe. I think what you're trying to do here is set it up as a list.看起来您正在将all_dfs (或应该是list_of_dataframes )设置为 pandas dataframe。我认为您在这里要做的是将其设置为列表。 Try changing line 2 to list_of_dataframes = [] :尝试将第 2 行更改为list_of_dataframes = []

import pandas as pd
from scipy.stats import uniform

list_of_dataframes = []
for i in range(10):
    list_of_dataframes.append(pd.DataFrame({'a': uniform.rvs(0, 1, 1000), 'b': uniform.rvs(0, 1, 1000)}))

pd.concat(list_of_dataframes)

The second version is ~10x faster.第二个版本快了约 10 倍。

TL;DR长话短说
Make sure you store your separate pandas dataframes in a list - appending to a dataframe is slow.确保将单独的 pandas 数据帧存储在列表中 - 附加到 dataframe 很慢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM