Python Pandas追加数据框

Question

I have a case, where I am adding UUID columns to .csv files. 我有一种情况，我正在将UUID列添加到.csv文件。 At the same time, I am checking source files and comparing them to processed ones - in case there are additional lines in source files, I plan to append those new lines to destination file. 同时，我正在检查源文件并将它们与已处理的文件进行比较-如果源文件中还有其他行，我计划将这些新行附加到目标文件中。 Reason why I want to append and not overwrite a file is due to need of keeping UUID of previously processed lines same. 我之所以要附加而不是覆盖文件，是因为需要保持先前处理的行的UUID不变。

So for case of appending lines, I check if row count is same for source and destination file. 因此，对于追加行的情况，我检查源文件和目标文件的行数是否相同。 In case it is not, I create new dataframe with data (from source file) from row number that equals row count in destination file. 如果不是这样，我将使用与目标文件中的行数相等的行号（来自源文件）的数据创建新的数据框。

At that point, I try to append newly created dataframe to destination dataframe, but it keeps failing. 在这一点上，我尝试将新创建的数据框附加到目标数据框，但是它一直失败。 I recieve the following error: 我收到以下错误：

 > RuntimeWarning: '<' not supported between instances of 'int' and > 'str', sort order is undefined for incomparable objects result = > result.union(other)

Code that I am using is below: 我正在使用的代码如下：

import os, uuid
import pandas as pd


def process_files():
    source_dir = "C:\\Projects\\test\\raw"
    destination_dir = "C:\\Projects\\test\\processed"

    for file_name in os.listdir(source_dir):
        if file_name.endswith((".csv", ".new")):
            df_source = pd.read_csv(source_dir + "/" + file_name, sep=";")

            if os.path.isfile(destination_dir + "/" + file_name):
                df_destination = pd.read_csv(destination_dir + "/" + file_name, sep=",", header=None)

                if df_source.shape[0] != (df_destination.shape[0]):
                    df_newlines = pd.read_csv(source_dir + "/" + file_name, sep=";", skiprows=df_destination.shape[0], header=None)
                    df_newlines.insert(0, "uu_id", pd.Series([uuid.uuid4() for i in range(len(df_newlines))]))
                    df_destination.append(df_newlines, ignore_index=True)
                    df_destination.to_csv(destination_dir + "/" + file_name, sep=",", header=False, mode="w", index=False)
                else:
                    continue
            else:
                df_source.insert(0,"uu_id", pd.Series([uuid.uuid4() for i in range(len(df_source))]))
                df_source.to_csv(destination_dir + "/" + file_name, sep=",", header=False, mode="w", index=False)
        else:
            continue


process_files()

I have checked dtypes of both dataframes, they match per columns. 我检查了两个数据框的dtype，它们每列匹配。 I have also forced renaming of columns to have same string, but it does not do the trick. 我还强制将列重命名为具有相同的字符串，但这不能解决问题。 Any idea what I am doing wrong with append (commenting out the append row runs the script without issues)? 知道我在执行追加操作时有什么问题（注释出追加行将运行脚本而不会出现问题）吗？

Thank you and best regards, Bostjan 谢谢您，最好的问候，Bostjan

Answer 1

Disclaimer: Due to a lack of reputation points, I am not allowed to comment 免责声明：由于缺乏信誉，我无权发表评论

Normally, append is not used in place. 通常， append不被使用。 Hence, I would suggest to say 因此，我建议说

df_destination = df_destination.append(df_newlines, ignore_index=True)

Hope that's it. 希望就是这样。

Apart from that, I suggest to use os.walk and fnmatch to browse the files. 除此之外，我建议使用os.walk和fnmatch浏览文件。

Python Pandas追加数据框

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-12-14 15:50:55

Python Pandas追加数据框

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-12-14 15:50:55

解决方案1
1 已采纳 2017-12-14 15:50:55