[英]Python Pandas append dataframe
I have a case, where I am adding UUID columns to .csv files. 我有一种情况,我正在将UUID列添加到.csv文件。 At the same time, I am checking source files and comparing them to processed ones - in case there are additional lines in source files, I plan to append those new lines to destination file. 同时,我正在检查源文件并将它们与已处理的文件进行比较-如果源文件中还有其他行,我计划将这些新行附加到目标文件中。 Reason why I want to append and not overwrite a file is due to need of keeping UUID of previously processed lines same. 我之所以要附加而不是覆盖文件,是因为需要保持先前处理的行的UUID不变。
So for case of appending lines, I check if row count is same for source and destination file. 因此,对于追加行的情况,我检查源文件和目标文件的行数是否相同。 In case it is not, I create new dataframe with data (from source file) from row number that equals row count in destination file. 如果不是这样,我将使用与目标文件中的行数相等的行号(来自源文件)的数据创建新的数据框。
At that point, I try to append newly created dataframe to destination dataframe, but it keeps failing. 在这一点上,我尝试将新创建的数据框附加到目标数据框,但是它一直失败。 I recieve the following error: 我收到以下错误:
> RuntimeWarning: '<' not supported between instances of 'int' and > 'str', sort order is undefined for incomparable objects result = > result.union(other)
Code that I am using is below: 我正在使用的代码如下:
import os, uuid
import pandas as pd
def process_files():
source_dir = "C:\\Projects\\test\\raw"
destination_dir = "C:\\Projects\\test\\processed"
for file_name in os.listdir(source_dir):
if file_name.endswith((".csv", ".new")):
df_source = pd.read_csv(source_dir + "/" + file_name, sep=";")
if os.path.isfile(destination_dir + "/" + file_name):
df_destination = pd.read_csv(destination_dir + "/" + file_name, sep=",", header=None)
if df_source.shape[0] != (df_destination.shape[0]):
df_newlines = pd.read_csv(source_dir + "/" + file_name, sep=";", skiprows=df_destination.shape[0], header=None)
df_newlines.insert(0, "uu_id", pd.Series([uuid.uuid4() for i in range(len(df_newlines))]))
df_destination.append(df_newlines, ignore_index=True)
df_destination.to_csv(destination_dir + "/" + file_name, sep=",", header=False, mode="w", index=False)
else:
continue
else:
df_source.insert(0,"uu_id", pd.Series([uuid.uuid4() for i in range(len(df_source))]))
df_source.to_csv(destination_dir + "/" + file_name, sep=",", header=False, mode="w", index=False)
else:
continue
process_files()
I have checked dtypes of both dataframes, they match per columns. 我检查了两个数据框的dtype,它们每列匹配。 I have also forced renaming of columns to have same string, but it does not do the trick. 我还强制将列重命名为具有相同的字符串,但这不能解决问题。 Any idea what I am doing wrong with append (commenting out the append row runs the script without issues)? 知道我在执行追加操作时有什么问题(注释出追加行将运行脚本而不会出现问题)吗?
Thank you and best regards, Bostjan 谢谢您,最好的问候,Bostjan
Disclaimer: Due to a lack of reputation points, I am not allowed to comment 免责声明:由于缺乏信誉,我无权发表评论
Normally, append
is not used in place. 通常, append
不被使用。 Hence, I would suggest to say 因此,我建议说
df_destination = df_destination.append(df_newlines, ignore_index=True)
Hope that's it. 希望就是这样。
Apart from that, I suggest to use os.walk
and fnmatch
to browse the files. 除此之外,我建议使用os.walk
和fnmatch
浏览文件。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.