[英]I need help concatenating 1 csv file and 1 pandas dataframe together without duplicates
My code currently looks like this:我的代码目前如下所示:
df1 = pd.DataFrame(statsTableList)
df2 = pd.read_csv('StatTracker.csv')
result = pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
I get an error and I'm not sure why.我得到一个错误,我不知道为什么。
The goal of my program is to pull data from an API, and then write it all to a file for analyzing.我的程序的目标是从 API 中提取数据,然后将其全部写入文件进行分析。 df1 is the lets say the first 100 games written to the csv file as the first version. df1 可以说是作为第一个版本写入 csv 文件的前 100 个游戏。 df2 is me reading back those first 100 games the second time around and comparing it to that of df1 (new data, next 100 games) to check for duplicates and delete them. df2 是我第二次读回前 100 场比赛并将其与 df1(新数据,接下来的 100 场比赛)进行比较,以检查重复项并删除它们。
The part that is not working is the drop duplicates part.不工作的部分是 drop duplicates 部分。 It gives me an error of unhashable list, I would assume that's because its two dataframes that are lists of dictionaries.它给了我一个不可哈希列表的错误,我认为这是因为它的两个数据框是字典列表。 The goal is to pull 100 games of data, and then pull the next 50, but if I pull number 100 again, to drop that one, and just add 101-150 and then add it all to my csv file.目标是提取 100 场比赛的数据,然后再拉下 50 场,但如果我再次拉出 100 号,则删除那个,只需添加 101-150,然后将其全部添加到我的 csv 文件中。 Then if I run it again, to pull 150-200, but drop 150 if its a duplicate, etc etc..然后,如果我再次运行它,拉 150-200,但如果它是重复的,则丢弃 150,等等。
Based from your explanation, you can use this one liner to find unique values in df1:根据您的解释,您可以使用这一行来查找 df1 中的唯一值:
df_diff = df1[~df1.apply(tuple,1)\
.isin(df2.apply(tuple,1))]
This code checks if the rows is exists in another dataframe.此代码检查行是否存在于另一个数据框中。 To do the comparision it converts each row to tuple (apply tuple conversion along 1 (row) axis).为了进行比较,它将每一行转换为元组(沿 1(行)轴应用元组转换)。
This solution is indeed slow because its compares each row inside df1 to all rows in df2.这个解决方案确实很慢,因为它会将 df1 中的每一行与 df2 中的所有行进行比较。 So it has time complexity n^2.所以它的时间复杂度为 n^2。
If you want more optimised version, try to use pandas built in compare method如果您想要更优化的版本,请尝试使用 pandas 内置的 compare 方法
df1.compare(df2)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.