[英]How can I compare two different csv file using python?
I want to make the code that compare two csv files!我想制作比较两个csv文件的代码!
import pandas as pd
import numpy as np
df = pd.read_csv("E:\Dupfile.csv")
df1 = pd.read_csv("E:\file.csv")
df['Correct'] = None
def Result(x):
if ....:
return int(1)
else:
return int(0)
df.loc[:,"Correct"]=df.apply(Result,axis=1)
print(df["Correct"])
df.to_csv("E:\file.csv")
print(df.head(20))
For example, file.csv format seems like below:例如,file.csv 格式如下所示:
round date first second third fourth fifth sixth
0 1 2021.04 1 14 15 24 40 41
1 2 2021.04 2 9 10 16 35 37
2 3 2021.04 4 15 24 35 36 40
3 4 2021.03 10 11 20 21 25 41
4 5 2021.03 4 9 23 26 29 33
5 6 2021.03 1 9 26 28 30 41
Dupfile.csv seems like below: Dupfile.csv 如下所示:
round date first second third fourth fifth sixth
0 1 2021.04 1 14 15 24 40 41
0 1 2021.04 1 2 3 4 5 6
1 2 2021.04 2 9 10 16 35 37
1 2 2021.04 1 2 3 4 5 6
2 3 2021.04 4 15 24 35 36 40
2 3 2021.04 1 2 3 4 5 6
3 4 2021.03 10 11 20 21 25 41
3 4 2021.03 1 2 3 4 5 6
4 5 2021.03 4 9 23 26 29 33
4 5 2021.03 1 2 3 4 5 6
it has one more same round, but value is different.它还有一个相同的回合,但价值不同。
check the file's round value with Dupfile's round and if the first to sixth value is equal, make the another "Correct" column in Dupfile and put 1. If not correct, put 0 to the "Correct" Column.使用 Dupfile 的轮次检查文件的轮次值,如果第一个到第六个值相等,则在 Dupfile 中创建另一个“正确”列并放入 1。如果不正确,将 0 放入“正确”列。
I tried to compare two different csv file but, I don't know how to do it.我试图比较两个不同的 csv 文件,但我不知道该怎么做。 Can someone help me?
有人能帮我吗?
my expectation answer:我的期望答案:
round date first second third fourth fifth sixth Correct
0 1 2021.04 1 14 15 24 40 41 1
0 1 2021.04 1 2 3 4 5 6 0
1 2 2021.04 2 9 10 16 35 37 1
1 2 2021.04 1 2 3 4 5 6 0
2 3 2021.04 4 15 24 35 36 40 1
2 3 2021.04 1 2 3 4 5 6 0
3 4 2021.03 10 11 20 21 25 41 1
3 4 2021.03 1 2 3 4 5 6 0
4 5 2021.03 4 9 23 26 29 33 1
4 5 2021.03 1 2 3 4 5 6 0
If you use pandas
module, it will be better to gain the methods that provide in the module.如果你使用
pandas
模块,最好能获得模块中提供的方法。 I suggest you, try to use merge
for comparing 2 different DataFrames.我建议您尝试使用
merge
来比较 2 个不同的数据帧。 I rewrite your code as follows.我将您的代码重写如下。
import pandas as pd
df = pd.read_csv("E:\Dupfile.csv")
df1 = pd.read_csv("E:\file.csv")
df1['Correct'] = 1
df = df.merge(
df1,
how='left',
on=['round',
'date',
'first',
'second',
'third',
'fourth',
'fifth',
'sixth']).fillna(0)
print(df)
print(df['Correct'])
df.to_csv("E:\file.csv")
print(df.head(20))
The merge
method tries to match the columns in df
and df1
with the same names that exist in on
array.所述
merge
方法试图在列相匹配df
和df1
与存在于相同的名称on
阵列。 When you select left
for how
argument, no values on the left side of merging ( df
) would be removed (Left Join).当您为
how
参数选择left
,合并( df
)左侧的任何值都不会被删除(Left Join)。 In another way, the correct
column that we create in file.csv
appends to Dupfil.csv
data, and non-match is assigned as nan
value.换句话说,我们在
file.csv
创建的correct
列附加到Dupfil.csv
数据,并且不匹配被分配为nan
值。 The fillna(0)
method helps us to replace nan
values with 0. fillna(0)
方法帮助我们用 0 替换nan
值。
pandas.DataFrame.merge API reference pandas.DataFrame.merge API 参考
You can do it with pure pandas using df.merge
.您可以使用
df.merge
对纯熊猫进行df.merge
。
Check out the example:查看示例:
import pandas as pd
# file.csv
file_df = pd.DataFrame(
columns=["round", "date", "first", "second", "third", "fourth", "fifth", "sixth"],
data=[
("1", "2021.04", "1", "14", "15", "24", "40", "41"),
("2", "2021.04", "2", "9", "10", "16", "35", "37"),
("3", "2021.04", "4", "15", "24", "35", "36", "40"),
("4", "2021.03", "10", "11", "20", "21", "25", "41"),
("5", "2021.03", "4", "9", "23", "26", "29", "33"),
("6", "2021.03", "1", "9", "26", "28", "30", "41"),
],
)
# adding control column (we already know that those are the right values)
file_df["correct"] = 1
# Dupfile.csv
dup_file_df = pd.DataFrame(
columns=["round", "date", "first", "second", "third", "fourth", "fifth", "sixth"],
data=[
("1", "2021.04", "1", "14", "15", "24", "40", "41"),
("1", "2021.04", "1", "2", "3", "4", "5", "6"),
("2", "2021.04", "2", "9", "10", "16", "35", "37"),
("2", "2021.04", "1", "2", "3", "4", "5", "6"),
("3", "2021.04", "4", "15", "24", "35", "36", "40"),
("3", "2021.04", "1", "2", "3", "4", "5", "6"),
("4", "2021.03", "10", "11", "20", "21", "25", "41"),
("4", "2021.03", "1", "2", "3", "4", "5", "6"),
("5", "2021.03", "4", "9", "23", "26", "29", "33"),
("5", "2021.03", "1", "2", "3", "4", "5", "6"),
],
)
# We extract the column names to use in the merging process
cols = [x for x in dup_file_df.columns]
# We merge the 2 dataframes.
# The data frames are to match on every column (round, date and first to sixth).
# The "correct" column will be populated only if all the columns are matching
merged = dup_file_df.merge(file_df, how="outer", left_on=cols, right_on=cols)
# We put "0" where correct is None and cast to integer (it was float)
merged["correct"] = merged["correct"].fillna(0).astype(int)
# Done!
print(merged)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.