简体   繁体   English

如何使用python比较两个不同的csv文件?

[英]How can I compare two different csv file using python?

I want to make the code that compare two csv files!我想制作比较两个csv文件的代码!

import pandas as pd
import numpy as np

    df = pd.read_csv("E:\Dupfile.csv")
    df1 = pd.read_csv("E:\file.csv")
    
    df['Correct'] = None
    
    def Result(x):
       if ....:
         return int(1)
       else:
         return int(0)
    
    
    df.loc[:,"Correct"]=df.apply(Result,axis=1)
    
    print(df["Correct"])
    
    df.to_csv("E:\file.csv")
    print(df.head(20))

For example, file.csv format seems like below:例如,file.csv 格式如下所示:

     round    date  first  second  third  fourth  fifth  sixth  
0     1  2021.04      1      14     15      24     40     41     
1     2  2021.04      2       9     10      16     35     37      
2     3  2021.04      4      15     24      35     36     40      
3     4  2021.03     10      11     20      21     25     41     
4     5  2021.03      4       9     23      26     29     33     
5     6  2021.03      1       9     26      28     30     41     

Dupfile.csv seems like below: Dupfile.csv 如下所示:

    round    date  first  second  third  fourth  fifth  sixth  
0     1  2021.04      1      14     15      24     40     41  
0     1  2021.04      1       2      3       4      5      6    
1     2  2021.04      2       9     10      16     35     37   
1     2  2021.04      1       2      3       4      5      6      
2     3  2021.04      4      15     24      35     36     40    
2     3  2021.04      1       2      3       4      5      6     
3     4  2021.03     10      11     20      21     25     41  
3     4  2021.03      1       2      3       4      5      6     
4     5  2021.03      4       9     23      26     29     33  
4     5  2021.03      1       2      3       4      5      6   

it has one more same round, but value is different.它还有一个相同的回合,但价值不同。

check the file's round value with Dupfile's round and if the first to sixth value is equal, make the another "Correct" column in Dupfile and put 1. If not correct, put 0 to the "Correct" Column.使用 Dupfile 的轮次检查文件的轮次值,如果第一个到第六个值相等,则在 Dupfile 中创建另一个“正确”列并放入 1。如果不正确,将 0 放入“正确”列。

I tried to compare two different csv file but, I don't know how to do it.我试图比较两个不同的 csv 文件,但我不知道该怎么做。 Can someone help me?有人能帮我吗?

my expectation answer:我的期望答案:

    round    date  first  second  third  fourth  fifth  sixth Correct
0     1  2021.04      1      14     15      24     40     41    1
0     1  2021.04      1       2      3       4      5      6    0
1     2  2021.04      2       9     10      16     35     37    1
1     2  2021.04      1       2      3       4      5      6    0  
2     3  2021.04      4      15     24      35     36     40    1
2     3  2021.04      1       2      3       4      5      6    0 
3     4  2021.03     10      11     20      21     25     41    1
3     4  2021.03      1       2      3       4      5      6    0 
4     5  2021.03      4       9     23      26     29     33    1
4     5  2021.03      1       2      3       4      5      6    0

If you use pandas module, it will be better to gain the methods that provide in the module.如果你使用pandas模块,最好能获得模块中提供的方法。 I suggest you, try to use merge for comparing 2 different DataFrames.我建议您尝试使用merge来比较 2 个不同的数据帧。 I rewrite your code as follows.我将您的代码重写如下。

import pandas as pd

df = pd.read_csv("E:\Dupfile.csv")
df1 = pd.read_csv("E:\file.csv")

df1['Correct'] = 1

df = df.merge(
        df1,
        how='left',
        on=['round',
            'date',
            'first',
            'second',
            'third',
            'fourth',
            'fifth',
            'sixth']).fillna(0)

print(df)

print(df['Correct'])

df.to_csv("E:\file.csv")
print(df.head(20))

How does it work?它是如何工作的?

The merge method tries to match the columns in df and df1 with the same names that exist in on array.所述merge方法试图在列相匹配dfdf1与存在于相同的名称on阵列。 When you select left for how argument, no values on the left side of merging ( df ) would be removed (Left Join).当您为how参数选择left ,合并( df )左侧的任何值都不会被删除(Left Join)。 In another way, the correct column that we create in file.csv appends to Dupfil.csv data, and non-match is assigned as nan value.换句话说,我们在file.csv创建的correct列附加到Dupfil.csv数据,并且不匹配被分配为nan值。 The fillna(0) method helps us to replace nan values with 0. fillna(0)方法帮助我们用 0 替换nan值。

pandas.DataFrame.merge API reference pandas.DataFrame.merge API 参考

You can do it with pure pandas using df.merge .您可以使用df.merge对纯熊猫进行df.merge

Check out the example:查看示例:

import pandas as pd


# file.csv
file_df = pd.DataFrame(
    columns=["round", "date", "first", "second", "third", "fourth", "fifth", "sixth"],
    data=[
        ("1", "2021.04", "1", "14", "15", "24", "40", "41"),
        ("2", "2021.04", "2", "9", "10", "16", "35", "37"),
        ("3", "2021.04", "4", "15", "24", "35", "36", "40"),
        ("4", "2021.03", "10", "11", "20", "21", "25", "41"),
        ("5", "2021.03", "4", "9", "23", "26", "29", "33"),
        ("6", "2021.03", "1", "9", "26", "28", "30", "41"),
    ],
)

# adding control column (we already know that those are the right values)
file_df["correct"] = 1

# Dupfile.csv
dup_file_df = pd.DataFrame(
    columns=["round", "date", "first", "second", "third", "fourth", "fifth", "sixth"],
    data=[
        ("1", "2021.04", "1", "14", "15", "24", "40", "41"),
        ("1", "2021.04", "1", "2", "3", "4", "5", "6"),
        ("2", "2021.04", "2", "9", "10", "16", "35", "37"),
        ("2", "2021.04", "1", "2", "3", "4", "5", "6"),
        ("3", "2021.04", "4", "15", "24", "35", "36", "40"),
        ("3", "2021.04", "1", "2", "3", "4", "5", "6"),
        ("4", "2021.03", "10", "11", "20", "21", "25", "41"),
        ("4", "2021.03", "1", "2", "3", "4", "5", "6"),
        ("5", "2021.03", "4", "9", "23", "26", "29", "33"),
        ("5", "2021.03", "1", "2", "3", "4", "5", "6"),
    ],
)

# We extract the column names to use in the merging process
cols = [x for x in dup_file_df.columns]

# We merge the 2 dataframes.
# The data frames are to match on every column (round, date and first to sixth). 
# The "correct" column will be populated only if all the columns are matching
merged = dup_file_df.merge(file_df, how="outer", left_on=cols, right_on=cols)

# We put "0" where correct is None and cast to integer (it was float)
merged["correct"] = merged["correct"].fillna(0).astype(int)

# Done!
print(merged)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM