在Python中，如何根据一列中的值比较两个csv文件并从第一个文件中输出与第二个不匹配的记录

Question

Pretty new to python and coding in general.一般而言，python 和编码非常新。 I've been searching for several csv comparison questions and answers and couldn't find anything that helped with this specific comparison problem.我一直在寻找几个 csv 比较问题和答案，但找不到任何有助于解决此特定比较问题的内容。

I have two files that contain network asset info.我有两个包含网络资产信息的文件。 Some devices have multiple IP addresses in one file, and only 1 address in another.有些设备在一个文件中有多个 IP 地址，而在另一个文件中只有 1 个地址。 Also they don't seem to share uppercase or lowercase format.此外，它们似乎不共享大写或小写格式。 I'm interested in their hostname values.我对他们的主机名值感兴趣。

(files don't have headers) （文件没有标题）

file 1:文件1：

HOSTNAME1,10.0.0.1
HOSTNAME2,10.0.0.2
HOSTNAME3,10.19.0.3
hostname4,10.19.0.4,10.19.17.31,10.19.17.32,10.19.17.33,10.19.17.34
hostname5,10.19.0.40,10.19.17.51,10.19.17.52,10.19.17.53,10.19.17.54
hostname6,10.19.0.55,10.19.17.56,10.19.17.57,10.19.17.58,10.19.17.59

File 2档案 2

HOSTNAME4,10.19.0.4
HOSTNAME5,10.19.0.40
HOSTNAME6,10.19.0.55
hostname7,192.168.0.1
hostname8,192.168.0.2
hostname9,192.168.0.3

I'd like to compare these files based on hostname (column 0) and output to a third file that contains the rows in file1 that are NOT in file2, ignoring case, and ignoring if they have multiple IP's in file1 or file2.我想根据主机名（第 0 列）比较这些文件，然后输出到包含 file1 中不在 file2 中的行的第三个文件，忽略大小写，并忽略它们在 file1 或 file2 中是否有多个 IP。

desired output:所需的输出：

file3:文件 3：

HOSTNAME1,10.0.0.1
HOSTNAME2,10.0.0.2
HOSTNAME3,10.19.0.3

I tried a simple comm command in bash to try and see if I could generate the desired result and had no luck, so I decided to try this in python我在 bash 中尝试了一个简单的 comm 命令来尝试查看是否可以生成所需的结果但没有运气，所以我决定在 python 中尝试这个

comm -23 --nocheck-order file1.csv file2.csv > file3.csv

Here's what i've tried in python:这是我在 python 中尝试过的：

with open('file1.csv', 'r') as f1, open('file2.csv', 'r') as f2:
    fileone = f1.readlines()
    filetwo = f2.readlines()

with open('file3.csv', 'w') as outFile:
    for line in fileone:
        if line not in filetwo:
            outFile.write(line)

Problem is it isn't writing the rows where the IP list don't match exactly.问题是它没有写入 IP 列表不完全匹配的行。 Even if in column 1 they share a hostname, if the row has multiple ips in one file it isn't counted.即使在第 1 列中它们共享一个主机名，如果该行在一个文件中具有多个 ip，则不会被计算在内。

I'm not sure my code above is ignore case and it seems to be trying to match the entire string from a row, rather than "contains."我不确定我上面的代码是否忽略大小写，它似乎试图匹配一行中的整个字符串，而不是“包含”。

willing to try pandas package if that makes more sense for this kind of comparison愿意尝试熊猫包，如果这对这种比较更有意义

Answer 1

Your own code is not too far away from what you need to do.您自己的代码与您需要做的事情相距不远。

Step 1 : Create a set from the list of hostnames in file2.csv .步骤 1：从file2.csv中的主机名列表创建一个集合。 Here the hostnames are changed to uppercase.此处主机名更改为大写。

with open('file2.csv') as check_file:
    check_set = set([row.split(',')[0].strip().upper() for row in check_file])

Step 2 : Iterate through the lines of file1.csv and check if the hostname is in the set.第 2 步：遍历file1.csv的行并检查主机名是否在集合中。

with open('file1.csv', 'r') as in_file, open('file3.csv', 'w') as out_file:
    for line in in_file:
        if line.split(',')[0].strip().upper() not in check_set:
            out_file.write(line)

Generated file file3.csv contents:生成的文件file3.csv内容：

HOSTNAME1,10.0.0.1
HOSTNAME2,10.0.0.2
HOSTNAME3,10.19.0.3

Answer 2

Since you are interested to use Pandas I would suggest this.由于您有兴趣使用Pandas我建议您这样做。

Use read_csv to read the csv file and merge to join both and identify the mismatch.使用read_csv读取 csv 文件并merge以连接两者并识别不匹配。 But for this the number of columns in both files should be same(or use names to define columns).但为此，两个文件中的列数应该相同（或使用names来定义列）。 Having said that,if you fine with only the first column comparison you can try this.话虽如此，如果你只对第一列比较满意，你可以试试这个。

import pandas as pd

#Read the 2 csv files and take only the first column
file1_df = pd.read_csv('filename1.csv',usecols=[0],names=['Name'])
file2_df = pd.read_csv('filename2.csv',usecols=[0],names=['Name'])

#Converting both the files first column to uppercase to make it case insensitive
file1_df['Name'] = file1_df['Name'].str.upper()
file2_df['Name'] = file2_df['Name'].str.upper()

#Merging both the Dataframe using left join
comparison_result = pd.merge(file1_df,file2_df,on='Name',how='left',indicator=True)

#Filtering only the rows that are available in left(file1)
comparison_result = comparison_result.loc[comparison_result['_merge'] == 'left_only']

print(comparison_result)

As I told, Since the number of columns are different(if separated by comma) in both csv, i'm reading only the first column.正如我所说，由于两个 csv 中的列数不同（如果用逗号分隔），我只读取第一列。 Hence output also will be only one column as shown below.因此输出也将只有一列，如下所示。

HOSTNAME1
HOSTNAME2
HOSTNAME3

Answer 3

you need to compare the first column only , try something like below您只需要比较第一列，请尝试以下操作

filetwo=[val.split(',')[0].strip().lower() for val in filetwo]
for line in fileone:
  if line.split(',')[0].strip().lower() not in filetwo:
    print(line)

在Python中，如何根据一列中的值比较两个csv文件并从第一个文件中输出与第二个不匹配的记录

问题描述

3 个解决方案

解决方案1
2 已采纳 2020-08-24 13:13:49

解决方案2
0 2020-08-24 14:37:40

解决方案3
-1 2020-08-24 12:03:29

在Python中，如何根据一列中的值比较两个csv文件并从第一个文件中输出与第二个不匹配的记录

问题描述

3 个解决方案

解决方案1 2 已采纳 2020-08-24 13:13:49

解决方案2 0 2020-08-24 14:37:40

解决方案3 -1 2020-08-24 12:03:29

解决方案1
2 已采纳 2020-08-24 13:13:49

解决方案2
0 2020-08-24 14:37:40

解决方案3
-1 2020-08-24 12:03:29