简体   繁体   English

在Python中,如何根据一列中的值比较两个csv文件并从第一个文件中输出与第二个不匹配的记录

[英]In Python, how to compare two csv files based on values in one column and output records from first file that do not match second

Pretty new to python and coding in general.一般而言,python 和编码非常新。 I've been searching for several csv comparison questions and answers and couldn't find anything that helped with this specific comparison problem.我一直在寻找几个 csv 比较问题和答案,但找不到任何有助于解决此特定比较问题的内容。

I have two files that contain network asset info.我有两个包含网络资产信息的文件。 Some devices have multiple IP addresses in one file, and only 1 address in another.有些设备在一个文件中有多个 IP 地址,而在另一个文件中只有 1 个地址。 Also they don't seem to share uppercase or lowercase format.此外,它们似乎不共享大写或小写格式。 I'm interested in their hostname values.我对他们的主机名值感兴趣。

(files don't have headers) (文件没有标题)

file 1:文件1:

HOSTNAME1,10.0.0.1
HOSTNAME2,10.0.0.2
HOSTNAME3,10.19.0.3
hostname4,10.19.0.4,10.19.17.31,10.19.17.32,10.19.17.33,10.19.17.34
hostname5,10.19.0.40,10.19.17.51,10.19.17.52,10.19.17.53,10.19.17.54
hostname6,10.19.0.55,10.19.17.56,10.19.17.57,10.19.17.58,10.19.17.59

File 2档案 2

HOSTNAME4,10.19.0.4
HOSTNAME5,10.19.0.40
HOSTNAME6,10.19.0.55
hostname7,192.168.0.1
hostname8,192.168.0.2
hostname9,192.168.0.3

I'd like to compare these files based on hostname (column 0) and output to a third file that contains the rows in file1 that are NOT in file2, ignoring case, and ignoring if they have multiple IP's in file1 or file2.我想根据主机名(第 0 列)比较这些文件,然后输出到包含 file1 中不在 file2 中的行的第三个文件,忽略大小写,并忽略它们在 file1 或 file2 中是否有多个 IP。

desired output:所需的输出:

file3:文件 3:

HOSTNAME1,10.0.0.1
HOSTNAME2,10.0.0.2
HOSTNAME3,10.19.0.3

I tried a simple comm command in bash to try and see if I could generate the desired result and had no luck, so I decided to try this in python我在 bash 中尝试了一个简单的 comm 命令来尝试查看是否可以生成所需的结果但没有运气,所以我决定在 python 中尝试这个

comm -23 --nocheck-order file1.csv file2.csv > file3.csv

Here's what i've tried in python:这是我在 python 中尝试过的:

with open('file1.csv', 'r') as f1, open('file2.csv', 'r') as f2:
    fileone = f1.readlines()
    filetwo = f2.readlines()

with open('file3.csv', 'w') as outFile:
    for line in fileone:
        if line not in filetwo:
            outFile.write(line)

Problem is it isn't writing the rows where the IP list don't match exactly.问题是它没有写入 IP 列表不完全匹配的行。 Even if in column 1 they share a hostname, if the row has multiple ips in one file it isn't counted.即使在第 1 列中它们共享一个主机名,如果该行在一个文件中具有多个 ip,则不会被计算在内。

I'm not sure my code above is ignore case and it seems to be trying to match the entire string from a row, rather than "contains."我不确定我上面的代码是否忽略大小写,它似乎试图匹配一行中的整个字符串,而不是“包含”。

willing to try pandas package if that makes more sense for this kind of comparison愿意尝试熊猫包,如果这对这种比较更有意义

Your own code is not too far away from what you need to do.您自己的代码与您需要做的事情相距不远。

Step 1 : Create a set from the list of hostnames in file2.csv .步骤 1:file2.csv中的主机名列表创建一个集合。 Here the hostnames are changed to uppercase.此处主机名更改为大写。

with open('file2.csv') as check_file:
    check_set = set([row.split(',')[0].strip().upper() for row in check_file])

Step 2 : Iterate through the lines of file1.csv and check if the hostname is in the set.第 2 步:遍历file1.csv的行并检查主机名是否在集合中。

with open('file1.csv', 'r') as in_file, open('file3.csv', 'w') as out_file:
    for line in in_file:
        if line.split(',')[0].strip().upper() not in check_set:
            out_file.write(line)

Generated file file3.csv contents:生成的文件file3.csv内容:

HOSTNAME1,10.0.0.1
HOSTNAME2,10.0.0.2
HOSTNAME3,10.19.0.3

Since you are interested to use Pandas I would suggest this.由于您有兴趣使用Pandas我建议您这样做。

Use read_csv to read the csv file and merge to join both and identify the mismatch.使用read_csv读取 csv 文件并merge以连接两者并识别不匹配。 But for this the number of columns in both files should be same(or use names to define columns).但为此,两个文件中的列数应该相同(或使用names来定义列)。 Having said that,if you fine with only the first column comparison you can try this.话虽如此,如果你只对第一列比较满意,你可以试试这个。

import pandas as pd

#Read the 2 csv files and take only the first column
file1_df = pd.read_csv('filename1.csv',usecols=[0],names=['Name'])
file2_df = pd.read_csv('filename2.csv',usecols=[0],names=['Name'])

#Converting both the files first column to uppercase to make it case insensitive
file1_df['Name'] = file1_df['Name'].str.upper()
file2_df['Name'] = file2_df['Name'].str.upper()

#Merging both the Dataframe using left join
comparison_result = pd.merge(file1_df,file2_df,on='Name',how='left',indicator=True)

#Filtering only the rows that are available in left(file1)
comparison_result = comparison_result.loc[comparison_result['_merge'] == 'left_only']

print(comparison_result)

As I told, Since the number of columns are different(if separated by comma) in both csv, i'm reading only the first column.正如我所说,由于两个 csv 中的列数不同(如果用逗号分隔),我只读取第一列。 Hence output also will be only one column as shown below.因此输出也将只有一列,如下所示。

HOSTNAME1
HOSTNAME2
HOSTNAME3

you need to compare the first column only , try something like below您只需要比较第一列,请尝试以下操作

filetwo=[val.split(',')[0].strip().lower() for val in filetwo]
for line in fileone:
  if line.split(',')[0].strip().lower() not in filetwo:
    print(line)

将代码作为列表运行

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何比较 csv 文件中一列的两行并在 Python 中相应地创建一个新列 - how do i compare two rows from one column in a csv file and create a new column accordingly in Python 比较两个文件并在Python中更新第二个文件中第一个文件的值的最佳方法是什么? - What's the best way to compare two files & update the values of the first file from second file in Python? 如何读取两个 CSV 文件并比较两者中的 1 列,然后写入列匹配的新文件 - How do I read two CSV files and compare 1 column from both and then write to a new file where columns match 比较两个文件,如果它们都匹配第一列,然后替换第 2 列和第 3 列的值(Python) - Compare two files if they both match first column then replace the values of column 2 and 3 (Python) 如何比较python中csv文件的列中的值? - How do I compare values within a column in a csv file in python? 如何比较 2 个 CSV 文件,检查第二列的值是否匹配并计算每个值匹配时的出现次数? - How can I compare 2 CSV files, check if the values of the second column match and count the number of occurrences for each value when they match? 如何根据 Python 中的列中的值将 csv 文件拆分为两个文件? - How can I split a csv file into two files based on values in a column in Python? 读取两个文件并根据第一个文件的列过滤第二个文件 - read two files and filter second file based on a column of first file 比较两个文本文件,然后根据匹配的第一列更新特定值。 (Python) - compare two text files then update the specific values based on the matching first column. (python) 如何按列比较两个CSV文件并使用Pandas Python将CSV文件中的差异保存 - How to compare two CSV files by column and save the differences in csv file using pandas python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM