简体   繁体   中英

How to remove all duplicated rows from 2 CSV files with pandas?

I have to CSV files. Data structures are equal and looks like ip, cve. I need to remove all rows, which are present in both files and leave only unique rows. (Left anti join) I think, that this can be done with left join, but it doesn't work. Is there easier way to solve such problem?

    import pandas as pd

    patrol = pd.read_csv('parse_results_MaxPatrol.csv')
    nessus = pd.read_csv('parse_result_nessus_new.csv')
    nessus_filtered = nessus.merge(patrol, how='left', left_on=[0], right_on=[0])

This code throws such traceback:

File "C:/Users/username/Desktop/pandas/parser.py", line 6, in <module>
    nessus_filtered = nessus.merge(patrol, how='left', left_on=[0], right_on=[0])
  File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 6868, in merge
    copy=copy, indicator=indicator, validate=validate)
  File "C:\Python37\lib\site-packages\pandas\core\reshape\merge.py", line 47, in merge
    validate=validate)
  File "C:\Python37\lib\site-packages\pandas\core\reshape\merge.py", line 529, in __init__
    self.join_names) = self._get_merge_keys()
  File "C:\Python37\lib\site-packages\pandas\core\reshape\merge.py", line 833, in _get_merge_keys
    right._get_label_or_level_values(rk))
  File "C:\Python37\lib\site-packages\pandas\core\generic.py", line 1706, in _get_label_or_level_values
    raise KeyError(key)

You can learn it from below given sample code

import pandas as pd
data_a = pd.read_csv('./a.csv')
data_b = pd.read_csv('./b.csv')
print('Data A')
print(data_a)
print('\nData B')
print(data_b)

data_c = pd.concat([data_a, data_b]).drop_duplicates(keep='first')
print('\nData C - Final dataset')
print(data_c)

It read two sample .csv files (a.csv and b.csv) which both having same structure (id, name columns) with few duplicate values. We just read these .csv files and drop the duplicates and keep the first row.

Data A
   id   name
0   1   Jhon
1   2   Kane
2   3    Leo
3   4  Brack

Data B
   id   name
0   2   Kane
1   4  Brack
2   5  Peter
3   6    Tom

Data C - Final dataset
   id   name
0   1   Jhon
1   2   Kane
2   3    Leo
3   4  Brack
2   5  Peter
3   6    Tom

Hope, this help you to solve your problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM