简体   繁体   中英

Compare multiple CSV files by row and delete files not needed

I am comparing multiple CSV files against a master file by a selected column values , and want to keep only the file that has the most matches with the master file.

The code I actually created give me the results for each file, but I don't know how to make the comparison between the files themselves, and just keep the one with the highest values sum at the end.

I know how to delete files via os.remove() and so on, but need help with the selection of the maximum value.

data0 = pd.read_csv('input_path/master_file.csv', sep=',')

csv_files = glob.glob(fr'path_to_files_in_comparison\**\*.csv', recursive=True)

for df in csv_files:
    df_base = os.path.basename(df)
    input_dir = os.path.dirname(df)
    data1 = pd.read_csv(df, sep=',')
    comp1 = pd.concat([data0, data1])[['values']]
    cnt1 = comp1.loc[comp1.duplicated()]
    match1 = cnt1.count(axis=1)
    sum = str(sum(match1))
    print('Matches between ' + df_base + ' & ' + input_dir + ': ' + sum)

The print gives me (paths and directories names appear correct):

Matches between ... & ...: 332215
Matches between ... & ...: 273239

Had the idea to try it via sub-lists, but just did not get anywhere.

You could write a function to calculate the "match score" for each file, and use that function as the key argument for the max function:

def match_score(csv_file):
    df_base = os.path.basename(csv_file)
    data1 = pd.read_csv(csv_file, sep=",")
    comp1 = pd.concat([data0, data1])[['values']]
    cnt1 = comp1.loc[comp1.duplicated()]
    match1 = cnt1.count(axis=1)
    return match1.sum()

Then,

csv_files = glob.glob(fr'path_to_files_in_comparison\**\*.csv', recursive=True)
max_match_file = max(csv_files, key=match_score)

You can simplify your code a lot using pathlib .

Addressing your question, you can store the duplicates sum in a dictionary, and after comparing all files, choose the one with most matches. Something like this:

import pandas as pd
from pathlib import Path

main_file = Path('/main/path/main.csv')
main_df = pd.read_csv(main_file)

other_path = Path('/other/path/')
other_files = other_path.rglob('*.csv')

matches_per_file = {}

for other_file in other_files:
    other_df = pd.read_csv(other_file)
    merged_df = pd.concat([main_df, other_df])[['values']]
    dups = merged_df.loc[merged_df.duplicated()]
    dups_sum = sum(dups.count(axis=1))
    matches_per_file[other_file] = dups_sum
    print(f'Matches between {other_file} and {main_file}: {dups_sum}')

# find the file with most matches
most_matches = max(matches_per_file, key=matches_per_file.get)

The code above will populate matches_per_file with pairs filename : matches . That will make it easy for you to find the max(matches) and the corresponding filename , and then decide which files you will keep and which ones you will delete. The variable most_matches will be set with that filename.

Use the code snippet as a starting point, since I don't have the data files to test it properly.

Thank you for your support. I have built a solution using list and sub-list. I added the following to my code and it works. Probably not the nicest solution, but it's my turn to improve my python skills.

        liste1.append(df)
        liste2.append(summe)
        liste_overall = list(zip(liste1, liste2))
    
    max_liste = max(liste_overall, key=lambda sublist: sublist[1])
    
for df2 in liste_overall:
    #print(max_liste)
    print(df2)
    if df2[1] in max_liste[1]:
        print("Maximum duplicated values, keep file!")
    else:
        print("Not maximum duplicated, file is removed!")
        os.remove(df2[0])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM