简体   繁体   English

按行比较多个 CSV 文件并删除不需要的文件

[英]Compare multiple CSV files by row and delete files not needed

I am comparing multiple CSV files against a master file by a selected column values , and want to keep only the file that has the most matches with the master file.我通过选定的列values将多个CSV文件与主文件进行比较,并且只想保留与主文件最匹配的文件。

The code I actually created give me the results for each file, but I don't know how to make the comparison between the files themselves, and just keep the one with the highest values sum at the end.我实际创建的代码为我提供了每个文件的结果,但我不知道如何在文件之间进行values ,最后只保留具有最高总和的那个。

I know how to delete files via os.remove() and so on, but need help with the selection of the maximum value.我知道如何通过os.remove()等删除文件,但需要帮助选择最大值。

data0 = pd.read_csv('input_path/master_file.csv', sep=',')

csv_files = glob.glob(fr'path_to_files_in_comparison\**\*.csv', recursive=True)

for df in csv_files:
    df_base = os.path.basename(df)
    input_dir = os.path.dirname(df)
    data1 = pd.read_csv(df, sep=',')
    comp1 = pd.concat([data0, data1])[['values']]
    cnt1 = comp1.loc[comp1.duplicated()]
    match1 = cnt1.count(axis=1)
    sum = str(sum(match1))
    print('Matches between ' + df_base + ' & ' + input_dir + ': ' + sum)

The print gives me (paths and directories names appear correct):印刷品给我(路径和目录名称显示正确):

Matches between ... & ...: 332215
Matches between ... & ...: 273239

Had the idea to try it via sub-lists, but just did not get anywhere.有通过子列表尝试的想法,但没有取得任何进展。

You could write a function to calculate the "match score" for each file, and use that function as the key argument for the max function:您可以编写一个 function 来计算每个文件的“匹配分数”,并将该 function 用作max function 的key参数:

def match_score(csv_file):
    df_base = os.path.basename(csv_file)
    data1 = pd.read_csv(csv_file, sep=",")
    comp1 = pd.concat([data0, data1])[['values']]
    cnt1 = comp1.loc[comp1.duplicated()]
    match1 = cnt1.count(axis=1)
    return match1.sum()

Then,然后,

csv_files = glob.glob(fr'path_to_files_in_comparison\**\*.csv', recursive=True)
max_match_file = max(csv_files, key=match_score)

You can simplify your code a lot using pathlib .您可以使用pathlib大大简化您的代码。

Addressing your question, you can store the duplicates sum in a dictionary, and after comparing all files, choose the one with most matches.解决您的问题,您可以将重复总和存储在字典中,并在比较所有文件后,选择匹配最多的文件。 Something like this:是这样的:

import pandas as pd
from pathlib import Path

main_file = Path('/main/path/main.csv')
main_df = pd.read_csv(main_file)

other_path = Path('/other/path/')
other_files = other_path.rglob('*.csv')

matches_per_file = {}

for other_file in other_files:
    other_df = pd.read_csv(other_file)
    merged_df = pd.concat([main_df, other_df])[['values']]
    dups = merged_df.loc[merged_df.duplicated()]
    dups_sum = sum(dups.count(axis=1))
    matches_per_file[other_file] = dups_sum
    print(f'Matches between {other_file} and {main_file}: {dups_sum}')

# find the file with most matches
most_matches = max(matches_per_file, key=matches_per_file.get)

The code above will populate matches_per_file with pairs filename : matches .上面的代码将使用对filename : matches填充matches_per_file That will make it easy for you to find the max(matches) and the corresponding filename , and then decide which files you will keep and which ones you will delete.这将使您很容易找到max(matches)和相应的filename ,然后决定要保留哪些文件以及要删除哪些文件。 The variable most_matches will be set with that filename.变量most_matches将设置为该文件名。

Use the code snippet as a starting point, since I don't have the data files to test it properly.使用代码片段作为起点,因为我没有数据文件来正确测试它。

Thank you for your support.感谢您的支持。 I have built a solution using list and sub-list.我已经使用列表和子列表构建了一个解决方案。 I added the following to my code and it works.我在我的代码中添加了以下内容并且它有效。 Probably not the nicest solution, but it's my turn to improve my python skills.可能不是最好的解决方案,但轮到我提高我的 python 技能了。

        liste1.append(df)
        liste2.append(summe)
        liste_overall = list(zip(liste1, liste2))
    
    max_liste = max(liste_overall, key=lambda sublist: sublist[1])
    
for df2 in liste_overall:
    #print(max_liste)
    print(df2)
    if df2[1] in max_liste[1]:
        print("Maximum duplicated values, keep file!")
    else:
        print("Not maximum duplicated, file is removed!")
        os.remove(df2[0])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM