[英]Compare multiple CSV files by row and delete files not needed
I am comparing multiple CSV
files against a master file by a selected column values
, and want to keep only the file that has the most matches with the master file.我通过选定的列
values
将多个CSV
文件与主文件进行比较,并且只想保留与主文件最匹配的文件。
The code I actually created give me the results for each file, but I don't know how to make the comparison between the files themselves, and just keep the one with the highest values
sum at the end.我实际创建的代码为我提供了每个文件的结果,但我不知道如何在文件之间进行
values
,最后只保留具有最高总和的那个。
I know how to delete files via os.remove()
and so on, but need help with the selection of the maximum value.我知道如何通过
os.remove()
等删除文件,但需要帮助选择最大值。
data0 = pd.read_csv('input_path/master_file.csv', sep=',')
csv_files = glob.glob(fr'path_to_files_in_comparison\**\*.csv', recursive=True)
for df in csv_files:
df_base = os.path.basename(df)
input_dir = os.path.dirname(df)
data1 = pd.read_csv(df, sep=',')
comp1 = pd.concat([data0, data1])[['values']]
cnt1 = comp1.loc[comp1.duplicated()]
match1 = cnt1.count(axis=1)
sum = str(sum(match1))
print('Matches between ' + df_base + ' & ' + input_dir + ': ' + sum)
The print gives me (paths and directories names appear correct):印刷品给我(路径和目录名称显示正确):
Matches between ... & ...: 332215
Matches between ... & ...: 273239
Had the idea to try it via sub-lists, but just did not get anywhere.有通过子列表尝试的想法,但没有取得任何进展。
You could write a function to calculate the "match score" for each file, and use that function as the key
argument for the max
function:您可以编写一个 function 来计算每个文件的“匹配分数”,并将该 function 用作
max
function 的key
参数:
def match_score(csv_file):
df_base = os.path.basename(csv_file)
data1 = pd.read_csv(csv_file, sep=",")
comp1 = pd.concat([data0, data1])[['values']]
cnt1 = comp1.loc[comp1.duplicated()]
match1 = cnt1.count(axis=1)
return match1.sum()
Then,然后,
csv_files = glob.glob(fr'path_to_files_in_comparison\**\*.csv', recursive=True)
max_match_file = max(csv_files, key=match_score)
You can simplify your code a lot using pathlib
.您可以使用
pathlib
大大简化您的代码。
Addressing your question, you can store the duplicates sum in a dictionary, and after comparing all files, choose the one with most matches.解决您的问题,您可以将重复总和存储在字典中,并在比较所有文件后,选择匹配最多的文件。 Something like this:
是这样的:
import pandas as pd
from pathlib import Path
main_file = Path('/main/path/main.csv')
main_df = pd.read_csv(main_file)
other_path = Path('/other/path/')
other_files = other_path.rglob('*.csv')
matches_per_file = {}
for other_file in other_files:
other_df = pd.read_csv(other_file)
merged_df = pd.concat([main_df, other_df])[['values']]
dups = merged_df.loc[merged_df.duplicated()]
dups_sum = sum(dups.count(axis=1))
matches_per_file[other_file] = dups_sum
print(f'Matches between {other_file} and {main_file}: {dups_sum}')
# find the file with most matches
most_matches = max(matches_per_file, key=matches_per_file.get)
The code above will populate matches_per_file
with pairs filename
: matches
.上面的代码将使用对
filename
: matches
填充matches_per_file
。 That will make it easy for you to find the max(matches)
and the corresponding filename
, and then decide which files you will keep and which ones you will delete.这将使您很容易找到
max(matches)
和相应的filename
,然后决定要保留哪些文件以及要删除哪些文件。 The variable most_matches
will be set with that filename.变量
most_matches
将设置为该文件名。
Use the code snippet as a starting point, since I don't have the data files to test it properly.使用代码片段作为起点,因为我没有数据文件来正确测试它。
Thank you for your support.感谢您的支持。 I have built a solution using list and sub-list.
我已经使用列表和子列表构建了一个解决方案。 I added the following to my code and it works.
我在我的代码中添加了以下内容并且它有效。 Probably not the nicest solution, but it's my turn to improve my python skills.
可能不是最好的解决方案,但轮到我提高我的 python 技能了。
liste1.append(df)
liste2.append(summe)
liste_overall = list(zip(liste1, liste2))
max_liste = max(liste_overall, key=lambda sublist: sublist[1])
for df2 in liste_overall:
#print(max_liste)
print(df2)
if df2[1] in max_liste[1]:
print("Maximum duplicated values, keep file!")
else:
print("Not maximum duplicated, file is removed!")
os.remove(df2[0])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.