[英]Combining values from Similar Strings in CSV File
因此,我有一個充滿交易的CSV文件,其中供應商名稱位於一列,交易金額位於另一列。 我們的目標是找到交易總數最大的供應商。 那部分非常簡單,我有如下代碼:
with open('Transactions.csv') as Vendor_Data:
file_reader = csv.reader(Vendor_Data, delimiter=',')
vendor_dict = {}
next(file_reader)
for row in file_reader:
if row[3] not in vendor_dict:
vendor_dict[row[3]] = [0, 0]
vendor_dict[row[3]][1] += round(float(row[1]), 2)
else:
vendor_dict[row[3]][0] += 1
vendor_dict[row[3]][1] += round(float(row[1]), 2)
問題是,有很多條目中同一供應商的拼寫略有不同(“達美航空”訴“達美航空”)。 在循環CSV文件並合並交易實例和金額時,檢測這些相似的字符串名稱(例如,使用Fuzzywuzzy)的最佳方法是什么?
import csv
from fuzzywuzzy import fuzz
with open('Transactions.csv') as Vendor_Data:
file_reader = csv.reader(Vendor_Data, delimiter=',')
vendor_dict = {}
next(file_reader) # skipping a header?
for row in file_reader:
# we can't use the dictionary directly (e.g. "key in vendor_dict")
# because we want to do a similarity search.
csv_name = row[3]
for vendor_name, vendor_values in vendor_dict.iteritems():
# this is *a* way to do it. You may want to use different scores
# or even a different comparison
if fuzz.token_set_ratio(csv_name, vendor_name) > 80:
vendor_values[0] += 1
vendor_values[1] += round(float(row[1]), 2)
break
else:
# we didn't find anything similar enough, so create an entry
vendor_values = [0, 0]
vendor_values[1] += round(float(row[1]), 2)
vendor_dict[csv_name] = vendor_values
在熊貓中讀取csv文件。 然后為“ fuzzywuzzy
百分比匹配添加新列。
創建一個閾值,確定哪個百分比應視為同一字符串,然后通過使用isin()
方法進行過濾,然后添加交易金額列的值來進行計算。
將其循環到整個DataFrame,您將獲得所需的結果。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.