簡體   English   中英

如何比較列表中的每個元素與另一個列表中的每個元素?

[英]how to compare each element in a list with each element in another list?

我想將提取的促銷代碼列表與正確的促銷代碼列表進行比較。

如果被提取的列表中的促銷代碼與correct_promo_code列表中的促銷代碼進行了比較,則找不到精確匹配的內容,則表示促銷代碼有錯誤。 為了從correct_promo_codes列表中找到正確的促銷代碼,我需要找到編輯距離(左芬汀距離)最少的促銷代碼,並將其與被比較的代碼(從extracted_list中比較)。

到目前為止的代碼:

import csv

with open("all_correct_promo.csv","rb") as file1:
    reader1 = csv.reader(file1)
    correctPromoList = list(reader1)
    #print correctPromoList

with open("all_extracted_promo.csv","rb") as file2:
    reader2 = csv.reader(file2)
    extractedPromoList = list(reader2)
    #print extractedPromoList

incorrectPromo = []
count = 0
for extracted in extractedPromoList:
    if(extracted not in correctPromoList):
        incorrectPromo.append(extracted)
    else:
        count = count + 1
#print incorrectPromo

for promos in incorrectPromo:
    print promos

根據nltk文檔

nltk.metrics.distance.edit_distance(s1, s2, transpositions=False)

計算兩個字符串之間的Levenshtein編輯距離。 編輯距離是將s1轉換為s2需要替換,插入或刪除的字符數。 例如,將“ rain”轉換為“ shine”需要三個步驟,包括兩個替換和一個插入:“ rain”->“ sain”->“ shin”->“ shine”。 這些操作可以其他順序完成,但至少需要三個步驟。

關於您的代碼 ,我認為下半部分的一些更改將捕獲編輯距離-

from nltk.metrics import distance # slow to load

extractedPromoList = ['abc','acd','abd'] # csv of extracted promo codes dummy
correctPromoList = ['abc','aba','xbz','abz','abx'] # csv to real promo codes dummy

def find_min_edit(str_,list_):
    nearest_correct_promos = []
    distances = {}
    min_dist = 100 # arbitrary large assignment
    for correct_promo in list_:
        dist = distance.edit_distance(extracted,correct_promo,True) # compute Levenshtein distance
        distances[correct_promo] = dist # store each score for real promo codes
        if dist<min_dist:
            min_dist = dist # store min distance
    # extract all real promo codes with minimum Levenshtein distance
    nearest_correct_promos.append(','.join([i[0] for i in distances.items() if i[1]==min_dist])) 
    return ','.join(nearest_correct_promos) # return a comma separated string of nearest real promo codes

incorrectPromo = {}
count = 0
for extracted in extractedPromoList:
    print 'Computing %dth promo code...' % count
    incorrectPromo[extracted] =  find_min_edit(extracted,correctPromoList) # get comma separated str of real promo codes nearest to extracted
    count+=1
print incorrectPromo

輸出量

Computing 0th promo code...
Computing 1th promo code...
Computing 2th promo code...
{'abc': 'abc', 'abd': 'abx,aba,abz,abc', 'acd': 'abx,aba,abz,abc'}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM