簡體   English   中英

如何在多個字符串上使用通配符或正則表達式

[英]How to Wildcard Or Regex on Multiple Strings

我有一個SKU名稱列表,我需要將縮寫解析為單詞。

縮寫的長度不同(2-5個字符),但與實際單詞的順序相匹配。

幾個例子:

SKU名稱:“235 DSKTP 10LB”---->“桌面”

SKU名稱:“222840 MSE 2oz”---->“鼠標”

其他說明:

  1. SKU名稱不是全部大寫字母,但我知道使用.upper()方法更容易更改
  2. 我需要匹配的單詞列表很長(100+個單詞),所以創建一個匹配模式的單詞列表會最有效嗎?

我玩了一些正則表達但無濟於事。

是否存在類似於d?e?s的正則表達式?k?t?o?p?

import re
from collections import OrderedDict

data = '''
235 DSKTP 10LB
222840 MSE 2oz
1234 WNE 1L
12345 XXX 23L
RND PTT GNCH 16 OZ 007349012845
FRN SHL CNCH 7.05 OZ 007473418910
TWST CLNT 16 OZ 00733544
'''

words = ['Desktop',
'Mouse',
'Tree',
'Wine',
'Gnocchi',
'Shells',
'Cellentani']

def compare(sku_abbr, full_word):
    s = ''.join(c for c in full_word if c not in set(sku_abbr) ^ set(full_word))
    s = ''.join(OrderedDict.fromkeys(s).keys())
    return s == sku_abbr

for full_sku in data.splitlines():
    if not full_sku:
        continue
    for sku_abbr in re.findall(r'([A-Z]{3,})', full_sku):
        should_break = False
        for w in words:
            if compare(sku_abbr.upper(), w.upper()):
                print(full_sku, w)
                should_break = True
                break
        if should_break:
            break
    else:
        print(full_sku, '* NOT FOUND *')

打印:

235 DSKTP 10LB Desktop
222840 MSE 2oz Mouse
1234 WNE 1L Wine
12345 XXX 23L * NOT FOUND *
RND PTT GNCH 16 OZ 007349012845 Gnocchi
FRN SHL CNCH 7.05 OZ 007473418910 Shells
TWST CLNT 16 OZ 00733544 Cellentani

您可以創建一個將縮寫與實際單詞相關聯的字典:

import re
names = ["235 DSKTP 10LB", "222840 MSE 2oz"]
abbrs = {'DSKTP':'Desktop', 'MSE':'Mouse'}
matched = [re.findall('(?<=\s)[a-zA-Z]+(?=\s)', i) for i in names]
result = ['N/A' if not i else abbrs.get(i[0], i[0]) for i in matched]

輸出:

['Desktop', 'Mouse']

查找Levenshtein距離 - 它測量“文本的相似性”。

Levenshtein的來源- 實施:https://en.wikibooks.org/wiki/Algorithm_Implementation

 def levenshtein(s1, s2): # source: https://en.wikibooks.org/wiki/Algorithm_Implementation # /Strings/Levenshtein_distance#Python if len(s1) < len(s2): return levenshtein(s2, s1) # len(s1) >= len(s2) if len(s2) == 0: return len(s1) previous_row = range(len(s2) + 1) for i, c1 in enumerate(s1): current_row = [i + 1] for j, c2 in enumerate(s2): insertions = previous_row[j + 1] + 1 deletions = current_row[j] + 1 substitutions = previous_row[j] + (c1 != c2) current_row.append( min(insertions, deletions, substitutions)) previous_row = current_row return previous_row[-1] 

適用於您的問題:

skus = ["235 DSKTP 10LB","222840 MSE 2oz"]
full = ["Desktop", "Mouse", "potkseD"]

# go over all skus
for sku in skus:
    name = sku.split()[1].lower()       # extract name
    dist = []
    for f in full:                      # calculate all levenshtein dists to full names
                                        # you could shorten this by only using those
                                        # where 1st character is identicall
        dist.append( ( levenshtein(name.lower(),f.lower()),name,f) )

    print(dist)

    # get the minimal distance (beware if same distances occure)
    print( min( (p for p in dist), key = lambda x:x[0]) )

輸出:

# distances 
[(2, 'dsktp', 'Desktop'), (5, 'dsktp', 'Mouse'), (6, 'dsktp', 'potkseD')]

# minimal one
(2, 'dsktp', 'Desktop')

# distances
[(6, 'mse', 'Desktop'), (2, 'mse', 'Mouse'), (5, 'mse', 'potkseD')]

# minimal one
(2, 'mse', 'Mouse')

如果您有一個固定的映射,請坐下來手動創建一個映射字典,然后您將獲得金色,直到新的skus被引入。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM