簡體   English   中英

如何使用 python 中的字符串列表在段落中進行精確匹配

[英]How to do exact match in a paragraph of by using the list of strings in python

我有一個帶有一些版本號的字符串列表。 我想在段落中找到(確切)這些字符串列表 Example products = ["productA v4.1", "productA v4.1.5", "productA v4.1.5 ver"]

段落 = "productA v4.1.5 文檔的故障排除步驟"

在這種情況下,如果我使用如下過濾器

products = ["productA v4.1", "productA v4.1.5", "product A v4.1.5 ver"]
paragraph = "Troubleshooting steps for productA v4.1.5 documents"
def checkIfProdExist(x):
  if paragraph.find(x) != -1:
    return True
  else:
    return False
results = filter(checkIfProdExist, products)
print(list(results))

上述代碼的 output 為 ['productA v4.1', 'productA v4.1.5']

我如何才能在段落中僅找到“productA v4.1.5”並獲取其索引值?

你想找到最長的匹配,所以你應該首先使用最長的字符串開始匹配:

products = ["productA v4.1", "productA v4.1.5", "product A v4.1.5 ver"]
productsSorted = sorted(products, key=len, reverse=True)
paragraph = "Troubleshooting steps for productA v4.1.5 documents"


def checkIfProdExist(x):
    if paragraph.find(x) != -1:
        return True
    else:
        return False


def checkIfProdExistAndExit(prods):
    # stop immediately after the first match!
    for x in prods:
        if paragraph.find(x) != -1:
            return x


results = filter(checkIfProdExist, productsSorted)
print(list(results)[0])
results = checkIfProdExistAndExit(productsSorted)
print(results)

出去:

productA v4.1.5
productA v4.1.5

聽起來你基本上希望匹配的開頭和結尾是段落的結尾,或者過渡到空格字符(“單詞”的結尾,但遺憾的是,單詞的正則表達式定義不包括. ,所以你不能使用基於\b的測試)。

這里最簡單的方法是用空格分割行,並查看您的字符串是否出現在結果list (使用在列表中查找子list一些變體):

def list_contains_sublist(haystack, needle):
    firstn, *restn = needle  # Extracted up front for efficiency
    for i, x in enumerate(haystack, 1):
        if x == firstn and haystack[i:i+len(restn)] == restn:
            return True
    return False

para_words = paragraph.split()
def checkIfProdExist(x):
    return list_contains_sublist(para_words, x.split())

如果您也想要索引,或者需要精確的空格匹配,那么它會更棘手( .split()不會保留空格的運行,因此您無法重建索引,並且如果您索引整個字符串,您可能會得到錯誤的索引和substring 出現兩次,但只有第二次滿足您的要求)。 那時,我可能只是 go 與正則表達式:

import re

def checkIfProdExist(x):
    m = re.search(fr'(^|\s){re.escape(x)}(?=\s|$)', paragraph)
    if m:
        return m.end(1)  # After the matched space, if any
    return -1  # Or omit return for implicit None, or raise an exception, or whatever

請注意,如上所述,這不適用於您的filter (如果段落以 substring 開頭,則返回0 ,這是錯誤的)。 您可能會讓它在失敗時返回None並在成功時返回索引的tuple ,因此它適用於 boolean 和要求索引的情況,例如(演示 walrus 用於 3.8+ 的樂趣):

def checkIfProdExist(x):
    if m := re.search(fr'(?:^|\s)({re.escape(x)})(?=\s|$)', paragraph):
        return m.span(1)  # We're capturing match directly to get end of match easily, so we stop capturing leading space and just use span of capture
    # Implicitly returns falsy None on failure

通過對產品列表進行反向排序並從段落中刪除第一個匹配的產品出現來解決我的用例。 以下是我的代碼。 它可能是也可能不是正確的方法,但解決了我的目的。 即使產品列表中沒有產品並且段落中有許多產品列表中的匹配字符串,它也能正常工作。 感謝您的所有研究和幫助!

products = ["productA v4.1", "productA v4.1.5", "productA v4.1.5 ver"]

#applying the reverse sorting so that large strings comes first
products = sorted(products, key=len, reverse=True)

paragraph = "Troubleshooting steps for productA v4.1.5 ver documents also has steps for productA v4.1 document "


def checkIfProdExist(x):
  if paragraph.find(x) != -1:
    return True
  else:
    return False

#filter all matched strings
prodResults = list(filter(checkIfProdExist, products))

print(prodResults)
# At this state Result is  = ['productA v4.1.5 ver', 'productA v4.1.5', 'productA v4.1']

finalResult = []

# Loop through the matched the strings
for prd in prodResults:
  if paragraph.find(prd) != -1:
    # Loop through the each matched string and copy the first index
    finalResult.append({"index":str(paragraph.find(prd)),"value":prd})
    
    #Once Index copied replace all occurrences of matched string with empty so that next short string will not find it. i.e. removing productA v4.1.5 ver occurrences in paragraph will not provide chance to match productA v4.1.5 and productA v4.1  
    paragraph = paragraph.replace(prd,"")
    
print(finalResult)
# Final Result is [{'index': '26', 'value': 'productA v4.1.5 ver'}, {'index': '56', 'value': 'productA v4.1'}]
# If Paragraph is "Troubleshooting steps for productA v4.1.5 documents" then the result is [{'index': '26', 'value': 'productA v4.1.5'}] 

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM