繁体   English   中英

如何使用 python 中的字符串列表在段落中进行精确匹配

[英]How to do exact match in a paragraph of by using the list of strings in python

我有一个带有一些版本号的字符串列表。 我想在段落中找到(确切)这些字符串列表 Example products = ["productA v4.1", "productA v4.1.5", "productA v4.1.5 ver"]

段落 = "productA v4.1.5 文档的故障排除步骤"

在这种情况下,如果我使用如下过滤器

products = ["productA v4.1", "productA v4.1.5", "product A v4.1.5 ver"]
paragraph = "Troubleshooting steps for productA v4.1.5 documents"
def checkIfProdExist(x):
  if paragraph.find(x) != -1:
    return True
  else:
    return False
results = filter(checkIfProdExist, products)
print(list(results))

上述代码的 output 为 ['productA v4.1', 'productA v4.1.5']

我如何才能在段落中仅找到“productA v4.1.5”并获取其索引值?

你想找到最长的匹配,所以你应该首先使用最长的字符串开始匹配:

products = ["productA v4.1", "productA v4.1.5", "product A v4.1.5 ver"]
productsSorted = sorted(products, key=len, reverse=True)
paragraph = "Troubleshooting steps for productA v4.1.5 documents"


def checkIfProdExist(x):
    if paragraph.find(x) != -1:
        return True
    else:
        return False


def checkIfProdExistAndExit(prods):
    # stop immediately after the first match!
    for x in prods:
        if paragraph.find(x) != -1:
            return x


results = filter(checkIfProdExist, productsSorted)
print(list(results)[0])
results = checkIfProdExistAndExit(productsSorted)
print(results)

出去:

productA v4.1.5
productA v4.1.5

听起来你基本上希望匹配的开头和结尾是段落的结尾,或者过渡到空格字符(“单词”的结尾,但遗憾的是,单词的正则表达式定义不包括. ,所以你不能使用基于\b的测试)。

这里最简单的方法是用空格分割行,并查看您的字符串是否出现在结果list (使用在列表中查找子list一些变体):

def list_contains_sublist(haystack, needle):
    firstn, *restn = needle  # Extracted up front for efficiency
    for i, x in enumerate(haystack, 1):
        if x == firstn and haystack[i:i+len(restn)] == restn:
            return True
    return False

para_words = paragraph.split()
def checkIfProdExist(x):
    return list_contains_sublist(para_words, x.split())

如果您也想要索引,或者需要精确的空格匹配,那么它会更棘手( .split()不会保留空格的运行,因此您无法重建索引,并且如果您索引整个字符串,您可能会得到错误的索引和substring 出现两次,但只有第二次满足您的要求)。 那时,我可能只是 go 与正则表达式:

import re

def checkIfProdExist(x):
    m = re.search(fr'(^|\s){re.escape(x)}(?=\s|$)', paragraph)
    if m:
        return m.end(1)  # After the matched space, if any
    return -1  # Or omit return for implicit None, or raise an exception, or whatever

请注意,如上所述,这不适用于您的filter (如果段落以 substring 开头,则返回0 ,这是错误的)。 您可能会让它在失败时返回None并在成功时返回索引的tuple ,因此它适用于 boolean 和要求索引的情况,例如(演示 walrus 用于 3.8+ 的乐趣):

def checkIfProdExist(x):
    if m := re.search(fr'(?:^|\s)({re.escape(x)})(?=\s|$)', paragraph):
        return m.span(1)  # We're capturing match directly to get end of match easily, so we stop capturing leading space and just use span of capture
    # Implicitly returns falsy None on failure

通过对产品列表进行反向排序并从段落中删除第一个匹配的产品出现来解决我的用例。 以下是我的代码。 它可能是也可能不是正确的方法,但解决了我的目的。 即使产品列表中没有产品并且段落中有许多产品列表中的匹配字符串,它也能正常工作。 感谢您的所有研究和帮助!

products = ["productA v4.1", "productA v4.1.5", "productA v4.1.5 ver"]

#applying the reverse sorting so that large strings comes first
products = sorted(products, key=len, reverse=True)

paragraph = "Troubleshooting steps for productA v4.1.5 ver documents also has steps for productA v4.1 document "


def checkIfProdExist(x):
  if paragraph.find(x) != -1:
    return True
  else:
    return False

#filter all matched strings
prodResults = list(filter(checkIfProdExist, products))

print(prodResults)
# At this state Result is  = ['productA v4.1.5 ver', 'productA v4.1.5', 'productA v4.1']

finalResult = []

# Loop through the matched the strings
for prd in prodResults:
  if paragraph.find(prd) != -1:
    # Loop through the each matched string and copy the first index
    finalResult.append({"index":str(paragraph.find(prd)),"value":prd})
    
    #Once Index copied replace all occurrences of matched string with empty so that next short string will not find it. i.e. removing productA v4.1.5 ver occurrences in paragraph will not provide chance to match productA v4.1.5 and productA v4.1  
    paragraph = paragraph.replace(prd,"")
    
print(finalResult)
# Final Result is [{'index': '26', 'value': 'productA v4.1.5 ver'}, {'index': '56', 'value': 'productA v4.1'}]
# If Paragraph is "Troubleshooting steps for productA v4.1.5 documents" then the result is [{'index': '26', 'value': 'productA v4.1.5'}] 

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM