简体   繁体   English

如何使用 python 中的字符串列表在段落中进行精确匹配

[英]How to do exact match in a paragraph of by using the list of strings in python

I have a list of strings with somer version number.我有一个带有一些版本号的字符串列表。 I would like to find (exact) these list of strings in a paragraph Example products = ["productA v4.1", "productA v4.1.5", "productA v4.1.5 ver"]我想在段落中找到(确切)这些字符串列表 Example products = ["productA v4.1", "productA v4.1.5", "productA v4.1.5 ver"]

paragraph = "Troubleshooting steps for productA v4.1.5 documents"段落 = "productA v4.1.5 文档的故障排除步骤"

In this case if Iam using filter like following在这种情况下,如果我使用如下过滤器

products = ["productA v4.1", "productA v4.1.5", "product A v4.1.5 ver"]
paragraph = "Troubleshooting steps for productA v4.1.5 documents"
def checkIfProdExist(x):
  if paragraph.find(x) != -1:
    return True
  else:
    return False
results = filter(checkIfProdExist, products)
print(list(results))

The output of above code is ['productA v4.1', 'productA v4.1.5']上述代码的 output 为 ['productA v4.1', 'productA v4.1.5']

How can i make only 'productA v4.1.5' find in paragraph and get its index value?我如何才能在段落中仅找到“productA v4.1.5”并获取其索引值?

You want to find the longest match, so you should start matching using the longest string first:你想找到最长的匹配,所以你应该首先使用最长的字符串开始匹配:

products = ["productA v4.1", "productA v4.1.5", "product A v4.1.5 ver"]
productsSorted = sorted(products, key=len, reverse=True)
paragraph = "Troubleshooting steps for productA v4.1.5 documents"


def checkIfProdExist(x):
    if paragraph.find(x) != -1:
        return True
    else:
        return False


def checkIfProdExistAndExit(prods):
    # stop immediately after the first match!
    for x in prods:
        if paragraph.find(x) != -1:
            return x


results = filter(checkIfProdExist, productsSorted)
print(list(results)[0])
results = checkIfProdExistAndExit(productsSorted)
print(results)

Out:出去:

productA v4.1.5
productA v4.1.5

Sounds like you basically want the beginning and end of the match to be either the end of the paragraph, or a transition to a space character (the end of a "word", though sadly, the regex definition of word excludes stuff like . , so you can't use tests based on \b ).听起来你基本上希望匹配的开头和结尾是段落的结尾,或者过渡到空格字符(“单词”的结尾,但遗憾的是,单词的正则表达式定义不包括. ,所以你不能使用基于\b的测试)。

The simplest approach here is to just split the line by whitespace, and see if the string you have occurs in the resulting list (using some variation on finding a sublist in a list ):这里最简单的方法是用空格分割行,并查看您的字符串是否出现在结果list (使用在列表中查找子list一些变体):

def list_contains_sublist(haystack, needle):
    firstn, *restn = needle  # Extracted up front for efficiency
    for i, x in enumerate(haystack, 1):
        if x == firstn and haystack[i:i+len(restn)] == restn:
            return True
    return False

para_words = paragraph.split()
def checkIfProdExist(x):
    return list_contains_sublist(para_words, x.split())

If you want the index too, or need precise whitespace matching, it's trickier ( .split() won't preserve runs of whitespace so you can't reconstruct the index, and you might get the wrong index if you index the whole string and the substring occurs twice, but only the second one meets your requirements).如果您也想要索引,或者需要精确的空格匹配,那么它会更棘手( .split()不会保留空格的运行,因此您无法重建索引,并且如果您索引整个字符串,您可能会得到错误的索引和substring 出现两次,但只有第二次满足您的要求)。 At that point, I'd probably just go with a regex:那时,我可能只是 go 与正则表达式:

import re

def checkIfProdExist(x):
    m = re.search(fr'(^|\s){re.escape(x)}(?=\s|$)', paragraph)
    if m:
        return m.end(1)  # After the matched space, if any
    return -1  # Or omit return for implicit None, or raise an exception, or whatever

Note that as written, this won't work with your filter (if the paragraph begins with the substring, it returns 0 , which is falsy).请注意,如上所述,这不适用于您的filter (如果段落以 substring 开头,则返回0 ,这是错误的)。 You might have it return None on failure and a tuple of the indices on success so it works both for boolean and index-demanding cases, eg (demonstrating walrus use for 3.8+ for fun):您可能会让它在失败时返回None并在成功时返回索引的tuple ,因此它适用于 boolean 和要求索引的情况,例如(演示 walrus 用于 3.8+ 的乐趣):

def checkIfProdExist(x):
    if m := re.search(fr'(?:^|\s)({re.escape(x)})(?=\s|$)', paragraph):
        return m.span(1)  # We're capturing match directly to get end of match easily, so we stop capturing leading space and just use span of capture
    # Implicitly returns falsy None on failure

Solved my use case by doing reverse sort on products list and stripping the 1st matched product occurrences from the paragraph.通过对产品列表进行反向排序并从段落中删除第一个匹配的产品出现来解决我的用例。 Following is the code how i did.以下是我的代码。 It may or may not be the right approach but solved my purpose.它可能是也可能不是正确的方法,但解决了我的目的。 It is working even products list has n no of products and paragraph has many matched strings from products list.即使产品列表中没有产品并且段落中有许多产品列表中的匹配字符串,它也能正常工作。 Appreciate all of your research and help!感谢您的所有研究和帮助!

products = ["productA v4.1", "productA v4.1.5", "productA v4.1.5 ver"]

#applying the reverse sorting so that large strings comes first
products = sorted(products, key=len, reverse=True)

paragraph = "Troubleshooting steps for productA v4.1.5 ver documents also has steps for productA v4.1 document "


def checkIfProdExist(x):
  if paragraph.find(x) != -1:
    return True
  else:
    return False

#filter all matched strings
prodResults = list(filter(checkIfProdExist, products))

print(prodResults)
# At this state Result is  = ['productA v4.1.5 ver', 'productA v4.1.5', 'productA v4.1']

finalResult = []

# Loop through the matched the strings
for prd in prodResults:
  if paragraph.find(prd) != -1:
    # Loop through the each matched string and copy the first index
    finalResult.append({"index":str(paragraph.find(prd)),"value":prd})
    
    #Once Index copied replace all occurrences of matched string with empty so that next short string will not find it. i.e. removing productA v4.1.5 ver occurrences in paragraph will not provide chance to match productA v4.1.5 and productA v4.1  
    paragraph = paragraph.replace(prd,"")
    
print(finalResult)
# Final Result is [{'index': '26', 'value': 'productA v4.1.5 ver'}, {'index': '56', 'value': 'productA v4.1'}]
# If Paragraph is "Troubleshooting steps for productA v4.1.5 documents" then the result is [{'index': '26', 'value': 'productA v4.1.5'}] 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM