简体   繁体   中英

How to do exact match in a paragraph of by using the list of strings in python

I have a list of strings with somer version number. I would like to find (exact) these list of strings in a paragraph Example products = ["productA v4.1", "productA v4.1.5", "productA v4.1.5 ver"]

paragraph = "Troubleshooting steps for productA v4.1.5 documents"

In this case if Iam using filter like following

products = ["productA v4.1", "productA v4.1.5", "product A v4.1.5 ver"]
paragraph = "Troubleshooting steps for productA v4.1.5 documents"
def checkIfProdExist(x):
  if paragraph.find(x) != -1:
    return True
  else:
    return False
results = filter(checkIfProdExist, products)
print(list(results))

The output of above code is ['productA v4.1', 'productA v4.1.5']

How can i make only 'productA v4.1.5' find in paragraph and get its index value?

You want to find the longest match, so you should start matching using the longest string first:

products = ["productA v4.1", "productA v4.1.5", "product A v4.1.5 ver"]
productsSorted = sorted(products, key=len, reverse=True)
paragraph = "Troubleshooting steps for productA v4.1.5 documents"


def checkIfProdExist(x):
    if paragraph.find(x) != -1:
        return True
    else:
        return False


def checkIfProdExistAndExit(prods):
    # stop immediately after the first match!
    for x in prods:
        if paragraph.find(x) != -1:
            return x


results = filter(checkIfProdExist, productsSorted)
print(list(results)[0])
results = checkIfProdExistAndExit(productsSorted)
print(results)

Out:

productA v4.1.5
productA v4.1.5

Sounds like you basically want the beginning and end of the match to be either the end of the paragraph, or a transition to a space character (the end of a "word", though sadly, the regex definition of word excludes stuff like . , so you can't use tests based on \b ).

The simplest approach here is to just split the line by whitespace, and see if the string you have occurs in the resulting list (using some variation on finding a sublist in a list ):

def list_contains_sublist(haystack, needle):
    firstn, *restn = needle  # Extracted up front for efficiency
    for i, x in enumerate(haystack, 1):
        if x == firstn and haystack[i:i+len(restn)] == restn:
            return True
    return False

para_words = paragraph.split()
def checkIfProdExist(x):
    return list_contains_sublist(para_words, x.split())

If you want the index too, or need precise whitespace matching, it's trickier ( .split() won't preserve runs of whitespace so you can't reconstruct the index, and you might get the wrong index if you index the whole string and the substring occurs twice, but only the second one meets your requirements). At that point, I'd probably just go with a regex:

import re

def checkIfProdExist(x):
    m = re.search(fr'(^|\s){re.escape(x)}(?=\s|$)', paragraph)
    if m:
        return m.end(1)  # After the matched space, if any
    return -1  # Or omit return for implicit None, or raise an exception, or whatever

Note that as written, this won't work with your filter (if the paragraph begins with the substring, it returns 0 , which is falsy). You might have it return None on failure and a tuple of the indices on success so it works both for boolean and index-demanding cases, eg (demonstrating walrus use for 3.8+ for fun):

def checkIfProdExist(x):
    if m := re.search(fr'(?:^|\s)({re.escape(x)})(?=\s|$)', paragraph):
        return m.span(1)  # We're capturing match directly to get end of match easily, so we stop capturing leading space and just use span of capture
    # Implicitly returns falsy None on failure

Solved my use case by doing reverse sort on products list and stripping the 1st matched product occurrences from the paragraph. Following is the code how i did. It may or may not be the right approach but solved my purpose. It is working even products list has n no of products and paragraph has many matched strings from products list. Appreciate all of your research and help!

products = ["productA v4.1", "productA v4.1.5", "productA v4.1.5 ver"]

#applying the reverse sorting so that large strings comes first
products = sorted(products, key=len, reverse=True)

paragraph = "Troubleshooting steps for productA v4.1.5 ver documents also has steps for productA v4.1 document "


def checkIfProdExist(x):
  if paragraph.find(x) != -1:
    return True
  else:
    return False

#filter all matched strings
prodResults = list(filter(checkIfProdExist, products))

print(prodResults)
# At this state Result is  = ['productA v4.1.5 ver', 'productA v4.1.5', 'productA v4.1']

finalResult = []

# Loop through the matched the strings
for prd in prodResults:
  if paragraph.find(prd) != -1:
    # Loop through the each matched string and copy the first index
    finalResult.append({"index":str(paragraph.find(prd)),"value":prd})
    
    #Once Index copied replace all occurrences of matched string with empty so that next short string will not find it. i.e. removing productA v4.1.5 ver occurrences in paragraph will not provide chance to match productA v4.1.5 and productA v4.1  
    paragraph = paragraph.replace(prd,"")
    
print(finalResult)
# Final Result is [{'index': '26', 'value': 'productA v4.1.5 ver'}, {'index': '56', 'value': 'productA v4.1'}]
# If Paragraph is "Troubleshooting steps for productA v4.1.5 documents" then the result is [{'index': '26', 'value': 'productA v4.1.5'}] 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM