How to add a if condition in re.sub in python

Question

I am using the following code to replace the strings in words with words[0] in the given sentences .

import re
sentences = ['industrial text minings', 'i love advanced data minings and text mining']

words = ["data mining", "advanced data mining", "data minings", "text mining"]


start_terms = sorted(words, key=lambda x: len(x), reverse=True)
start_re = "|".join(re.escape(item) for item in start_terms)

results = []

    for sentence in sentences:
    for terms in words:
        if terms in sentence:
            result = re.sub(start_re, words[0], sentence)
            results.append(result)
            break

print(results)

My expected output is as follows:

[industrial text minings', 'i love data mining and data mining]

However, what I am getting is:

[industrial data minings', 'i love data mining and data mining]

In the first sentence text minings is not in words . However, it contains "text mining" in the words list, so the condition "text mining" in "industrial text minings" becomes True . Then post replacement, it "text mining" becomes "data mining", with the 's' character staying at the same place. I want to avoid such situations.

Therefore, I am wondering if there is a way to use if condition in re.sub to see if the next character is a space or not. If a space, do the replacement, else do not do it.

I am also happy with other solutions that could resolve my issue.

Answer 1

I modifed your code a bit:

# Using Python 3.6.1
import re
sentences = ['industrial text minings and data minings and data', 'i love advanced data mining and text mining as data mining has become a trend']
words = ["data mining", "advanced data mining", "data minings", "text mining", "data", 'text']

# Sort by length
start_terms = sorted(words, key=len, reverse=True)

results = []

# Loop through sentences
for sentence in sentences:
    # Loop through sorted words to replace
    result = sentence
    for term in start_terms:
        # Use exact word matching
        exact_regex = r'\b' + re.escape(term) + r'\b'
        # Replace matches with blank space (to avoid priority conflicts)
        result = re.sub(exact_regex, " ", result)
    # Replace inserted blank spaces with "data mining"
    blank_regex = r'^\s(?=\s)|(?<=\s)\s$|(?<=\s)\s(?=\s)'
    result = re.sub(blank_regex, words[0] , result)
    results.append(result)
# Print sentences
print(results)

Output:

['industrial data mining minings and data mining and data mining', 'i love data mining and data mining as data mining has become a trend']

The regex can be a bit confusing so here's a quick breakdown:

\\bword\\b matches exact phrases/words since \\b is a word boundary (more on that here )

^\\s(?=\\s) matches a space at the beginning followed by another space.

(?<=\\s)\\s$ matches a space at the end preceded by another space.

(?<=\\s)\\s(?=\\s) matches a space with a space on both sides.

For more info on positive look behinds (?<=...) and positive look aheads (?=...) see this Regex tutorial .

Answer 2

You can use a word boundary \\b to surround your whole regex:

start_re = "\\b(?:" + "|".join(re.escape(item) for item in start_terms) + ")\\b"

Your regex will become something like:

\b(?:data mining|advanced data mining|data minings|text mining)\b

(?:) denotes a non-capturing group.

How to add a if condition in re.sub in python

Question

2 answers

solution1
2 2019-01-30 07:34:02

solution2
1 ACCPTED 2019-01-30 06:41:54

How to add a if condition in re.sub in python

Question

2 answers

solution1 2 2019-01-30 07:34:02

solution2 1 ACCPTED 2019-01-30 06:41:54

solution1
2 2019-01-30 07:34:02

solution2
1 ACCPTED 2019-01-30 06:41:54