如何在python的re.sub中添加if条件

Question

我正在使用以下代码在给定的sentences中用words[0]替换words的字符串。

import re
sentences = ['industrial text minings', 'i love advanced data minings and text mining']

words = ["data mining", "advanced data mining", "data minings", "text mining"]


start_terms = sorted(words, key=lambda x: len(x), reverse=True)
start_re = "|".join(re.escape(item) for item in start_terms)

results = []

    for sentence in sentences:
    for terms in words:
        if terms in sentence:
            result = re.sub(start_re, words[0], sentence)
            results.append(result)
            break

print(results)

我的预期输出如下：

[industrial text minings', 'i love data mining and data mining]

但是，我得到的是：

[industrial data minings', 'i love data mining and data mining]

在第一句话中， text minings不是用words 。 但是，它在单词列表中包含“文本挖掘”，因此“工业文本挖掘”中的条件“文本挖掘”变为True 。 然后替换后，它的“文本挖掘”变为“数据挖掘”，并且“ s”字符停留在同一位置。 我想避免这种情况。

因此，我想知道是否有一种方法可以在re.sub使用if条件来查看下一个字符是否为空格。 如果有空格，请进行替换，否则请勿这样做。

我也对可以解决我的问题的其他解决方案感到满意。

Answer 1

我对您的代码做了一些修改：

# Using Python 3.6.1
import re
sentences = ['industrial text minings and data minings and data', 'i love advanced data mining and text mining as data mining has become a trend']
words = ["data mining", "advanced data mining", "data minings", "text mining", "data", 'text']

# Sort by length
start_terms = sorted(words, key=len, reverse=True)

results = []

# Loop through sentences
for sentence in sentences:
    # Loop through sorted words to replace
    result = sentence
    for term in start_terms:
        # Use exact word matching
        exact_regex = r'\b' + re.escape(term) + r'\b'
        # Replace matches with blank space (to avoid priority conflicts)
        result = re.sub(exact_regex, " ", result)
    # Replace inserted blank spaces with "data mining"
    blank_regex = r'^\s(?=\s)|(?<=\s)\s$|(?<=\s)\s(?=\s)'
    result = re.sub(blank_regex, words[0] , result)
    results.append(result)
# Print sentences
print(results)

输出：

['industrial data mining minings and data mining and data mining', 'i love data mining and data mining as data mining has become a trend']

正则表达式可能会有些混乱，所以这里有一个快速的细分：

\\bword\\b匹配精确的词组/单词，因为\\b是单词边界（有关此内容，请\\bword\\b ）

^\\s(?=\\s)开头匹配一个空格，后跟另一个空格。

(?<=\\s)\\s$匹配末尾有一个空格的另一个空格。

(?<=\\s)\\s(?=\\s)匹配一个在两侧都带有空格的空间。

有关正面正向(?<=...)和正面正向(?=...)更多信息，请参见此Regex教程。

Answer 2

您可以使用边界\\b包围整个正则表达式：

start_re = "\\b(?:" + "|".join(re.escape(item) for item in start_terms) + ")\\b"

您的正则表达式将变为：

\b(?:data mining|advanced data mining|data minings|text mining)\b

(?:)表示非捕获组。

如何在python的re.sub中添加if条件

问题描述

2 个解决方案

解决方案1
2 2019-01-30 07:34:02

解决方案2
1 已采纳 2019-01-30 06:41:54

如何在python的re.sub中添加if条件

问题描述

2 个解决方案

解决方案1 2 2019-01-30 07:34:02

解决方案2 1 已采纳 2019-01-30 06:41:54

解决方案1
2 2019-01-30 07:34:02

解决方案2
1 已采纳 2019-01-30 06:41:54