Python 正则表达式在给定字符串中向前查找“文档”单词并用空字符串替换

Question

如果可以通过从给定字符串中删除字符来生成单词“document”，则将从字符串中删除拼写为“document”的字母。 如果可以删除结果字符串中的字母以保留字符串“document”，则删除该字符串中拼写为“document”的字母。 这种情况一直持续到无法删除字母以离开“文档”，此时将返回最终字符串。

例如，如果字符串是：

documdocumententer
     ^^^^^^^^

“document”可以通过去掉开头的“docum”和“end”的“enter”形成，所以去掉中间的“document”，留下

documenter
^^^^^^^^

然后将重复该过程以离开

er

由于“er”不包含“document”，因此将返回“er”。

同样，如果字符串是：

adbocucdmefgnhtj
 ^ ^^^  ^^  ^ ^

拼写为“文档”的字母将被删除以离开：

abcdfghj

将返回此字符串，因为它不包含“文档”。

例子

doconeument转换为one
documdocumentent被转换为empty string
documentone转换为one
pydocdbument转换为pydb
documentdocument被转换为empty string

如何从给定字符串中获取感兴趣的字符串（仅针对特定单词“文档”）。

我通过 python for loop 尝试了这个查询，但我不知道如何只使用正则表达式我的代码如下

import re
def fun1(text):
    print('original string:', text)
    pattern = r"((d|D).*o.*c.*u.*m.*e.*n.*t){1,}"
    result = re.sub(pattern, '', text)
    if len(result) == len(text):
        print('return original string because it does not contain "document" word forward direction:')
        return text

    # if word is containing "document" in forward direction

    temp = []   # for storing letter and its index

    # find each letter and index in "document" word
    search_str = 'document'
    for index in range(len(search_str)):
        # if it is a last letter in "document" that is t
        if index == len(search_str)-1:
            current_letter = search_str[index]
            pattern = r'.*n.*t'

        else:
            next_letter = search_str[index + 1]
            current_letter = search_str[index]
            pattern = rf".*{current_letter}.*{next_letter}"

        result = re.match(pattern, text)
        a, b = result.span()
        if temp:
            # value of last dict in temp list
            val = list(temp[-1].values())[0]
            current_letter = val + text[val:].index(current_letter)
        else:
            # first time when temp list is empty
            current_letter = text[a:b].rindex(current_letter)

        temp.append({search_str[index]: current_letter})

    # now using temp list we remove "document" word at specific index
    text = list(text)

    # create a list with index decending order to remove from text
    remove_index_list = [list(i.values())[0] for i in temp]
    remove_index_list.sort(reverse=True)

    for j in remove_index_list:
        text.pop(j)

    final_txt = ''.join(text)
    # to check if text containing or not one more "document" word
    pattern = r"((d|D).*o.*c.*u.*m.*e.*n.*t){1,}"
    result = re.findall(pattern, final_txt)
    if result:
        print('The word again containing "document" in it')
        final_txt = fun1(final_txt)
    return final_txt
print('final_output:', fun1('doconeument'))

Answer 1

我有一个正则表达式和递归的解决方案：

from re import compile

candidates = ["doconeument", "documdocumentent",  "documentone",
              "pydocdbument", "documentdocument", "hansi"]
word = "document"

def strip_word(word, candidate):
    regex = compile("^(.*)" + "(.*)".join(word) + "(.*)$")
    match = regex.match(candidate)
    if not match:
        return candidate
    return strip_word(word, "".join(match.groups()))

for cand in candidates:
    print(f"'{cand}' -> '{strip_word(word, cand)}'")

编辑：对代码进行了更正（function 的前两行留在外面）。

Answer 2

如果给定的字符串无法匹配正则表达式：

r'^([a-z]*)d([a-z]*)o([a-z]*)c([a-z]*)u([a-z]*)m([a-z]*)e([a-z]*)n([a-z]*)t([a-z]*)$'

返回字符串。 如果正则表达式匹配字符串，则字符串：

"\1\2\3\4\5\6\7\8\9"

形成并尝试将该字符串与正则表达式匹配。 重复此过程，直到没有匹配项，此时返回最后一个测试的字符串。 请注意，由此生成的每个字符串都比前一个字符串少 8 个字符。

演示，第 1 步

演示，第 2 步

如果正则表达式匹配字符串，捕获组 1 将包含 substring，在“document”中“d”之前，捕获组 2 将包含 substring，在“d”和“o”之间，依此类推，捕获组 9包含“t”之后的 substring。 这些子字符串中的一些或全部可能为空。

我将把它留给 OP 来生成实现该算法所需的 Python 代码。

Python 正则表达式在给定字符串中向前查找“文档”单词并用空字符串替换

问题描述

2 个解决方案

解决方案1
8 2020-05-30 20:58:20

解决方案2
0 2020-05-31 13:24:35

Python 正则表达式在给定字符串中向前查找“文档”单词并用空字符串替换

问题描述

2 个解决方案

解决方案1 8 2020-05-30 20:58:20

解决方案2 0 2020-05-31 13:24:35

解决方案1
8 2020-05-30 20:58:20

解决方案2
0 2020-05-31 13:24:35