使用正则表达式使用分隔符列表拆分字符串的问题

Question

我正在使用此功能在保留单词的同时用单词和分隔符拆分文本

import re 

def split_text_in_words(phrase_text, separators=[" "]):
  separator_regex = """({0})""".format("""|""".join(separators))
  return [f for f in re.split(separator_regex,phrase_text) if len(f) > 0]

我正在使用这样的代码：

>>> split_text_in_words('Mary & his family has a?nice.house at #157, at the beach? Of course! it is great. I owe her 40$ so I plan to pay my debt weekly at 3% interest :) "no comment"', separators=[' ', '\?', '\*', '\.', ',', ';', ':', "'", '"', '-', '\?', '!', '#', '\$', '%', '^', '&'])
['Mary', ' ', '&', ' ', 'his', ' ', 'family', ' ', 'has', ' ', 'a', '?', 'nice', '.', 'house', ' ', 'at', ' ', '#', '157', ',', ' ', 'at', ' ', 'the', ' ', 'beach', '?', ' ', 'Of', ' ', 'course', '!', ' ', 'it', ' ', 'is', ' ', 'great', '.', ' ', 'I', ' ', 'owe', ' ', 'her', ' ', '40', '$', ' ', 'so', ' ', 'I', ' ', 'plan', ' ', 'to', ' ', 'pay', ' ', 'my', ' ', 'debt', ' ', 'weekly', ' ', 'at', ' ', '3', '%', ' ', 'interest', ' ', ':', ')', ' ', '"', 'no', ' ', 'comment', '"']

到目前为止看起来不错，而这正是我想要的。 但是，当在分隔符列表上添加括号时，我碰巧文本以括号开头，所以分割齿轮无法启动：

>>> split_text_in_words('(as if it was not aware) Mary & his family has a?nice beach* house at #157, at the beach? Of course! it is great. I owe her 40$ so I plan to pay my debt weekly at 3% interest :) "no comment"', separators=[' ', '\?', '\*', '\.', ',', ';', ':', "'", '"', '-', '\?', '!', '#', '\$', '%', '^', '&', '\*', '\(', '\)'])
['(as', ' ', 'if', ' ', 'it', ' ', 'was', ' ', 'not', ' ', 'aware', ')', ' ', 'Mary', ' ', '&', ' ', 'his', ' ', 'family', ' ', 'has', ' ', 'a', '?', 'nice', ' ', 'beach', '*', ' ', 'house', ' ', 'at', ' ', '#', '157', ',', ' ', 'at', ' ', 'the', ' ', 'beach', '?', ' ', 'Of', ' ', 'course', '!', ' ', 'it', ' ', 'is', ' ', 'great', '.', ' ', 'I', ' ', 'owe', ' ', 'her', ' ', '40', '$', ' ', 'so', ' ', 'I', ' ', 'plan', ' ', 'to', ' ', 'pay', ' ', 'my', ' ', 'debt', ' ', 'weekly', ' ', 'at', ' ', '3', '%', ' ', 'interest', ' ', ':', ')', ' ', '"', 'no', ' ', 'comment', '"']

第一个括号保留在单词上。 我可以通过在开头添加一个空格来解决此问题：

>>> split_text_in_words(' (as if it was not aware) Mary & his family has a?nice beach* house at #157, at the beach? Of course! it is great. I owe her 40$ so I plan to pay my debt weekly at 3% interest :) "no comment"', separators=[' ', '\?', '\*', '\.', ',', ';', ':', "'", '"', '-', '\?', '!', '#', '\$', '%', '^', '&', '\*', '\(', '\)'])
[' ', '(', 'as', ' ', 'if', ' ', 'it', ' ', 'was', ' ', 'not', ' ', 'aware', ')', ' ', 'Mary', ' ', '&', ' ', 'his', ' ', 'family', ' ', 'has', ' ', 'a', '?', 'nice', ' ', 'beach', '*', ' ', 'house', ' ', 'at', ' ', '#', '157', ',', ' ', 'at', ' ', 'the', ' ', 'beach', '?', ' ', 'Of', ' ', 'course', '!', ' ', 'it', ' ', 'is', ' ', 'great', '.', ' ', 'I', ' ', 'owe', ' ', 'her', ' ', '40', '$', ' ', 'so', ' ', 'I', ' ', 'plan', ' ', 'to', ' ', 'pay', ' ', 'my', ' ', 'debt', ' ', 'weekly', ' ', 'at', ' ', '3', '%', ' ', 'interest', ' ', ':', ')', ' ', '"', 'no', ' ', 'comment', '"']

但是我担心为什么会这样，如果在开始时添加一个空格的策略（确实是hack）不能确保我在其他更微妙的情况下不会失败

为什么会发生这种情况，通常在开始时添加空格的破解/修复会正常工作吗？

Answer 1

问题是在分隔符中使用未转义的^会成为拆分正则表达式的一部分。 ^是特殊的正则表达式元字符，表示开始定位。

您必须这样逃避它：

separators=[' ', '\?', '\*', '\.', ',', ';', ':', "'", '"', '-', '\?', '!', '#', '\$', '%', '\^', '&', '\*', '\(', '\)']

Answer 2

^标记字符串的开头，因此必须在分隔符列表中进行转义： '\\^'

一种更舒适，更安全的方法是，不要在参数中而不是函数中转义分隔符：

separator_regex = """({0})""".format("""|""".join(map(re.escape, separators)))

Answer 3

问题是未转义的^ 。 您可能应该将所有标点符号转义，例如：

split_text_in_words(
    '(as if it was not aware) Mary & his family',
    separators=["\\" + c for c in " ?*.,;:'\"-!#$%^&()"]
)

甚至可以在函数中执行以下操作：

import re 

def split_text_in_words(phrase_text, separators=[" "]):
    inter = "|".join(
        re.sub(r"(^|[^\\])([^A-Za-z0-9])", r"\\\2", sep) for sep in separators
    )
    # Add the backslash if not already present for every non-alphanumeric
    # character.

    separator_regex = "({0})".format(inter)
    return [f for f in re.split(separator_regex, phrase_text) if len(f) > 0]

使用正则表达式使用分隔符列表拆分字符串的问题

问题描述

3 个解决方案

解决方案1
1 2018-11-16 21:17:55

解决方案2
1 2018-11-16 21:18:11

解决方案3
1 已采纳 2018-11-16 21:39:21

使用正则表达式使用分隔符列表拆分字符串的问题

问题描述

3 个解决方案

解决方案1 1 2018-11-16 21:17:55

解决方案2 1 2018-11-16 21:18:11

解决方案3 1 已采纳 2018-11-16 21:39:21

解决方案1
1 2018-11-16 21:17:55

解决方案2
1 2018-11-16 21:18:11

解决方案3
1 已采纳 2018-11-16 21:39:21