Python正则表达式：标记英语收缩

Question

I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. 我试图解析字符串，以便分离所有单词组件，甚至是那些已经签约的单词组件。 For example the tokenization of "shouldn't" would be ["should", "n't"]. 例如，“不应该”的标记化将是[“应该”，“不是”]。

The nltk module does not seem to be up to the task however as: nltk模块似乎不适合任务，但是：

"I wouldn't've done that." “我不会这样做的。”

tokenizes as: 标记为：

['I', "wouldn't", "'ve", 'done', 'that', '.'] ['我'，'不会'，''ve'，'完成'，'那'，'。']

where the desired tokenization of "wouldn't've" was: ['would', "n't", "'ve"] 所谓的“遗嘱”的标记化是：['would'，“not”，“'ve”]

After examining common English contractions, I am trying to write a regex to do the job but I am having a hard time figuring out how to match "'ve" only once. 在检查了常见的英语收缩后，我正在尝试写一个正则表达式来完成这项工作，但我很难弄清楚如何只匹配“'ve”一次。 For example, the following tokens can all terminate a contraction: 例如，以下令牌都可以终止收缩：

n't, 've, 'd, 'll, 's, 'm, 're 不，'，'，'，'，'，'，重新

But the token "'ve" can also follow other contractions such as: 但令牌“'ve”也可以遵循其他收缩，例如：

'd've, n't've, and (conceivably) 'll've 'd'，n't，和（可以想象）'我会的

At the moment, I am trying to wrangle this regex: 目前，我正试图争论这个正则表达式：

\\b[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)\\b \\ B [A-ZA-Z] +（:(？ 'd |' LL |不）（ '阳离子））|？（' S | 'M |' 重新|'阳离子）\\ b

However, this pattern also matches the badly formed: 但是，这种模式也与形成不良的形式相匹配：

"wouldn't've've" “wouldn't've've”

It seems the problem is that the third apostrophe qualifies as a word boundary so that the final "'ve" token matches the whole regex. 似乎问题是第三个撇号符合词边界，因此最终的“'ve”令牌与整个正则表达式相匹配。

I have been unable to think of a way to differentiate a word boundary from an apostrophe and, failing that, I am open to advice for alternative strategies. 我一直无法想出一种区分单词边界和撇号的方法，如果不这样做，我愿意接受替代策略的建议。

Also, I am curious if there is any way to include the word boundary special character in a character class. 另外，我很好奇是否有任何方法在字符类中包含单词boundary特殊字符。 According to the Python documentation, \\b in a character class matches a backspace and there doesn't seem to be a way around this. 根据Python文档，字符类中的\\ b匹配退格并且似乎没有解决方法。

EDIT: 编辑：

Here's the output: 这是输出：

>>>pattern = re.compile(r"\b[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)\b")
>>>matches = pattern.findall("She'll wish she hadn't've done that.")
>>>print matches
[("'ll", '', ''), ("n't", "'ve", ''), ('', '', "'ve")]

I can't figure out the third match. 我无法弄清楚第三场比赛。 In particular, I just realized that if the third apostrophe were matching the leading \\b, then I don't know what would be matching the character class [a-zA-Z]+. 特别是，我刚刚意识到，如果第三个撇号与前导\\ b匹配，那么我不知道匹配字符类[a-zA-Z] +的是什么。

Answer 1

(?<!['"\w])(['"])?([a-zA-Z]+(?:('d|'ll|n't)('ve)?|('s|'m|'re|'ve)))(?(1)\1|(?!\1))(?!['"\w])

编辑：\\ 2是匹配，\\ 3是第一组，\\ 4是第二组，\\ 5是第三组。

Answer 2

You can use the following complete regexes : 您可以使用以下完整的正则表达式：

import re
patterns_list = [r'\s',r'(n\'t)',r'\'m',r'(\'ll)',r'(\'ve)',r'(\'s)',r'(\'re)',r'(\'d)']
pattern=re.compile('|'.join(patterns_list))
s="I wouldn't've done that."

print [i for i in pattern.split(s) if i]

result : 结果：

['I', 'would', "n't", "'ve", 'done', 'that.']

Answer 3

You can use this regex to tokenize the text: 您可以使用此正则表达式来标记文本：

(?:(?!.')\w)+|\w?'\w+|[^\s\w]

Usage: 用法：

>>> re.findall(r"(?:(?!.')\w)+|\w?'\w+|[^\s\w]", "I wouldn't've done that.")
['I', 'would', "n't", "'ve", 'done', 'that', '.']

Answer 4

>>> import nltk
>>> nltk.word_tokenize("I wouldn't've done that.")
['I', "wouldn't", "'ve", 'done', 'that', '.']

so: 所以：

>>> from itertools import chain
>>> [nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]
[['I'], ['would', "n't"], ["'ve"], ['done'], ['that'], ['.']]
>>> list(chain(*[nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]))
['I', 'would', "n't", "'ve", 'done', 'that', '.']

Answer 5

Here a simple one 这是一个简单的

text = ' ' + text.lower() + ' '
text = text.replace(" won't ", ' will not ').replace("n't ", ' not ') \
    .replace("'s ", ' is ').replace("'m ", ' am ') \
    .replace("'ll ", ' will ').replace("'d ", ' would ') \
    .replace("'re ", ' are ').replace("'ve ", ' have ')

Python正则表达式：标记英语收缩

问题描述

5 个解决方案

解决方案1
2 2015-01-20 20:29:01

解决方案2
2 2015-01-20 20:33:04

解决方案3
1 2015-01-20 20:36:37

解决方案4
1 2015-01-21 02:28:15

解决方案5
0 2017-03-07 02:41:26

Python正则表达式：标记英语收缩

问题描述

5 个解决方案

解决方案1 2 2015-01-20 20:29:01

解决方案2 2 2015-01-20 20:33:04

解决方案3 1 2015-01-20 20:36:37

解决方案4 1 2015-01-21 02:28:15

解决方案5 0 2017-03-07 02:41:26

解决方案1
2 2015-01-20 20:29:01

解决方案2
2 2015-01-20 20:33:04

解决方案3
1 2015-01-20 20:36:37

解决方案4
1 2015-01-21 02:28:15

解决方案5
0 2017-03-07 02:41:26