使用正则表达式在Python中拆分句子

Question

I'm trying to split words, punctuation, numbers from a sentence. 我正在尝试从句子中拆分单词，标点符号和数字。 However, my code produces output that isn't expected. 但是，我的代码产生了意外的输出。 How can I fix it? 我该如何解决？

This is my input text (in a text file): 这是我的输入文本（在文本文件中）：

 "I 2changed to ask then, said that mildes't of men2,

And my code outputs this: 我的代码输出如下：

['"', 'I', '2', 'changed', 'to', 'ask', 'then', ',', 'said', 'that', "mildes't", 'of', 'men2']

However, the expected output is: 但是，预期的输出是：

 ['"', 'I', '2', 'changed', 'to', 'ask', 'then', ',', 'said', 'that', "mildes't", 'of', 'men','2']

Here's my code: 这是我的代码：

import re
newlist = []
f = open("Inputfile2.txt",'r')
out = f.readlines()
for line in out:
    word = line.strip('\n')
    f.close()
    lst = re.compile(r"\d|\w+[\w']+|\w|[^\w\s]").findall(word)
print(lst)

Answer 1

In regular expressions, '\\w' matches any alphanumeric character, ie [a-zA-Z0-9]. 在正则表达式中，“ \\ w”匹配任何字母数字字符，即[a-zA-Z0-9]。

Also in the first part of your regular expression, it should be '\\d+' to match more than one digits. 同样在正则表达式的第一部分中，它应为'\\ d +'以匹配多个数字。

The second and the third part of your regular expression '\\w+[\\w']+|\\w' can be merged into a single part by changing '+' to '*'. 通过将'+'更改为'*'，可以将正则表达式'\\ w + [\\ w'] + | \\ w'的第二部分和第三部分合并为单个部分。

import re
with open('Inputfile2.txt', 'r') as f:
    for line in f:
        word = line.strip('\n')
        lst = re.compile(r"\d+|[a-zA-Z]+[a-zA-Z']*|[^\w\s]").findall(word)
        print(lst)

This gives: 这给出：

['"', 'I', '2', 'changed', 'to', 'ask', 'then', ',', 'said', 'that', "mildes't", 'of', 'men', '2', ',']

Note that your expected output is incorrect. 请注意，您的预期输出不正确。 It is missing a ','. 它缺少一个“，”。

使用正则表达式在Python中拆分句子

问题描述

1 个解决方案

解决方案1
2 2014-10-22 13:50:29

使用正则表达式在Python中拆分句子

问题描述

1 个解决方案

解决方案1 2 2014-10-22 13:50:29

解决方案1
2 2014-10-22 13:50:29