在python中使用正则表达式将文本文件拆分为单词

Question

全新的python !!! 我给了一个文本文件https://en.wikipedia.org/wiki/Character_mask ，我需要将文件拆分为单个单词（（除了一个字母之外，还要用其他任何一个字符分隔）尝试使用正则表达式，但似乎无法正确无误地将其拆分。 这是我到目前为止的代码，谁能帮助我修复此正则表达式

import re 
file = open("charactermask.txt", "r")
text = file.read()
message = print(re.split(',.-\d\c\s',text))
print (message)
file.close()

Answer 1

您可以将re.findall与以下正则表达式模式一起使用，以查找所有长度超过1个字符的单词。

更改：

message = print(re.split(',.-\d\c\s',text))

至：

message = re.findall(r'[A-Za-z]{2,}', text))

Answer 2

如果您正在寻找文本字符串中单词的简单标记，则可以使用.split它会像魅力一样起作用！ 例如

mystring = "My favorite color is blue"
mystring.split()
['My', 'favorite', 'color', 'is', 'blue']

Answer 3

如果您只是想分割文本，那么SmashGuy的答案应该可以完成您的工作。 使用正则表达式似乎有点过分。 另外，您的正则表达式模式似乎并没有达到您所描述的意图。 您可能需要先测试模式，直到正确为止，然后再将其插入python脚本。 尝试https://regex101.com/

这是您的模式现在执行的操作：

, matches the character , literally (case sensitive)
. matches any character (except for line terminators)
- matches the character - literally (case sensitive)
\d matches a digit (equal to [0-9])
\c matches the character c literally (case sensitive)
\s matches any whitespace character (equal to [\r\n\t\f\v ])

我不确定您是否真的想使用[，.-]这些字符前缀之一，并且对\\ c令牌的印象也可能不正确，因为它对python的regex风格没有任何特殊作用。

在python中使用正则表达式将文本文件拆分为单词

问题描述

3 个解决方案

解决方案1
2 已采纳 2018-09-26 05:37:02

解决方案2
1 2018-09-26 05:40:28

解决方案3
1 2018-09-26 05:56:23

在python中使用正则表达式将文本文件拆分为单词

问题描述

3 个解决方案

解决方案1 2 已采纳 2018-09-26 05:37:02

解决方案2 1 2018-09-26 05:40:28

解决方案3 1 2018-09-26 05:56:23

解决方案1
2 已采纳 2018-09-26 05:37:02

解决方案2
1 2018-09-26 05:40:28

解决方案3
1 2018-09-26 05:56:23