在python中使用正則表達式將文本文件拆分為單詞

Question

全新的python !!! 我給了一個文本文件https://en.wikipedia.org/wiki/Character_mask ，我需要將文件拆分為單個單詞（（除了一個字母之外，還要用其他任何一個字符分隔）嘗試使用正則表達式，但似乎無法正確無誤地將其拆分。 這是我到目前為止的代碼，誰能幫助我修復此正則表達式

import re 
file = open("charactermask.txt", "r")
text = file.read()
message = print(re.split(',.-\d\c\s',text))
print (message)
file.close()

Answer 1

您可以將re.findall與以下正則表達式模式一起使用，以查找所有長度超過1個字符的單詞。

更改：

message = print(re.split(',.-\d\c\s',text))

至：

message = re.findall(r'[A-Za-z]{2,}', text))

Answer 2

如果您正在尋找文本字符串中單詞的簡單標記，則可以使用.split它會像魅力一樣起作用！ 例如

mystring = "My favorite color is blue"
mystring.split()
['My', 'favorite', 'color', 'is', 'blue']

Answer 3

如果您只是想分割文本，那么SmashGuy的答案應該可以完成您的工作。 使用正則表達式似乎有點過分。 另外，您的正則表達式模式似乎並沒有達到您所描述的意圖。 您可能需要先測試模式，直到正確為止，然后再將其插入python腳本。 嘗試https://regex101.com/

這是您的模式現在執行的操作：

, matches the character , literally (case sensitive)
. matches any character (except for line terminators)
- matches the character - literally (case sensitive)
\d matches a digit (equal to [0-9])
\c matches the character c literally (case sensitive)
\s matches any whitespace character (equal to [\r\n\t\f\v ])

我不確定您是否真的想使用[，.-]這些字符前綴之一，並且對\\ c令牌的印象也可能不正確，因為它對python的regex風格沒有任何特殊作用。

在python中使用正則表達式將文本文件拆分為單詞

問題描述

3 個解決方案

解決方案1
2 已采納 2018-09-26 05:37:02

解決方案2
1 2018-09-26 05:40:28

解決方案3
1 2018-09-26 05:56:23

在python中使用正則表達式將文本文件拆分為單詞

問題描述

3 個解決方案

解決方案1 2 已采納 2018-09-26 05:37:02

解決方案2 1 2018-09-26 05:40:28

解決方案3 1 2018-09-26 05:56:23

解決方案1
2 已采納 2018-09-26 05:37:02

解決方案2
1 2018-09-26 05:40:28

解決方案3
1 2018-09-26 05:56:23