简体   繁体   中英

splitting a text file into words using regex in python

brand new to python!!! I'm given a text file https://en.wikipedia.org/wiki/Character_mask and I need to split the file into single words, (more than a single letter separated by one of more of any other character) I've tried using regex but can't seem to split it right without error. here is the code I have so far, can anyone help me fix this regex expression

import re 
file = open("charactermask.txt", "r")
text = file.read()
message = print(re.split(',.-\d\c\s',text))
print (message)
file.close()

You can use re.findall with the following regex pattern instead to find all words that are more than 1 character long.

Change:

message = print(re.split(',.-\d\c\s',text))

to:

message = re.findall(r'[A-Za-z]{2,}', text))

If you are looking for simple tokens of words from text string you can use .split it will work like a charm! For example

mystring = "My favorite color is blue"
mystring.split()
['My', 'favorite', 'color', 'is', 'blue']

If you're just trying to split the text then SmashGuy's answer should get your job done. Using regex would seem like an overkill. Additionally, your regex pattern doesn't quite seem to do what you described your intention to be. You might want to test your pattern out until you get it right before plugging it into your python script. Try https://regex101.com/

Here's what your pattern does right now:

, matches the character , literally (case sensitive)
. matches any character (except for line terminators)
- matches the character - literally (case sensitive)
\d matches a digit (equal to [0-9])
\c matches the character c literally (case sensitive)
\s matches any whitespace character (equal to [\r\n\t\f\v ])

I'm not sure if you actually meant [,.-], one of these character-prefixes and you might have had the wrong impression on the \\c token too as it doesn't do anything special in python's flavor of regex.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM