简体   繁体   中英

How to split string by space and treat special characters as a separate word in Python?

Assume I have a string,

"I want that one, it is great."

I want to split up this string to be

["I", "want", "that", "one", ",", "it", "is", "great", "."]

Keeping special characters such as ",.:;" and possibly other ones to be treated as a separate word.

Is there any easy way to do this with Python 2.7?

Update

For an example such as "I don't." , it should be ["I", "don", "'", "t", "."] . It would ideally work with non-English punctuations such as ؛ and others.

See here for a similar question. The answer there applies to you as well:

import re
print re.split('(\W)', "I want that one, it is great.")
print re.split('(\W)', "I don't.")

You can remove the spaces and empty strings returned by re.split using a filter:

s = "I want that one, it is great."
print filter(lambda _: _ not in [' ', ''], re.split('(\W)', s))

You can use Regex and a simple list comprehension to do this. The regex will pull out words and separate punctuation, and the list comprehension will remove the blank spaces.

import re
s = "I want that one, it is great. Don't do it."
new_s = [c.strip() for c in re.split('(\W+)', s) if c.strip() != '']
print new_s

The output of new_s will be:

['I', 'want', 'that', 'one', ',', 'it', 'is', 'great', '.', 'Don', "'", 't', 'do', 'it', '.']
In [70]: re.findall(r"[^,.:;' ]+|[,.:;']", "I want that one, it is great.")
Out[70]: ['I', 'want', 'that', 'one', ',', 'it', 'is', 'great', '.']

In [76]: re.findall(r"[^,.:;' ]+|[,.:;']", "I don't.")
Out[76]: ['I', 'don', "'", 't', '.']

The regex [^,.:;' ]+|[,.:;'] [^,.:;' ]+|[,.:;'] matches (1-or-more characters other than , , . , : , ; , ' or a literal space), or (the literal characters , , . , : , ; , or ' ).


Or, using the regex module , you could easily expand this to include all punctuation and symbols by using the [:punct:] character class:

In [77]: import regex

In Python2:

In [4]: regex.findall(ur"[^[:punct:] ]+|[[:punct:]]", u"""A \N{ARABIC SEMICOLON} B""")
Out[4]: [u'A', u'\u061b', u'B']

In [6]: regex.findall(ur"[^[:punct:] ]+|[[:punct:]]", u"""He said, "I don't!" """)
Out[6]: [u'He', u'said', u',', u'"', u'I', u'don', u"'", u't', u'!', u'"']

In Python3:

In [105]: regex.findall(r"[^[:punct:] ]+|[[:punct:]]", """A \N{ARABIC SEMICOLON} B""")
Out[105]: ['A', '؛', 'B']

In [83]: regex.findall(r"[^[:punct:] ]+|[[:punct:]]", """He said, "I don't!" """)
Out[83]: ['He', 'said', ',', '"', 'I', 'don', "'", 't', '!', '"']

Note that it is important that you pass a unicode as the second argument to regex.findall if you wish [:punct:] to match unicode punctuation or symbols.

In Python2:

import regex
print(regex.findall(r"[^[:punct:] ]+|[[:punct:]]", 'help؛'))
print(regex.findall(ur"[^[:punct:] ]+|[[:punct:]]", u'help؛'))

prints

['help\xd8\x9b']
[u'help', u'\u061b']

I don't know of any functions that can do this but you could use a for loop.

Something like this: word = "" wordLength = 0 for i in range(0, len(stringName)): if stringName[i] != " ": for x in range((i-wordLength), i): word += stringName[i] wordLength = 0 list.append(word) word = "" else: worldLength = wordlength + 1 Hope this works...sorry if it is not the best way

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM