例如,字符串是hello %$ world %^& let me ^@ love && you
预期结果将是一个变量中的 hello 而其他变量中的其余部分例如 a="hello" b="world" 等。
Use regular expression
Like this:-
import re
a = "hello %$ world %^& let me ^@ love && you"
print(re.findall(r'\w+',a))
You can user ( regular expressions to retrieve worlds from the string):
import re
my_string = "hello %$ world %^& let me ^@ love && you"
re.findall(r'\w+\b', my_string)
# ['hello', 'world', 'let', 'me', 'love', 'you']
Please see more about regular expressions in Regular Expression HOWTO
As asked in comments, attaching regexp to retrieve group of words separated by special characters:
my_string = "hello world #$$ i love you #$@^ welcome to world"
re.findall(r'(\w+[\s\w]*)\b', my_string)
# ['hello world', 'i love you', 'welcome to world']
The basic answer would be a regexp. I would recommend looking in to tokenizer from NLTK, they encompas research on the topic and give you the flexibility to switch to something more sophisticated later on. Guess what? It offers a Regexp based tokenizer too!
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'([A-Za-z0-9 ]+)')
corpus = tokenizer.tokenize("hello %$ world %^& let me ^@ love && you")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.