简体   繁体   English

如何在不同的定界符上分割字符串,但在输出中保留某些所说的定界符? (标记字符串)

[英]How do I split a string on different delimiters, but keeping on the output some of said delimiters? (Tokenize a string)

More specifically I want to split a string on any non alpha-numeric character but in the case that the delimiter is not a white space I want to keept it. 更具体地说,我想在任何非字母数字字符上分割字符串,但是在定界符不是空格的情况下,我想保留它。 That is, to the input: 也就是说,输入:

my_string = 'Hey, I\'m 9/11 7-11'

I want to get: 我想得到:

['Hey' , ',' , 'I' , "'" , 'm', '9' , '/' , '11', '7' , '-' , '11']

Without no whitespace as a list element. 没有空格作为列表元素。

I have tried the following: 我尝试了以下方法:

re.split('([/\'\-_,.;])|\s', my_string)

But outputs: 但是输出:

['Hey', ',', '', None, 'I', "'", 'm', None, '9', '/', '11', None, '7', '-', '11']

How do I solve this without 'unnecessary' iterations? 我如何解决此问题而无需“不必要的”迭代?

Also I have some trouble with escaping the backslash character, since '\\\\\\\\' does not seem to be working, any ideas on how to also solve this? 另外,我在转义反斜杠字符时遇到了一些麻烦,因为'\\\\\\\\'似乎不起作用,关于如何也解决此问题的任何想法?

Thanks a lot. 非常感谢。

You may use 您可以使用

import re
my_string = "Hey, I'm 9/11 7-11"
print(re.findall(r'\w+|[^\w\s]', my_string))
# => ['Hey', ',', 'I', "'", 'm', '9', '/', '11', '7', '-', '11']

See the Python demo 参见Python演示

The \\w+|[^\\w\\s] regex matches either 1+ word chars (letters, digits, _ symbols) or a single character other than a word and whitespace char. \\w+|[^\\w\\s]正则表达式匹配1个以上的字符字符(字母,数字, _符号)或除单词和空白字符之外的单个字符。

BTW, to match a backslash with a regex, you need to use \\\\ in a raw string literal ( r'\\\\' ) or 4 backslashes in a regular one ( '\\\\\\\\' ). 顺便说一句,要将反斜杠与正则表达式匹配,您需要在原始字符串文字( r'\\\\' )中使用\\\\或在常规的字符串文字中使用4个反斜杠( '\\\\\\\\' )。 It is recommended to use raw string literals to define a regex pattern in Python. 建议使用原始字符串文字在Python中定义正则表达式模式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM