简体   繁体   English

正则表达式连接由空格和连字符分隔的单词

[英]Regex joining words splitted by whitespace and hyphen

My string is quite messy and looks something like this:我的字符串很乱,看起来像这样:

s="I'm hope-less and can -not solve this pro- blem on my own. Wo - uld you help me?"

I'd like to have the hyphen (& sometimes whitespace) stripped words together in one list.. Desired output:我想在一个列表中将连字符(有时是空格)剥离单词..所需的输出:

list = ['I'm','hopeless','and','cannot','solve','this','problem','on','my','own','.','Would','you','help','me','?']

I tried a lot of different variations, but nothing worked..我尝试了很多不同的变化,但没有任何效果..

rgx = re.compile("([\w][\w'][\w\-]*\w)") s = "My string'" rgx.findall(s)

Here's one way:这是一种方法:

[re.sub(r'\s*-\s*', '', i) for i in re.split(r'(?<!-)\s(?!-)', s)]

# ["I'm", 'hopeless', 'and', 'cannot', 'solve', 'this', 'problem', 'on', 'my', 'own.', 'Would', 'you', 'help', 'me?']

Two operations here:这里有两个操作:

  1. Split the text based on whitespaces without hyphens using both negative lookahead and negative lookbehind.使用负前瞻和负后瞻,基于没有连字符的空格拆分文本。

  2. In each of the split word, replace the hyphens with possible whitespaces in front or behind to empty string.在每个拆分词中,将连字符前面或后面可能的空格替换为空字符串。

You can see the first operation's demo here: https://regex101.com/r/ayHPvY/2你可以在这里看到第一个操作的演示: https ://regex101.com/r/ayHPvY/2

And the second: https://regex101.com/r/ayHPvY/1第二个: https ://regex101.com/r/ayHPvY/1

Edit: To get the .编辑:要获得. and ?? to be separated as well, use this instead:也要分开,请改用:

[re.sub(r'\s*-\s*','', i) for i in re.split(r"(?<!-)\s(?!-)|([^\w\s'-]+)", s) if i]

# ["I'm", 'hopeless', 'and', 'cannot', 'solve', 'this', 'problem', 'on', 'my', 'own', '.', 'Would', 'you', 'help', 'me', '?']

The catch was also splitting the non-alphabets, non-whitespace and not hyphens/apostrophe.问题还在于拆分非字母、非空格和连字符/撇号。 The if i is necessary as the split might return some None items. if i是必要的,因为拆分可能会返回一些None项目。

Quick, non-regex way to do it would be快速,非正则表达式的方法是

''.join(map(lambda s: s.strip(), s.split('-'))).split()

that is split on hyphens, strip of additional whitespaces, join back into string and split on space, this however doesn't separate dot or question marks.即在连字符上拆分,附加空白条带,重新加入字符串并在空格上拆分,但这不会分隔点或问号。

How about this:这个怎么样:

>>> s
"I'm hope-less and can -not solve this pro- blem on my own. Wo - uld you help me
?"
>>> list(map(lambda x:re.sub(' *- *','',x), filter(lambda x:x, re.split(r'(?<!-) +(?!-)|([.?])',s))))
["I'm", 'hopeless', 'and', 'cannot', 'solve', 'this', 'problem', 'on', 'my', 'own', '.', 'Would', 'you', 'help', 'me', '?']

Above used a simple space ' ' , but use \s is better:上面使用了一个简单的空格' ' ,但使用\s更好:

list(map(lambda x:re.sub('\s*-\s*','',x), filter(lambda x:x, re.split(r'(?<!-)\s+(?!-)|([.?])',s))))

(?<!-)\s+(?!-) means spaces that don't have - before or after. (?<!-)\s+(?!-)表示没有-之前或之后的空格。
[.?] means single . [.?]表示单. or ?还是? . .

re.split(r'(?<!-)\s+(?!-)|([.?])',s) will split the string accordingly, but will have some None and empty string '' inside: re.split(r'(?<!-)\s+(?!-)|([.?])',s)将相应地拆分字符串,但内部会有一些None和空字符串''

["I'm", None, 'hope-less', None, 'and', None, 'can -not', None, 'solve', None, 'this', None, 'pro- blem', None, 'on', None, 'my', None, 'own', '.', '', None, 'Wo - uld', None, 'you', None, 'help', None, 'me', '?', '']

This result was directly feed to filter to remove None and '' , and then feed to map to remove space and - inside each word.该结果直接馈送到filter以删除None'' ,然后馈送到map以删除每个单词中的空格和-

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM