简体   繁体   English

正则表达式匹配一些符号,但不包含一些符号

[英]Regular expression matches a few symbols but not includes some

There is paragraph, and I want to use regular expression to extract all the words inside. 有段落,我想使用正则表达式提取其中的所有单词。

a bdag agasg it's the cookies for dogs',don't you think so? the word 'wow' in english means.you hey b 097  dag final

I have tried several regexes with re.findall(regX,str), and found one that can match most words. 我用re.findall(regX,str)尝试了几种正则表达式,发现其中一个可以匹配大多数单词。

regX = "[ ,\.\?]?([a-z]+'?[a-z]?)[ ,\.\?]?"

['a', 'bdag', 'agasg', "it's", 'the', 'cookies', 'for', "dogs'", "don't", 'you', 'think', 'so', 'the', 'word', " wow' ", 'in', 'english', 'means', 'you', 'hey', 'b', 'dag', 'final'] [“ a”,“ bdag”,“ agasg”,“它”,“ the”,“ cookies”,“ for”,“ dogs”,“ do n't”,“ you”,“ think”,“ so” ','the','word',' wow' ,'in','english','means','you','hey','b','dag','final']

All are good except **wow'** . 除了**wow'**之外一切都很好。

I wonder if regular expression could explain the logic "it can be a comma/space/period/etc but can't be an apostrophe". 我想知道正则表达式是否可以解释逻辑“它可以是逗号/空格/句号/等,但不能是撇号”。

Can someone advise? 有人可以建议吗?

Try: 尝试:

[ ,\.\?']?([a-z]*('\w)?)[\' ,\.\?]? 

Added another group so you'll have to select only group 1. 添加了另一个组,因此您只需要选择组1。

I didn't fully understand what you wanted the output to be but, try this: 我不完全了解您想要的输出是什么,但是请尝试以下操作:

[ ,\.\?]?(["-']?+[a-z]+["-']?[a-z]?)[ ,\.\?]? 

using this regex lets you get the ' and " within the text. 使用此正则表达式可让您在文本中获得'"

if this still was not what you wanted please let me know so I can update my answer. 如果这仍然不是您想要的,请告诉我,以便我更新我的答案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM