简体   繁体   English

尝试正则表达式所有大写单词,除了那些紧跟在 Python 中的单词

[英]Trying to regex all capitalized words EXCEPT those immediately following a period in Python

I'm trying to have a bot crawl through text and absorb with a high degree of accuracy all proper nouns/phrases.我试图让机器人在文本中爬行并高度准确地吸收所有专有名词/短语。 So anything capitalized in the middle of a sentence, where anything capitalized in succession is considered part of the same phrase (and list entry).因此,在句子中间大写的任何内容,连续大写的任何内容都被视为同一短语(和列表条目)的一部分。

So far I have:到目前为止,我有:

tag_string = re.findall('([a-zA-Z]+)\s([A-Z][a-z]*)\s([a-zA-Z]+)', in_string)

Which has trouble with proper nouns immediately preceding periods.这对紧接在句号之前的专有名词有问题。 Also takes surrounding lowercase words.也需要周围的小写单词。

And I also have:而且我还有:

#tag_string = re.findall('([a-zA-Z]+)\s([A-Z][a-z]*)(\s([a-zA-Z]+)|\.)', in_string)

Which takes even more surrounding lowercase words but is less susceptible to the preceding period issue.这需要更多周围的小写单词,但不太容易受到前一期问题的影响。 I've been at this for hours.我已经在这几个小时了。 Anyone see what I'm doing wrong?有人看到我做错了吗?

One option would be to match everything being sure to match the period.一种选择是匹配所有内容以确保匹配期间。 Then you can filter out all the matches that contain a period.然后您可以过滤掉所有包含句点的匹配项。

Something like this \.? *[AZ][az]*像这样\.? *[AZ][az]* \.? *[AZ][az]*

Then you can filter out the offending matches.然后,您可以过滤掉有问题的匹配项。

import re

out = re.findall('\.? *[A-Z][a-z]*', 'This is a sentence. This is Another sentence.   And this is a anoth.er Hello')
outFil = [x for x in out if x[0] != '.']
print(out, outFil)

['This', '. ['这个', '。 This', ' Another', '.这个','另一个','。 And', ' Hello']和','你好']

['This', ' Another', ' Hello'] ['这个','另一个','你好']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM