简体   繁体   English

使用正则表达式查找大写单词

[英]Find Uppercase Words with Regex

I have a string that looks like this:我有一个看起来像这样的字符串:

df = '''
ACCP ACLL ADER ADERW AEAC AEACW AHI AIRTP AKO/A AKO/B ALIT AMHCU ANDAU APOPW AUGZ AUUD AUUDW 
AVDG AVDR AYTUP BBRX BCAC BCACU BCACW BCTX BCTXW BF/A BF/B BIO/B BRK/A BRK/B BRLIU BRPAU BWL/A 
CCZ CFCV CMCTP CMPX CNNB CNTX COMSW CPTAG CPTI CRD/A CRD/B CRTDW DDI DECZ DEFN DFH DRMT DSOC EAC 
EACPW: No data found, symbol may be delisted- ECC : No data found, symbol may be delisted- ECOM :
No data found, symbol may be delisted
'''

And I need to extract from this string all the symbols with regex, getting something like this as result:我需要用正则表达式从这个字符串中提取所有符号,结果是这样的:

result = 'ACCP ACLL ADER ADERW AEAC AEACW AHI AIRTP AKO/A AKO/B ALIT AMHCU ANDAU APOPW AUGZ AUUD AUUDW 
AVDG AVDR AYTUP BBRX BCAC BCACU BCACW BCTX BCTXW BF/A BF/B BIO/B BRK/A BRK/B BRLIU BRPAU BWL/A 
CCZ CFCV CMCTP CMPX CNNB CNTX COMSW CPTAG CPTI CRD/A CRD/B CRTDW DDI DECZ DEFN DFH DRMT DSOC EAC 
EACPW ECC ECOM'

I have already tried to get all words starting with two uppercase letters with this:我已经尝试通过以下方式获取所有以两个大写字母开头的单词:

"\b[A-Z]{2}\b"

And also this one:还有这个:

"\b[A-Z]+[A-Z\/]+\b"

The last one works fine but only on the initial word of the string, so maybe is an issue with not taking into account the spaces between words, anyway, none worked in this case:最后一个工作正常但仅适用于字符串的初始单词,因此可能是不考虑单词之间的空格的问题,无论如何,在这种情况下都不起作用:

What would be the regex pattern needed in this case?在这种情况下需要什么正则表达式模式?

All you need is a simple list comprehension.您所需要的只是一个简单的列表理解。

For example:例如:

df = '''
ACCP ACLL ADER ADERW AEAC AEACW AHI AIRTP AKO/A AKO/B ALIT AMHCU ANDAU APOPW AUGZ AUUD AUUDW 
AVDG AVDR AYTUP BBRX BCAC BCACU BCACW BCTX BCTXW BF/A BF/B BIO/B BRK/A BRK/B BRLIU BRPAU BWL/A 
CCZ CFCV CMCTP CMPX CNNB CNTX COMSW CPTAG CPTI CRD/A CRD/B CRTDW DDI DECZ DEFN DFH DRMT DSOC EAC 
EACPW: No data found, symbol may be delisted- ECC : No data found, symbol may be delisted- ECOM :
No data found, symbol may be delisted
'''

print([w for w in df.split() if w.isupper() and len(w) > 2])

Output: Output:

['ACCP', 'ACLL', 'ADER', 'ADERW', 'AEAC', 'AEACW', 'AHI', 'AIRTP', 'AKO/A', 'AKO/B', 'ALIT', 'AMHCU', 'ANDAU', 'APOPW', 'AUGZ', 'AUUD', 'AUUDW', 'AVDG', 'AVDR', 'AYTUP', 'BBRX', 'BCAC', 'BCACU', 'BCACW', 'BCTX', 'BCTXW', 'BF/A', 'BF/B', 'BIO/B', 'BRK/A', 'BRK/B', 'BRLIU', 'BRPAU', 'BWL/A', 'CCZ', 'CFCV', 'CMCTP', 'CMPX', 'CNNB', 'CNTX', 'COMSW', 'CPTAG', 'CPTI', 'CRD/A', 'CRD/B', 'CRTDW', 'DDI', 'DECZ', 'DEFN', 'DFH', 'DRMT', 'DSOC', 'EAC', 'EACPW:', 'ECC', 'ECOM']

Admittedly, you might want to refine this, eg by using a set , but it seemingly gets all the ticker symbols:无可否认,您可能想要改进这一点,例如通过使用set ,但它似乎获得了所有的股票代码:

import string 

ticker = [
    word for word in df.split() if \
    all(char in string.ascii_uppercase + '/' for char in word)
]

Another option is to use your second pattern \b[AZ]+[AZ\/]+\b with re.findall and then join the parts together.另一种选择是将第二个模式\b[AZ]+[AZ\/]+\bre.findall一起使用,然后将各部分连接在一起。

import re

df = '''
ACCP ACLL ADER ADERW AEAC AEACW AHI AIRTP AKO/A AKO/B ALIT AMHCU ANDAU APOPW AUGZ AUUD AUUDW 
AVDG AVDR AYTUP BBRX BCAC BCACU BCACW BCTX BCTXW BF/A BF/B BIO/B BRK/A BRK/B BRLIU BRPAU BWL/A 
CCZ CFCV CMCTP CMPX CNNB CNTX COMSW CPTAG CPTI CRD/A CRD/B CRTDW DDI DECZ DEFN DFH DRMT DSOC EAC 
EACPW: No data found, symbol may be delisted- ECC : No data found, symbol may be delisted- ECOM :
No data found, symbol may be delisted
'''

result = ' '.join(re.findall(r"\b[A-Z]+[A-Z\/]+\b", df))
print(result)

Output Output

ACCP ACLL ADER ADERW AEAC AEACW AHI AIRTP AKO/A AKO/B ALIT AMHCU ANDAU APOPW AUGZ AUUD AUUDW AVDG AVDR AYTUP BBRX BCAC BCACU BCACW BCTX BCTXW BF/A BF/B BIO/B BRK/A BRK/B BRLIU BRPAU BWL/A CCZ CFCV CMCTP CMPX CNNB CNTX COMSW CPTAG CPTI CRD/A CRD/B CRTDW DDI DECZ DEFN DFH DRMT DSOC EAC EACPW ECC ECOM

Python demo Python演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM