简体   繁体   中英

Find Uppercase Words with Regex

I have a string that looks like this:

df = '''
ACCP ACLL ADER ADERW AEAC AEACW AHI AIRTP AKO/A AKO/B ALIT AMHCU ANDAU APOPW AUGZ AUUD AUUDW 
AVDG AVDR AYTUP BBRX BCAC BCACU BCACW BCTX BCTXW BF/A BF/B BIO/B BRK/A BRK/B BRLIU BRPAU BWL/A 
CCZ CFCV CMCTP CMPX CNNB CNTX COMSW CPTAG CPTI CRD/A CRD/B CRTDW DDI DECZ DEFN DFH DRMT DSOC EAC 
EACPW: No data found, symbol may be delisted- ECC : No data found, symbol may be delisted- ECOM :
No data found, symbol may be delisted
'''

And I need to extract from this string all the symbols with regex, getting something like this as result:

result = 'ACCP ACLL ADER ADERW AEAC AEACW AHI AIRTP AKO/A AKO/B ALIT AMHCU ANDAU APOPW AUGZ AUUD AUUDW 
AVDG AVDR AYTUP BBRX BCAC BCACU BCACW BCTX BCTXW BF/A BF/B BIO/B BRK/A BRK/B BRLIU BRPAU BWL/A 
CCZ CFCV CMCTP CMPX CNNB CNTX COMSW CPTAG CPTI CRD/A CRD/B CRTDW DDI DECZ DEFN DFH DRMT DSOC EAC 
EACPW ECC ECOM'

I have already tried to get all words starting with two uppercase letters with this:

"\b[A-Z]{2}\b"

And also this one:

"\b[A-Z]+[A-Z\/]+\b"

The last one works fine but only on the initial word of the string, so maybe is an issue with not taking into account the spaces between words, anyway, none worked in this case:

What would be the regex pattern needed in this case?

All you need is a simple list comprehension.

For example:

df = '''
ACCP ACLL ADER ADERW AEAC AEACW AHI AIRTP AKO/A AKO/B ALIT AMHCU ANDAU APOPW AUGZ AUUD AUUDW 
AVDG AVDR AYTUP BBRX BCAC BCACU BCACW BCTX BCTXW BF/A BF/B BIO/B BRK/A BRK/B BRLIU BRPAU BWL/A 
CCZ CFCV CMCTP CMPX CNNB CNTX COMSW CPTAG CPTI CRD/A CRD/B CRTDW DDI DECZ DEFN DFH DRMT DSOC EAC 
EACPW: No data found, symbol may be delisted- ECC : No data found, symbol may be delisted- ECOM :
No data found, symbol may be delisted
'''

print([w for w in df.split() if w.isupper() and len(w) > 2])

Output:

['ACCP', 'ACLL', 'ADER', 'ADERW', 'AEAC', 'AEACW', 'AHI', 'AIRTP', 'AKO/A', 'AKO/B', 'ALIT', 'AMHCU', 'ANDAU', 'APOPW', 'AUGZ', 'AUUD', 'AUUDW', 'AVDG', 'AVDR', 'AYTUP', 'BBRX', 'BCAC', 'BCACU', 'BCACW', 'BCTX', 'BCTXW', 'BF/A', 'BF/B', 'BIO/B', 'BRK/A', 'BRK/B', 'BRLIU', 'BRPAU', 'BWL/A', 'CCZ', 'CFCV', 'CMCTP', 'CMPX', 'CNNB', 'CNTX', 'COMSW', 'CPTAG', 'CPTI', 'CRD/A', 'CRD/B', 'CRTDW', 'DDI', 'DECZ', 'DEFN', 'DFH', 'DRMT', 'DSOC', 'EAC', 'EACPW:', 'ECC', 'ECOM']

Admittedly, you might want to refine this, eg by using a set , but it seemingly gets all the ticker symbols:

import string 

ticker = [
    word for word in df.split() if \
    all(char in string.ascii_uppercase + '/' for char in word)
]

Another option is to use your second pattern \b[AZ]+[AZ\/]+\b with re.findall and then join the parts together.

import re

df = '''
ACCP ACLL ADER ADERW AEAC AEACW AHI AIRTP AKO/A AKO/B ALIT AMHCU ANDAU APOPW AUGZ AUUD AUUDW 
AVDG AVDR AYTUP BBRX BCAC BCACU BCACW BCTX BCTXW BF/A BF/B BIO/B BRK/A BRK/B BRLIU BRPAU BWL/A 
CCZ CFCV CMCTP CMPX CNNB CNTX COMSW CPTAG CPTI CRD/A CRD/B CRTDW DDI DECZ DEFN DFH DRMT DSOC EAC 
EACPW: No data found, symbol may be delisted- ECC : No data found, symbol may be delisted- ECOM :
No data found, symbol may be delisted
'''

result = ' '.join(re.findall(r"\b[A-Z]+[A-Z\/]+\b", df))
print(result)

Output

ACCP ACLL ADER ADERW AEAC AEACW AHI AIRTP AKO/A AKO/B ALIT AMHCU ANDAU APOPW AUGZ AUUD AUUDW AVDG AVDR AYTUP BBRX BCAC BCACU BCACW BCTX BCTXW BF/A BF/B BIO/B BRK/A BRK/B BRLIU BRPAU BWL/A CCZ CFCV CMCTP CMPX CNNB CNTX COMSW CPTAG CPTI CRD/A CRD/B CRTDW DDI DECZ DEFN DFH DRMT DSOC EAC EACPW ECC ECOM

Python demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM