I have a string that looks like this:
df = '''
ACCP ACLL ADER ADERW AEAC AEACW AHI AIRTP AKO/A AKO/B ALIT AMHCU ANDAU APOPW AUGZ AUUD AUUDW
AVDG AVDR AYTUP BBRX BCAC BCACU BCACW BCTX BCTXW BF/A BF/B BIO/B BRK/A BRK/B BRLIU BRPAU BWL/A
CCZ CFCV CMCTP CMPX CNNB CNTX COMSW CPTAG CPTI CRD/A CRD/B CRTDW DDI DECZ DEFN DFH DRMT DSOC EAC
EACPW: No data found, symbol may be delisted- ECC : No data found, symbol may be delisted- ECOM :
No data found, symbol may be delisted
'''
And I need to extract from this string all the symbols with regex, getting something like this as result:
result = 'ACCP ACLL ADER ADERW AEAC AEACW AHI AIRTP AKO/A AKO/B ALIT AMHCU ANDAU APOPW AUGZ AUUD AUUDW
AVDG AVDR AYTUP BBRX BCAC BCACU BCACW BCTX BCTXW BF/A BF/B BIO/B BRK/A BRK/B BRLIU BRPAU BWL/A
CCZ CFCV CMCTP CMPX CNNB CNTX COMSW CPTAG CPTI CRD/A CRD/B CRTDW DDI DECZ DEFN DFH DRMT DSOC EAC
EACPW ECC ECOM'
I have already tried to get all words starting with two uppercase letters with this:
"\b[A-Z]{2}\b"
And also this one:
"\b[A-Z]+[A-Z\/]+\b"
The last one works fine but only on the initial word of the string, so maybe is an issue with not taking into account the spaces between words, anyway, none worked in this case:
What would be the regex pattern needed in this case?
All you need is a simple list comprehension.
For example:
df = '''
ACCP ACLL ADER ADERW AEAC AEACW AHI AIRTP AKO/A AKO/B ALIT AMHCU ANDAU APOPW AUGZ AUUD AUUDW
AVDG AVDR AYTUP BBRX BCAC BCACU BCACW BCTX BCTXW BF/A BF/B BIO/B BRK/A BRK/B BRLIU BRPAU BWL/A
CCZ CFCV CMCTP CMPX CNNB CNTX COMSW CPTAG CPTI CRD/A CRD/B CRTDW DDI DECZ DEFN DFH DRMT DSOC EAC
EACPW: No data found, symbol may be delisted- ECC : No data found, symbol may be delisted- ECOM :
No data found, symbol may be delisted
'''
print([w for w in df.split() if w.isupper() and len(w) > 2])
Output:
['ACCP', 'ACLL', 'ADER', 'ADERW', 'AEAC', 'AEACW', 'AHI', 'AIRTP', 'AKO/A', 'AKO/B', 'ALIT', 'AMHCU', 'ANDAU', 'APOPW', 'AUGZ', 'AUUD', 'AUUDW', 'AVDG', 'AVDR', 'AYTUP', 'BBRX', 'BCAC', 'BCACU', 'BCACW', 'BCTX', 'BCTXW', 'BF/A', 'BF/B', 'BIO/B', 'BRK/A', 'BRK/B', 'BRLIU', 'BRPAU', 'BWL/A', 'CCZ', 'CFCV', 'CMCTP', 'CMPX', 'CNNB', 'CNTX', 'COMSW', 'CPTAG', 'CPTI', 'CRD/A', 'CRD/B', 'CRTDW', 'DDI', 'DECZ', 'DEFN', 'DFH', 'DRMT', 'DSOC', 'EAC', 'EACPW:', 'ECC', 'ECOM']
Admittedly, you might want to refine this, eg by using a set
, but it seemingly gets all the ticker symbols:
import string
ticker = [
word for word in df.split() if \
all(char in string.ascii_uppercase + '/' for char in word)
]
Another option is to use your second pattern \b[AZ]+[AZ\/]+\b
with re.findall and then join the parts together.
import re
df = '''
ACCP ACLL ADER ADERW AEAC AEACW AHI AIRTP AKO/A AKO/B ALIT AMHCU ANDAU APOPW AUGZ AUUD AUUDW
AVDG AVDR AYTUP BBRX BCAC BCACU BCACW BCTX BCTXW BF/A BF/B BIO/B BRK/A BRK/B BRLIU BRPAU BWL/A
CCZ CFCV CMCTP CMPX CNNB CNTX COMSW CPTAG CPTI CRD/A CRD/B CRTDW DDI DECZ DEFN DFH DRMT DSOC EAC
EACPW: No data found, symbol may be delisted- ECC : No data found, symbol may be delisted- ECOM :
No data found, symbol may be delisted
'''
result = ' '.join(re.findall(r"\b[A-Z]+[A-Z\/]+\b", df))
print(result)
Output
ACCP ACLL ADER ADERW AEAC AEACW AHI AIRTP AKO/A AKO/B ALIT AMHCU ANDAU APOPW AUGZ AUUD AUUDW AVDG AVDR AYTUP BBRX BCAC BCACU BCACW BCTX BCTXW BF/A BF/B BIO/B BRK/A BRK/B BRLIU BRPAU BWL/A CCZ CFCV CMCTP CMPX CNNB CNTX COMSW CPTAG CPTI CRD/A CRD/B CRTDW DDI DECZ DEFN DFH DRMT DSOC EAC EACPW ECC ECOM
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.