简体   繁体   中英

Pandas Regex: Separate name from string that starts with word or start of string, and ends in certain words

I have a pandas series that contains rows of share names amongst other details:

Netflix DIVIDEND
Apple Inc (All Sessions) COMM
Intel Corporation CONS
Correction Netflix Section 31 Fee

I'm trying to use a regex to retrieve the stock name, which I did with this look ahead:

transactions_df["Share Name"] = transactions_df["MarketName"].str.extract(r"(^.*?(?=DIVIDEND|\(All|CONS|COMM|Section))")

The only thing I'm having trouble with is the row Correction Netflix Section 31 Fee , where my regex is getting the sharename as Correction Netflix . I don't want the word "Correction".

I need my regular expression to check for either the start of the string, OR the word "Correction ".

I tried a few things, such as an OR | with the start of string character ^ . I also tried a look behind to check for ^ or Correction but the error says they need to be constant length.

r"((^|Correction ).*?(?=DIVIDEND|\(All|CONS|COMM|Section))"

gives an error; ValueError: Wrong number of items passed 2, placement implies 1 . I'm new to regex so I don't really know what this means.

You could use an optional part, and in instead of lookarounds use a capture group with a match:

^(?:Correction\s*)?(\S.*?)\s*(?:\([^()]*\)|DIVIDEND|All|CONS|COMM|Section)
  • ^ Start of string
  • (?:Correction\s*)?
  • (\S.*?)\s* Capture in group 1 , matching a non whitespace char and as least chars as possible and match (not capture) 0+ whitespace chars
  • (?: Non capture group for the alternation |
    • \([^()]*\) Match from ( till )
    • | Or
    • DIVIDEND|All|CONS|COMM|Section Match any of the words
  • ) Close group

Regex demo

data = ["Netflix DIVIDEND", "Apple Inc (All Sessions) COMM", "Intel Corporation CONS", "Correction Netflix Section 31 Fee"]
pattern = r"^(?:Correction\s*)?(\S.*?)\s*(?:\([^()]*\)|DIVIDEND|All|CONS|COMM|Section)"
transactions_df = pd.DataFrame(data, columns = ['MarketName'])
transactions_df["Share Name"] = transactions_df["MarketName"].str.extract(pattern)
print(transactions_df)

Output

0                   Netflix DIVIDEND            Netflix
1      Apple Inc (All Sessions) COMM          Apple Inc
2             Intel Corporation CONS  Intel Corporation
3  Correction Netflix Section 31 Fee            Netflix

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM