Pandas 正则表达式：将名称与以单词或字符串开头并以某些单词结尾的字符串分开

Question

I have a pandas series that contains rows of share names amongst other details:我有一个 pandas 系列，其中包含共享名称行以及其他详细信息：

Netflix DIVIDEND
Apple Inc (All Sessions) COMM
Intel Corporation CONS
Correction Netflix Section 31 Fee

I'm trying to use a regex to retrieve the stock name, which I did with this look ahead:我正在尝试使用正则表达式来检索股票名称，我对此进行了展望：

transactions_df["Share Name"] = transactions_df["MarketName"].str.extract(r"(^.*?(?=DIVIDEND|\(All|CONS|COMM|Section))")

The only thing I'm having trouble with is the row Correction Netflix Section 31 Fee , where my regex is getting the sharename as Correction Netflix .我唯一遇到的问题是Correction Netflix Section 31 Fee行，我的正则表达式将共享名作为Correction Netflix 。 I don't want the word "Correction".我不想要“更正”这个词。

I need my regular expression to check for either the start of the string, OR the word "Correction ".我需要我的正则表达式来检查字符串的开头或单词“Correction”。

I tried a few things, such as an OR |我尝试了一些东西，例如 OR | with the start of string character ^ .以字符串字符^开头。 I also tried a look behind to check for ^ or Correction but the error says they need to be constant length.我还尝试向后看以检查^或Correction ，但错误表明它们需要是恒定的长度。

r"((^|Correction ).*?(?=DIVIDEND|\(All|CONS|COMM|Section))"

gives an error;给出错误； ValueError: Wrong number of items passed 2, placement implies 1 . ValueError: Wrong number of items passed 2, placement implies 1 。 I'm new to regex so I don't really know what this means.我是正则表达式的新手，所以我真的不知道这意味着什么。

Answer 1

You could use an optional part, and in instead of lookarounds use a capture group with a match:您可以使用一个可选部分，而不是环顾四周使用一个匹配的捕获组：

^(?:Correction\s*)?(\S.*?)\s*(?:\([^()]*\)|DIVIDEND|All|CONS|COMM|Section)

^ Start of string ^字符串开头
(?:Correction\s*)?
(\S.*?)\s* Capture in group 1 , matching a non whitespace char and as least chars as possible and match (not capture) 0+ whitespace chars (\S.*?)\s*在组 1中捕获，匹配非空白字符和尽可能少的字符并匹配（不捕获）0+ 空白字符
(?: Non capture group for the alternation | (?:交替的非捕获组|
- \([^()]*\) Match from ( till ) \([^()]*\)从(到)匹配
- | Or或者
- DIVIDEND|All|CONS|COMM|Section Match any of the words DIVIDEND|All|CONS|COMM|Section匹配任何单词
) Close group )关闭组

Regex demo正则表达式演示

data = ["Netflix DIVIDEND", "Apple Inc (All Sessions) COMM", "Intel Corporation CONS", "Correction Netflix Section 31 Fee"]
pattern = r"^(?:Correction\s*)?(\S.*?)\s*(?:\([^()]*\)|DIVIDEND|All|CONS|COMM|Section)"
transactions_df = pd.DataFrame(data, columns = ['MarketName'])
transactions_df["Share Name"] = transactions_df["MarketName"].str.extract(pattern)
print(transactions_df)

Output Output

0                   Netflix DIVIDEND            Netflix
1      Apple Inc (All Sessions) COMM          Apple Inc
2             Intel Corporation CONS  Intel Corporation
3  Correction Netflix Section 31 Fee            Netflix

Pandas 正则表达式：将名称与以单词或字符串开头并以某些单词结尾的字符串分开

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-04-29 14:33:31

Pandas 正则表达式：将名称与以单词或字符串开头并以某些单词结尾的字符串分开

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-04-29 14:33:31

解决方案1
2 已采纳 2021-04-29 14:33:31