[英]Pandas Regex: Separate name from string that starts with word or start of string, and ends in certain words
I have a pandas series that contains rows of share names amongst other details:我有一个 pandas 系列,其中包含共享名称行以及其他详细信息:
Netflix DIVIDEND
Apple Inc (All Sessions) COMM
Intel Corporation CONS
Correction Netflix Section 31 Fee
I'm trying to use a regex to retrieve the stock name, which I did with this look ahead:我正在尝试使用正则表达式来检索股票名称,我对此进行了展望:
transactions_df["Share Name"] = transactions_df["MarketName"].str.extract(r"(^.*?(?=DIVIDEND|\(All|CONS|COMM|Section))")
The only thing I'm having trouble with is the row Correction Netflix Section 31 Fee
, where my regex is getting the sharename as Correction Netflix
.我唯一遇到的问题是
Correction Netflix Section 31 Fee
行,我的正则表达式将共享名作为Correction Netflix
。 I don't want the word "Correction".我不想要“更正”这个词。
I need my regular expression to check for either the start of the string, OR the word "Correction ".我需要我的正则表达式来检查字符串的开头或单词“Correction”。
I tried a few things, such as an OR |
我尝试了一些东西,例如 OR
|
with the start of string character ^
.以字符串字符
^
开头。 I also tried a look behind to check for ^
or Correction
but the error says they need to be constant length.我还尝试向后看以检查
^
或Correction
,但错误表明它们需要是恒定的长度。
r"((^|Correction ).*?(?=DIVIDEND|\(All|CONS|COMM|Section))"
gives an error;给出错误;
ValueError: Wrong number of items passed 2, placement implies 1
. ValueError: Wrong number of items passed 2, placement implies 1
。 I'm new to regex so I don't really know what this means.我是正则表达式的新手,所以我真的不知道这意味着什么。
You could use an optional part, and in instead of lookarounds use a capture group with a match:您可以使用一个可选部分,而不是环顾四周使用一个匹配的捕获组:
^(?:Correction\s*)?(\S.*?)\s*(?:\([^()]*\)|DIVIDEND|All|CONS|COMM|Section)
^
Start of string ^
字符串开头(?:Correction\s*)?
(\S.*?)\s*
Capture in group 1 , matching a non whitespace char and as least chars as possible and match (not capture) 0+ whitespace chars (\S.*?)\s*
在组 1中捕获,匹配非空白字符和尽可能少的字符并匹配(不捕获)0+ 空白字符(?:
Non capture group for the alternation |
(?:
交替的非捕获组|
\([^()]*\)
Match from (
till )
\([^()]*\)
从(
到)
匹配|
OrDIVIDEND|All|CONS|COMM|Section
Match any of the words DIVIDEND|All|CONS|COMM|Section
匹配任何单词)
Close group )
关闭组data = ["Netflix DIVIDEND", "Apple Inc (All Sessions) COMM", "Intel Corporation CONS", "Correction Netflix Section 31 Fee"]
pattern = r"^(?:Correction\s*)?(\S.*?)\s*(?:\([^()]*\)|DIVIDEND|All|CONS|COMM|Section)"
transactions_df = pd.DataFrame(data, columns = ['MarketName'])
transactions_df["Share Name"] = transactions_df["MarketName"].str.extract(pattern)
print(transactions_df)
Output Output
0 Netflix DIVIDEND Netflix
1 Apple Inc (All Sessions) COMM Apple Inc
2 Intel Corporation CONS Intel Corporation
3 Correction Netflix Section 31 Fee Netflix
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.