简体   繁体   English

pandas/regex: 去掉连字符或括号字符后的字符串(包括) carry string in the comma after the comma in pandas dataframe

[英]pandas/regex: Remove the string after the hyphen or parenthesis character (including) carry string after the comma in pandas dataframe

I have a dataframe contains one column which has multiple strings separated by the comma, but in this string, I want to remove all matter after hyphen (including hyphen), main point is after in some cases hyphen is not there but directed parenthesis is there so I also want to remove that as well and carry all the after the comma how can I do it?我有一个 dataframe 包含一列,其中有多个字符串由逗号分隔,但在这个字符串中,我想删除连字符(包括连字符)之后的所有内容,要点是在某些情况下连字符不存在但有向括号所以我也想删除它并在逗号后携带所有我该怎么做? You can see this case in last row.您可以在最后一行看到这种情况。

dd = pd.DataFrame()
dd['sin'] = ['U147(BCM), U35(BCM)','P01-00(ECM), P02-00(ECM)', 'P3-00(ECM), P032-00(ECM)','P034-00(ECM)', 'P23F5(PCM), P04-00(ECM)']

Expected output预计 output

dd['sin']
# output 
U147 U35
P01 P02
P3 P032
P034
P23F5 P04

Want to carry only string before the hyphen or parenthesis or any special character.只想在连字符或括号或任何特殊字符之前携带字符串。

The following code seems to reproduce your desired result:以下代码似乎重现了您想要的结果:

dd['sin'] = dd['sin'].str.split(", ")
dd = dd.explode('sin').reset_index()
dd['sin'] = dd['sin'].str.replace('\W.*', '', regex=True)

Which gives dd['sin'] as:其中dd['sin']为:

0     U147
1      U35
2      P01
3      P02
4       P3
5     P032
6     P034
7    P23F5
8      P04
Name: sin, dtype: object

The call of .reset_index() in the second line is optional depending on whether you want to preserve which row that piece of the string came from.第二行中的.reset_index()调用是可选的,具体取决于您是否要保留该字符串来自哪一行。

You can use the following regex :您可以使用以下正则表达式

r"-\d{2}|\([EBP]CM\)|\s"


Here is the code:这是代码:

sin = ['U147(BCM), U35(BCM)','P01-00(ECM), P02-00(ECM)', 'P3-00(ECM), P032-00(ECM)','P034-00(ECM)', 'P23F5(PCM), P04-00(ECM)']

dd = pd.DataFrame()
dd['sin'] = sin
dd['sin'] = dd['sin'].str.replace(r'-\d{2}|\([EBP]CM\)|\s', '', regex=True)
print(dd)

OUTPUT: OUTPUT:

         sin
0   U147,U35
1    P01,P02
2    P3,P032
3       P034
4  P23F5,P04



EDIT编辑

Or use this line to remove the comma:或者使用此行删除逗号:

dd['sin'] = dd['sin'].str.replace(r'-\d{2}|\([EBP]CM\)|\s', '', regex=True).str.replace(',',' ')

OUTPUT: OUTPUT:

         sin
0   U147 U35
1    P01 P02
2    P3 P032
3       P034
4  P23F5 P04

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM