简体   繁体   English

Pandas 如何通过正则表达式从列中提取到多行?

[英]Pandas how can I extract by regex from column into multiple rows?

I have the following data:我有以下数据:

ID ID content内容 date日期
1 1 2429(sach:MySpezialItem:16.59) 2429(萨赫:MySpezialItem:16.59) 2022-04-12 2022-04-12
2 2 2429(sach:item 13:18.59)(sach:this and that costs:16.59) 2429(萨赫:项目 13:18.59)(萨赫:这个和那个成本:16.59) 2022-06-12 2022-06-12

And I want to achieve the following:我想实现以下目标:

ID ID number数字 price价格 date日期
1 1 2429 2429 2022-04-12 2022-04-12
1 1 16.59 16.59 2022-04-12 2022-04-12
2 2 2429 2429 2022-06-12 2022-06-12
2 2 18.59 18.59 2022-06-12 2022-06-12
2 2 16.59 16.59 2022-06-12 2022-06-12

What I tried我试过的

df['sach'] = df['content'].str.split(r'\(sach:.*\)').explode('content')
df['content'] = df['content'].str.replace(r'\(sach:.*\)','', regex=True)

You can use a single regex with str.extractall :您可以将单个正则表达式与str.extractall一起使用:

regex = r'(?P<number>\d+)\(|:(?P<price>\d+(?:\.\d+)?)\)'

df = df.join(df.pop('content').str.extractall(regex).droplevel(1))

NB.注意。 if you want a new DataFrame, don't pop :如果你想要一个新的 DataFrame,不要pop

df2 = (df.drop(columns='content')
         .join(df['content'].str.extractall(regex).droplevel(1))
       )

output: output:

   ID        date number  price
0   1  2022-04-12   2429    NaN
0   1  2022-04-12    NaN  16.59
1   2  2022-06-12   2429    NaN
1   2  2022-06-12    NaN  18.59
1   2  2022-06-12    NaN  16.59

regex demo正则表达式演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM