[英]Extract data from a pandas dataframe strings column and generate new columns based on content in it
I have a pandas column which has data like this:我有一个 pandas 列,其中包含如下数据:
**Title **: New_ind **标题**:New_ind
**Body **: Detection_error **正文**:检测错误
*respo_URL **: www.github.com *respo_URL **: www.github.com
**respo_status **: {color} **respo_status **:{颜色}
data = {'sl no': [661, 662],
'key': ['3484', '3483'],
'id': [13592349, 13592490],
'Sum': ['[E-1]', '[E-1]'],
'Desc': [
"**Title **: New_ind\n\n**Body **: Detection_error\n\n*respo_URL **: www.github.com\n\n**respo_status **: {yellow}","**Title **: New_ind2\n\n**Body **: import_error\n\n*respo_URL **: \n\n**respo_status **: {green}"]}
df = pd.DataFrame(data)
I need to generate new columns where Title, Body, response_URL, etc would be column names and everything after: should be the value contained in those column cells.我需要生成新列,其中 Title、Body、response_URL 等将是列名,后面的所有内容:应该是这些列单元格中包含的值。 Just to mention the items in the column are not dictionaries只是提一下列中的项目不是字典
There are various ways to do that with regex but I found this with str
-methods to be the clearest:使用正则表达式有多种方法可以做到这一点,但我发现使用str
方法最清楚:
desc_df = df["Desc"].str.split("\n\n", expand=True)
for col in desc_df.columns:
desc_df[col] = desc_df[col].str.split(":").str[1].str.strip()
colnames = "Title", "Body", "respo_URL", "respo_status"
desc_df = desc_df.rename(columns=dict(enumerate(colnames)))
df = pd.concat([df.drop(columns="Desc"), desc_df], axis=1)
Desc
at \n\n
and expand the result into a dataframe desc_df
.首先在\n\n
拆分列Desc
并将结果展开为 dataframe desc_df
。:
, take the right side, and strip whitespace.然后在:
拆分每个新列,取右侧,并去除空格。Desc
column and desc_df
.最后更改列名并连接初始的 dataframe,不带Desc
列和desc_df
。Result for the sample:示例结果:
sl no key id Sum Title Body respo_URL \
0 661 3484 13592349 [E-1] New_ind Detection_error www.github.com
1 662 3483 13592490 [E-1] New_ind2 import_error
respo_status
0 {yellow}
1 {green}
The following regex-version worked for the sample, but I think it's not as robust the other one:以下正则表达式版本适用于该示例,但我认为它不如另一个强大:
pattern = "\n\n".join(
f"\*+{col} \*+: (?P<{col}>[^\n]*)"
for col in ("Title", "Body", "respo_URL", "respo_status")
)
desc_df = df["Desc"].str.extract(pattern)
df = pd.concat([df.drop(columns="Desc"), desc_df], axis=1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.