简体   繁体   English

从 pandas dataframe strings 列中提取数据,并根据其中的内容生成新的列

[英]Extract data from a pandas dataframe strings column and generate new columns based on content in it

I have a pandas column which has data like this:我有一个 pandas 列,其中包含如下数据:

**Title **: New_ind **标题**:New_ind

**Body **: Detection_error **正文**:检测错误

*respo_URL **: www.github.com *respo_URL **: www.github.com

**respo_status **: {color} **respo_status **:{颜色}

data = {'sl no': [661, 662],
        'key': ['3484', '3483'],
        'id': [13592349, 13592490],
        'Sum': ['[E-1]', '[E-1]'],
        'Desc': [
              "**Title **: New_ind\n\n**Body **: Detection_error\n\n*respo_URL **: www.github.com\n\n**respo_status **: {yellow}","**Title **: New_ind2\n\n**Body **: import_error\n\n*respo_URL **: \n\n**respo_status **: {green}"]}

df = pd.DataFrame(data)

I need to generate new columns where Title, Body, response_URL, etc would be column names and everything after: should be the value contained in those column cells.我需要生成新列,其中 Title、Body、response_URL 等将是列名,后面的所有内容:应该是这些列单元格中包含的值。 Just to mention the items in the column are not dictionaries只是提一下列中的项目不是字典

There are various ways to do that with regex but I found this with str -methods to be the clearest:使用正则表达式有多种方法可以做到这一点,但我发现使用str方法最清楚:

desc_df = df["Desc"].str.split("\n\n", expand=True)
for col in desc_df.columns:
    desc_df[col] = desc_df[col].str.split(":").str[1].str.strip()
colnames = "Title", "Body", "respo_URL", "respo_status"
desc_df = desc_df.rename(columns=dict(enumerate(colnames)))
df = pd.concat([df.drop(columns="Desc"), desc_df], axis=1)
  • First split column Desc at \n\n and expand the result into a dataframe desc_df .首先在\n\n拆分列Desc并将结果展开为 dataframe desc_df
  • Then split each new column at : , take the right side, and strip whitespace.然后在:拆分每个新列,取右侧,并去除空格。
  • Finally change the column names and concat the initial dataframe without the Desc column and desc_df .最后更改列名并连接初始的 dataframe,不带Desc列和desc_df

Result for the sample:示例结果:

   sl no   key        id    Sum     Title             Body       respo_URL  \
0    661  3484  13592349  [E-1]   New_ind  Detection_error  www.github.com   
1    662  3483  13592490  [E-1]  New_ind2     import_error                   

  respo_status  
0     {yellow}  
1      {green}

The following regex-version worked for the sample, but I think it's not as robust the other one:以下正则表达式版本适用于该示例,但我认为它不如另一个强大:

pattern = "\n\n".join(
    f"\*+{col} \*+: (?P<{col}>[^\n]*)"
    for col in ("Title", "Body", "respo_URL", "respo_status")    
)
desc_df = df["Desc"].str.extract(pattern)
df = pd.concat([df.drop(columns="Desc"), desc_df], axis=1)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pandas dataframe:根据其他列的数据创建新列 - Pandas dataframe: Creating a new column based on data from other columns Python:在 Pandas 中,根据条件从数据帧中的几列中提取数据,并添加到列上的不同数据帧匹配中 - Python: In Pandas extract data from several columns in a dataframe based on a condition and add to different dataframe matching on a column Pandas:根据公共列名称将多个数据帧中的列提取到新的数据帧 - Pandas: extract columns from multiple dataframes to a new dataframe based on common column name 根据其他列 ID 从现有 dataframe 中获取新 pandas dataframe 中的汇总数据列 - Get summary data columns in new pandas dataframe from existing dataframe based on other column-ID Pandas 数据框根据其他列是否有数据添加新列 - Pandas dataframe add new column based on if other columns have data or not Pandas从单列字符串生成列 - Pandas generate columns from single column of strings 根据每个句子的第一个单词,将pandas dataframe列中的字符串列表分解为新列 - Break up a list of strings in a pandas dataframe column into new columns based on first word of each sentence Pandas将列中的数字提取到新列中 - Pandas extract numbers from column into new columns 从 pandas dataframe 中的列创建新列 - Creating new column from columns in pandas dataframe Pandas 根据另一个数据框中的匹配列填充新的数据框列 - Pandas populate new dataframe column based on matching columns in another dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM