熊猫数据框：选择列中的列表项，然后在项目上转换字符串

Question

One of the columns I'm importing into my dataframe is structured as a list.我要导入到我的数据框中的一列是一个列表。 I need to pick out certain values from said list, transform the value and add it to one of two new columns in the dataframe.我需要从所述列表中选择某些值，转换该值并将其添加到数据框中的两个新列之一。 Before:前：

Name姓名	Listed_Items Listed_Items
Tom汤姆	["dr_md_coca_cola", "dr_od_water", "potatoes", "grass", "ot_other_stuff"] [“dr_md_coca_cola”、“dr_od_water”、“土豆”、“草”、“ot_other_stuff”]
Steve史蒂夫	["dr_od_orange_juice", "potatoes", "grass", "ot_other_stuff", "dr_md_pepsi"] [“dr_od_orange_juice”、“土豆”、“草”、“ot_other_stuff”、“dr_md_pepsi”]
Phil菲尔	["dr_md_dr_pepper", "potatoes", "grass", "dr_od_coffee","ot_other_stuff"] [“dr_md_dr_pepper”、“土豆”、“草”、“dr_od_coffee”、“ot_other_stuff”]

From what I've read I can turn the column into a list根据我的阅读，我可以将该列变成一个列表

df["listed_items"] = df["listed_items"].apply(eval)

But then I cannot see how to find any list items that start dr_md , extract the item, remove the starting dr_md , replace any underscores, capitalize the first letter and add that to a new MD column in the row.但是我看不到如何找到任何以dr_md 开头的列表项，提取该项目，删除起始 dr_md ，替换任何下划线，将第一个字母大写并将其添加到行中的新MD列中。 Then same again for dr_od .然后对dr_od再次进行同样的操作。 There is only one item in the list that starts dr_md and dr_od in each row.列表中只有一项在每一行中以dr_md和dr_od开头。 Desired output期望的输出

Name姓名	MD医学博士	OD外径
Tom汤姆	Coca Cola可口可乐	Water水
Steve史蒂夫	Pepsi百事可乐	Orange Juice橙汁
Phil菲尔	Dr Pepper胡椒博士	Coffee咖啡

Answer 1

Use pivot_table使用pivot_table

df = df.explode('Listed_Items')
df = df[df.Listed_Items.str.contains('dr_')]

df['Type'] = df['Listed_Items'].str.contains('dr_md').map({True: 'MD', 
                                                           False: 'OD'})

df.pivot_table(values='Listed_Items', 
               columns='Type', 
               index='Name',
               aggfunc='first')

Type                MD                  OD
Name                                      
Phil   dr_md_dr_pepper        dr_od_coffee
Steve      dr_md_pepsi  dr_od_orange_juice
Tom    dr_md_coca_cola         dr_od_water

From here it's just a matter of beautifying your dataset as your wish.从这里开始，只需按照您的意愿美化您的数据集。

Answer 2

What you need to do is make a function that does the processing for you that you can pass into apply (or in this case, map ).您需要做的是创建一个为您执行处理的函数，您可以将其传递给apply （或在本例中为map ）。 Alternatively, you could expand your list column into multiple columns and then process them afterwards, but that will only work if your lists are always in the same order (see panda expand columns with list into multiple columns ).或者，您可以将列表列扩展为多个列，然后再处理它们，但这仅在您的列表始终处于相同顺序时才有效（请参阅panda expand columns with list into multiple columns ）。 Because you only have one input column, you could use map instead of apply .因为您只有一个输入列，所以您可以使用map而不是apply 。

def process_dr_md(l:list):
    for s in l:
        if s.startswith("dr_md_"):
            # You can process your string further here
            return l[6:]

def process_dr_od(l:list):
    for s in l:
        if s.startswith("dr_od_"):
            # You can process your string further here
            return l[6:]

df["listed_items"] = df["listed_items"].map(eval)
df["MD"] = df["listed_items"].map(process_dr_md)
df["OD"] = df["listed_items"].map(process_dr_od)

I hope that gets you on your way!我希望这能让你上路！

Answer 3

I took a slightly different approach from the previous answers.我采取了与以前的答案略有不同的方法。 given a df of form:给定一个 df 形式：

    Name    Items
0   Tom [dr_md_coca_cola, dr_od_water, potatoes, grass...
1   Steve   [dr_od_orange_juice, potatoes, grass, ot_other...
2   Phil    [dr_md_dr_pepper, potatoes, grass, dr_od_coffe...

and making the following assumptions:并做出以下假设：

only one item in a list matches the target mask列表中只有一项与目标掩码匹配
the target mask always appears at the start of the entry string目标掩码始终出现在输入字符串的开头

I created the following function to parse the list:我创建了以下函数来解析列表：

import re
def parse_Items(tgt_mask: str, itmList: list) -> str:
    p = re.compile(tgt_mask)
    for itm in itmList:
        if p.match(itm):
            return itm[p.search(itm).span()[1]:].replace('_', ' ')

Then you can modify your original data farme by use of the following:然后，您可以使用以下内容修改原始数据场：

df['MD'] = [parse_Items('dr_md_', x) for x in df['Items'].to_list()]
df['OD'] = [parse_Items('dr_od_', x) for x in df['Items'].to_list()]
df.pop('Items')

This produces the following:这会产生以下结果：

    Name    MD          OD
0   Tom     coca cola   water
1   Steve   pepsi       orange juice
2   Phil    dr pepper   coffee

Answer 4

I would normalize de data before to put in a dataframe:我会在放入数据框之前对数据进行规范化：

import pandas as pd
from typing import Dict, List, Tuple


def clean_stuff(text: str):
    clean_text = text[6:].replace('_', ' ')
    return " ".join([
        word.capitalize()
        for word in clean_text.split(" ")
    ])


def get_md_od(stuffs: List[str]) -> Tuple[str, str]:
    md_od = [s for s in stuffs if s.startswith(('dr_md', 'dr_od'))]
    md_od = sorted(md_od)
    print(md_od)

    return clean_stuff(md_od[0]), clean_stuff(md_od[1])


dirty_stuffs = [{'Name': 'Tom',
                 'Listed_Items': ["dr_md_coca_cola",
                                  "dr_od_water",
                                  "potatoes",
                                  "grass",
                                  "ot_other_stuff"]},
                {'Name': 'Tom',
                 'Listed_Items': ["dr_md_coca_cola",
                                  "dr_od_water",
                                  "potatoes",
                                  "grass",
                                  "ot_other_stuff"]}
                ]

normalized_stuff: List[Dict[str, str]] = []
for stuff in dirty_stuffs:
    md, od = get_md_od(stuff['Listed_Items'])
    normalized_stuff.append({
        'Name': stuff['Name'],
        'MD': md,
        'OD': od,
    })

df = pd.DataFrame(normalized_stuff)
print(df)

熊猫数据框：选择列中的列表项，然后在项目上转换字符串

问题描述

4 个解决方案

解决方案1
1 2022-06-06 14:44:50

解决方案2
1 已采纳 2022-06-06 14:44:58

解决方案3
0 2022-06-06 15:14:04

解决方案4
0 2022-06-06 15:25:30

熊猫数据框：选择列中的列表项，然后在项目上转换字符串

问题描述

4 个解决方案

解决方案1 1 2022-06-06 14:44:50

解决方案2 1 已采纳 2022-06-06 14:44:58

解决方案3 0 2022-06-06 15:14:04

解决方案4 0 2022-06-06 15:25:30

解决方案1
1 2022-06-06 14:44:50

解决方案2
1 已采纳 2022-06-06 14:44:58

解决方案3
0 2022-06-06 15:14:04

解决方案4
0 2022-06-06 15:25:30