简体   繁体   English

熊猫数据框:选择列中的列表项,然后在项目上转换字符串

[英]Pandas dataframe: select list items in a column, then transform string on the items

One of the columns I'm importing into my dataframe is structured as a list.我要导入到我的数据框中的一列是一个列表。 I need to pick out certain values from said list, transform the value and add it to one of two new columns in the dataframe.我需要从所述列表中选择某些值,转换该值并将其添加到数据框中的两个新列之一。 Before:前:

Name姓名 Listed_Items Listed_Items
Tom汤姆 ["dr_md_coca_cola", "dr_od_water", "potatoes", "grass", "ot_other_stuff"] [“dr_md_coca_cola”、“dr_od_water”、“土豆”、“草”、“ot_other_stuff”]
Steve史蒂夫 ["dr_od_orange_juice", "potatoes", "grass", "ot_other_stuff", "dr_md_pepsi"] [“dr_od_orange_juice”、“土豆”、“草”、“ot_other_stuff”、“dr_md_pepsi”]
Phil菲尔 ["dr_md_dr_pepper", "potatoes", "grass", "dr_od_coffee","ot_other_stuff"] [“dr_md_dr_pepper”、“土豆”、“草”、“dr_od_coffee”、“ot_other_stuff”]

From what I've read I can turn the column into a list根据我的阅读,我可以将该列变成一个列表

df["listed_items"] = df["listed_items"].apply(eval)

But then I cannot see how to find any list items that start dr_md , extract the item, remove the starting dr_md , replace any underscores, capitalize the first letter and add that to a new MD column in the row.但是我看不到如何找到任何以dr_md 开头的列表项,提取该项目,删除起始 dr_md 替换任何下划线,将第一个字母大写并将其添加到行中的新MD列中。 Then same again for dr_od .然后对dr_od再次进行同样的操作。 There is only one item in the list that starts dr_md and dr_od in each row.列表中只有一项在每一行中以dr_mddr_od开头。 Desired output期望的输出

Name姓名 MD医学博士 OD外径
Tom汤姆 Coca Cola可口可乐 Water
Steve史蒂夫 Pepsi百事可乐 Orange Juice橙汁
Phil菲尔 Dr Pepper胡椒博士 Coffee咖啡

Use pivot_table使用pivot_table

df = df.explode('Listed_Items')
df = df[df.Listed_Items.str.contains('dr_')]

df['Type'] = df['Listed_Items'].str.contains('dr_md').map({True: 'MD', 
                                                           False: 'OD'})

df.pivot_table(values='Listed_Items', 
               columns='Type', 
               index='Name',
               aggfunc='first')

Type                MD                  OD
Name                                      
Phil   dr_md_dr_pepper        dr_od_coffee
Steve      dr_md_pepsi  dr_od_orange_juice
Tom    dr_md_coca_cola         dr_od_water

From here it's just a matter of beautifying your dataset as your wish.从这里开始,只需按照您的意愿美化您的数据集。

What you need to do is make a function that does the processing for you that you can pass into apply (or in this case, map ).您需要做的是创建一个为您执行处理的函数,您可以将其传递给apply (或在本例中为map )。 Alternatively, you could expand your list column into multiple columns and then process them afterwards, but that will only work if your lists are always in the same order (see panda expand columns with list into multiple columns ).或者,您可以将列表列扩展为多个列,然后再处理它们,但这仅在您的列表始终处于相同顺序时才有效(请参阅panda expand columns with list into multiple columns )。 Because you only have one input column, you could use map instead of apply .因为您只有一个输入列,所以您可以使用map而不是apply

def process_dr_md(l:list):
    for s in l:
        if s.startswith("dr_md_"):
            # You can process your string further here
            return l[6:]

def process_dr_od(l:list):
    for s in l:
        if s.startswith("dr_od_"):
            # You can process your string further here
            return l[6:]

df["listed_items"] = df["listed_items"].map(eval)
df["MD"] = df["listed_items"].map(process_dr_md)
df["OD"] = df["listed_items"].map(process_dr_od)

I hope that gets you on your way!我希望这能让你上路!

I took a slightly different approach from the previous answers.我采取了与以前的答案略有不同的方法。 given a df of form:给定一个 df 形式:

    Name    Items
0   Tom [dr_md_coca_cola, dr_od_water, potatoes, grass...
1   Steve   [dr_od_orange_juice, potatoes, grass, ot_other...
2   Phil    [dr_md_dr_pepper, potatoes, grass, dr_od_coffe...  

and making the following assumptions:并做出以下假设:

  1. only one item in a list matches the target mask列表中只有一项与目标掩码匹配
  2. the target mask always appears at the start of the entry string目标掩码始终出现在输入字符串的开头

I created the following function to parse the list:我创建了以下函数来解析列表:

import re
def parse_Items(tgt_mask: str, itmList: list) -> str:
    p = re.compile(tgt_mask)
    for itm in itmList:
        if p.match(itm):
            return itm[p.search(itm).span()[1]:].replace('_', ' ')  

Then you can modify your original data farme by use of the following:然后,您可以使用以下内容修改原始数据场:

df['MD'] = [parse_Items('dr_md_', x) for x in df['Items'].to_list()]
df['OD'] = [parse_Items('dr_od_', x) for x in df['Items'].to_list()]
df.pop('Items')  

This produces the following:这会产生以下结果:

    Name    MD          OD
0   Tom     coca cola   water
1   Steve   pepsi       orange juice
2   Phil    dr pepper   coffee

I would normalize de data before to put in a dataframe:我会在放入数据框之前对数据进行规范化:

import pandas as pd
from typing import Dict, List, Tuple


def clean_stuff(text: str):
    clean_text = text[6:].replace('_', ' ')
    return " ".join([
        word.capitalize()
        for word in clean_text.split(" ")
    ])


def get_md_od(stuffs: List[str]) -> Tuple[str, str]:
    md_od = [s for s in stuffs if s.startswith(('dr_md', 'dr_od'))]
    md_od = sorted(md_od)
    print(md_od)

    return clean_stuff(md_od[0]), clean_stuff(md_od[1])


dirty_stuffs = [{'Name': 'Tom',
                 'Listed_Items': ["dr_md_coca_cola",
                                  "dr_od_water",
                                  "potatoes",
                                  "grass",
                                  "ot_other_stuff"]},
                {'Name': 'Tom',
                 'Listed_Items': ["dr_md_coca_cola",
                                  "dr_od_water",
                                  "potatoes",
                                  "grass",
                                  "ot_other_stuff"]}
                ]

normalized_stuff: List[Dict[str, str]] = []
for stuff in dirty_stuffs:
    md, od = get_md_od(stuff['Listed_Items'])
    normalized_stuff.append({
        'Name': stuff['Name'],
        'MD': md,
        'OD': od,
    })

df = pd.DataFrame(normalized_stuff)
print(df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从熊猫数据框中选择特定的列项目作为列表? - How to select specific column items as list from pandas dataframe? 替换pandas数据框中的一列中的列表内的项目? - Replace items inside a list in a column in pandas dataframe? 在 dataframe pandas 中将列表项读取为字符串 - Read list items as string in dataframe pandas Pandas Dataframe - 如何检查A列中的字符串值是否在B列中的字符串项列表中可用 - Pandas Dataframe - How to check if the string value in column A is available in the list of string items in column B 将逗号分隔项的数据框列转换为列表列表 - Transform dataframe column of comma separated items to list of lists 将 aa 字符串拆分为 pandas dataframe 中的项目列表 | Python | Pandas | - Splitting a a string into a list of items in pandas dataframe | Python | Pandas | 值 Pandas Dataframe 列中的项目计数,其中包含字符串列表作为值 - Value Counts of items inside a column in Pandas Dataframe which contains list of string as value 按熊猫数据框中的列表项分组? - Grouping by list items in pandas dataframe? 带有列表的pandas数据框中的唯一项 - Unique Items in a pandas dataframe with a list 在一个列表中查找项目,但不在熊猫数据框列中的另一个列表中查找项目 - Find items in one list but not in another in a pandas dataframe column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM