[英]Pandas dataframe: select list items in a column, then transform string on the items
One of the columns I'm importing into my dataframe is structured as a list.我要导入到我的数据框中的一列是一个列表。 I need to pick out certain values from said list, transform the value and add it to one of two new columns in the dataframe.
我需要从所述列表中选择某些值,转换该值并将其添加到数据框中的两个新列之一。 Before:
前:
Name![]() |
Listed_Items ![]() |
---|---|
Tom![]() |
["dr_md_coca_cola", "dr_od_water", "potatoes", "grass", "ot_other_stuff"] ![]() |
Steve![]() |
["dr_od_orange_juice", "potatoes", "grass", "ot_other_stuff", "dr_md_pepsi"] ![]() |
Phil![]() |
["dr_md_dr_pepper", "potatoes", "grass", "dr_od_coffee","ot_other_stuff"] ![]() |
From what I've read I can turn the column into a list根据我的阅读,我可以将该列变成一个列表
df["listed_items"] = df["listed_items"].apply(eval)
But then I cannot see how to find any list items that start dr_md , extract the item, remove the starting dr_md , replace any underscores, capitalize the first letter and add that to a new MD column in the row.但是我看不到如何找到任何以dr_md 开头的列表项,提取该项目,删除起始 dr_md ,替换任何下划线,将第一个字母大写并将其添加到行中的新MD列中。 Then same again for dr_od .
然后对dr_od再次进行同样的操作。 There is only one item in the list that starts dr_md and dr_od in each row.
列表中只有一项在每一行中以dr_md和dr_od开头。 Desired output
期望的输出
Name![]() |
MD![]() |
OD![]() |
---|---|---|
Tom![]() |
Coca Cola![]() |
Water![]() |
Steve![]() |
Pepsi![]() |
Orange Juice![]() |
Phil![]() |
Dr Pepper![]() |
Coffee![]() |
Use pivot_table
使用
pivot_table
df = df.explode('Listed_Items')
df = df[df.Listed_Items.str.contains('dr_')]
df['Type'] = df['Listed_Items'].str.contains('dr_md').map({True: 'MD',
False: 'OD'})
df.pivot_table(values='Listed_Items',
columns='Type',
index='Name',
aggfunc='first')
Type MD OD
Name
Phil dr_md_dr_pepper dr_od_coffee
Steve dr_md_pepsi dr_od_orange_juice
Tom dr_md_coca_cola dr_od_water
From here it's just a matter of beautifying your dataset as your wish.从这里开始,只需按照您的意愿美化您的数据集。
What you need to do is make a function that does the processing for you that you can pass into apply
(or in this case, map
).您需要做的是创建一个为您执行处理的函数,您可以将其传递给
apply
(或在本例中为map
)。 Alternatively, you could expand your list column into multiple columns and then process them afterwards, but that will only work if your lists are always in the same order (see panda expand columns with list into multiple columns ).或者,您可以将列表列扩展为多个列,然后再处理它们,但这仅在您的列表始终处于相同顺序时才有效(请参阅panda expand columns with list into multiple columns )。 Because you only have one input column, you could use
map
instead of apply
.因为您只有一个输入列,所以您可以使用
map
而不是apply
。
def process_dr_md(l:list):
for s in l:
if s.startswith("dr_md_"):
# You can process your string further here
return l[6:]
def process_dr_od(l:list):
for s in l:
if s.startswith("dr_od_"):
# You can process your string further here
return l[6:]
df["listed_items"] = df["listed_items"].map(eval)
df["MD"] = df["listed_items"].map(process_dr_md)
df["OD"] = df["listed_items"].map(process_dr_od)
I hope that gets you on your way!我希望这能让你上路!
I took a slightly different approach from the previous answers.我采取了与以前的答案略有不同的方法。 given a df of form:
给定一个 df 形式:
Name Items
0 Tom [dr_md_coca_cola, dr_od_water, potatoes, grass...
1 Steve [dr_od_orange_juice, potatoes, grass, ot_other...
2 Phil [dr_md_dr_pepper, potatoes, grass, dr_od_coffe...
and making the following assumptions:并做出以下假设:
I created the following function to parse the list:我创建了以下函数来解析列表:
import re
def parse_Items(tgt_mask: str, itmList: list) -> str:
p = re.compile(tgt_mask)
for itm in itmList:
if p.match(itm):
return itm[p.search(itm).span()[1]:].replace('_', ' ')
Then you can modify your original data farme by use of the following:然后,您可以使用以下内容修改原始数据场:
df['MD'] = [parse_Items('dr_md_', x) for x in df['Items'].to_list()]
df['OD'] = [parse_Items('dr_od_', x) for x in df['Items'].to_list()]
df.pop('Items')
This produces the following:这会产生以下结果:
Name MD OD
0 Tom coca cola water
1 Steve pepsi orange juice
2 Phil dr pepper coffee
I would normalize de data before to put in a dataframe:我会在放入数据框之前对数据进行规范化:
import pandas as pd
from typing import Dict, List, Tuple
def clean_stuff(text: str):
clean_text = text[6:].replace('_', ' ')
return " ".join([
word.capitalize()
for word in clean_text.split(" ")
])
def get_md_od(stuffs: List[str]) -> Tuple[str, str]:
md_od = [s for s in stuffs if s.startswith(('dr_md', 'dr_od'))]
md_od = sorted(md_od)
print(md_od)
return clean_stuff(md_od[0]), clean_stuff(md_od[1])
dirty_stuffs = [{'Name': 'Tom',
'Listed_Items': ["dr_md_coca_cola",
"dr_od_water",
"potatoes",
"grass",
"ot_other_stuff"]},
{'Name': 'Tom',
'Listed_Items': ["dr_md_coca_cola",
"dr_od_water",
"potatoes",
"grass",
"ot_other_stuff"]}
]
normalized_stuff: List[Dict[str, str]] = []
for stuff in dirty_stuffs:
md, od = get_md_od(stuff['Listed_Items'])
normalized_stuff.append({
'Name': stuff['Name'],
'MD': md,
'OD': od,
})
df = pd.DataFrame(normalized_stuff)
print(df)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.