[英]Pandas, populate dataframe columns for each row from array in column
I have a large Pandas Dataframe with the following structure:我有一个大型 Pandas Dataframe 具有以下结构:
data = {'id': [3, 5, 9, 12],
'names': ["{name1,name2,name3}", "{name1,name3}", "{name1,name2}", "{name2,name1,name3}"],
'values':["{N,Y,N}", "{N,N}", "{Y,N}", "{N,Y,Y}"]
}
df = pd.DataFrame(data)
df
Note that the names are not always in the same order or always all included for each id
, however the order of the values does correspond to the names as ordered for each row.请注意,名称并不总是以相同的顺序排列或总是包含在每个id
中,但是值的顺序确实对应于为每行排序的名称。
I would like to turn this table into the following structure as efficiently as possible:我想尽可能高效地将这张表转换为以下结构:
data = {'id': [3, 5, 9, 12],
'name1': ["N", "N", "Y", "Y"],
'name2': ["Y", " ", "N", "N"],
'name3': ["N", "N", " ", "Y"],
}
df = pd.DataFrame(data)
df
Currently I am accomplishing this with the following subroutine where I essentially go through the df
row by row and create lists of the names and values and then add those values to new columns.目前我正在使用以下子例程完成此操作,其中我基本上是 go 通过df
逐行创建名称和值的列表,然后将这些值添加到新列中。 This works correctly but it is very slow (estimated at ~14 hrs) since my df
is large (~2e5 rows).这可以正常工作,但由于我的df
很大(~2e5 行),它非常慢(估计在 ~14 小时)。 And each row or id
can have up to 194 names, ie "{name1, name2, ..., name193, name194}"
.并且每一行或id
最多可以有 194 个名称,即"{name1, name2, ..., name193, name194}"
。
def add_name_cols(df, title_col, value_col):
nRows = len(df)
for index,row in df.iterrows(): # parse rows and replace characters
title_spl = [ i for i in row[title_col].replace('{','').replace('}','').split(',') ]
value_spl = [ i for i in row[value_col].replace('{','').replace('}','').split(',') ]
i = 0
for t in title_spl: # add value in correct column for this row
print('Progress rows: {0:2.2f}%, Progress columns: {1:2.2f}%'.format(float(index)/float(nRows)*100, float(i)/float(194)*100), end='\r')
df.loc[index,t] = value_spl[i]
i += 1
return df
df_new = add_name_cols(df, 'names', 'values')
df_new
Is there a way to accomplish this manipulation using more of Pandas' built-in methods that would expedite this process?有没有办法使用更多 Pandas 的内置方法来完成这种操作,从而加快这个过程?
Use string methods and dict constructor inside list comprehension.在列表理解中使用字符串方法和 dict 构造函数。
(i) Convert df[['names','values']]
to a list of lists (i) 将df[['names','values']]
转换为列表列表
(ii) iterate over each pair, ie row in df
, and use str.strip
and str.split
to create pair of lists, unpack and cast to dict
constructor (ii) 遍历每一对,即df
中的行,并使用str.strip
和str.split
创建一对列表,解包并转换为dict
构造函数
(iii) Pass the resulting list of dictionaries to pd.DataFrame
(iii) 将得到的字典列表传递给pd.DataFrame
temp = pd.DataFrame([dict(zip(*[x.strip('{}').split(',') for x in pair])) for pair in df[['names','values']].to_numpy().tolist()]).fillna('')
df[temp.columns] = temp
df = df.drop(['names','values'], axis=1)
Output: Output:
id name1 name2 name3
0 3 N Y N
1 5 N N
2 9 Y N
3 12 Y N Y
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.