I have a large Pandas Dataframe with the following structure:
data = {'id': [3, 5, 9, 12],
'names': ["{name1,name2,name3}", "{name1,name3}", "{name1,name2}", "{name2,name1,name3}"],
'values':["{N,Y,N}", "{N,N}", "{Y,N}", "{N,Y,Y}"]
}
df = pd.DataFrame(data)
df
Note that the names are not always in the same order or always all included for each id
, however the order of the values does correspond to the names as ordered for each row.
I would like to turn this table into the following structure as efficiently as possible:
data = {'id': [3, 5, 9, 12],
'name1': ["N", "N", "Y", "Y"],
'name2': ["Y", " ", "N", "N"],
'name3': ["N", "N", " ", "Y"],
}
df = pd.DataFrame(data)
df
Currently I am accomplishing this with the following subroutine where I essentially go through the df
row by row and create lists of the names and values and then add those values to new columns. This works correctly but it is very slow (estimated at ~14 hrs) since my df
is large (~2e5 rows). And each row or id
can have up to 194 names, ie "{name1, name2, ..., name193, name194}"
.
def add_name_cols(df, title_col, value_col):
nRows = len(df)
for index,row in df.iterrows(): # parse rows and replace characters
title_spl = [ i for i in row[title_col].replace('{','').replace('}','').split(',') ]
value_spl = [ i for i in row[value_col].replace('{','').replace('}','').split(',') ]
i = 0
for t in title_spl: # add value in correct column for this row
print('Progress rows: {0:2.2f}%, Progress columns: {1:2.2f}%'.format(float(index)/float(nRows)*100, float(i)/float(194)*100), end='\r')
df.loc[index,t] = value_spl[i]
i += 1
return df
df_new = add_name_cols(df, 'names', 'values')
df_new
Is there a way to accomplish this manipulation using more of Pandas' built-in methods that would expedite this process?
Use string methods and dict constructor inside list comprehension.
(i) Convert df[['names','values']]
to a list of lists
(ii) iterate over each pair, ie row in df
, and use str.strip
and str.split
to create pair of lists, unpack and cast to dict
constructor
(iii) Pass the resulting list of dictionaries to pd.DataFrame
temp = pd.DataFrame([dict(zip(*[x.strip('{}').split(',') for x in pair])) for pair in df[['names','values']].to_numpy().tolist()]).fillna('')
df[temp.columns] = temp
df = df.drop(['names','values'], axis=1)
Output:
id name1 name2 name3
0 3 N Y N
1 5 N N
2 9 Y N
3 12 Y N Y
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.