Say I have a pandas
DataFrame as below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':['B', 'A', 'A', 'C', 'A']})
# ID
# 0 B
# 1 A
# 2 A
# 3 C
# 4 A
And I have the following function:
def func(id):
# returns a list of three ['columnName', 'value'] pairs (values are random number between 0 and 5)
# example: [[1,0], [3,4], [0,2]]
res = [np.random.randint(0,5, size=2) for i in range(3)]
return res
I want to iterate each value in ID
column, execute func
on that value (which returns pairs of ['columnName', 'value']
), and then for each pair of values returned by func
, I need to create a new column and assign the value to that specific row. This is a sample output:
ID 1 3 0 2 4
0 B 3.0 4.0 NaN 3.0 NaN
1 A 0.0 4.0 4.0 NaN 4.0
2 A 1.0 1.0 1.0 0.0 NaN
3 C 1.0 1.0 NaN NaN NaN
4 A 1.0 4.0 0.0 2.0 0.0
To clarify this, suppose this array was generated by func
for the first row:
[[1, 3], [3, 4], [2, 3]]
As you can see in the sample output, columns with names "1,3,2" were created with values "3,4,3" respectively.
ID 1 3 0 2 4
0 B 3.0 4.0 NaN 3.0 NaN
for id in df['ID']:
for rowIndex in df.index[df['ID'] == id]:
for columnName, value in func(id):
df.loc[rowIndex, columnName] = value
This works just fine! The only problem is that it's not efficient at all . That's because when I look for rowIndex of specific ID
(say 'A' for instance), it might be present multiple times in the column and iterates over them all. And later it overrides them too. Isn't there any better way to write this?
This is just a simplified version of my project. So id
is not used by the func
. Also, func
doesn't return random values for specific ID
s and it returns the same value for each ID
.
First idea is working with groups, but necessary reshape by Series.unstack
before DataFrame.join
:
df1 = df.join(df.groupby('ID')['ID'].apply(lambda x: dict(func(x))).unstack(), 'ID')
For improve performance working only with unique values in list comprehension before DataFrame.join
:
u = df['ID'].unique()
df1 = df.join(pd.DataFrame([dict(func(x)) for x in u], index=u), on='ID')
print (df1)
ID 4 0 3 1 2
0 B 1.0 4.0 0.0 NaN NaN
1 A 2.0 NaN NaN 1.0 NaN
2 A 2.0 NaN NaN 1.0 NaN
3 C NaN 2.0 1.0 NaN 0.0
4 A 2.0 NaN NaN 1.0 NaN
You can do series.map
and create a dataframe of it, then join
df.join(pd.DataFrame(df['ID'].map(lambda x: dict(func(x))).tolist()))
Sample Run:
np.random.seed(0)
def func(id):
# returns a list of three ['columnName', 'value'] pairs (values are random number between 0 and 5)
# example: [[1,0], [3,4], [0,2]]
res = [np.random.randint(0,5, size=2) for i in range(3)]
return res
print(df.join(pd.DataFrame(df['ID'].map(lambda x: dict(func(x))).tolist())))
ID 4 3 0 2 1
0 B 0.0 1.0 NaN NaN NaN
1 A 0.0 2.0 4.0 NaN NaN
2 A NaN NaN 1.0 1.0 0.0
3 C NaN 0.0 NaN NaN 4.0
4 A NaN 3.0 1.0 3.0 NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.