简体   繁体   中英

Python Pandas DataFrame Create Columns with specific value on iteration of every row

Say I have a pandas DataFrame as below:

import pandas as pd
import numpy as np

df = pd.DataFrame({'ID':['B', 'A', 'A', 'C', 'A']})

#   ID
# 0  B
# 1  A
# 2  A
# 3  C
# 4  A

And I have the following function:

def func(id):
    # returns a list of three ['columnName', 'value'] pairs (values are random number between 0 and 5)
    # example: [[1,0], [3,4], [0,2]]
    res = [np.random.randint(0,5, size=2) for i in range(3)]
    return res

I want to iterate each value in ID column, execute func on that value (which returns pairs of ['columnName', 'value'] ), and then for each pair of values returned by func , I need to create a new column and assign the value to that specific row. This is a sample output:

  ID    1    3    0    2    4
0  B  3.0  4.0  NaN  3.0  NaN
1  A  0.0  4.0  4.0  NaN  4.0
2  A  1.0  1.0  1.0  0.0  NaN
3  C  1.0  1.0  NaN  NaN  NaN
4  A  1.0  4.0  0.0  2.0  0.0

To clarify this, suppose this array was generated by func for the first row:

[[1, 3], [3, 4], [2, 3]]

As you can see in the sample output, columns with names "1,3,2" were created with values "3,4,3" respectively.

  ID    1    3    0    2    4
0  B  3.0  4.0  NaN  3.0  NaN

What I've done so far:

for id in df['ID']:
    for rowIndex in df.index[df['ID'] == id]:
        for columnName, value in func(id):
            df.loc[rowIndex, columnName] = value

This works just fine! The only problem is that it's not efficient at all . That's because when I look for rowIndex of specific ID (say 'A' for instance), it might be present multiple times in the column and iterates over them all. And later it overrides them too. Isn't there any better way to write this?

Note:

This is just a simplified version of my project. So id is not used by the func . Also, func doesn't return random values for specific ID s and it returns the same value for each ID .

First idea is working with groups, but necessary reshape by Series.unstack before DataFrame.join :

df1 = df.join(df.groupby('ID')['ID'].apply(lambda x: dict(func(x))).unstack(), 'ID')

For improve performance working only with unique values in list comprehension before DataFrame.join :

u = df['ID'].unique()
df1 = df.join(pd.DataFrame([dict(func(x)) for x in u], index=u), on='ID')
print (df1)

  ID    4    0    3    1    2
0  B  1.0  4.0  0.0  NaN  NaN
1  A  2.0  NaN  NaN  1.0  NaN
2  A  2.0  NaN  NaN  1.0  NaN
3  C  NaN  2.0  1.0  NaN  0.0
4  A  2.0  NaN  NaN  1.0  NaN

You can do series.map and create a dataframe of it, then join

df.join(pd.DataFrame(df['ID'].map(lambda x: dict(func(x))).tolist()))

Sample Run:

np.random.seed(0)
def func(id):
    # returns a list of three ['columnName', 'value'] pairs (values are random number between 0 and 5)
    # example: [[1,0], [3,4], [0,2]]
    res = [np.random.randint(0,5, size=2) for i in range(3)]
    return res
print(df.join(pd.DataFrame(df['ID'].map(lambda x: dict(func(x))).tolist())))

  ID    4    3    0    2    1
0  B  0.0  1.0  NaN  NaN  NaN
1  A  0.0  2.0  4.0  NaN  NaN
2  A  NaN  NaN  1.0  1.0  0.0
3  C  NaN  0.0  NaN  NaN  4.0
4  A  NaN  3.0  1.0  3.0  NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM