I have the following data frame:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
A B C D
0 foo one 0.478183 -1.267588
1 bar one 0.555985 -2.143590
2 foo two -1.592865 1.251546
3 bar three 0.174138 -0.708198
4 foo two 0.302215 -0.219041
5 bar two -0.034550 -0.965414
6 foo one 1.310828 -0.388601
7 foo three 0.357659 -1.610443
I'm trying to add another column which will be a normalized version of column C over partition by A:
normed = df.groupby('A').apply(lambda x: (x['C']-min(x['C']))/(max(x['C'])-min(x['C'])))
A
bar 1 0.000000
3 0.033396
5 1.000000
foo 0 1.000000
2 0.413716
4 0.000000
6 0.441061
7 0.357787
Finally I want to join this result back to df (using advice from the similar question ):
df.join(normed, on='A', rsuffix='_normed')
However, I get an error:
ValueError: len(left_on) must equal the number of levels in the index of "right"
How can I add normed
result back to dataframe df
?
You get this error because you have a MultiIndex with length 2 in the first level. The second level is the original index.
normed.index
Out[35]:
MultiIndex(levels=[['bar', 'foo'], [0, 1, 2, 3, 4, 5, 6, 7]],
labels=[[0, 0, 0, 1, 1, 1, 1, 1], [1, 3, 5, 0, 2, 4, 6, 7]],
names=['A', None])
You probably want to join on the Original index, so you must drop the first level of the new index
normed.index = normed.index.droplevel()
before joining:
df.join(normed, rsuffix='_normed')
The simplest way is to apply reset_index
to the normed
normed = df.groupby('A').apply(lambda x: (x['C']-min(x['C']))/(max(x['C'])-min(x['C'])))
normed = normed.reset_index(level=0, drop=True)
And now simply add normed
as a column to df
df['normed'] = normed
Actually, there is a very easy solution. When groupby is doing a one-for-one operation (rather than a reduction), you can use transform
and the indexing is already taken care of for you:
df['c_normed'] = df.groupby('A')['C'].transform( lambda x: (x-min(x))/(max(x)-min(x)))
Also note that the code is a bit cleaner if you use df.groupby('A')['C']
because then you can just use x
instead of x['C']
inside the lambda. And also in this case using x['C']
works with apply but not transform (I am not sure why...).
What you can do is the following :
# Get tuples (index, value) for each level
foo = zip(normed.foo.index, normed.foo.values)
bar = zip(normed.bar.index, normed.bar.values)
# Merge the two lists
foo.extend(bar) # merged lists contained in foo
# Sort the list
new_list = sorted(foo, key=lambda x: x[0])
# Create new column in dataframe
index, values = zip(*new_list) # unzip
df['New_column'] = values
Output
Out[85]:
A B C D New_column
0 foo one 0.039683 -0.041559 0.638594
1 bar one -0.090650 -2.316097 0.000000
2 foo two 0.024210 0.616764 0.629815
3 bar three 0.142740 0.156198 0.450339
4 foo two -1.085916 -0.432832 0.000000
5 bar two 0.427604 -1.154850 1.000000
6 foo one -0.156424 0.037188 0.527335
7 foo three 0.676706 -1.336921 1.000000
NB : Maybe there is a cleverer way to do this.
You have to get rid of the the first-level of the multi-index created by groupby first (ie 'Foo' and 'Bar').
Adding the following code should work:
normed = normed.reset_index(level=0)
del normed['A']
normed.rename(columns={'C':'C_normed'}, inplace=True)
pd.concat([df, normed], axis=1)
Result:
A B C D C_normed
0 foo one 1.697923 0.656727 1.000000
1 bar one -0.626052 -0.466088 0.000000
2 foo two -0.501440 1.080408 0.000000
3 bar three 0.731791 -1.531915 1.000000
4 foo two -0.202666 0.275042 0.135846
5 bar two -0.340455 -0.737039 0.210332
6 foo one 0.506664 1.049853 0.458362
7 foo three -0.358317 -0.598262 0.065075
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.