I have a Pandas DataFrame with a large number of unique values. I would like to group these values with a more general column. By doing so I expect to add hierarchies to my data and thus make analysis easier.
One thing that worked was to copy the column and replaced the values as follows:
data.loc[data['new_col'].str.contains('string0|string1'), 'new_col']\
= 'substitution'
However, I am trying to find a way to reproduce this easily without adding a condition for each entry.
Also tried using without success using the following methods:
I would like to hear your advice to know how to approach this.
import pandas as pd
# My DataFrame looks similar to this:
>>> df = pd.DataFrame({'A': ['a', 'w', 'c', 'd', 'z']})
# The dictionary were I store the generalization:
>>> subs = {'g1': ['a', 'b', 'c', 'd'],
... 'g2': ['w', 'x', 'y', 'z']}
>>> df
A H
0 a g1
1 w g2
2 c g1
3 d g1
4 z g2
create a new dict by swapping key with values of list. Next, map df.A
with the swapped dict.
swap_dict = {x: k for k, v in d.items() for x in v}
Out[1054]:
{'a': 's1',
'b': 's1',
'c': 's1',
'd': 's1',
'w': 's2',
'x': 's2',
'y': 's2',
'z': 's2'}
df['H'] = df.A.map(swap_dict)
Out[1058]:
A H
0 a s1
1 w s2
2 c s1
3 d s1
4 z s2
Note : I directly use keys of your dict as values of H
instead of g1
, g2
,.... because I think it is enough to identify each group of values. If you still want g1
, g2
,..., it is easy to accomplish. Just let me know.
I also named your dict as d
in my code
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.