I have the following dataframe in pandas
df = pd.DataFrame({'a' : ['hello', 'world', 'great', 'hello'], 'b' : ['world', None, 'hello', 'world'], 'c' : [None, 'hello', 'great', None]})
i would like to count the occurrence of the unique values in column 'a' across all the other columns and column 'a' too and save that into new columns for the dataframe with appropriate naming that take on the values in column 'a' such as 'hello_count', 'world_count' and so on. Hence the end result would be something like
df = pd.DataFrame({'a' : ['hello', 'world', 'great', 'hello'], 'b' : ['world', None, 'hello', 'world'], 'c' : [None, 'hello', 'great', None], 'hello_count' : [1,1,1,1], 'world_count' : [1,1,0,1], 'great_count' : [0,0,2,0]})
i tried
df['a', 'b', 'a'].groupby('a').agg(['count])
but that did not work. Any help is really appreciated
Let's use pd.get_dummies
and groupby
:
(df1.assign(**pd.get_dummies(df1)
.pipe(lambda x: x.groupby(x.columns.str[2:], axis=1)
.sum())))
Output:
a b c great hello world
0 hello world None 0 1 1
1 world None hello 0 1 1
2 great hello great 2 1 0
3 hello world None 0 1 1
Here is the above solution in steps.
df_gd = pd.get_dummies(df1)
print(df_gd)
a_great a_hello a_world b_hello b_world c_great c_hello
0 0 1 0 0 1 0 0
1 0 0 1 0 0 0 1
2 1 0 0 1 0 1 0
3 0 1 0 0 1 0 0
df_gb = df_gd.groupby(df_gd.columns.str[2:], axis=1).sum()
print(df_gb)
great hello world
0 0 1 1
1 0 1 1
2 2 1 0
3 0 1 1
df_out = df1.join(df_gb)
print(df_out)
Ouput:
a b c great hello world
0 hello world None 0 1 1
1 world None hello 0 1 1
2 great hello great 2 1 0
3 hello world None 0 1 1
Using df.apply
in a loop simplifies the job. Each row is then tested how many of its elements are same as the required string:
for ss in df.a.unique():
df[ss+"_count"] = df.apply(lambda row: sum(map(lambda x: x==ss, row)), axis=1)
print(df)
Output:
a b c hello_count world_count great_count
0 hello world None 1 1 0
1 world None hello 1 1 0
2 great hello great 1 0 2
3 hello world None 1 1 0
You can create dictionary d_unique={} and assign all the unique values as key pair in to it, consider the dataframe named as data_rnr:
d_unique={}
for col in data_rnr.columns:
print(data_rnr[col].name)
print(len(data_rnr[col].unique()))
d_unique[data_rnr[col].name]=len(data_rnr[col].unique())
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.