简体   繁体   中英

Count occurance of unique values in a pandas dataframe across multiple columns

I have the following dataframe in pandas

df = pd.DataFrame({'a' : ['hello', 'world', 'great', 'hello'], 'b' : ['world', None, 'hello', 'world'], 'c' : [None, 'hello', 'great', None]})

i would like to count the occurrence of the unique values in column 'a' across all the other columns and column 'a' too and save that into new columns for the dataframe with appropriate naming that take on the values in column 'a' such as 'hello_count', 'world_count' and so on. Hence the end result would be something like

 df = pd.DataFrame({'a' : ['hello', 'world', 'great', 'hello'], 'b' : ['world', None, 'hello', 'world'], 'c' : [None, 'hello', 'great', None], 'hello_count' : [1,1,1,1], 'world_count' : [1,1,0,1], 'great_count' : [0,0,2,0]})

i tried

df['a', 'b', 'a'].groupby('a').agg(['count])

but that did not work. Any help is really appreciated

Let's use pd.get_dummies and groupby :

(df1.assign(**pd.get_dummies(df1)
                .pipe(lambda x: x.groupby(x.columns.str[2:], axis=1)
                .sum())))

Output:

       a      b      c  great  hello  world
0  hello  world   None      0      1      1
1  world   None  hello      0      1      1
2  great  hello  great      2      1      0
3  hello  world   None      0      1      1

Here is the above solution in steps.

Step 1: pd.get_dummies

df_gd = pd.get_dummies(df1)
print(df_gd)

   a_great  a_hello  a_world  b_hello  b_world  c_great  c_hello
0        0        1        0        0        1        0        0
1        0        0        1        0        0        0        1
2        1        0        0        1        0        1        0
3        0        1        0        0        1        0        0

Step 2: groupby column names ignoring the first two letters

df_gb = df_gd.groupby(df_gd.columns.str[2:], axis=1).sum()
print(df_gb)

   great  hello  world
0      0      1      1
1      0      1      1
2      2      1      0
3      0      1      1

Step 3: Join back to original dataframe

df_out = df1.join(df_gb)
print(df_out)

Ouput:

       a      b      c  great  hello  world
0  hello  world   None      0      1      1
1  world   None  hello      0      1      1
2  great  hello  great      2      1      0
3  hello  world   None      0      1      1

Using df.apply in a loop simplifies the job. Each row is then tested how many of its elements are same as the required string:

for ss in df.a.unique():
    df[ss+"_count"] = df.apply(lambda row: sum(map(lambda x: x==ss, row)), axis=1)

print(df)

Output:

       a      b      c  hello_count  world_count  great_count
0  hello  world   None            1            1            0
1  world   None  hello            1            1            0
2  great  hello  great            1            0            2
3  hello  world   None            1            1            0

You can create dictionary d_unique={} and assign all the unique values as key pair in to it, consider the dataframe named as data_rnr:

d_unique={}
for col in data_rnr.columns:
    print(data_rnr[col].name)
    print(len(data_rnr[col].unique()))
    d_unique[data_rnr[col].name]=len(data_rnr[col].unique())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM