简体   繁体   English

计算跨多个列的pandas数据帧中唯一值的出现次数

[英]Count occurance of unique values in a pandas dataframe across multiple columns

I have the following dataframe in pandas 我在熊猫中有以下数据帧

df = pd.DataFrame({'a' : ['hello', 'world', 'great', 'hello'], 'b' : ['world', None, 'hello', 'world'], 'c' : [None, 'hello', 'great', None]})

i would like to count the occurrence of the unique values in column 'a' across all the other columns and column 'a' too and save that into new columns for the dataframe with appropriate naming that take on the values in column 'a' such as 'hello_count', 'world_count' and so on. 我想计算所有其他列和列'a'中列'a'中唯一值的出现,并将其保存到数据帧的新列中,并使用适当的命名来获取列'a'中的值如'hello_count','world_count'等。 Hence the end result would be something like 因此,最终结果将是这样的

 df = pd.DataFrame({'a' : ['hello', 'world', 'great', 'hello'], 'b' : ['world', None, 'hello', 'world'], 'c' : [None, 'hello', 'great', None], 'hello_count' : [1,1,1,1], 'world_count' : [1,1,0,1], 'great_count' : [0,0,2,0]})

i tried 我试过了

df['a', 'b', 'a'].groupby('a').agg(['count])

but that did not work. 但那没用。 Any help is really appreciated 任何帮助都非常感谢

Let's use pd.get_dummies and groupby : 让我们使用pd.get_dummiesgroupby

(df1.assign(**pd.get_dummies(df1)
                .pipe(lambda x: x.groupby(x.columns.str[2:], axis=1)
                .sum())))

Output: 输出:

       a      b      c  great  hello  world
0  hello  world   None      0      1      1
1  world   None  hello      0      1      1
2  great  hello  great      2      1      0
3  hello  world   None      0      1      1

Here is the above solution in steps. 以下是步骤中的上述解决方案。

Step 1: pd.get_dummies 第1步:pd.get_dummies

df_gd = pd.get_dummies(df1)
print(df_gd)

   a_great  a_hello  a_world  b_hello  b_world  c_great  c_hello
0        0        1        0        0        1        0        0
1        0        0        1        0        0        0        1
2        1        0        0        1        0        1        0
3        0        1        0        0        1        0        0

Step 2: groupby column names ignoring the first two letters 第2步:groupby列名忽略前两个字母

df_gb = df_gd.groupby(df_gd.columns.str[2:], axis=1).sum()
print(df_gb)

   great  hello  world
0      0      1      1
1      0      1      1
2      2      1      0
3      0      1      1

Step 3: Join back to original dataframe 第3步:加入原始数据框

df_out = df1.join(df_gb)
print(df_out)

Ouput: 输出继电器:

       a      b      c  great  hello  world
0  hello  world   None      0      1      1
1  world   None  hello      0      1      1
2  great  hello  great      2      1      0
3  hello  world   None      0      1      1

Using df.apply in a loop simplifies the job. 在循环中使用df.apply简化作业。 Each row is then tested how many of its elements are same as the required string: 然后测试每行中有多少元素与所需字符串相同:

for ss in df.a.unique():
    df[ss+"_count"] = df.apply(lambda row: sum(map(lambda x: x==ss, row)), axis=1)

print(df)

Output: 输出:

       a      b      c  hello_count  world_count  great_count
0  hello  world   None            1            1            0
1  world   None  hello            1            1            0
2  great  hello  great            1            0            2
3  hello  world   None            1            1            0

You can create dictionary d_unique={} and assign all the unique values as key pair in to it, consider the dataframe named as data_rnr: 您可以创建字典d_unique = {}并将所有唯一值作为密钥对分配给它,考虑名为data_rnr的数据帧:

d_unique={}
for col in data_rnr.columns:
    print(data_rnr[col].name)
    print(len(data_rnr[col].unique()))
    d_unique[data_rnr[col].name]=len(data_rnr[col].unique())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM