计算熊猫数据框中相似值的百分比

Question

I have one dataframe df , with two columns : Script (with text) and Speaker我有一个数据框df ，有两列：脚本（带文本）和扬声器

Script  Speaker
aze     Speaker 1 
art     Speaker 2
ghb     Speaker 3
jka     Speaker 1
tyc     Speaker 1
avv     Speaker 2 
bhj     Speaker 1

And I have the following list : L = ['a','b','c']我有以下列表： L = ['a','b','c']

With the following code,使用以下代码，

df = (df.set_index('Speaker')['Script'].str.findall('|'.join(L))
        .str.join('|')
        .str.get_dummies()
        .sum(level=0))
print (df)

I obtain this dataframe df2 :我获得了这个数据帧df2 ：

Speaker     a    b    c
Speaker 1   2    1    1
Speaker 2   2    0    0
Speaker 3   0    1    0

Which line can I add in my code to obtain, for each line of my dataframe df2 , a percentage value of all lines spoken by speaker, in order to have the following dataframe df3 :对于我的数据帧df2每一行，我可以在我的代码中添加哪一行，以获得扬声器所说的所有行的百分比值，以获得以下数据帧df3 ：

Speaker     a    b    c
Speaker 1   50%  25%   25%
Speaker 2  100%    0   0
Speaker 3   0   100%   0

Answer 1

You could divide by the sum along the first axis and then cast to string and add % :你可以通过划分sum沿第一轴，然后转换为字符串，并添加% ：

out = (df.set_index('Speaker')['Script'].str.findall('|'.join(L))
         .str.join('|')
         .str.get_dummies()
         .sum(level=0))

(out/out.sum(0)[:,None]).mul(100).astype(int).astype(str).add('%')

            a     b    c
Speaker                  
Speaker1   50%   25%  25%
Speaker2  100%    0%   0%
Speaker3    0%  100%   0%

Answer 2

Starting from your original dataframe, if you want % and not grouped sum of dummies , you can change the entire script like below:从您的原始数据帧开始，如果您想要 % 而不是分组 sum of dummies ，您可以更改整个脚本，如下所示：

m = df.set_index('Speaker')['Script'].str.findall('|'.join(L)) #creates a list of matches
m = m.explode().reset_index() #explode to a series 
final = pd.crosstab(m['Speaker'],m['Script'],normalize='index').mul(100) # percentage pivot

Script         a      b     c
Speaker                      
Speaker 1   50.0   25.0  25.0
Speaker 2  100.0    0.0   0.0
Speaker 3    0.0  100.0   0.0

If you dont want the percentage just use:如果您不想要百分比，请使用：

pd.crosstab(m['Speaker'],m['Script'])

Script     a  b  c
Speaker           
Speaker 1  2  1  1
Speaker 2  2  0  0
Speaker 3  0  1  0

Note: this uses pandas 0.25+ as version注意：这里使用 pandas 0.25+ 作为版本

Answer 3

(df.set_index('Speaker')['Script'].str.extractall(f'({"|".join(L)})')
   .groupby('Speaker')[0].value_counts(normalize=True)
   .unstack(fill_value=0)
)

Output:输出：

0            a     b     c
Speaker                   
Speaker 1  0.5  0.25  0.25
Speaker 2  1.0  0.00  0.00
Speaker 3  0.0  1.00  0.00

Answer 4

Given the example you can try with the following line of code:鉴于示例，您可以尝试使用以下代码行：

df = (df/df.sum(axis=1)[:, None]).mul(100).astype(int)

With the data you provide:使用您提供的数据：

import pandas as pd
import numpy as np
data = {'a':[2,2,0],'b':[1,0,1],'c':[1,0,0]}
df = pd.DataFrame(data)
df = (df/df.sum(axis=1)[:, None]).mul(100).astype(int)
print(df)

Output:输出：

     a   b   c
0   50  25  25
1  100   0   0
2    0 100   0

Or, if you wish to add the '%' symbol:或者，如果您想添加 '%' 符号：

df = (df / df.sum(axis=1)[:, None]).mul(100).astype(int).astype(str) + '%'

Output:输出：

      a     b    c
0   50%   25%  25%
1  100%    0%   0%
2    0%  100%   0%

计算熊猫数据框中相似值的百分比

问题描述

4 个解决方案

解决方案1
8 已采纳 2019-12-27 15:42:22

解决方案2
5 2019-12-27 15:52:20

解决方案3
3 2019-12-27 15:52:19

解决方案4
2 2019-12-27 15:41:42

计算熊猫数据框中相似值的百分比

问题描述

4 个解决方案

解决方案1 8 已采纳 2019-12-27 15:42:22

解决方案2 5 2019-12-27 15:52:20

解决方案3 3 2019-12-27 15:52:19

解决方案4 2 2019-12-27 15:41:42

解决方案1
8 已采纳 2019-12-27 15:42:22

解决方案2
5 2019-12-27 15:52:20

解决方案3
3 2019-12-27 15:52:19

解决方案4
2 2019-12-27 15:41:42