[英]Calculate percentage of similar values in pandas dataframe
I have one dataframe df
, with two columns : Script (with text) and Speaker我有一个数据框df
,有两列:脚本(带文本)和扬声器
Script Speaker
aze Speaker 1
art Speaker 2
ghb Speaker 3
jka Speaker 1
tyc Speaker 1
avv Speaker 2
bhj Speaker 1
And I have the following list : L = ['a','b','c']
我有以下列表: L = ['a','b','c']
With the following code,使用以下代码,
df = (df.set_index('Speaker')['Script'].str.findall('|'.join(L))
.str.join('|')
.str.get_dummies()
.sum(level=0))
print (df)
I obtain this dataframe df2
:我获得了这个数据帧df2
:
Speaker a b c
Speaker 1 2 1 1
Speaker 2 2 0 0
Speaker 3 0 1 0
Which line can I add in my code to obtain, for each line of my dataframe df2
, a percentage value of all lines spoken by speaker, in order to have the following dataframe df3
:对于我的数据帧df2
每一行,我可以在我的代码中添加哪一行,以获得扬声器所说的所有行的百分比值,以获得以下数据帧df3
:
Speaker a b c
Speaker 1 50% 25% 25%
Speaker 2 100% 0 0
Speaker 3 0 100% 0
You could divide by the sum
along the first axis and then cast to string and add %
:你可以通过划分sum
沿第一轴,然后转换为字符串,并添加%
:
out = (df.set_index('Speaker')['Script'].str.findall('|'.join(L))
.str.join('|')
.str.get_dummies()
.sum(level=0))
(out/out.sum(0)[:,None]).mul(100).astype(int).astype(str).add('%')
a b c
Speaker
Speaker1 50% 25% 25%
Speaker2 100% 0% 0%
Speaker3 0% 100% 0%
Starting from your original dataframe, if you want % and not grouped sum of dummies , you can change the entire script like below:从您的原始数据帧开始,如果您想要 % 而不是分组 sum of dummies ,您可以更改整个脚本,如下所示:
m = df.set_index('Speaker')['Script'].str.findall('|'.join(L)) #creates a list of matches
m = m.explode().reset_index() #explode to a series
final = pd.crosstab(m['Speaker'],m['Script'],normalize='index').mul(100) # percentage pivot
Script a b c
Speaker
Speaker 1 50.0 25.0 25.0
Speaker 2 100.0 0.0 0.0
Speaker 3 0.0 100.0 0.0
If you dont want the percentage just use:如果您不想要百分比,请使用:
pd.crosstab(m['Speaker'],m['Script'])
Script a b c
Speaker
Speaker 1 2 1 1
Speaker 2 2 0 0
Speaker 3 0 1 0
Note: this uses pandas 0.25+ as version注意:这里使用 pandas 0.25+ 作为版本
(df.set_index('Speaker')['Script'].str.extractall(f'({"|".join(L)})')
.groupby('Speaker')[0].value_counts(normalize=True)
.unstack(fill_value=0)
)
Output:输出:
0 a b c
Speaker
Speaker 1 0.5 0.25 0.25
Speaker 2 1.0 0.00 0.00
Speaker 3 0.0 1.00 0.00
Given the example you can try with the following line of code:鉴于示例,您可以尝试使用以下代码行:
df = (df/df.sum(axis=1)[:, None]).mul(100).astype(int)
With the data you provide:使用您提供的数据:
import pandas as pd
import numpy as np
data = {'a':[2,2,0],'b':[1,0,1],'c':[1,0,0]}
df = pd.DataFrame(data)
df = (df/df.sum(axis=1)[:, None]).mul(100).astype(int)
print(df)
Output:输出:
a b c
0 50 25 25
1 100 0 0
2 0 100 0
Or, if you wish to add the '%' symbol:或者,如果您想添加 '%' 符号:
df = (df / df.sum(axis=1)[:, None]).mul(100).astype(int).astype(str) + '%'
Output:输出:
a b c
0 50% 25% 25%
1 100% 0% 0%
2 0% 100% 0%
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.