I have one dataframe df
, with two columns : Script (with text) and Speaker
Script Speaker
aze Speaker 1
art Speaker 2
ghb Speaker 3
jka Speaker 1
tyc Speaker 1
avv Speaker 2
bhj Speaker 1
And I have the folloing list : list = ['a','b','c']
My target is to obtain a matrix/dataframe like this, only with items from my list.
Speaker a b c
Speaker 1 2 1 1
Speaker 2 2 0 0
Speaker 3 0 1 0
I tried the following :
r = '|'.join(list)
nb_df = df[df['Script'].str.contains(r, case = False)]
df_target = nb_df.groupby('Speaker')['Speaker'].count()
I obtain a part of my target, I know how much time each speaker say items searched from list. but I can't distinguish the number of time for each of the items.
First not use list
like variable, because builtin (python code word).
Use crosstab
with Series.str.extractall
:
print (df)
Script Speaker
0 azc Speaker 1 <-change sample data
1 art Speaker 2
2 ghb Speaker 3
3 jka Speaker 1
4 tyc Speaker 1
5 avv Speaker 2
6 bhj Speaker 1
L = ['a','b','c']
pat = r'({})'.format('|'.join(L))
df = df.set_index('Speaker')['Script'].str.extractall(pat)[0].reset_index(name='val')
df = pd.crosstab(df['Speaker'], df['val'])
print (df)
val a b c
Speaker
Speaker 1 2 1 2
Speaker 2 2 0 0
Speaker 3 0 1 0
If performance is not so important use 3 text functions Series.str.findall
, Series.str.join
and Series.str.get_dummies
and sum
per level:
df = (df.set_index('Speaker')['Script'].str.findall('|'.join(L))
.str.join('|')
.str.get_dummies()
.sum(level=0))
print (df)
a b c
Speaker
Speaker 1 2 1 2
Speaker 2 2 0 0
Speaker 3 0 1 0
You can use the series.str.findall()
with str.join()
and str.get_dummies()
with groupby().sum
:
l = ['a','b','c']
final=(df['Script'].str.findall('|'.join(l)).str.join('|')
.str.get_dummies().groupby(df['Speaker']).sum())
a b c
Speaker
Speaker 1 2 1 1
Speaker 2 2 0 0
Speaker 3 0 1 0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.