简体   繁体   中英

Str contains from list and distinguish by items of list

I have one dataframe df , with two columns : Script (with text) and Speaker

Script  Speaker
aze     Speaker 1 
art     Speaker 2
ghb     Speaker 3
jka     Speaker 1
tyc     Speaker 1
avv     Speaker 2 
bhj     Speaker 1

And I have the folloing list : list = ['a','b','c']

My target is to obtain a matrix/dataframe like this, only with items from my list.

Speaker     a    b    c
Speaker 1   2    1    1
Speaker 2   2    0    0
Speaker 3   0    1    0

I tried the following :

r = '|'.join(list)

nb_df = df[df['Script'].str.contains(r, case = False)]
df_target = nb_df.groupby('Speaker')['Speaker'].count()

I obtain a part of my target, I know how much time each speaker say items searched from list. but I can't distinguish the number of time for each of the items.

  1. How can I make it with a pandas function (if existing)
  2. How could I make it with a Python Loop ?

First not use list like variable, because builtin (python code word).

Use crosstab with Series.str.extractall :

print (df)
  Script    Speaker
0    azc  Speaker 1 <-change sample data
1    art  Speaker 2
2    ghb  Speaker 3
3    jka  Speaker 1
4    tyc  Speaker 1
5    avv  Speaker 2
6    bhj  Speaker 1

L = ['a','b','c']
pat = r'({})'.format('|'.join(L))
df = df.set_index('Speaker')['Script'].str.extractall(pat)[0].reset_index(name='val')

df = pd.crosstab(df['Speaker'], df['val'])
print (df)
val        a  b  c
Speaker           
Speaker 1  2  1  2
Speaker 2  2  0  0
Speaker 3  0  1  0

If performance is not so important use 3 text functions Series.str.findall , Series.str.join and Series.str.get_dummies and sum per level:

df = (df.set_index('Speaker')['Script'].str.findall('|'.join(L))
        .str.join('|')
        .str.get_dummies()
        .sum(level=0))
print (df)
           a  b  c
Speaker           
Speaker 1  2  1  2
Speaker 2  2  0  0
Speaker 3  0  1  0

You can use the series.str.findall() with str.join() and str.get_dummies() with groupby().sum :

l = ['a','b','c']
final=(df['Script'].str.findall('|'.join(l)).str.join('|')
  .str.get_dummies().groupby(df['Speaker']).sum())

           a  b  c
Speaker           
Speaker 1  2  1  1
Speaker 2  2  0  0
Speaker 3  0  1  0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM