dataframe - find common phrases in column

Question

I have a df like this

df a
name     |  list_of_skills
---------------------------
brian    | ['coding in python', 'halo 3']
jon      | ['coding in python', 'running', 'sports']

Is there a way I can get some counts of the most common phrases like this:

'coding in python'    2
'halo 3'              1
'running'             1
'sports'              1

Just a simple phrase counter without iterating through each list and comparing it to all the other ones?

Answer 1

Use Series.explode with Series.value_counts :

s = df['list_of_skills'].explode().value_counts()
print (s)
coding in python    2
running             1
sports              1
halo 3              1
Name: list_of_skills, dtype: int64

If need DataFrame:

df1 = s.rename_axis('words').reset_index(name='counts')
print (df1)
              words  counts
0  coding in python       2
1           running       1
2            sports       1
3            halo 3       1

Answer 2

Use df.explode with Groupby.size :

In [254]: df = pd.DataFrame({'name':['brian', 'john'], 'list_of_skills':[['coding in python', 'halo 3'], ['coding in python', 'running', 'sports']]})

In [255]: df
Out[255]: 
    name                       list_of_skills
0  brian           [coding in python, halo 3]
1   john  [coding in python, running, sports]

In [258]: ans = df.explode('list_of_skills').groupby('list_of_skills').size()

In [259]: ans
Out[259]: 
list_of_skills
coding in python    2
halo 3              1
running             1
sports              1
dtype: int64

dataframe - find common phrases in column

Question

2 answers

solution1
5 ACCPTED 2021-03-23 05:55:16

solution2
1 2021-03-23 05:53:30

dataframe - find common phrases in column

Question

2 answers

solution1 5 ACCPTED 2021-03-23 05:55:16

solution2 1 2021-03-23 05:53:30

solution1
5 ACCPTED 2021-03-23 05:55:16

solution2
1 2021-03-23 05:53:30