function to make faster processing

Question

I have a data set which looks something like this

df['pos_tag']
0        [(colin, NN), (eats, VB), (cake, NN)]
1        [(paris, NN), (kicks, VB), (ball, NN)]
2        [(jackson, NN), (watches, VB), (television, NN)]
3        [(joyce, NN), (drinks, VB), (water, NN)]
4        [(oscar, NN), (wins, VB), (award, NN)]

I want to write a function to count the occurrences of each parts of speech

def count_pos_tag(dfcol):
    values = []
    for row in dfcol:
        count = [0, 0]
        for token, tag in row:
            if tag.startswith('NN'):
                count[0] += 1
            elif tag.startswith('VB'):
                count[1] += 1
        values.append(count)
    return values

values = count_pos_tag(df['pos_tag'])

I have noticed that it takes up some time as I am running it on a big data set. Is there another way I could use to make the processing faster?

Answer 1

You need to re-think your organization. pandas is meant for 2D arrays of simple data (ie scalars like int , str , datetime64ns ) not complex objects like lists, tuples, dicts, or in this case a list of tuples.

Once we reshape the data to a simpler organization all you need to do is groupby + value_counts to get the counts per part of speech per row from the original DataFrame. The key here is that the re-shaped DataFrame has each word and pos split into a single cell, and the index of this DataFrame is no longer unique, but points back to the original index.

Sample Data

import pandas as pd
df = pd.DataFrame({'pos_tag': [[('colin', 'NN'), ('eats', 'VB'), ('cake', 'NN')],
                               [('paris', 'NN'), ('kicks', 'VB'), ('ball', 'NN')],
                               [('jackson', 'NN'), ('watches', 'VB'), ('television', 'NN')],
                               [('joyce', 'NN'), ('drinks', 'VB'), ('water', 'NN')],
                               [('oscar', 'NN'), ('wins', 'VB'), ('award', 'NN')]]})

Code

s = df['pos_tag'].explode()
df1 = pd.DataFrame(s.to_list(), index=s.index, columns=['word', 'pos'])
#         word pos
#0       colin  NN
#0        eats  VB
#0        cake  NN
#1       paris  NN
#...      ...   ..
#3       water  NN
#4       oscar  NN
#4        wins  VB
#4       award  NN

df1.groupby(level=0).pos.value_counts()

   pos
0  NN     2
   VB     1
1  NN     2
   VB     1
2  NN     2
   VB     1
3  NN     2
   VB     1
4  NN     2
   VB     1
Name: pos, dtype: int64

function to make faster processing

Question

1 answers

solution1
1 2020-07-02 16:58:29

Sample Data

Code

function to make faster processing

Question

1 answers

solution1 1 2020-07-02 16:58:29

Sample Data

Code

solution1
1 2020-07-02 16:58:29