Removing a rows from pandas data frame if one of its cell contains list of all caps string

Question

I was working with conll2003dataset. It contains articles from various news sources among other things. It contains sentences, part of speech tags for each word in those sentences, chunk ids for those words etc.

Some sentences are all caps. I simply want to remove those rows from the corresponding data frame. Here is what I tried:

import re

df_train = conll2003dataset['train'].to_pandas()
df_test = conll2003dataset['test'].to_pandas()

all_caps_regex = re.compile('^[^a-z]*$')

df_train.drop(df_train[all(map(all_caps_regex.search, df_train['tokens']))].index, inplace=True)
df_test.drop(df_test[all(map(all_caps_regex.search, df_test['tokens']))].index, inplace=True)

But I am getting following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-17-feda9c78b1c7> in <module>()
      9 all_caps_regex = re.compile('^[^a-z]*$')
     10 
---> 11 df_train.drop(df_train[all(map(all_caps_regex.search, df_train['tokens']))].index, inplace=True)
     12 df_test.drop(df_test[all(map(all_caps_regex.search, df_test['tokens']))].index, inplace=True)
     13 

TypeError: cannot use a string pattern on a bytes-like object

Where I am going wrong? How do I do this?

Here is the colab notebook illustrating the same.

Answer 1

The problem is that each element in the Series "df_train" is a list, so you are applying your regex to the list and not the elements inside the lists. To do that you need to loop through the elements of the list, like so:

df_train[ [all(map(all_caps_regex.search, w)) for w in df_train['tokens']] ].index

Having said that, since we are using pandas it is recommended to use pandas methods, which are usually faster and more practical. We can map a function to each element of a Series using .map(), or .apply():

df_train[ df_train['tokens'].apply(lambda l:all([all_caps_regex.search(w) for w in l])) ]

One last solution is to not use a regex at all since all we need to do is check if every element is in caps. To do that we can check if the text.upper() is the same as the unmodified text:

df_train[ df_train['tokens'].apply(lambda l:[w.upper() for w in l] == [w for w in l]) ].index

Removing a rows from pandas data frame if one of its cell contains list of all caps string

Question

1 answers

solution1
0 2021-10-28 21:52:20

Removing a rows from pandas data frame if one of its cell contains list of all caps string

Question

1 answers

solution1 0 2021-10-28 21:52:20

solution1
0 2021-10-28 21:52:20