简体   繁体   中英

How can I keep the rows of a pandas data frame that match a particular condition using value_counts() on multiple columns

I would like to get rid of those rows where a particular value occurs only once in a column, considering 3 columns. That is, for feature:

  • text: if value_counts() == 1, then eliminate those rows, or just keep when value_counts() > 1
  • next_word: if value_counts() == 1, then eliminate those rows, or just keep when value_counts() > 1. In this case, just work with the already processed (just kept the rows that the column 'text' contains values showing up more than once)
  • previous_word: if value_counts() == 1, then eliminate those rows, or just keep when value_counts() > 1. In this case, work with the already processed cases (just kept the rows that the column 'text' and 'next_word' contains values showing up more than once)

What I already tried is to get a data frame that keeps the rows that contains those values from a particular column:

#text
text_counts = df_processed['text'].value_counts()
text_list = text_counts[text_counts > 1].index.tolist()
zip_data_text_removed = df_processed[df_processed['text'].isin(text_list)] 

If I show the value_counts from this particular column 'text': zip_data_text_removed.text.value_counts()

I can check that I got a dataframe which contains values that occur more than once, that is 25470 unique values out of 50539 initial unique values (which is correct). However, when I show the information about the dataframe:

class 'pandas.core.frame.DataFrame' Int64Index: 291442 entries, 0 to 316510

It clearly mismatches.

I also want to apply the same methodology to the rest of the columns (now, using this previous filtered data frame):

#Next
next_word_counts = df_processed['next_word'].value_counts()
next_word_list = next_word_counts[next_word_counts > 1].index.tolist()
zip_data_next_text_removed = zip_data_text_removed[zip_data_text_removed['next_word'].isin(next_word_list)]


#Previous
previous_word_counts = df_processed['previous_word'].value_counts()
previous_word_list = previous_word_counts[previous_word_counts > 1].index.tolist()
zip_data_prev_text_removed = zip_data_next_text_removed[zip_data_next_text_removed['previous_word'].isin(previous_word_list)]

However, when I show the value_counts of "text" ie, the first feature used:

zip_data_prev_text_removed.text.value_counts()

it also shows values with only one occurrence.. which is weird. The info of the data frame is also confusing:

class 'pandas.core.frame.DataFrame' Int64Index: 247621 entries, 0 to 316509

Shouldn't it be from 0 to 247621 entries ?

***EDIT

Now, I added reset_index(drop=True) as suggested by @janPansky:

#text
text_counts = df_processed['text'].value_counts()
text_list = text_counts[text_counts > 1].index.tolist()
zip_data_text_removed = df_processed[df_processed['text'].isin(text_list)]
zip_data_text_removed = zip_data_text_removed.reset_index(drop=True) 

#Next
next_word_counts = zip_data_text_removed['next_word'].value_counts()
next_word_list = next_word_counts[next_word_counts > 1].index.tolist()
zip_data_next_text_removed = zip_data_text_removed[zip_data_text_removed['next_word'].isin(next_word_list)]
zip_data_next_text_removed = zip_data_next_text_removed.reset_index(drop=True)
print(zip_data_next_text_removed.text.value_counts() )  

However, still continue printing values which has value_count == 1

Still a little iffy if I'm understanding your problem correctly, but see if this different approach does what you need. I'm breaking it apart to make it understandable, but it could be done in an ugly one-liner as well.

counts_text = df_processed['text'].value_counts()
non_unique_text = df_processed['text'].apply(lambda text: counts_text[text]>1)

We're using the results of value_counts() as a dictionary of sorts here.

So now we have a series of booleans for each row, stating if the value in that row is non-unique. You can do the same to each of the other columns to make non_unique_nextword and non_unique_prevword , just by replacing all instances of text above with the corresponding column header.

Finally, we just use a logical AND to keep rows that have non-unique values in each of the three columns. Then we can get the final dataframe from the original by simple indexing:

df_nonunique = df_processed[non_unique_test & non_unique_nextword & non_unique_prevword]

Let me know if this is way off-base.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM