简体   繁体   中英

Using pandas value_counts() under defined condition

After a lot of errors, exceptions and high blood pressure, I finally came up with this solution that works for what I needed it to: basically I need to calculate all the column values that respect a specific condition.

So, let's say I got a list of strings just like

vehicle = ['car', 'boat', 'car', 'car', 'bike', 'tank', 'DeLorean', 'tank']

I want to count which values appear more than 2 times.

Consider that the column name of the dataframe based upon the list is 'veh'.

So, this piece of code works:

df['veh'].value_counts()[df['veh'].value_counts() > 2]

The question is: why the [df['veh'].value_counts() > 2] part comes right after the "()" of value_counts() ? No "." or any other linking sign that could mean something.

If I use the code

df['classi'].value_counts() > 1

(which would be the logic synthax that my limited brain can abstract), it returns boolean values.

Can someone, please, help me understanding the logic behind pandas?

I am pretty sure that pandas is awesome and the problem lies on this side of mine, but I really want to understand it. I've read a lot of material (documentation included), but could not find a solution to this gap of mine.

Thank you in advance!

The logic is that you can slice a series with a boolean series of the same size:

s[bool_series]

or equivalently

s.loc[bool_series]

This is also referred as boolean indexing .

Now, your code is equivalent to:

s = df['veh'].value_counts()

bool_series = s > 2

And then either the first two lines, eg s[s>2]

The following line of code

df['veh'].value_counts()

Return a pandas Series with keys as indices and number of occurrences as values

Everything between square brackets [] are filters on keys for a pandas Series. So

df['veh'].value_counts()['car']

Should return the number of occurrences of the word 'car' in column 'veh'. Which is equivalent to the corresponding value for key 'car' on the series df['veh'].value_counts()

A pandas series also accept lists of keys as indices, So

df['veh'].value_counts()[['car','boat']]

Should return the number of occurrences for the words 'car' and 'boat' respectively

Furthermore, the series accept a list of booleans as key, if it is of the same length of the series. That is, it accepts a boolean mask

When you write

df['veh'].value_counts() > 2

You make a comparison between each value on df['veh'].value_counts() and the number 2. This returns a boolean for each value, that is a boolean mask .

So you can use the boolean mask as a filter on the series you created. Thus

df['veh'].value_counts()[df['veh'].value_counts() > 2]

Returns all the occurrences for the keys where the occurrences are greater than 2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM