How to count the number of rows containing both a value in a set of columns and another value in another column in a Pandas dataframe?

Question

# import packages, set nan
import pandas as pd
import numpy as np
nan = np.nan

The problem

I have a dataframe, with a certain number of observations as columns, measurements as rows. The results of the observations are A, B, C, D... . It also has a category column, which denote the category of the measurement . Categories: a, b, c, d... . If a column contains a nan in a row, that means that the observation during that measurement has not been made (so nan is not an observation , it is lack of it). An MRE :

data = {'observation0': ['A','A','A','A','B'],'observation1': ['B','B','B','C',nan], 'category': ['a', 'b', 'c','a','b']}
df = pd.DataFrame.from_dict(data)

df looks like this:

I would like to count how many times each observational result (ie A, B, C, D... ) is observed using each category of measurement (ie a, b, c, d... ).

I would like to get:

obs_A_in_cat_a    2
obs_A_in_cat_b    1
obs_A_in_cat_c    1
obs_B_in_cat_a    1
obs_B_in_cat_b    2
obs_B_in_cat_c    1
obs_C_in_cat_a    1
obs_C_in_cat_b    0
obs_C_in_cat_c    0

Observation A appears in rows with index 0 and 3 (see above df) while the measurement category is a , so obs_A_in_cat_a is 2 . Observation A appears only once (row index 1 ) in a measurement with category : b , so obs_A_in_cat_b is 1 , and so on.

My solution

First I gather the outcomes of observations, taking care not to include nans :

observations = pd.unique(pd.concat([df[col] for col in df.columns if 'observation' in col]).dropna())

The different categories they belong to:

categories = pd.unique(df['category'])

Then, iterate through observations. If it is relying on this ,

for observation in observations:
    for category in categories:
        df['obs_'+observation+'_in_cat_'+category]=\
        df.apply(lambda row: int(observation in [row[col]
                                                 for col in df.columns
                                                 if 'observation' in col]
                                 and row['category'] == category),axis=1)

The lambda function checks if observation appears in each row , and that the measurement is in the category which is currently considered in the iteration. New columns are created, with headers obs_OBSERVATION_in_cat_CATEGORY, where OBSERVATION is A, B, C, D... , CATEGORY is a, b, c, d... If an observationX in a categoryY was made during a measurement, obs_OBSERVATIONX_in_cat_CATEGORYY is 1 in the row corresponding to that measurement, otherwise it is 0 .

The resulting df (parts of it) looks like this:

Finish using sum() ming the values of the newly created columns, selecting those with a conditional list comprehension :

df[[col for col in df.columns if '_in_cat_' in col]].sum()

This gives me the output which I'd like to get, shown above. Whole notebook here .

The question

This method seem to work, but it is too slow to be easily applicable in real life. How could I make it quicker? I am looking for something like:

how_many_times_each_observation_was_made_using_each_category_of_measurement(
df,
list_of_observation_columns,
category_column)

Answer 1

Solutuion with MultiIndex with DataFrame.melt , GroupBy.size for count values, add 0 for missing combinations by Series.reindex :

s = df.melt('category').groupby(['value','category']).size()
s = s.reindex(pd.MultiIndex.from_product(s.index.levels), fill_value=0)
print (s)
value  category
A      a           2
       b           1
       c           1
B      a           1
       b           2
       c           1
C      a           1
       b           0
       c           0
dtype: int64

Last is possible flatten it by f-string s:

s.index = s.index.map(lambda x: f'obs_{x[0]}_in_cat_{x[1]}')   
print (s)
obs_A_in_cat_a    2
obs_A_in_cat_b    1
obs_A_in_cat_c    1
obs_B_in_cat_a    1
obs_B_in_cat_b    2
obs_B_in_cat_c    1
obs_C_in_cat_a    1
obs_C_in_cat_b    0
obs_C_in_cat_c    0
dtype: int64

Answer 2

You could combine melt with crosstab to get your output:

s = df.melt("category")
s = pd.crosstab(s.value, s.category).stack()
s.index = [f"obs_{first}_in_cat_{last}" for first, last in s.index]

s

obs_A_in_cat_a    2
obs_A_in_cat_b    1
obs_A_in_cat_c    1
obs_B_in_cat_a    1
obs_B_in_cat_b    2
obs_B_in_cat_c    1
obs_C_in_cat_a    1
obs_C_in_cat_b    0
obs_C_in_cat_c    0
dtype: int64

Answer 3

You could do it in the following way:

dfT = []
for colName in ['observation0','observation1']:
    df1 = df.groupby([colName,'category'])['category'].count().to_frame()
    df1.columns = ['count']
    df1 = df1.reset_index()
    df1['label'] = 'obs_'+df1[colName]+'_cat_'+df1['category']
    df1 = df1.loc[:,['label','count']]
    dfT.append(df1)

dfT = pd.concat(dfT,axis=0).reset_index(drop=True)

How to count the number of rows containing both a value in a set of columns and another value in another column in a Pandas dataframe?

Question

The problem

My solution

The question

3 answers

solution1
5 ACCPTED 2020-08-04 13:04:32

solution2
4 2020-08-04 13:11:46

solution3
1 2020-08-04 13:11:33

How to count the number of rows containing both a value in a set of columns and another value in another column in a Pandas dataframe?

Question

The problem

My solution

The question

3 answers

solution1 5 ACCPTED 2020-08-04 13:04:32

solution2 4 2020-08-04 13:11:46

solution3 1 2020-08-04 13:11:33

solution1
5 ACCPTED 2020-08-04 13:04:32

solution2
4 2020-08-04 13:11:46

solution3
1 2020-08-04 13:11:33