# import packages, set nan
import pandas as pd
import numpy as np
nan = np.nan
I have a dataframe, with a certain number of observations as columns, measurements as rows. The results of the observations are A, B, C, D...
. It also has a category column, which denote the category of the measurement . Categories: a, b, c, d...
. If a column contains a nan
in a row, that means that the observation during that measurement has not been made (so nan
is not an observation
, it is lack of it). An MRE :
data = {'observation0': ['A','A','A','A','B'],'observation1': ['B','B','B','C',nan], 'category': ['a', 'b', 'c','a','b']}
df = pd.DataFrame.from_dict(data)
df
looks like this:
I would like to count how many times each observational result (ie A, B, C, D...
) is observed using each category of measurement (ie a, b, c, d...
).
I would like to get:
obs_A_in_cat_a 2
obs_A_in_cat_b 1
obs_A_in_cat_c 1
obs_B_in_cat_a 1
obs_B_in_cat_b 2
obs_B_in_cat_c 1
obs_C_in_cat_a 1
obs_C_in_cat_b 0
obs_C_in_cat_c 0
Observation A
appears in rows with index
0
and 3
(see above df) while the measurement category
is a
, so obs_A_in_cat_a
is 2
. Observation A
appears only once (row index
1
) in a measurement with category
: b
, so obs_A_in_cat_b
is 1
, and so on.
First I gather the outcomes of observations, taking care not to include nans :
observations = pd.unique(pd.concat([df[col] for col in df.columns if 'observation' in col]).dropna())
The different categories they belong to:
categories = pd.unique(df['category'])
Then, iterate through observations. If it is relying on this ,
for observation in observations:
for category in categories:
df['obs_'+observation+'_in_cat_'+category]=\
df.apply(lambda row: int(observation in [row[col]
for col in df.columns
if 'observation' in col]
and row['category'] == category),axis=1)
The lambda function checks if observation
appears in each row
, and that the measurement is in the category which is currently considered in the iteration. New columns are created, with headers obs_OBSERVATION_in_cat_CATEGORY, where OBSERVATION
is A, B, C, D...
, CATEGORY
is a, b, c, d...
If an observationX
in a categoryY
was made during a measurement, obs_OBSERVATIONX_in_cat_CATEGORYY
is 1
in the row corresponding to that measurement, otherwise it is 0
.
The resulting df
(parts of it) looks like this:
Finish using sum()
ming the values of the newly created columns, selecting those with a conditional list comprehension :
df[[col for col in df.columns if '_in_cat_' in col]].sum()
This gives me the output which I'd like to get, shown above. Whole notebook here .
This method seem to work, but it is too slow to be easily applicable in real life. How could I make it quicker? I am looking for something like:
how_many_times_each_observation_was_made_using_each_category_of_measurement(
df,
list_of_observation_columns,
category_column)
Solutuion with MultiIndex
with DataFrame.melt
, GroupBy.size
for count values, add 0
for missing combinations by Series.reindex
:
s = df.melt('category').groupby(['value','category']).size()
s = s.reindex(pd.MultiIndex.from_product(s.index.levels), fill_value=0)
print (s)
value category
A a 2
b 1
c 1
B a 1
b 2
c 1
C a 1
b 0
c 0
dtype: int64
Last is possible flatten it by f-string
s:
s.index = s.index.map(lambda x: f'obs_{x[0]}_in_cat_{x[1]}')
print (s)
obs_A_in_cat_a 2
obs_A_in_cat_b 1
obs_A_in_cat_c 1
obs_B_in_cat_a 1
obs_B_in_cat_b 2
obs_B_in_cat_c 1
obs_C_in_cat_a 1
obs_C_in_cat_b 0
obs_C_in_cat_c 0
dtype: int64
You could combine melt with crosstab to get your output:
s = df.melt("category")
s = pd.crosstab(s.value, s.category).stack()
s.index = [f"obs_{first}_in_cat_{last}" for first, last in s.index]
s
obs_A_in_cat_a 2
obs_A_in_cat_b 1
obs_A_in_cat_c 1
obs_B_in_cat_a 1
obs_B_in_cat_b 2
obs_B_in_cat_c 1
obs_C_in_cat_a 1
obs_C_in_cat_b 0
obs_C_in_cat_c 0
dtype: int64
You could do it in the following way:
dfT = []
for colName in ['observation0','observation1']:
df1 = df.groupby([colName,'category'])['category'].count().to_frame()
df1.columns = ['count']
df1 = df1.reset_index()
df1['label'] = 'obs_'+df1[colName]+'_cat_'+df1['category']
df1 = df1.loc[:,['label','count']]
dfT.append(df1)
dfT = pd.concat(dfT,axis=0).reset_index(drop=True)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.