I have many columns in a dataframe, and I want to compare the values in each column to a specific column. For example, say I want to, for every column in this dataframe, sum the cases where both the column value and the label are equal to 1:
col1 | col2 | col3 | ... | label
1 | 0 | 0 | ... | 1
0 | 0 | 1 | ... | 0
When I try to do this with something like df.apply(lambda x: x.label==1, axis=1)
, I can select the label column with x.label
, but how do I select the column itself?
I can do this using a for loop that iterates over the column names, but am wondering if there's a more pandas-like way to do it without using a loop.
results = []
for col in df.columns:
val = len(df[(df[col]==1) & (df.label==1)])
results.append(val)
Just filter by label and sum what is left:
df.loc[df['label'] == 1].sum()
Example:
df = pd.DataFrame(np.random.randint(2, size=(10, 4)),
columns=['col1', 'col2', 'col3', 'label'])
print(df)
col1 col2 col3 label
0 0 0 1 1
1 1 1 0 0
2 1 1 0 0
3 0 0 0 0
4 0 0 1 0
5 0 0 0 1
6 1 0 1 1
7 0 1 1 0
8 0 0 0 0
9 0 0 0 0
results = []
for col in df.columns:
val = len(df[(df[col]==1) & (df.label==1)])
results.append(val)
results
[1, 0, 2, 3]
df.loc[df['label'] == 1].sum().tolist()
[1, 0, 2, 3]
EDIT:
If not everything is 0 or 1 but you still want to sum the cases where both the column value and the label are equal to 1, after filtering by label make everyting which is not 0 or 1 to be 0 and sum what is left:
df = pd.DataFrame(np.random.randint(3, size=(10, 4)),
columns=['col1', 'col2', 'col3', 'label'])
print(df)
col1 col2 col3 label
0 0 0 2 1
1 1 0 0 2
2 2 1 0 2
3 1 1 1 0
4 0 0 2 1
5 2 2 1 2
6 0 2 1 1
7 1 1 0 0
8 1 0 0 2
9 0 2 1 2
results = []
for col in df.columns:
val = len(df[(df[col]==1) & (df.label==1)])
results.append(val)
results
[0, 0, 1, 3]
df.loc[df['label'] == 1][df == 1].sum().fillna(0).tolist()
[0.0, 0.0, 1.0, 3.0]
You can use np.equal()
to get a boolean
array for element-wise equality. This works for any integer
as well as other dtypes
.
To illustrate:
df = pd.DataFrame(np.random.randint(2, size=(10, 4)), columns=['col1', 'col2', 'col3', 'label'])
col1 col2 col3 label
0 0 1 1 0
1 1 0 1 0
2 1 0 0 1
3 1 0 0 0
4 0 1 1 1
5 1 1 0 0
6 0 0 0 1
7 1 1 1 0
8 0 1 0 1
9 0 1 1 1
Compare label
column
to each other column
:
comparison = np.equal(df[['col1', 'col2', 'col3']], df[['label']])
col1 col2 col3
0 True False False
1 False True False
2 True False False
3 False True True
4 False True True
5 False False True
6 False False False
7 False False False
8 False True False
9 False True True
You can then sum the result to get the number of equal cases per column:
comparison.sum()
col1 2
col2 5
col3 4
dtype: int64
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.