查找 Pandas 中每一列的每个唯一值的百分比

Question

I know that to count each unique value of a column and turning it into percentage I can use:我知道要计算列的每个唯一值并将其转换为我可以使用的百分比：

df['name_of_the_column'].value_counts(normalize=True)*100

I wonder how can I do this for all the columns as a function and then drop the column where a unique value in a given column has above 95% of all values?我想知道如何将所有列作为函数执行此操作，然后删除给定列中唯一值占所有值的 95% 以上的列？ Note that the function should also count the NaN values.请注意，该函数还应计算 NaN 值。

Answer 1

You can try this:你可以试试这个：

l=df.columns

for i in l:
    res=df[i].value_counts(normalize=True)*100
    if res.iloc[0]>=95:
        del df[i]

Answer 2

You can write a small wrapper around value_counts that returns False if any value is above some threshold, and True if the counts look good:您可以围绕value_counts编写一个小包装器，如果任何值高于某个阈值，则返回 False，如果计数看起来不错，则返回 True：

Sample Data样本数据

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "A": [1] * 20,                   # should NOT survive
    "B": [1, 0] * 10,                # should survive
    "C": [np.nan] * 20,              # should NOT survive
    "D": [1,2,3,4] * 5,              # should survive
    "E": [0] * 18 + [np.nan, np.nan] # should survive
})

print(df.head())

Implementation执行

def threshold_counts(s, threshold=0):
    counts = s.value_counts(normalize=True, dropna=False)
    if (counts >= threshold).any():
        return False
    return True

column_mask = df.apply(threshold_counts, threshold=0.95)
clean_df = df.loc[:, column_mask]

print(clean_df.head())
   B  D    E
0  1  1  0.0
1  0  2  0.0
2  1  3  0.0
3  0  4  0.0
4  1  1  0.0

查找 Pandas 中每一列的每个唯一值的百分比

问题描述

2 个解决方案

解决方案1
3 2020-11-03 17:06:26

解决方案2
2 已采纳 2020-11-03 17:13:00

查找 Pandas 中每一列的每个唯一值的百分比

问题描述

2 个解决方案

解决方案1 3 2020-11-03 17:06:26

解决方案2 2 已采纳 2020-11-03 17:13:00

解决方案1
3 2020-11-03 17:06:26

解决方案2
2 已采纳 2020-11-03 17:13:00