根据值的计数删除数据框上的列

Question

Hi I am new to pandas and struggling with a manipulation.嗨，我是大熊猫的新手，正在努力进行操作。 I have a dataframe df with a huge number of columns, and I only want to keep the number of columns that have a count of above 5000 values.我有一个包含大量列的数据框 df，我只想保留计数超过 5000 个值的列数。

I tried the loop below but it does not work.我尝试了下面的循环，但它不起作用。 Is there any easy way to do this?有什么简单的方法可以做到这一点吗？ Also is there a function I could create to apply this to any dataframe where I want to keep columns with only n values or more?还有我可以创建一个函数来将它应用到任何我想保留只有 n 个或更多值的列的数据帧吗？

for column in df.columns: 
   if df[column].count() > 5000: 
      column = column
   else: 
      df[column].drop()

Thanks谢谢

Answer 1

We can use DataFrame.dropna which has the argument thresh , for example:我们可以使用DataFrame.dropna参数thresh DataFrame.dropna ，例如：

import pandas as pd
import numpy as np

# example dataframe
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, np.nan],
    'C': [np.nan, np.nan, 6],
    'D': [np.nan, np.nan, np.nan]
})


   A    B    C   D
0  1  4.0  NaN NaN
1  2  5.0  NaN NaN
2  3  NaN  6.0 NaN

We set the threshold to 2 , in your case it is 5000 :我们将阈值设置为2 ，在您的情况下是5000 ：

df.dropna(thresh=2, axis=1)

   A    B
0  1  4.0
1  2  5.0
2  3  NaN

Notice column C and D dropped because they had less than 2 non-Na values注意C列和D列被删除，因为它们的非 Na 值少于 2 个

Answer 2

Try this:尝试这个：

newdf=df.copy()
for column in newdf.columns: 
    if df[column].count() <= 5000: 
        df=df.drop(column, axis=1)

or the equivalent:或等价物：

newdf=df.copy()
for column in newdf.columns: 
    if df[column].count() <= 5000: 
        del df.column

根据值的计数删除数据框上的列

问题描述

2 个解决方案

解决方案1
3 已采纳 2020-11-01 12:08:13

解决方案2
0 2020-11-01 11:52:23

根据值的计数删除数据框上的列

问题描述

2 个解决方案

解决方案1 3 已采纳 2020-11-01 12:08:13

解决方案2 0 2020-11-01 11:52:23

解决方案1
3 已采纳 2020-11-01 12:08:13

解决方案2
0 2020-11-01 11:52:23