简体   繁体   English

根据值的计数删除数据框上的列

[英]Dropping columns on a dataframe based on their count of values

Hi I am new to pandas and struggling with a manipulation.嗨,我是大熊猫的新手,正在努力进行操作。 I have a dataframe df with a huge number of columns, and I only want to keep the number of columns that have a count of above 5000 values.我有一个包含大量列的数据框 df,我只想保留计数超过 5000 个值的列数。

I tried the loop below but it does not work.我尝试了下面的循环,但它不起作用。 Is there any easy way to do this?有什么简单的方法可以做到这一点吗? Also is there a function I could create to apply this to any dataframe where I want to keep columns with only n values or more?还有我可以创建一个函数来将它应用到任何我想保留只有 n 个或更多值的列的数据帧吗?

for column in df.columns: 
   if df[column].count() > 5000: 
      column = column
   else: 
      df[column].drop()

Thanks谢谢

We can use DataFrame.dropna which has the argument thresh , for example:我们可以使用DataFrame.dropna参数thresh DataFrame.dropna ,例如:

import pandas as pd
import numpy as np

# example dataframe
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, np.nan],
    'C': [np.nan, np.nan, 6],
    'D': [np.nan, np.nan, np.nan]
})


   A    B    C   D
0  1  4.0  NaN NaN
1  2  5.0  NaN NaN
2  3  NaN  6.0 NaN

We set the threshold to 2 , in your case it is 5000 :我们将阈值设置为2 ,在您的情况下是5000

df.dropna(thresh=2, axis=1)

   A    B
0  1  4.0
1  2  5.0
2  3  NaN

Notice column C and D dropped because they had less than 2 non-Na values注意C列和D列被删除,因为它们的非 Na 值少于 2 个

Try this:尝试这个:

newdf=df.copy()
for column in newdf.columns: 
    if df[column].count() <= 5000: 
        df=df.drop(column, axis=1) 

or the equivalent:或等价物:

newdf=df.copy()
for column in newdf.columns: 
    if df[column].count() <= 5000: 
        del df.column

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM