[英]Dropping columns on a dataframe based on their count of values
Hi I am new to pandas and struggling with a manipulation.嗨,我是大熊猫的新手,正在努力进行操作。 I have a dataframe df with a huge number of columns, and I only want to keep the number of columns that have a count of above 5000 values.我有一个包含大量列的数据框 df,我只想保留计数超过 5000 个值的列数。
I tried the loop below but it does not work.我尝试了下面的循环,但它不起作用。 Is there any easy way to do this?有什么简单的方法可以做到这一点吗? Also is there a function I could create to apply this to any dataframe where I want to keep columns with only n values or more?还有我可以创建一个函数来将它应用到任何我想保留只有 n 个或更多值的列的数据帧吗?
for column in df.columns:
if df[column].count() > 5000:
column = column
else:
df[column].drop()
Thanks谢谢
We can use DataFrame.dropna
which has the argument thresh
, for example:我们可以使用DataFrame.dropna
参数thresh
DataFrame.dropna
,例如:
import pandas as pd
import numpy as np
# example dataframe
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, np.nan],
'C': [np.nan, np.nan, 6],
'D': [np.nan, np.nan, np.nan]
})
A B C D
0 1 4.0 NaN NaN
1 2 5.0 NaN NaN
2 3 NaN 6.0 NaN
We set the threshold to 2
, in your case it is 5000
:我们将阈值设置为2
,在您的情况下是5000
:
df.dropna(thresh=2, axis=1)
A B
0 1 4.0
1 2 5.0
2 3 NaN
Notice column C
and D
dropped because they had less than 2 non-Na values注意C
列和D
列被删除,因为它们的非 Na 值少于 2 个
Try this:尝试这个:
newdf=df.copy()
for column in newdf.columns:
if df[column].count() <= 5000:
df=df.drop(column, axis=1)
or the equivalent:或等价物:
newdf=df.copy()
for column in newdf.columns:
if df[column].count() <= 5000:
del df.column
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.