简体   繁体   English

如果列中超过 90% 的值是 0,则删除 Dataframe 中的列

[英]Drop columns in Dataframe if more than 90% of the values in the column are 0's

I have a dataframe which looks like this:我有一个 dataframe,它看起来像这样: 在此处输入图像描述

As you can see the third and fourth column have a lot of 0's.如您所见,第三列和第四列有很多 0。 I need to drop these columns if more than 90% of these values are 0.如果这些值的 90% 以上为 0,我需要删除这些列。

First of all, next time please give an example dataset, not an image or copy of one.首先,下次请给出示例数据集,而不是图像或副本。 It's best to give a minimal example that reproduces your problem (it's also a good way to investigate your problem).最好给出一个重现问题的最小示例(这也是调查问题的好方法)。 This df, for example, will do the trick:例如,这个 df 可以解决问题:

df = pd.DataFrame.from_dict({
    'a':[1,0,0,0,0,0,0,0,0,0,0],
    'b':[1,1,1,0,1,0,0,0,0,0,0]})

Now, the previous answers help, but if you can avoid a loop, it's preferable.现在,先前的答案有所帮助,但如果您可以避免循环,则最好。 You can write something simpler and more concise that will do the trick:你可以写一些更简单、更简洁的东西来解决这个问题:

df.drop(columns=df.columns[df.eq(0).mean()>0.9])

Let's go through it step by step:让我们一步一步来:
The df.eq(0) returns True \\ False in each cell. df.eq(0)在每个单元格中返回True \\ False
The .mean() method treats True as 1 and False as 0, so comparing that mean to 0.9 is what you want. .mean()方法将 True 视为 1,将 False 视为 0,因此将该均值与 0.9 进行比较就是您想要的。
Calling df.columns[...] at these places will return only those where the >0.9 holds, and drop just drops them.在这些地方调用df.columns[...]将只返回那些>0.9地方,而drop只是丢弃它们。

The following should do the trick for you:以下应该为您解决问题:

row_count = df.shape[0]
columns_to_drop = []

for column, count in df.apply(lambda column: (column == 0).sum()).iteritems():
    if count / row_count >= 0.9:
        columns_to_drop.append(column)

df = df.drop(columns_to_drop, axis=1, inplace=True)
bad_col = []
for i, x in enumerate(df.columns):
    if sorted(list(df[x].value_counts(normalize = True).values))[-1] >= 0.9 :
        bad_col.append(x)

Explanation inline the code .解释内联代码。

#Suppose df is your DataFrame then execute the following code.

df_float=df.loc[:, df.dtypes == np.float64] #checks if the column contains numbers

for i in df_float.columns:
    if ((len(df_float[i].loc[df_float[i]==0])/len(df_float))>0.9): #checking if 90% data is zero
        df_float.drop(i,axis=1,inplace=True) #delete the column

#Your results are stored in df_float

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 删除 dataframe 中的行,其列具有超过一定数量的不同值 - Drop rows in dataframe whose column has more than a certain number of distinct values 如何根据另一个布尔数组的列值删除 Pandas 数据框列? - How to drop Pandas dataframe columns based on another boolean array's column values? 删除 Pandas 中“空”值超过 60% 的列 - Drop Columns with more than 60 Percent of "empty" Values in Pandas 删除列中具有超过 5% NULL 值的所有行 - Drop all the rows having more than 5% NULL values in columns 与 Dataframe 列(多于一列)的列表比较列表 - List of list comparison with Dataframe columns (more than one column) 在Pandas DataFrame列中替换n个连续值 - Replacing more than n consecutive values in Pandas DataFrame column 如何将一个数据框(多于一列)的值与另一个数据框(多于一列)的值匹配? - How can I match the values of one dataframe(more than column) with values of another dataframe(more than one column)? 删除具有超过70%零的列 - Drop columns with more than 70% zeros Pandas-根据特定列的值在DataFrame中创建单独的列 - Pandas - Create Separate Columns in DataFrame Based on a Specific Column's Values 如何根据列的值对 Pandas 数据框中的列进行分类? - How to classify columns in pandas dataframe based on column's values?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM