简体   繁体   English

df.drop 如果存在

[英]df.drop if it exists

Below is a function that takes a file and drops column names 'row_num", 'start_date', 'end_date.'下面是一个函数,它接受一个文件并删除列名“row_num”、“start_date”、“end_date”。

The problem is not every file has each of these column names, so the function returns an error.问题是不是每个文件都有这些列名,所以函数返回一个错误。

My goal is to alter code so that it removes these columns if it exists but does not return an error if the column does not exist.我的目标是更改代码,以便在这些列存在时删除这些列,但如果该列不存在则不返回错误。

def read_df(file):
    df = pd.read_csv(file, na_values=['', ' '])
    # Drop useless junk and fill empty values with zero 
    df = df.drop(['row_num','start_date','end_date','symbol'], axis=1).fillna(0)
    df=df[df!=0][:-1].dropna().append(df.iloc[-1])
    return df

Add parameter errors to DataFrame.drop :将参数errors添加到DataFrame.drop

errors : {'ignore', 'raise'}, default 'raise'错误:{'ignore', 'raise'},默认为 'raise'

If 'ignore', suppress error and only existing labels are dropped.如果为“忽略”,则抑制错误并仅删除现有标签。

df = df.drop(['row_num','start_date','end_date','symbol'], axis=1, errors='ignore')

Sample :样品

df = pd.DataFrame({'row_num':[1,2], 'w':[3,4]})
df = df.drop(['row_num','start_date','end_date','symbol'], axis=1, errors='ignore')
print (df)
   w
0  3
1  4

In my tests the following was at least as fast as any of the given answers:在我的测试中,以下内容至少与任何给定答案一样快:

candidates=['row_num','start_date','end_date','symbol']
df = df.drop([x for x in candidates if x in df.columns], axis=1)

It has the benefit of readability and (with a small tweak to the code) the ability to record exactly which columns existed/were dropped when.它具有可读性和(对代码进行小幅调整)能够准确记录哪些列存在/何时被删除的能力。

Some reasons this might be more desireable than the previous solutions:这可能比以前的解决方案更可取的一些原因:

  • Looping over the items and dropping each column individually if it exists is functional, but quite slow (see benchmarks below).循环遍历项目并单独删除每列(如果存在)是有效的,但速度很慢(请参阅下面的基准)。
  • jezrael's answer is very nice, but made me nervous at first (ignoring errors feels bad!). jezrael 的回答非常好,但一开始让我很紧张(忽略错误感觉很糟糕!)。 Further looking at the documentation makes it sounds like this is OK though, and simply ignores the error of the column not existing (not other errors that might be undesireable to ignore).进一步查看文档使它听起来好像没问题,并且只是忽略不存在的列的错误(不是其他可能不希望忽略的错误)。 My solution may be more readable, especially for those less familiar with optional kwargs in pandas.我的解决方案可能更具可读性,尤其是对于那些不太熟悉 Pandas 中可选 kwargs 的人。

Benchmark Results:基准测试结果:

![基准结果

Code for benchmark tests (credit to an answer in this question for how to create this sort of benchmark):基准测试代码(归功于此问题中有关如何创建此类基准的答案):

import math
from simple_benchmark import benchmark
import pandas as pd

# setting up the toy df:
def df_creator(length):
    c1=list(range(0,10))
    c2=list('a,b,c,d,e'.split(','))
    c3=list(range(0,5))
    c4=[True,False]
    lists=[c1,c2,c3,c4]
    df=pd.DataFrame()
    count=0
    for x in lists:
        count+=1
        df['col'+str(count)]=x*math.floor(length/len(x))
    return df

# setting up benchmark test:
def list_comp(df,candidates=['col1','col2','col5','col8']):
    return df.drop([x for x in candidates if x in df.columns], axis=1)

def looper(df,candidates=['col1','col2','col5','col8']):
    for col in candidates:
        if col in df.columns:
            out = df.drop(columns=col, axis=1)
    return out

def ignore_error(df,candidates=['col1','col2','col5','col8']):
    return df.drop(candidates, axis=1, errors='ignore')

functions=[list_comp,looper,ignore_error]

args={n : df_creator(n) for n in [10,100,1000,10000,100000]}
argname='df_length'
b=benchmark(functions,args,argname)
b.plot()

I just had to do this;我只需要这样做; here's what I did:这是我所做的:

# Drop these columns if they exist
cols = ['Billing Address Street 1', 'Billing Address Street 2','Billing Company']
for col in cols:
    if col in df.columns:
        df = df.drop(columns=col, axis=1)

Might not be the best way, but it served it's purpose.可能不是最好的方法,但它达到了它的目的。

x = ['row_num','start_date','end_date','symbol']

To check if column exists then You can do:要检查列是否存在,您可以执行以下操作:

for i in x:
    if i in df:
        df = df.drop(['row_num','start_date','end_date','symbol'], axis=1).fillna(0)

or或者

for i in x:
    if i in df.columns:
        df = df.drop(['row_num','start_date','end_date','symbol'], axis=1).fillna(0)

Just use Pandas Filter, the Pythonic Way只需使用 Pandas 过滤器,Pythonic 方式

Oddly, No answers use the pandas dataframe filter method奇怪的是,没有答案使用pandas数据框filter method

thisFilter = df.filter(drop_list)
df.drop(thisFilter, inplace=True, axis=1)

This will create a filter from the drop_list that exists in df , then drop thisFilter from the df inplace on axis=1这将创建从一个滤波器drop_list存在于df ,再滴thisFilterdf inplaceaxis=1

ie, drop the columns that match the drop_list and don't error if they are nonexistent即,删除与drop_list匹配的列,如果它们不存在则不要出错

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM