I am working on a project that involves a lot of database filtering using Pandas. So I wrote the following function:
def filterList(df, dropL, col, criteria, reason="", strCont=False, isIN=False,
notEq=False, isEq=False, isNAN=False, isDup=False, useDropL=True,
dropCol=False, dropColDropList=False, useDropReason=True):
# make a mask
if strCont:
mask = df[col].str.contains(criteria)
elif notEq:
mask = df[col] != criteria
elif isEq:
mask = df[col] == criteria
elif isNAN:
mask = np.isnan(df[col])
elif isIN:
mask = df[col].isin(criteria)
elif isDup:
mask = df.duplicated(col, keep=False)
else:
print("you must specify how to make the mask")
sys.exit()
# fill the droplist
if useDropL:
dropL = dropL.append(df[mask]).fillna("")
dropL.reset_index(drop=True, inplace=True)
if useDropReason:
dropL.loc[dropL["Reason Dropped"] == '', 'Reason Dropped'] = reason
if dropColDropList:
dropL.drop(col, axis='columns', inplace=True)
# filter the list
df_Filtered = df.drop(df[mask].index)
df_Filtered.reset_index(drop=True, inplace=True)
# special instructions
if dropCol:
df_Filtered.drop(col, axis='columns', inplace=True)
return df_Filtered, dropL
It's setup such that I have to pass one of the boolean variables as true in order to specify how the matching criteria should be compared to the specific column. It also tracks the dropped items and fills in a reason why that item was dropped (for error manual error checking later).
I would like to not have such a long declaration statement. I mean, it works, I just think it looks ugly.
So I figured that I could use **kwargs
to capture all the bools, and then just look for the variable names in them, but everywhere I look to see how to do that is saying that this is the worst idea in the world.
The given reasons seem to revolve around not knowing what variables will be passed, and possible variable name collisions. But I'm the only one who will be writing or running this code, so I'm not worried about variable name collisions in this case.
So
and
Since the filtering criteria are mutually exclusive, you should just use a single parameter that specifies the filtering method, rather than lots of boolean parameters.
def filterList(df, dropL, col, filterType, reason="", useDropL=True,
dropCol=False, dropColDropList=False, useDropReason=True):
if filterType == "strCont":
mask = df[col].str.contains(criteria)
elif filterType == "notEq":
mask = df[col] != criteria
...
else:
print("you must specify how to make the mask")
sys.exit()
...
Not addressing your specific use-case, but there will be times when it's necessary to have a function that can take a whole lot of arguments, and those must be specific. Using kwargs
isn't the worst idea in the world, but it would create two problems that you have to solve:
kwargs
object, and you would also have to handle which items are optional and which are required.Having said that, readability is also a factor, and you are right to be concerened about the "ugliness" of a large decleration statement. To solve that, I think it will be smarter to just change the format of the decletation. Writing something like this is totally acceptable, and much more readable:
def filterList(df,
dropL,
col,
criteria,
reason="",
strCont=False,
isIN=False,
notEq=False,
isEq=False,
isNAN=False,
isDup=False,
useDropL=True,
dropCol=False,
dropColDropList=False,
useDropReason=True):
And that format would even make it easier to add comments or type-hints, if needed, to each variable
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.