[英]How to filter on a pandas dataframe using contains against a list of columns, if I don't know which columns are present?
I want to filter my dataframe to look for columns containing a known string.我想过滤我的 dataframe 以查找包含已知字符串的列。 I know you can do something like this:
我知道你可以这样做:
summ_proc = summ_proc[
summ_proc['data.process.name'].str.contains(indicator) |
summ_proc['data.win.eventdata.processName'].str.contains(indicator) |
summ_proc['data.win.eventdata.logonProcessName'].str.contains(indicator) |
summ_proc['syscheck.audit.process.name'].str.contains(indicator)
]
where I'm using the |我在哪里使用 | operator to check against multiple columns.
运算符来检查多个列。 But there are cases where a certain column name isn't present.
但在某些情况下,某个列名不存在。 So 'data.process.name' might not be present every time.
所以'data.process.name'可能不会每次都出现。
I tried the following implementation:我尝试了以下实现:
summ_proc[summ_proc.apply(lambda x: summ_proc['data.process.name'].str.contains(indicator) if 'data.process.name' in summ_proc.columns else summ_proc)]
And that works.那行得通。 But I'm not sure how I can apply the OR operator to this lambda function.
但我不确定如何将 OR 运算符应用于此 lambda function。 I want all the rows where either
data.process.name
or data.win.eventdata.processName
or data.win.eventdata.logonProcessName
or syscheck.audit.process.name
contains the indicator.我想要
data.process.name
或data.win.eventdata.processName
或data.win.eventdata.logonProcessName
或syscheck.audit.process.name
包含指标的所有行。
EDIT:编辑:
I tried the following approach, where I created individual frames and concated all the frames.我尝试了以下方法,在其中创建了单个帧并连接了所有帧。
summ_proc1 = summ_proc[summ_proc.apply(lambda x: summ_proc['data.process.name'].str.contains(indicator) if 'data.process.name' in summ_proc.columns else summ_proc)]
summ_proc2 = summ_proc[summ_proc.apply(lambda x: summ_proc['data.win.eventdata.processName'].str.contains(indicator) if 'data.win.eventdata.processName' in summ_proc.columns else summ_proc)]
summ_proc3 = summ_proc[summ_proc.apply(lambda x: summ_proc['data.win.eventdata.logonProcessName'].str.contains(indicator) if 'data.win.eventdata.logonProcessName' in summ_proc.columns else summ_proc)]
frames = [summ_proc1, summ_proc2, summ_proc3]
result = pd.concat(frames)
This works, but I'm curious if there's a better more pythonic approach?这行得通,但我很好奇是否有更好的pythonic方法? Or if this current method will cause more downstream issues?
或者如果这种当前方法会导致更多的下游问题?
should work with something like this:应该使用这样的东西:
import numpy as np
columns = ['data.process.name', 'data.win.eventdata.processName']
# filter columns that are in summ_proc
available_columns = [c for c in columns if c in summ_proc.columns]
# array of Boolean values indicating if c contains indicator
ss = [summ_proc[c].str.contains(indicator) for c in available_columns]
# reduce without '|' by using 'np.logical_or'
indexer = np.logical_or.reduce(ss)
result = summ_proc[indexer]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.