[英]Filter Pandas dataframe by column name on regex patterns using str.contains
I want to find columns in a dataframe that match a string pattern.我想在 dataframe 中找到与字符串模式匹配的列。 I specifically want to find two parts, firstly find a column that contains "WORDABC" and then I want to find the column that also is the "1" value of that column (ie "WORDABC1").
我特别想找到两个部分,首先找到一个包含“WORDABC”的列,然后我想找到也是该列的“1”值的列(即“WORDABC1”)。 To do this I have been using the .str.contains Pandas function.
为此,我一直在使用.str.contains Pandas function。
My problem is when there are two numbers, such as "11" or "13".我的问题是当有两个数字时,例如“11”或“13”。
df = pd.DataFrame({'WORDABC1': {0: 1, 1: 2, 2: 3},
'WORDABC11': {0: 4, 1: 5, 2: 6},
'WORDABC8N123': {0: 7, 1: 8, 2: 9},
'WORDABC81N123': {0: 10, 1: 11, 2: 12},
'WORDABC9N123': {0: 13, 1: 14, 2: 15},
'WORDABC99N123': {0: 16, 1: 17, 2: 18}})
Trying to search for the column that contains "WORDABC1" gives two results, "WORDABC1" and尝试搜索包含“WORDABC1”的列会给出两个结果,“WORDABC1”和
df[df.columns[df.columns.str.contains(pat = 'WORDABC1')]]
WORDABC1 WORDABC11
0 1 4
1 2 5
2 3 6
df[df.columns[df.columns.str.contains(pat = 'WORDABC1\\b')]]
WORDABC1
0 1
1 2
2 3
For the above example, it works for me.对于上面的例子,它对我有用。 However my problem happens if there are more characters after my found pattern.
但是,如果在我找到的模式之后有更多字符,我的问题就会发生。
df[df.columns[df.columns.str.contains(pat = 'WORDABC9')]]
WORDABC9N123 WORDABC99N123
0 13 16
1 14 17
2 15 18
df[df.columns[df.columns.str.contains(pat = 'WORDABC9\\b')]]
Empty DataFrame
Columns: []
Index: [0, 1, 2]
I only want the "WORDABC9N123" column, and I cannot just remove the other column.我只想要“WORDABC9N123”列,我不能只删除其他列。 I have considered just using
df[df.columns[df.columns.str.contains(pat = 'WORDABC9')][0]]
to get the series I want, but that creates another issue.我考虑过只使用
df[df.columns[df.columns.str.contains(pat = 'WORDABC9')][0]]
来获得我想要的系列,但这会产生另一个问题。
I have also been using things such as (df.columns.str.contains(pat = 'WORDABC1\\b')).sum()
to create truth statements, so the above df[0]
method doesn't help me get through the issue.我也一直在使用诸如
(df.columns.str.contains(pat = 'WORDABC1\\b')).sum()
类的东西来创建真值语句,所以上面的df[0]
方法不能帮助我得到通过问题。
Is there a better method to use instead of str.contains?有没有更好的方法来代替 str.contains? Or is my regex just incorrect?
还是我的正则表达式不正确? Thank you!
谢谢你!
Try .filter
with regex=
parameter:尝试
.filter
和regex=
参数:
print(df.filter(regex=r"WORDABC9(?=[^\d]|$)"))
Prints:印刷:
WORDABC9N123
0 13
1 14
2 15
pat = 'WORDABC1\\b'
works when matching 'WORDABC1'
because \\b
matches word boundaries, and the end of a string is a word boundary. pat = 'WORDABC1\\b'
在匹配'WORDABC1'
时起作用,因为\\b
匹配单词边界,而字符串的末尾是单词边界。
If you want to match 'WORDABC9N123'
but not 'WORDABC99N123'
, the similar pattern 'WORDABC9\\b'
will not work because there is no word boundary in either case.如果你想匹配
'WORDABC9N123'
而不是'WORDABC99N123'
,类似的模式'WORDABC9\\b'
将不起作用,因为在这两种情况下都没有单词边界。
I think you want to match WORDABC9
followed by a non-digit, in which case you can try pat = 'WORDABC9[\\b | \\D]'
我认为您想匹配
WORDABC9
后跟一个非数字,在这种情况下,您可以尝试pat = 'WORDABC9[\\b | \\D]'
pat = 'WORDABC9[\\b | \\D]'
. pat = 'WORDABC9[\\b | \\D]'
。 That will match either WORDABC9
or WORDABC9N...
, but not WORDABC99N123
这将匹配
WORDABC9
或WORDABC9N...
,但不WORDABC99N123
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.