使用 str.contains 按正则表达式模式的列名过滤 Pandas dataframe

Question

I want to find columns in a dataframe that match a string pattern.我想在 dataframe 中找到与字符串模式匹配的列。 I specifically want to find two parts, firstly find a column that contains "WORDABC" and then I want to find the column that also is the "1" value of that column (ie "WORDABC1").我特别想找到两个部分，首先找到一个包含“WORDABC”的列，然后我想找到也是该列的“1”值的列（即“WORDABC1”）。 To do this I have been using the .str.contains Pandas function.为此，我一直在使用.str.contains Pandas function。

My problem is when there are two numbers, such as "11" or "13".我的问题是当有两个数字时，例如“11”或“13”。

df = pd.DataFrame({'WORDABC1': {0: 1, 1: 2, 2: 3},
 'WORDABC11': {0: 4, 1: 5, 2: 6},
 'WORDABC8N123': {0: 7, 1: 8, 2: 9},
 'WORDABC81N123': {0: 10, 1: 11, 2: 12},
 'WORDABC9N123': {0: 13, 1: 14, 2: 15},
 'WORDABC99N123': {0: 16, 1: 17, 2: 18}})

Trying to search for the column that contains "WORDABC1" gives two results, "WORDABC1" and尝试搜索包含“WORDABC1”的列会给出两个结果，“WORDABC1”和

df[df.columns[df.columns.str.contains(pat = 'WORDABC1')]]

   WORDABC1  WORDABC11
0         1          4
1         2          5
2         3          6

df[df.columns[df.columns.str.contains(pat = 'WORDABC1\\b')]]

   WORDABC1
0         1
1         2
2         3

For the above example, it works for me.对于上面的例子，它对我有用。 However my problem happens if there are more characters after my found pattern.但是，如果在我找到的模式之后有更多字符，我的问题就会发生。

df[df.columns[df.columns.str.contains(pat = 'WORDABC9')]]
   WORDABC9N123  WORDABC99N123
0            13             16
1            14             17
2            15             18

df[df.columns[df.columns.str.contains(pat = 'WORDABC9\\b')]]
Empty DataFrame
Columns: []
Index: [0, 1, 2]

I only want the "WORDABC9N123" column, and I cannot just remove the other column.我只想要“WORDABC9N123”列，我不能只删除其他列。 I have considered just using df[df.columns[df.columns.str.contains(pat = 'WORDABC9')][0]] to get the series I want, but that creates another issue.我考虑过只使用df[df.columns[df.columns.str.contains(pat = 'WORDABC9')][0]]来获得我想要的系列，但这会产生另一个问题。

I have also been using things such as (df.columns.str.contains(pat = 'WORDABC1\\b')).sum() to create truth statements, so the above df[0] method doesn't help me get through the issue.我也一直在使用诸如(df.columns.str.contains(pat = 'WORDABC1\\b')).sum()类的东西来创建真值语句，所以上面的df[0]方法不能帮助我得到通过问题。

Is there a better method to use instead of str.contains?有没有更好的方法来代替 str.contains？ Or is my regex just incorrect?还是我的正则表达式不正确？ Thank you!谢谢你！

Answer 1

Try .filter with regex= parameter:尝试.filter和regex=参数：

print(df.filter(regex=r"WORDABC9(?=[^\d]|$)"))

Prints:印刷：

   WORDABC9N123
0            13
1            14
2            15

Answer 2

pat = 'WORDABC1\\b' works when matching 'WORDABC1' because \\b matches word boundaries, and the end of a string is a word boundary. pat = 'WORDABC1\\b'在匹配'WORDABC1'时起作用，因为\\b匹配单词边界，而字符串的末尾是单词边界。

If you want to match 'WORDABC9N123' but not 'WORDABC99N123' , the similar pattern 'WORDABC9\\b' will not work because there is no word boundary in either case.如果你想匹配'WORDABC9N123'而不是'WORDABC99N123' ，类似的模式'WORDABC9\\b'将不起作用，因为在这两种情况下都没有单词边界。

I think you want to match WORDABC9 followed by a non-digit, in which case you can try pat = 'WORDABC9[\\b | \\D]'我认为您想匹配WORDABC9后跟一个非数字，在这种情况下，您可以尝试pat = 'WORDABC9[\\b | \\D]' pat = 'WORDABC9[\\b | \\D]' . pat = 'WORDABC9[\\b | \\D]' 。 That will match either WORDABC9 or WORDABC9N... , but not WORDABC99N123这将匹配WORDABC9或WORDABC9N... ，但不WORDABC99N123

使用 str.contains 按正则表达式模式的列名过滤 Pandas dataframe

问题描述

2 个解决方案

解决方案1
6 已采纳 2021-08-20 21:31:12

解决方案2
1 2021-08-20 21:32:50

使用 str.contains 按正则表达式模式的列名过滤 Pandas dataframe

问题描述

2 个解决方案

解决方案1 6 已采纳 2021-08-20 21:31:12

解决方案2 1 2021-08-20 21:32:50

解决方案1
6 已采纳 2021-08-20 21:31:12

解决方案2
1 2021-08-20 21:32:50