简体   繁体   English

使用 str.contains 按正则表达式模式的列名过滤 Pandas dataframe

[英]Filter Pandas dataframe by column name on regex patterns using str.contains

I want to find columns in a dataframe that match a string pattern.我想在 dataframe 中找到与字符串模式匹配的列。 I specifically want to find two parts, firstly find a column that contains "WORDABC" and then I want to find the column that also is the "1" value of that column (ie "WORDABC1").我特别想找到两个部分,首先找到一个包含“WORDABC”的列,然后我想找到也是该列的“1”值的列(即“WORDABC1”)。 To do this I have been using the .str.contains Pandas function.为此,我一直在使用.str.contains Pandas function。

My problem is when there are two numbers, such as "11" or "13".我的问题是当有两个数字时,例如“11”或“13”。

df = pd.DataFrame({'WORDABC1': {0: 1, 1: 2, 2: 3},
 'WORDABC11': {0: 4, 1: 5, 2: 6},
 'WORDABC8N123': {0: 7, 1: 8, 2: 9},
 'WORDABC81N123': {0: 10, 1: 11, 2: 12},
 'WORDABC9N123': {0: 13, 1: 14, 2: 15},
 'WORDABC99N123': {0: 16, 1: 17, 2: 18}})

Trying to search for the column that contains "WORDABC1" gives two results, "WORDABC1" and尝试搜索包含“WORDABC1”的列会给出两个结果,“WORDABC1”和

df[df.columns[df.columns.str.contains(pat = 'WORDABC1')]]

   WORDABC1  WORDABC11
0         1          4
1         2          5
2         3          6
df[df.columns[df.columns.str.contains(pat = 'WORDABC1\\b')]]

   WORDABC1
0         1
1         2
2         3

For the above example, it works for me.对于上面的例子,它对我有用。 However my problem happens if there are more characters after my found pattern.但是,如果在我找到的模式之后有更多字符,我的问题就会发生。

df[df.columns[df.columns.str.contains(pat = 'WORDABC9')]]
   WORDABC9N123  WORDABC99N123
0            13             16
1            14             17
2            15             18

df[df.columns[df.columns.str.contains(pat = 'WORDABC9\\b')]]
Empty DataFrame
Columns: []
Index: [0, 1, 2]

I only want the "WORDABC9N123" column, and I cannot just remove the other column.我只想要“WORDABC9N123”列,我不能只删除其他列。 I have considered just using df[df.columns[df.columns.str.contains(pat = 'WORDABC9')][0]] to get the series I want, but that creates another issue.我考虑过只使用df[df.columns[df.columns.str.contains(pat = 'WORDABC9')][0]]来获得我想要的系列,但这会产生另一个问题。

I have also been using things such as (df.columns.str.contains(pat = 'WORDABC1\\b')).sum() to create truth statements, so the above df[0] method doesn't help me get through the issue.我也一直在使用诸如(df.columns.str.contains(pat = 'WORDABC1\\b')).sum()类的东西来创建真值语句,所以上面的df[0]方法不能帮助我得到通过问题。

Is there a better method to use instead of str.contains?有没有更好的方法来代替 str.contains? Or is my regex just incorrect?还是我的正则表达式不正确? Thank you!谢谢你!

Try .filter with regex= parameter:尝试.filterregex=参数:

print(df.filter(regex=r"WORDABC9(?=[^\d]|$)"))

Prints:印刷:

   WORDABC9N123
0            13
1            14
2            15

pat = 'WORDABC1\\b' works when matching 'WORDABC1' because \\b matches word boundaries, and the end of a string is a word boundary. pat = 'WORDABC1\\b'在匹配'WORDABC1'时起作用,因为\\b匹配单词边界,而字符串的末尾是单词边界。

If you want to match 'WORDABC9N123' but not 'WORDABC99N123' , the similar pattern 'WORDABC9\\b' will not work because there is no word boundary in either case.如果你想匹配'WORDABC9N123'而不是'WORDABC99N123' ,类似的模式'WORDABC9\\b'将不起作用,因为在这两种情况下都没有单词边界。

I think you want to match WORDABC9 followed by a non-digit, in which case you can try pat = 'WORDABC9[\\b | \\D]'我认为您想匹配WORDABC9后跟一个非数字,在这种情况下,您可以尝试pat = 'WORDABC9[\\b | \\D]' pat = 'WORDABC9[\\b | \\D]' . pat = 'WORDABC9[\\b | \\D]' That will match either WORDABC9 or WORDABC9N... , but not WORDABC99N123这将匹配WORDABC9WORDABC9N... ,但不WORDABC99N123

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM