使用正则表达式过滤熊猫数据框列有异常

Question

I'm trying to subset (retrieve a set of rows) a python pandas data frame by using pd.filter with a regex string to identify the columns of interest before performing a subset based on the values in those columns.我正在尝试通过使用 pd.filter 和正则表达式字符串来子集（检索一组行）python pandas 数据框，以在根据这些列中的值执行子集之前识别感兴趣的列。

For example, this is my mock data frame:例如，这是我的模拟数据框：

id status status_drug_use drugA drugA_use    drugB  drugB_use
0  1      analgesic       0     None         1      hypertensive
1  0      analgesic       1     analgesic    1      hypertensive
2  0      analgesic       1     hypertensive 0      None
3  1      analgesic       0     None         1      analgesic

I would like all rows that contain the values in columns drugA_use or drugB_use which match the value in status_drug_use .我想要包含与status_drug_use中的值匹配的drugA_use或drugB_use列中的值的所有行。 As per the example, this would return the two rows:根据示例，这将返回两行：

id status status_drug_use drugA drugA_use    drugB  drugB_use
1  0      analgesic       1     analgesic    1      hypertensive
3  1      analgesic       0     None         1      analgesic

There are a few column name conventions to stick with:有一些列名约定要坚持：

status_drug_use is always there. status_drug_use始终存在。
The matching columns ( drugA_use and drugB_use ) always follow the template <ANYTHING>_use .匹配列（ drugA_use和drugB_use ）始终遵循模板<ANYTHING>_use 。

Alteration There is a second scenario, one in which I would like to perform a comparison between a user defined string eg analgesic and the two columns drugA_use and drugB_use .变更还有第二种情况，我想在用户定义的字符串（例如analgesic ）和两列drugA_use和drugB_use之间进行比较。 This is different from using the content of status_drug_use .这与使用status_drug_use的内容不同。

Answer 1

Here's a way to do what you've asked:这是一种执行您所要求的方法：

df2 = df.assign(all_use=df.apply(
    lambda x: list(x[[col for col in df.columns if col.endswith('_use') and col != 'status_drug_use']]), 
    axis=1)).explode(
    'all_use').query('status_drug_use == all_use').drop_duplicates().drop(columns='all_use')

Input:输入：

  id status status_drug_use drugA     drugA_use drugB     drugB_use
0  0      1       analgesic     0          None     1  hypertensive
1  1      0       analgesic     1     analgesic     1  hypertensive
2  2      0       analgesic     1  hypertensive     0          None
3  3      1       analgesic     0          None     1     analgesic

Output:输出：

  id status status_drug_use drugA  drugA_use drugB     drugB_use
1  1      0       analgesic     1  analgesic     1  hypertensive
3  3      1       analgesic     0       None     1     analgesic

Explanation:解释：

find the subset of all columns ending in _use (excluding status_drug_use )查找以_use结尾的所有列的子集（不包括status_drug_use ）
add a column named all_use whose value for a given row is a list of the values in the columns ending in _use添加一个名为all_use的列，其给定行的值是以_use结尾的列中的值的列表
use explode() to add rows such that for each original row, there are now multiple rows, one for each of the values in all_use for the original row使用explode()添加行，使得对于每个原始行，现在有多个行，一个用于原始行的all_use中的每个值
use query() to select only rows where status_drug_use matches the value in all_use使用query()仅选择status_drug_use与all_use中的值匹配的行
use drop_duplicates to eliminate rows in case there were multiple matches for any rows in the original dataframe (for example, if both drugA_use and drugB_use contained "analgesic" and so did status_drug_use )如果原始数据框中的任何行有多个匹配项，请使用drop_duplicates消除行（例如，如果drugA_use和drugB_use都包含“analgesic”，而status_drug_use也是如此）
drop the column all_use as we no longer need it.删除列all_use因为我们不再需要它。

UPDATE : Addressing OP's question in a comment: 'Rather than using the values in column status_drug_use, how do I achieve the same output but by using a single user defined string eg, "analgesic"?'更新：在评论中解决 OP 的问题：“而不是使用列 status_drug_use 中的值，我如何通过使用单个用户定义的字符串（例如“analgesic”）来实现相同的输出？

You can do this by having the user defined query string (call it user_defined_str ) as a variable and changing the contents of query() by replacing the column name status_drug_use with the variable name with @ prepended: @user_defined_str (see the query() docs here for more detail).您可以通过将用户定义的查询字符串（称为user_defined_str ）作为变量并通过将列名status_drug_use替换为带有@前缀的变量名来更改query()的内容来做到这一点： @user_defined_str （请参阅query()文档此处了解更多详细信息）。

user_defined_str = 'analgesic'
df3 = df.assign(all_use=df.apply(
    lambda x: list(x[[col for col in df.columns if col.endswith('_use') and col != 'status_drug_use']]), 
    axis=1)).explode(
    'all_use').query('@user_defined_str == all_use').drop_duplicates().drop(columns='all_use')

使用正则表达式过滤熊猫数据框列有异常

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-05-23 20:51:29

使用正则表达式过滤熊猫数据框列有异常

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-05-23 20:51:29

解决方案1
1 已采纳 2022-05-23 20:51:29