[英]Create lambda function to apply to select df columns
I have the following df:我有以下df:
id header1 header2 diabetes obesity hypertension/high blood pressure. . .
1 metabolism diabetes no no no
2 heart issue heart disease None None None
3 obesity diabetes yes no no
4 metabolism hypertension no no yes
5 heart issue heart disease no no yes
6 obesity diabetes yes yes no
7 obesity diabetes no no yes
I want to create a lambda function that iterates through header1 and header2, checks if either cell is a substring of the column names.我想创建一个遍历 header1 和 header2 的 lambda function,检查任一单元格是否是列名的 substring。 Depending on whether the column has yes, no, or null, return a column with a flag value.根据列是yes、no 还是null,返回带有标志值的列。
For every cell in header1 or header2, if it contains a substring match in the column name and there is a yes within that column, flag the new column as 2. If any of the category columns contains a yes, but not a keyword match with header1 and header2, put a 1. Else, leave blank!对于 header1 或 header2 中的每个单元格,如果它在列名中包含 substring 匹配并且该列中有一个是,则将新列标记为 2。如果任何类别列包含一个是,但不是关键字匹配header1 和 header2,输入 1。否则,留空!
Example)例子)
attempt: cols = [x for x in df.columns if x not in ['header1', 'header2']]尝试:cols = [x for x in df.columns if x not in ['header1', 'header2']]
df['flag'] = df.apply(lambda x: 2 if df['header1'] or df['header2'] in cols and cols == yes, 1 elif df['header1'] not in df['header2'] in cols and cols == yes, None else
desired result:期望的结果:
id header1 header2 diabetes obesity hypertension/high blood pressure | flag
1 metabolism diabetes no no no None
2 heart issue heart disease None None None None
3 obesity diabetes yes no no 2
4 metabolism hypertension no no yes 2
5 heart issue heart disease no no yes 1
6 obesity diabetes yes yes no 2
7 obesity diabetes no no yes 1
Constructor构造函数
Please note that my actual df has a dynamic amount of yes/no columns, but only two header columns.请注意,我的实际 df 具有动态数量的 yes/no 列,但只有两个 header 列。
data = np.array([('metabolism','diabetes','no','no', 'no'),
('heart issue', 'heart disease', None,None',None),
('obesity','diabetes','yes','no','no'),
('metabolism','hypertension','no','no','yes'),
('heart issue', 'heart disease','no','no','yes'),
('obesity', 'diabetes','yes','yes', 'no'),
('obesity', 'diabetes', 'no','no', 'yes')])
df = pd.DataFrame(data, columns=['header1', 'header2','diabetes','obesity','hypertension/high blood pressure'])
cols = [x for x in df.columns if x not in ['header1', 'header2']]
First create disease column index and disease names series (the latter is used to capture "hypertension").首先创建疾病列索引和疾病名称系列(后者用于捕捉“高血压”)。
Then simply apply a function that first counts the "yes" answers and searches for disease names among the "yes" answers然后只需应用一个 function,它首先计算“是”答案并在“是”答案中搜索疾病名称
headers = ['header1', 'header2']
disease_cols = df.columns.difference(headers)
disease_names = disease_cols.str.split('/').str[0]
def get_flag(row):
yes = row[disease_cols].eq('yes')
if sum(yes) > 0:
return 2 if row[headers].str.contains('|'.join(disease_names[yes])).any() else 1
else:
return np.nan
df['flag'] = df.apply(get_flag, axis=1)
Output: Output:
header1 header2 diabetes obesity hypertension/high blood pressure flag
0 metabolism diabetes no no no NaN
1 heart issue heart disease no no no NaN
2 obesity diabetes yes no no 2.0
3 metabolism hypertension no no yes 2.0
4 heart issue heart disease no no yes 1.0
5 obesity diabetes yes yes no 2.0
6 obesity diabetes no no yes 1.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.