[英]Iterating across multiple columns in Pandas DF and slicing dynamically
TLDR: How to iterate across all options of multiple columns in a pandas dataframe without specifying the columns or their values explicitly? TLDR:如何在未显式指定列或其值的情况下遍历pandas数据框中多个列的所有选项?
Long Version: I have a pandas dataframe that looks like this, only it has a lot more features or drug dose combinations than are listed here. 加长版:我有一个像这样的熊猫数据框,但它具有比此处列出的功能或药物剂量组合更多的功能。 Instead of just 3 types of features, it could have something like 70...:
除了3种类型的功能外,它还可以包含70左右的内容:
> dosage_df
First Score Last Score A_dose B_dose C_dose
22 28 1 40 130
55 11 2 40 130
15 72 3 40 130
42 67 1 90 130
90 74 2 90 130
87 89 3 90 130
14 43 1 40 700
12 61 2 40 700
41 5 3 40 700
Along with my data frame, I also have a python dictionary with the relevant ranges for each feature. 除了数据框外,我还具有一个python字典,其中包含每个功能的相关范围。 The keys are the feature names, and the different values which it can take are the keys:
键是要素名称,键可以采用的不同值是键:
> dict_of_dose_ranges = {'A_dose': [1, 2, 3], 'B_dose': [40, 90], 'C_dose': [130,700]}
For my purposes, I need to generate a particular combination (say A_dose = 1, B_dose = 90, and C_dose = 700), and based on those settings take the relevant slice out of my dataframe, and do relevant calculations from that smaller subset, and save the results somewhere. 出于我的目的,我需要生成一个特定的组合(例如A_dose = 1,B_dose = 90和C_dose = 700),并根据这些设置从我的数据框中获取相关的切片,并从该较小的子集中进行相关的计算,并将结果保存在某处。
I need to do this for ALL possible combinations of ALL of my features (far more than the 3 which are here, and which will be variable in the future). 我需要对我所有功能的所有可能组合进行此操作(远远超过此处的3个,将来还会有所变化)。
In this case, I could easily pop this into SkLearn's Parameter grid, generate the options: 在这种情况下,我可以轻松地将其弹出到SkLearn的“参数”网格中,生成选项:
> from sklearn.grid_search import ParameterGrid
> all_options = list(ParameterGrid(dict_of_dose_ranges))
> all_options
and get: 并得到:
[{'A_dose': 1, 'B_dose': 40, 'C_dose': 130},
{'A_dose': 1, 'B_dose': 40, 'C_dose': 700},
{'A_dose': 1, 'B_dose': 90, 'C_dose': 130},
{'A_dose': 1, 'B_dose': 90, 'C_dose': 700},
{'A_dose': 2, 'B_dose': 40, 'C_dose': 130},
{'A_dose': 2, 'B_dose': 40, 'C_dose': 700},
{'A_dose': 2, 'B_dose': 90, 'C_dose': 130},
{'A_dose': 2, 'B_dose': 90, 'C_dose': 700},
{'A_dose': 3, 'B_dose': 40, 'C_dose': 130},
{'A_dose': 3, 'B_dose': 40, 'C_dose': 700},
{'A_dose': 3, 'B_dose': 90, 'C_dose': 130},
{'A_dose': 3, 'B_dose': 90, 'C_dose': 700}]
This is where I run into problems: 这是我遇到问题的地方:
Problem #1) I can now iterate across all_options
, but I'm not sure how to now SELECT out of my dosage_df
from each of the dictionary options (ie {'A_dose': 1, 'B_dose': 40, 'C_dose': 130}) WITHOUT doing it explicitly. 问题1)我现在可以遍历
all_options
,但是我不确定现在如何从每个字典选项中从我的dosage_df
中进行选择(即{'A_dose':1,'B_dose':40,'C_dose': 130})没有明确地做。
In the past, I could do something like: 过去,我可以做类似的事情:
dosage_df[(dosage_df.A_dose == 1) & (dosage_df.B_dose == 40) & (dosage_df.C_dose == 130)]
First Score Last Score A_dose B_dose C_dose
0 22 28 140 130
But now I'm not sure what to put inside the brackets to slice it dynamically... 但是现在我不确定放在括号内的内容可以动态地对其进行切片...
dosage_df[?????]
Problem #2) When I actually enter in my full dictionary of features with their respective ranges, I get an error because it deems it as having too many options... 问题#2)当我实际上输入完整的特征字典及其相应范围时,出现错误,因为它认为它具有太多选择...
from sklearn.grid_search import ParameterGrid
all_options = list(ParameterGrid(dictionary_of_features_and_ranges))
all_options
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
<ipython-input-138-7b73d5e248f5> in <module>()
1 from sklearn.grid_search import ParameterGrid
----> 2 all_options = list(ParameterGrid(dictionary_of_features_and_ranges))
3 all_options
OverflowError: long int too large to convert to int
I tried a number of alternate approaches including using double while loops, a tree / recursion method from here , another recursion method from here , but it wasn't coming together.... Any help is much appreciated. 我尝试了多种替代方法,包括使用double while循环, 从此处使用树/递归方法,从此处使用另一种递归方法 ,但是并没有将其结合在一起。...非常感谢您的帮助。
You can use itertools.product
to generate all possible dosage combinations, and DataFrame.query
to do the selection: 您可以使用
itertools.product
生成所有可能的剂量组合,并使用DataFrame.query
进行选择:
from itertools import product
for dosage_comb in product(*dict_of_dose_ranges.values()):
dosage_items = zip(dict_of_dose_ranges.keys(), dosage_comb)
query_str = ' & '.join('{} == {}'.format(*x) for x in dosage_items)
sub_df = dosage_df.query(query_str)
# Do Stuff...
What about using the underlying numpy array and some boolean logic to build an array containing only the lines you want ? 使用基础numpy数组和一些布尔逻辑来构建仅包含所需行的数组怎么办?
dosage_df = pd.DataFrame((np.random.rand(40000,10)*100).astype(np.int))
dict_of_dose_ranges={3:[10,11,12,13,15,20],4:[20,22,23,24]}
#combined_doses will be bool array that will select all the lines that match the wanted combinations of doses
combined_doses=np.ones(dosage_df.shape[0]).astype(np.bool)
for item in dict_of_dose_ranges.items():
#item[0] is the kind of dose
#item[1] are the values of that kind of dose
next_dose=np.zeros(dosage_df.shape[0]).astype(np.bool)
#we then iterate over the wanted values
for value in item[1]:
# we select and "logical or" all lines matching the values
next_dose|=(dosage_df[item[0]] == value)
# we "logical and" all the kinds of dose
combined_doses&=next_dose
print(dosage_df[combined_doses])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.