[英]Iterate through pandas dataframe, select row by condition, when condition true, select a number of other rows, only containing unique values
I have a large (1M+) dataframe, something like我有一个很大的 (1M+) dataframe,比如
Column A Column B Column C
0 'Aa' 'Ba' 14
1 'Ab' 'Bc' 24
2 'Ab' 'Ba' 24
...
So basically I have a list of string pairs and some number for each, where that number depends only on Column A. What I want to do is:所以基本上我有一个字符串对列表和每个字符串对的一些数字,其中该数字仅取决于 A 列。我想要做的是:
For example, have the condition Column C > 15, and have N = 5, then for a row that passed the condition:例如,条件 Column C > 15,并且 N = 5,那么对于满足条件的行:
Column A Column B Column C
78 'Ae' 'Bf' 16
We will have the N group as for example:例如,我们将有 N 组:
Column A Column B Column C
78 'Ag' 'Br' 18
111 'Ah' 'Bg' 20
20 'An' 'Bd' 17
19 'Am' 'Bk' 18
301 'Aq' 'Bq' 32
My initial codes are a mess, I've tried it with randomly sampling rows until N is reached, and checking them for the condition, and building a duplicate dictionary to check whether they are unique or not.我的初始代码一团糟,我试过随机抽样行直到达到 N,检查它们的条件,并构建一个重复的字典来检查它们是否唯一。 However, rolling random numbers on several millions long intervals over and over again proved to be way too slow.
然而,在数百万长的间隔上一遍又一遍地滚动随机数被证明太慢了。
My second idea was to iterate from the point of the condition-passed row forward and search for other rows that pass the condition and once again check them against a duplicate dictionary.我的第二个想法是从条件传递行的点向前迭代并搜索通过条件的其他行,并再次根据重复的字典检查它们。 This started to be more feasible, however it had the problem, that the iteration had to be reset to the beginning of the df when the end of the df was reached and it didn't find N viable rows.
这开始变得更可行,但是它有问题,当到达 df 的末尾并且没有找到 N 可行的行时,迭代必须重置到 df 的开头。 Still felt quite slow.
还是觉得挺慢的。 Like this:
像这样:
in_data = []
for i in range(len(df)):
A = df.iloc[i]['A']
B = df.iloc[i]['B']
if (condition(A)):
in_data.append([A, B])
dup_dict = {}
dup_dict[A] = 1
dup_dict[B] = 1
j = i
k = 1
while (j < len(df) and k != N):
other_A = df.iloc[j]['A']
other_B = df.iloc[j]['B']
if (condition(other_A) and
other_A not in dup_dict and
other_B not in dup_dict):
dup_dict[other_A] = 1
dup_dict[other_B] = 1
in_data.append([other_A, other_B])
k += 1
j += 1
if (j == len(df) and k != N):
j = 0
return in_data
My latest idea was to somehow implement it via apply(), but it started to become way too complicated, as I couldn't figure out how to properly index the df inside the apply() and iterate forward, plus then how to do the reset trick.我最近的想法是通过 apply() 以某种方式实现它,但它开始变得太复杂了,因为我无法弄清楚如何在 apply() 中正确索引 df 并向前迭代,然后如何做重置技巧。
So, there has to be a more streamlined solution for this.因此,必须有一个更精简的解决方案。 Oh, and the original dataframe is more like ~60M long, but it is split and distributed among the available cpu cores via multiprocessing, hence the smaller size / task.
哦,原来的 dataframe 更像是 ~60M 长,但它通过多处理在可用的 cpu 核心之间拆分和分布,因此尺寸/任务更小。
Edit: the condition is dynamic, ie Column C is compared to a random number in each check, so shouldn't be pre-masked.编辑:条件是动态的,即列 C 与每次检查中的随机数进行比较,因此不应预先屏蔽。
Edit 2: some typos.编辑2:一些错别字。
You are right if I have this right如果我有这个权利,你是对的
data = [
["Ag", "Br", 18],
["Ah", "Bg", 20],
["An", "Bd", 17],
["Am", "Bk", 18],
["Aq", "Bq", 32],
"Aq", "Aq", 16],
]
df = pd.DataFrame(data=data, columns=['A', 'B', 'C'])
temp_df = df[(df.C > 14) & (df.A != df.B)] # e.g. condition_on_c = 14
# get the first row to sample
initial_row_index = temp_df.sample(1, random_state=42).index.values[0]
output = temp_df[temp_df.index != initial_row_index].sample(N, replace=True)
# sample = True means with replacement so you may get dup rows (definitely if N > len(temp_df) - 1
output = pd.concat([temp_df.loc[[initial_row_index]], output])
# if N = 5 we get
A B C
1 Ah Bg 20 # initial row
3 Am Bk 18
4 Aq Bq 32
2 An Bd 17
4 Aq Bq 32
4 Aq Bq 32
you can see the original index is the original index in the data frame you are sampling.您可以看到原始索引是您正在采样的数据框中的原始索引。 So you can reset this index.
所以你可以重置这个索引。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.