按条件遍历 pandas dataframe、select 行，当条件为真时，select 其他一些行，只包含唯一值

Question

I have a large (1M+) dataframe, something like我有一个很大的 (1M+) dataframe，比如

   Column A    Column B   Column C
0       'Aa'        'Ba'        14    
1       'Ab'        'Bc'        24           
2       'Ab'        'Ba'        24
...

So basically I have a list of string pairs and some number for each, where that number depends only on Column A. What I want to do is:所以基本上我有一个字符串对列表和每个字符串对的一些数字，其中该数字仅取决于 A 列。我想要做的是：

Iterate over the rows of the dataframe遍历 dataframe 的行
For each row, check Column C with a condition对于每一行，检查具有条件的列 C
If condition passed, select that row如果条件通过，select 那一行
Sample N other rows so that all in all we have N+1 rows for each condition-passed row对其他 N 行进行采样，以便每个条件通过的行总共有 N+1 行
BUT sample them in a way, that each N+1 group have only rows, where the condition is passed as well, and no strings of Column A or B repeats但是以某种方式对它们进行采样，每个 N+1 组只有行，条件也通过，并且 A 列或 B 列的字符串没有重复
Duplicates over different N+1 groups doesn't matter, nor the fact that the resulting list of N+1 groups will be way longer than the initial df.在不同的 N+1 组上重复并不重要，N+1 组的结果列表将比初始 df 更长的事实也不重要。 My task requires that all entries are processed and passed in N+1 groups that have no duplicates.我的任务要求所有条目都在没有重复的 N+1 组中进行处理和传递。

For example, have the condition Column C > 15, and have N = 5, then for a row that passed the condition:例如，条件 Column C > 15，并且 N = 5，那么对于满足条件的行：

    Column A    Column B   Column C
78       'Ae'        'Bf'        16

We will have the N group as for example:例如，我们将有 N 组：

   Column A    Column B   Column C
78       'Ag'        'Br'        18
111      'Ah'        'Bg'        20
20       'An'        'Bd'        17
19       'Am'        'Bk'        18
301      'Aq'        'Bq'        32

My initial codes are a mess, I've tried it with randomly sampling rows until N is reached, and checking them for the condition, and building a duplicate dictionary to check whether they are unique or not.我的初始代码一团糟，我试过随机抽样行直到达到 N，检查它们的条件，并构建一个重复的字典来检查它们是否唯一。 However, rolling random numbers on several millions long intervals over and over again proved to be way too slow.然而，在数百万长的间隔上一遍又一遍地滚动随机数被证明太慢了。

My second idea was to iterate from the point of the condition-passed row forward and search for other rows that pass the condition and once again check them against a duplicate dictionary.我的第二个想法是从条件传递行的点向前迭代并搜索通过条件的其他行，并再次根据重复的字典检查它们。 This started to be more feasible, however it had the problem, that the iteration had to be reset to the beginning of the df when the end of the df was reached and it didn't find N viable rows.这开始变得更可行，但是它有问题，当到达 df 的末尾并且没有找到 N 可行的行时，迭代必须重置到 df 的开头。 Still felt quite slow.还是觉得挺慢的。 Like this:像这样：

    in_data = []

    for i in range(len(df)):
        
        A = df.iloc[i]['A']
        B = df.iloc[i]['B']
        
        if (condition(A)):
            
            in_data.append([A, B])
            dup_dict = {}
            dup_dict[A] = 1
            dup_dict[B] = 1
            j = i
            k = 1
            
            while (j < len(df) and k != N):

                other_A = df.iloc[j]['A']
                other_B = df.iloc[j]['B']
                
                if (condition(other_A) and
                    other_A not in dup_dict and
                    other_B not in dup_dict):
                    
                    dup_dict[other_A] = 1
                    dup_dict[other_B] = 1
                    in_data.append([other_A, other_B])
                    k += 1
                
                j += 1
                
                if (j == len(df) and k != N):
                    
                    j = 0
     
    return in_data

My latest idea was to somehow implement it via apply(), but it started to become way too complicated, as I couldn't figure out how to properly index the df inside the apply() and iterate forward, plus then how to do the reset trick.我最近的想法是通过 apply() 以某种方式实现它，但它开始变得太复杂了，因为我无法弄清楚如何在 apply() 中正确索引 df 并向前迭代，然后如何做重置技巧。

So, there has to be a more streamlined solution for this.因此，必须有一个更精简的解决方案。 Oh, and the original dataframe is more like ~60M long, but it is split and distributed among the available cpu cores via multiprocessing, hence the smaller size / task.哦，原来的 dataframe 更像是 ~60M 长，但它通过多处理在可用的 cpu 核心之间拆分和分布，因此尺寸/任务更小。

Edit: the condition is dynamic, ie Column C is compared to a random number in each check, so shouldn't be pre-masked.编辑：条件是动态的，即列 C 与每次检查中的随机数进行比较，因此不应预先屏蔽。

Edit 2: some typos.编辑2：一些错别字。

Answer 1

You are right if I have this right如果我有这个权利，你是对的

data = [
["Ag", "Br", 18],
["Ah", "Bg", 20],
["An", "Bd", 17],
["Am", "Bk", 18],
["Aq", "Bq", 32],
"Aq", "Aq", 16],
]
df = pd.DataFrame(data=data, columns=['A', 'B', 'C'])

temp_df = df[(df.C > 14) & (df.A != df.B)] # e.g. condition_on_c = 14

# get the first row to sample
initial_row_index = temp_df.sample(1, random_state=42).index.values[0]
output = temp_df[temp_df.index != initial_row_index].sample(N, replace=True)
# sample = True means with replacement so you may get dup rows (definitely if N > len(temp_df) - 1
output = pd.concat([temp_df.loc[[initial_row_index]], output])

# if N = 5 we get 
    A   B   C
1  Ah  Bg  20 # initial row
3  Am  Bk  18
4  Aq  Bq  32
2  An  Bd  17
4  Aq  Bq  32
4  Aq  Bq  32

you can see the original index is the original index in the data frame you are sampling.您可以看到原始索引是您正在采样的数据框中的原始索引。 So you can reset this index.所以你可以重置这个索引。

按条件遍历 pandas dataframe、select 行，当条件为真时，select 其他一些行，只包含唯一值

问题描述

1 个解决方案

解决方案1
0 2022-04-29 10:12:15

按条件遍历 pandas dataframe、select 行，当条件为真时，select 其他一些行，只包含唯一值

问题描述

1 个解决方案

解决方案1 0 2022-04-29 10:12:15

解决方案1
0 2022-04-29 10:12:15