简体   繁体   中英

Iterate through pandas dataframe, select row by condition, when condition true, select a number of other rows, only containing unique values

I have a large (1M+) dataframe, something like

   Column A    Column B   Column C
0       'Aa'        'Ba'        14    
1       'Ab'        'Bc'        24           
2       'Ab'        'Ba'        24
...

So basically I have a list of string pairs and some number for each, where that number depends only on Column A. What I want to do is:

  1. Iterate over the rows of the dataframe
  2. For each row, check Column C with a condition
  3. If condition passed, select that row
  4. Sample N other rows so that all in all we have N+1 rows for each condition-passed row
  5. BUT sample them in a way, that each N+1 group have only rows, where the condition is passed as well, and no strings of Column A or B repeats
  6. Duplicates over different N+1 groups doesn't matter, nor the fact that the resulting list of N+1 groups will be way longer than the initial df. My task requires that all entries are processed and passed in N+1 groups that have no duplicates.

For example, have the condition Column C > 15, and have N = 5, then for a row that passed the condition:

    Column A    Column B   Column C
78       'Ae'        'Bf'        16

We will have the N group as for example:

   Column A    Column B   Column C
78       'Ag'        'Br'        18
111      'Ah'        'Bg'        20
20       'An'        'Bd'        17
19       'Am'        'Bk'        18
301      'Aq'        'Bq'        32

My initial codes are a mess, I've tried it with randomly sampling rows until N is reached, and checking them for the condition, and building a duplicate dictionary to check whether they are unique or not. However, rolling random numbers on several millions long intervals over and over again proved to be way too slow.

My second idea was to iterate from the point of the condition-passed row forward and search for other rows that pass the condition and once again check them against a duplicate dictionary. This started to be more feasible, however it had the problem, that the iteration had to be reset to the beginning of the df when the end of the df was reached and it didn't find N viable rows. Still felt quite slow. Like this:

    in_data = []

    for i in range(len(df)):
        
        A = df.iloc[i]['A']
        B = df.iloc[i]['B']
        
        if (condition(A)):
            
            in_data.append([A, B])
            dup_dict = {}
            dup_dict[A] = 1
            dup_dict[B] = 1
            j = i
            k = 1
            
            while (j < len(df) and k != N):

                other_A = df.iloc[j]['A']
                other_B = df.iloc[j]['B']
                
                if (condition(other_A) and
                    other_A not in dup_dict and
                    other_B not in dup_dict):
                    
                    dup_dict[other_A] = 1
                    dup_dict[other_B] = 1
                    in_data.append([other_A, other_B])
                    k += 1
                
                j += 1
                
                if (j == len(df) and k != N):
                    
                    j = 0
     
    return in_data

My latest idea was to somehow implement it via apply(), but it started to become way too complicated, as I couldn't figure out how to properly index the df inside the apply() and iterate forward, plus then how to do the reset trick.

So, there has to be a more streamlined solution for this. Oh, and the original dataframe is more like ~60M long, but it is split and distributed among the available cpu cores via multiprocessing, hence the smaller size / task.

Edit: the condition is dynamic, ie Column C is compared to a random number in each check, so shouldn't be pre-masked.

Edit 2: some typos.

You are right if I have this right

data = [
["Ag", "Br", 18],
["Ah", "Bg", 20],
["An", "Bd", 17],
["Am", "Bk", 18],
["Aq", "Bq", 32],
"Aq", "Aq", 16],
]
df = pd.DataFrame(data=data, columns=['A', 'B', 'C'])

temp_df = df[(df.C > 14) & (df.A != df.B)] # e.g. condition_on_c = 14

# get the first row to sample
initial_row_index = temp_df.sample(1, random_state=42).index.values[0]
output = temp_df[temp_df.index != initial_row_index].sample(N, replace=True)
# sample = True means with replacement so you may get dup rows (definitely if N > len(temp_df) - 1
output = pd.concat([temp_df.loc[[initial_row_index]], output])

# if N = 5 we get 
    A   B   C
1  Ah  Bg  20 # initial row
3  Am  Bk  18
4  Aq  Bq  32
2  An  Bd  17
4  Aq  Bq  32
4  Aq  Bq  32

you can see the original index is the original index in the data frame you are sampling. So you can reset this index.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM