I have a large (1M+) dataframe, something like
Column A Column B Column C
0 'Aa' 'Ba' 14
1 'Ab' 'Bc' 24
2 'Ab' 'Ba' 24
...
So basically I have a list of string pairs and some number for each, where that number depends only on Column A. What I want to do is:
For example, have the condition Column C > 15, and have N = 5, then for a row that passed the condition:
Column A Column B Column C
78 'Ae' 'Bf' 16
We will have the N group as for example:
Column A Column B Column C
78 'Ag' 'Br' 18
111 'Ah' 'Bg' 20
20 'An' 'Bd' 17
19 'Am' 'Bk' 18
301 'Aq' 'Bq' 32
My initial codes are a mess, I've tried it with randomly sampling rows until N is reached, and checking them for the condition, and building a duplicate dictionary to check whether they are unique or not. However, rolling random numbers on several millions long intervals over and over again proved to be way too slow.
My second idea was to iterate from the point of the condition-passed row forward and search for other rows that pass the condition and once again check them against a duplicate dictionary. This started to be more feasible, however it had the problem, that the iteration had to be reset to the beginning of the df when the end of the df was reached and it didn't find N viable rows. Still felt quite slow. Like this:
in_data = []
for i in range(len(df)):
A = df.iloc[i]['A']
B = df.iloc[i]['B']
if (condition(A)):
in_data.append([A, B])
dup_dict = {}
dup_dict[A] = 1
dup_dict[B] = 1
j = i
k = 1
while (j < len(df) and k != N):
other_A = df.iloc[j]['A']
other_B = df.iloc[j]['B']
if (condition(other_A) and
other_A not in dup_dict and
other_B not in dup_dict):
dup_dict[other_A] = 1
dup_dict[other_B] = 1
in_data.append([other_A, other_B])
k += 1
j += 1
if (j == len(df) and k != N):
j = 0
return in_data
My latest idea was to somehow implement it via apply(), but it started to become way too complicated, as I couldn't figure out how to properly index the df inside the apply() and iterate forward, plus then how to do the reset trick.
So, there has to be a more streamlined solution for this. Oh, and the original dataframe is more like ~60M long, but it is split and distributed among the available cpu cores via multiprocessing, hence the smaller size / task.
Edit: the condition is dynamic, ie Column C is compared to a random number in each check, so shouldn't be pre-masked.
Edit 2: some typos.
You are right if I have this right
data = [
["Ag", "Br", 18],
["Ah", "Bg", 20],
["An", "Bd", 17],
["Am", "Bk", 18],
["Aq", "Bq", 32],
"Aq", "Aq", 16],
]
df = pd.DataFrame(data=data, columns=['A', 'B', 'C'])
temp_df = df[(df.C > 14) & (df.A != df.B)] # e.g. condition_on_c = 14
# get the first row to sample
initial_row_index = temp_df.sample(1, random_state=42).index.values[0]
output = temp_df[temp_df.index != initial_row_index].sample(N, replace=True)
# sample = True means with replacement so you may get dup rows (definitely if N > len(temp_df) - 1
output = pd.concat([temp_df.loc[[initial_row_index]], output])
# if N = 5 we get
A B C
1 Ah Bg 20 # initial row
3 Am Bk 18
4 Aq Bq 32
2 An Bd 17
4 Aq Bq 32
4 Aq Bq 32
you can see the original index is the original index in the data frame you are sampling. So you can reset this index.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.