best way to generate rows based on other rows in pandas at a big file

Question

I have a csv with around 8 million of rows, something like that:

a b c
0 2 3

and I wanted to generate from it new rows based on the second and the third value so I will get:

which is basically just itereating through every row(in this example one row), and then creating a new row with a value of b+i, where i is between 0 to the value of c including c itself. c column is irelevant after the rows have been generated, problem is that it has million of rows, and doing that might generate many rows, so how can I do it efficenly? (loops are too slow for that amount of data). thanks

Answer 1

You can reindex on the repeated index:

out = df.loc[df.index.repeat(df['c']+1)]
out['b'] += out.groupby(level=0).cumcount()
print(out)

Output (reset index if you want):

Note since you blow your data up by the c column and you already have 8 million rows, your new dataframe can be too big on its own.

best way to generate rows based on other rows in pandas at a big file

Question

1 answers

solution1
2 2022-09-01 13:37:18

best way to generate rows based on other rows in pandas at a big file

Question

1 answers

solution1 2 2022-09-01 13:37:18

solution1
2 2022-09-01 13:37:18