简体   繁体   中英

best way to generate rows based on other rows in pandas at a big file

I have a csv with around 8 million of rows, something like that:

a b c
0 2 3

and I wanted to generate from it new rows based on the second and the third value so I will get:

a b c
0 2 3
0 3 3
0 4 3
0 5 3

which is basically just itereating through every row(in this example one row), and then creating a new row with a value of b+i, where i is between 0 to the value of c including c itself. c column is irelevant after the rows have been generated, problem is that it has million of rows, and doing that might generate many rows, so how can I do it efficenly? (loops are too slow for that amount of data). thanks

You can reindex on the repeated index:

out = df.loc[df.index.repeat(df['c']+1)]
out['b'] += out.groupby(level=0).cumcount()
print(out)

Output (reset index if you want):

   a  b  c
0  0  2  3
0  0  3  3
0  0  4  3
0  0  5  3

Note since you blow your data up by the c column and you already have 8 million rows, your new dataframe can be too big on its own.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM