简体   繁体   中英

How to multiply dataframe rows by an array based on an 'attribute' of the row?

I have a big pandas dataframe DF which takes up nearly all my RAM. Its index is always one of the following: 'c1','c2','c3'.

    0   1   2  ...  n
i            
c1  1   8  15
c2  2   9  16
c1  3  10  17
c3  4  11  18
c2  5  12  19
c1  6  13  20
c3  7  14  21
...

I also have a dictionary in which the keys 'match' the dataframe index labels. The values are lists/arrays with length n (equal to number of dataframe columns).

d = {'c1':[10,10,10,...],'c2':[100,100,100,...],'c3':[1000,1000,1000,...]}

len(d[<allkeys>]) == n >>> True

What I want to do is a simple dataframe multiplication based on an 'attribute' of the row -- in this case it's its index, but it could be a value from a column. The result would look like this:

       0      1      2
i                     
c1    10     80    150
c2   200    900   1600
c1    30    100    170
c3  4000  11000  18000
c2   500   1200   1900
c1    60    130    200
c3  7000  14000  21000

I don't want to build a second dataframe DF2 in a way that I could just do DF*DF2, because I don't have enough RAM and because such dataframe seems pointless to me, meaning that it would just be a repetition of arrays/lists (please correct me if I'm wrong).

I thought about having a second dataframe like this:

       0     1     2
c1    10    10    10
c2   100   100   100
c3  1000  1000  1000

But when I do DF * DF2, somehow the index gets sorted and I don't know how this can be avoided. Result:

          0         1         2
c1      100       800      1500
c1      300      1000      1700
c1      600      1300      2000
c2    20000     90000    160000
c2    50000    120000    190000
c3  4000000  11000000  18000000
c3  7000000  14000000  21000000

If you think there's another way to approach this problem (different data structures,libraries,etc) that would also be appreciated.

Code for samples:

df = pd.DataFrame({0: [1, 2, 3, 4, 5, 6, 7],1: [8, 9, 10, 11, 12, 13, 14], 2: [15, 16, 17, 18, 19, 20, 21]}, index=pd.Index(['c1', 'c2', 'c1', 'c3', 'c2', 'c1', 'c3'], dtype='object', name='i'))
d = {'c1':[10,10,10],'c2':[100,100,100],'c3':[1000,1000,1000]}
df2 = pd.DataFrame.from_dict(d,orient='index')

Idea is deduplicate index by DataFrame.set_index , multiple by DataFrame.mul by first level and last remove second level by DataFrame.reset_index :

df1 = (df.set_index(np.arange(len(df)), append=True)
         .mul(df2, level=0)
         .reset_index(level=1, drop=True))
print (df1)
       0      1      2
i                     
c1    10     80    150
c2   200    900   1600
c1    30    100    170
c3  4000  11000  18000
c2   500   1200   1900
c1    60    130    200
c3  7000  14000  21000

As an old dinosaur that had to work on systems where memory was scarse, I would do what we used to do when loading everything in memory was not an option: I would just process the dataframe in chunks, saving to disk what is not required to be in live memory.

So I would:

  • save the large dataframe to a csv file on disk and purge it from memory
  • open that csv file to read it by chunks
  • open a new csv file to save the resul (with an empty copy of the dataframe)
  • then chunk by chunk:
    • read a chunk
    • process it
    • append it the the new csv file

Certainly not as efficient as buying more memory but robust and scalable.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM