简体   繁体   中英

How to "multiply" dataframes with each other in Python?

I have two dataframes in Python/pandas that look as follows:

df1 =
[[01/01/2001, 01/04/2004, 12/12/2007],
[02/07/2002, NA, NA],
[04/08/2012, 02/11/2018, NA]]

df2 =
[[1, 3, 2],
[2, NA, NA],
[3, 1, NA]]

I would like to create a third dataframe that looks as follows:

df3 =
[[01/01/2001, 01/04/2004, 01/04/2004, 01/04/2004, 12/12/2007, 12/12/2007],
[02/07/2002, 02/07/2002, NA, NA, NA, NA],
[04/08/2012, 04/08/2012, 04/08/2012, 02/11/2018, NA, NA]]

In other words, the second df gives the number of times that I want to copy the corresponding value of the first df into the third one. For lack of a better word, I called this "multiplying" in the question, even though I realize that this is probably wrong.

Does someone know of a way to efficiently do this? My approach would be to work with loops and lists for each row, but I'm guessing that there should by a much more efficient way of doing this in Python. Many thanks for your help and sorry again for probably using bad terminology here.

Fully vectorized solution cannot result from this logic, but we can take some benefit of numpy and python Inbuilt operation of list comprehension .

LOGIC:
1. Using np.repeat which Array manipulation routines we will use it to repeat along dataframe df1 row, where argument of repeats of np.repeat function will be the row of df2 object.

np.repeat(df1.iloc[i,:], df2_u.iloc[i,:].astype('i4'))

2. Important thing to look at is that repeats arguments should be type of int and we will use astype('i4') which is np.int32 datatype to convert df2 row while list comprehension .

df2_u.iloc[i,:].astype('i4')

3. And lastly how to repeat np.nan value form np.nan for that just update df2 as df2_u where NA is filled with 0 using this operation:

df2_u = df2.fillna(0)

Generalized solution, here logic work as if we pass list of lists with unequal-size of nested-list will result into DataFrame-Object with broadcasted row with fill all undefined value np.nan object.

CODE:

import pandas as pd
import numpy as np

df1 = pd.DataFrame([['01/01/2001', '01/04/2004', '12/12/2007'],
                    ['02/07/2002', np.nan, np.nan],
                    ['04/08/2012', '02/11/2018', np.nan]])

df2 = pd.DataFrame([[1, 3, 2], [2, np.nan, np.nan], [3, 1, np.nan]])

df1_sub = df1
df2_sub = df2.fillna(0)

df3 = pd.DataFrame([list(np.repeat(df1_sub.iloc[i,:], df2_sub.iloc[i,:].astype('i4')) )for i in range(df1_sub.shape[0])])
print(df3)

OUTPUT:

[['01/01/2001' '01/04/2004' '01/04/2004' '01/04/2004' '12/12/2007''12/12/2007']
 ['02/07/2002' '02/07/2002' nan nan nan nan]
 ['04/08/2012' '04/08/2012' '04/08/2012' '02/11/2018' nan nan]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM