I have two dataframes in Python/pandas that look as follows:
df1 =
[[01/01/2001, 01/04/2004, 12/12/2007],
[02/07/2002, NA, NA],
[04/08/2012, 02/11/2018, NA]]
df2 =
[[1, 3, 2],
[2, NA, NA],
[3, 1, NA]]
I would like to create a third dataframe that looks as follows:
df3 =
[[01/01/2001, 01/04/2004, 01/04/2004, 01/04/2004, 12/12/2007, 12/12/2007],
[02/07/2002, 02/07/2002, NA, NA, NA, NA],
[04/08/2012, 04/08/2012, 04/08/2012, 02/11/2018, NA, NA]]
In other words, the second df gives the number of times that I want to copy the corresponding value of the first df into the third one. For lack of a better word, I called this "multiplying" in the question, even though I realize that this is probably wrong.
Does someone know of a way to efficiently do this? My approach would be to work with loops and lists for each row, but I'm guessing that there should by a much more efficient way of doing this in Python. Many thanks for your help and sorry again for probably using bad terminology here.
Fully vectorized solution
cannot result from this logic, but we can take some benefit of numpy
and python Inbuilt
operation of list comprehension
.
LOGIC:
1. Using np.repeat
which Array manipulation routines
we will use it to repeat along dataframe df1
row, where argument of repeats
of np.repeat
function will be the row of df2
object.
np.repeat(df1.iloc[i,:], df2_u.iloc[i,:].astype('i4'))
2. Important thing to look at is that repeats
arguments should be type of int
and we will use astype('i4')
which is np.int32
datatype to convert df2
row while list comprehension
.
df2_u.iloc[i,:].astype('i4')
3. And lastly how to repeat np.nan
value form np.nan
for that just update df2
as df2_u
where NA
is filled with 0
using this operation:
df2_u = df2.fillna(0)
Generalized solution, here logic work as if we pass list of lists
with unequal-size
of nested-list
will result into DataFrame-Object
with broadcasted row
with fill
all undefined value np.nan
object.
CODE:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([['01/01/2001', '01/04/2004', '12/12/2007'],
['02/07/2002', np.nan, np.nan],
['04/08/2012', '02/11/2018', np.nan]])
df2 = pd.DataFrame([[1, 3, 2], [2, np.nan, np.nan], [3, 1, np.nan]])
df1_sub = df1
df2_sub = df2.fillna(0)
df3 = pd.DataFrame([list(np.repeat(df1_sub.iloc[i,:], df2_sub.iloc[i,:].astype('i4')) )for i in range(df1_sub.shape[0])])
print(df3)
OUTPUT:
[['01/01/2001' '01/04/2004' '01/04/2004' '01/04/2004' '12/12/2007''12/12/2007']
['02/07/2002' '02/07/2002' nan nan nan nan]
['04/08/2012' '04/08/2012' '04/08/2012' '02/11/2018' nan nan]]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.