简体   繁体   中英

How to merge two separated dataframes again in the same order using python

I have an input dataset in .csv format which I am trying to input in python and do some data analysis. The sample format is given below:

(df)

cus_ID hrs   mins    col4   risk
 1      2      7      1      NA
 2      3      5      1      NA
 1      1      4      6      2
 7      8      9      1      1 
 12     13     2      34     NA
 4      5      6      1      7
 16     7      10     22     NA
 12     10     3      12     9

Here if you see, column 5 has NA values. I have filtered this NA values alone, so that all the rows that ha NA values have been removed from this dataframe and copied into new dataframe, so that the resulting dataframe will be as below:

Dataframe With NA Values (df1):

   cus_ID hrs   mins    col4   risk
    1      2      7      1      NA
    2      3      5      1      NA
    12     13     2      34     NA
    16     7      10     22     NA

DataFrame without NA Values (df2):

    cus_ID hrs   mins    col4   risk
     1      1      4      6      2
     7      8      9      1      1 
     4      5      6      1      7
     12     10     3      12     9

Here I have done some manipulations and updated values for Nan. I need to put the newly updated Col5 values in the same order as earlier. Example: If my NaN Values are updated as 2.3,3.5,10,4, (these values are not in order and they are generated randomly in decimal points or whole number) now I want these updated rows of df1 to be clubed with DataFrame without NA Values df2 and I need to get the updated dataframe in same order as that of my initial dataframe.

   cus_ID hrs   mins    col4   risk
    1      2      7      1      2.3(NA Value replaced)
    2      3      5      1      3.5(NA Value replaced)
    1      1      4      6      2
    7      8      9      1      1 
    12     13     2      34     10 (NA Value replaced)
    4      5      6      1      7
    16     7      10     22     4 (NA Value replaced)
    12     10     3      12     9

Note: I want these updated rows to be appended in the same order as that of my initial dataframe. The main reason why I am splitting is, I am using some kind of manipulations to predict NA value. Just for sample, I have provided basic representation of Dataframe, but mine has thousands of records and many other attributes and there are many NA values spread randomly in risk column. I found out the NA values with some manipulations and have all the null values filled with some values. But now I looking on how to replace the NA value on my initial dataset with this calculated NA values. Should I do some kind of concat or should I compare df2 with my initial dataframe df and do some groupby options (considering customer ID or hours or any other attributes) to replace NA values? I want to implement it using python pandas. Could someone help me with a code?

You can use concat with sort_index :

print df
   cus_ID  hrs  mins  col4  risk
0       1    2     7     1   NaN
1       2    3     5     1   NaN
2       1    1     4     6   2.0
3       7    8     9     1   1.0
4      12   13     2    34   NaN
5       4    5     6     1   7.0
6      16    7    10    22   NaN
7      12   10     3    12   9.0

df1 = df[df.risk.isnull()].copy()
print df1
   cus_ID  hrs  mins  col4  risk
0       1    2     7     1   NaN
1       2    3     5     1   NaN
4      12   13     2    34   NaN
6      16    7    10    22   NaN

df2 = df[df.risk.notnull()].copy()
print df2
   cus_ID  hrs  mins  col4  risk
2       1    1     4     6   2.0
3       7    8     9     1   1.0
5       4    5     6     1   7.0
7      12   10     3    12   9.0

#append values to column risk
df1['risk'] = [2.3,3.5,10,4]
print df1
   cus_ID  hrs  mins  col4  risk
0       1    2     7     1   2.3
1       2    3     5     1   3.5
4      12   13     2    34  10.0
6      16    7    10    22   4.0
print pd.concat([df1,df2]).sort_index()
   cus_ID  hrs  mins  col4  risk
0       1    2     7     1   2.3
1       2    3     5     1   3.5
2       1    1     4     6   2.0
3       7    8     9     1   1.0
4      12   13     2    34  10.0
5       4    5     6     1   7.0
6      16    7    10    22   4.0
7      12   10     3    12   9.0

You can do this without splitting the dataframe:

df.loc[pd.isnull(df.col5),'col5']= np.arange(3)

Will produce the result you are looking for:

In [89]: df
Out[89]:
   col1  col2  col3  col4  col5
0     1     0     0     1     0
1     2     3     5     1     1
2     1     1     4     6     2
3     7     8     9     1     1
4    12    13     0    34     5
5     4     5     6     1     2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM