简体   繁体   中英

Merge columns values within the same dataframe in pandas

Hello I have a dataframe such as

  >>> tab
   COL1                     COL2 COL3 COL4    COL5
0    G1        S_-__1Canis_lupus    A    B    SEQ1
1    G1          S_+__2Elpah_bis    C    D  SEQ4.1
2    G1       S_-__3Felis_cattus  NaN  NaN  SEQA.B
3    G1       S_-__4Felis_cattus  NaN  NaN  SEQA.B
4    G1  S-BICs_-__5Felis_cattus    E    F  SEQA.A
5    G1       S_+__6Felis_cattus  NaN  NaN  SEQA.A
6    G1       S_-__7Felis_cattus  NaN  NaN  SEQA.A
7    G1  S-BICs_-__8Felis_cattus    L    P  SEQA.B
8    G1       S_-__9Felis_cattus    K    L  SEQA.A
9    G2      S_+__10Felis_cattus    M    N  SEQA.A
10   G2       S_-__11Lupus_lupus  NaN  NaN    SEQ3

and the idea is within each COL1 groups to focus on value in COL2 that contain the pattern: -BICs

then fill all COL3 and COL4 values that are NaN with the same COL5 value as the one that contain the -BICs pattern

exemple:

in line4 S-BICs_-__5Felis_cattus has a -BICs pattern, its COL5 = SEQA.A

Within G1

S_-__3Felis_cattus and S_-__4Felis_cattus have NaN values in COL3 and COL4 and have the same COL5 value. Then I put the COL3 and COL4 values of S-BICs_-__5Felis_cattus :

   >>> tab
   COL1                     COL2 COL3 COL4    COL5
0    G1        S_-__1Canis_lupus    A    B    SEQ1
1    G1          S_+__2Elpah_bis    C    D  SEQ4.1
2    G1       S_-__3Felis_cattus  NaN  NaN  SEQA.B
3    G1       S_-__4Felis_cattus  NaN  NaN  SEQA.B
4    G1  S-BICs_-__5Felis_cattus    E    F  SEQA.A
5    G1       S_+__6Felis_cattus    E    F  SEQA.A
6    G1       S_-__7Felis_cattus    E    F  SEQA.A
7    G1  S-BICs_-__8Felis_cattus    L    P  SEQA.B
8    G1       S_-__9Felis_cattus    K    L  SEQA.A
9    G2      S_+__10Felis_cattus    M    N  SEQA.A
10   G2       S_-__11Lupus_lupus  NaN  NaN    SEQ3 

and same for the S-BICs_-__8Felis_cattus where NaN are transformed to L and P`

   >>> tab
   COL1                     COL2 COL3 COL4    COL5
0    G1        S_-__1Canis_lupus    A    B    SEQ1
1    G1          S_+__2Elpah_bis    C    D  SEQ4.1
2    G1       S_-__3Felis_cattus    L    P  SEQA.B
3    G1       S_-__4Felis_cattus    L    P  SEQA.B
4    G1  S-BICs_-__5Felis_cattus    E    F  SEQA.A
5    G1       S_+__6Felis_cattus    E    F  SEQA.A
6    G1       S_-__7Felis_cattus    E    F  SEQA.A
7    G1  S-BICs_-__8Felis_cattus    L    P  SEQA.B
8    G1       S_-__9Felis_cattus    K    L  SEQA.A
9    G2      S_+__10Felis_cattus    M    N  SEQA.A
10   G2       S_-__11Lupus_lupus  NaN  NaN    SEQ3 

You can do it with where the col2 str.contains the pattern to repalce all the rows that does not contains the pattern with nan. Then groupby.transform by col1 and col5 and get first (to get the non nan value if any). Finally, fillna the original data like:

tab[['COL3','COL4']] = (tab[['COL3','COL4']]
                           .fillna(tab[['COL3','COL4']]
                                      .where(tab['COL2'].str.contains('-BICs'))
                                      .groupby([tab['COL1'], tab['COL5']])
                                      .transform('first'))
                       )
print (tab)
   COL1                     COL2 COL3 COL4    COL5
0    G1        S_-__1Canis_lupus    A    B    SEQ1
1    G1          S_+__2Elpah_bis    C    D  SEQ4.1
2    G1       S_-__3Felis_cattus    L    P  SEQA.B
3    G1       S_-__4Felis_cattus    L    P  SEQA.B
4    G1  S-BICs_-__5Felis_cattus    E    F  SEQA.A
5    G1       S_+__6Felis_cattus    E    F  SEQA.A
6    G1       S_-__7Felis_cattus    E    F  SEQA.A
7    G1  S-BICs_-__8Felis_cattus    L    P  SEQA.B
8    G1       S_-__9Felis_cattus    K    L  SEQA.A
9    G2      S_+__10Felis_cattus    M    N  SEQA.A
10   G2       S_-__11Lupus_lupus  NaN  NaN    SEQ3

If I understood correctly, what about something like:

reference = tab.iloc[tab["COL2"].str.contains("-BICs"),:].rename(columns = {"COL2":"R_COL2","COL3":"R_COL3","COL4":"R_COL4"})
table = pd.merge(table,reference, how='left')

table.iat[table["COL3"].isnull(), 2] = table.iloc[table["COL3"].isnull(), 6]
table.iat[table["COL3"].isnull(), 3] = table.iloc[table["COL3"].isnull(), 7]

table = table[["COL1","COL2","COL3","COL4","COL5"]]

I didn't try it, but the idea would be to do something similar.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM