[英]Merge columns values within the same dataframe in pandas
Hello I have a dataframe such as您好,我有一个 dataframe 如
>>> tab
COL1 COL2 COL3 COL4 COL5
0 G1 S_-__1Canis_lupus A B SEQ1
1 G1 S_+__2Elpah_bis C D SEQ4.1
2 G1 S_-__3Felis_cattus NaN NaN SEQA.B
3 G1 S_-__4Felis_cattus NaN NaN SEQA.B
4 G1 S-BICs_-__5Felis_cattus E F SEQA.A
5 G1 S_+__6Felis_cattus NaN NaN SEQA.A
6 G1 S_-__7Felis_cattus NaN NaN SEQA.A
7 G1 S-BICs_-__8Felis_cattus L P SEQA.B
8 G1 S_-__9Felis_cattus K L SEQA.A
9 G2 S_+__10Felis_cattus M N SEQA.A
10 G2 S_-__11Lupus_lupus NaN NaN SEQ3
and the idea is within each COL1 groups
to focus on value in COL2
that contain the pattern: -BICs
这个想法是在每个COL1 groups
中关注包含模式的COL2
中的值: -BICs
then fill all COL3
and COL4
values that are NaN
with the same COL5
value as the one that contain the -BICs
pattern然后用与包含-BICs
模式的相同COL5
值填充所有为NaN
的COL3
和COL4
值
exemple:例子:
in line4 S-BICs_-__5Felis_cattus
has a -BICs
pattern, its COL5
= SEQA.A
在 line4 S-BICs_-__5Felis_cattus
有一个-BICs
模式,它的COL5
= SEQA.A
Within G1
G1
内
S_-__3Felis_cattus and S_-__4Felis_cattus have NaN
values in COL3
and COL4
and have the same COL5
value. S_-__3Felis_cattus和S_-__4Felis_cattus在COL3
和COL4
中具有NaN
值,并且具有相同的COL5
值。 Then I put the COL3
and COL4
values of S-BICs_-__5Felis_cattus
:然后我把S-BICs_-__5Felis_cattus
的COL3
和COL4
值:
>>> tab
COL1 COL2 COL3 COL4 COL5
0 G1 S_-__1Canis_lupus A B SEQ1
1 G1 S_+__2Elpah_bis C D SEQ4.1
2 G1 S_-__3Felis_cattus NaN NaN SEQA.B
3 G1 S_-__4Felis_cattus NaN NaN SEQA.B
4 G1 S-BICs_-__5Felis_cattus E F SEQA.A
5 G1 S_+__6Felis_cattus E F SEQA.A
6 G1 S_-__7Felis_cattus E F SEQA.A
7 G1 S-BICs_-__8Felis_cattus L P SEQA.B
8 G1 S_-__9Felis_cattus K L SEQA.A
9 G2 S_+__10Felis_cattus M N SEQA.A
10 G2 S_-__11Lupus_lupus NaN NaN SEQ3
and same for the S-BICs_-__8Felis_cattus where NaN are transformed to
L and
P`与S-BICs_-__8Felis_cattus where NaN are transformed to
L and
P`
>>> tab
COL1 COL2 COL3 COL4 COL5
0 G1 S_-__1Canis_lupus A B SEQ1
1 G1 S_+__2Elpah_bis C D SEQ4.1
2 G1 S_-__3Felis_cattus L P SEQA.B
3 G1 S_-__4Felis_cattus L P SEQA.B
4 G1 S-BICs_-__5Felis_cattus E F SEQA.A
5 G1 S_+__6Felis_cattus E F SEQA.A
6 G1 S_-__7Felis_cattus E F SEQA.A
7 G1 S-BICs_-__8Felis_cattus L P SEQA.B
8 G1 S_-__9Felis_cattus K L SEQA.A
9 G2 S_+__10Felis_cattus M N SEQA.A
10 G2 S_-__11Lupus_lupus NaN NaN SEQ3
You can do it with where
the col2 str.contains
the pattern to repalce all the rows that does not contains the pattern with nan.您可以where
col2 str.contains
模式的地方执行此操作,以用 nan 替换所有不包含模式的行。 Then groupby.transform
by col1 and col5 and get first
(to get the non nan value if any).然后通过 col1 和groupby.transform
进行 groupby.transform 并first
获取(如果有的话,获取非 nan 值)。 Finally, fillna
the original data like:最后, fillna
原始数据,如:
tab[['COL3','COL4']] = (tab[['COL3','COL4']]
.fillna(tab[['COL3','COL4']]
.where(tab['COL2'].str.contains('-BICs'))
.groupby([tab['COL1'], tab['COL5']])
.transform('first'))
)
print (tab)
COL1 COL2 COL3 COL4 COL5
0 G1 S_-__1Canis_lupus A B SEQ1
1 G1 S_+__2Elpah_bis C D SEQ4.1
2 G1 S_-__3Felis_cattus L P SEQA.B
3 G1 S_-__4Felis_cattus L P SEQA.B
4 G1 S-BICs_-__5Felis_cattus E F SEQA.A
5 G1 S_+__6Felis_cattus E F SEQA.A
6 G1 S_-__7Felis_cattus E F SEQA.A
7 G1 S-BICs_-__8Felis_cattus L P SEQA.B
8 G1 S_-__9Felis_cattus K L SEQA.A
9 G2 S_+__10Felis_cattus M N SEQA.A
10 G2 S_-__11Lupus_lupus NaN NaN SEQ3
If I understood correctly, what about something like:如果我理解正确,那么类似:
reference = tab.iloc[tab["COL2"].str.contains("-BICs"),:].rename(columns = {"COL2":"R_COL2","COL3":"R_COL3","COL4":"R_COL4"})
table = pd.merge(table,reference, how='left')
table.iat[table["COL3"].isnull(), 2] = table.iloc[table["COL3"].isnull(), 6]
table.iat[table["COL3"].isnull(), 3] = table.iloc[table["COL3"].isnull(), 7]
table = table[["COL1","COL2","COL3","COL4","COL5"]]
I didn't try it, but the idea would be to do something similar.我没有尝试过,但我的想法是做类似的事情。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.