[英]Comparing different pandas dataframes by columns with tolerance variation
I have 3 pandas dataframes like these:我有 3 个像这样的熊猫数据框:
#0
A C G T uA uC uG uT cmA cmC cmG cmT
seq_1_0 47.0 47.0 54.0 52.0 100.978723 100.957447 100.370370 99.788462 5147.0 5144.0 5055.0 4968.0
seq_1_50 47.0 47.0 54.0 52.0 101.829787 101.680851 99.092593 99.692308 5279.0 5256.0 4864.0 4953.0
seq_2_0 47.0 47.0 54.0 52.0 100.978723 100.957447 100.370370 99.788462 5147.0 5144.0 5055.0 4968.0
seq_2_50 47.0 47.0 54.0 52.0 101.468085 101.425532 99.000000 100.346154 5223.0 5216.0 4850.0 5052.0
seq_3_0 47.0 47.0 54.0 52.0 100.212766 99.680851 100.870370 101.115385 5030.0 4952.0 5131.0 5169.0
seq_3_50 46.0 47.0 53.0 54.0 100.173913 100.978723 100.924528 99.944444 5026.0 5148.0 5139.0 4990.0
seq_4_0 45.0 47.0 54.0 54.0 99.044444 99.000000 101.407407 102.111111 4856.0 4851.0 5214.0 5323.0
seq_4_50 47.0 47.0 53.0 53.0 101.872340 104.382979 97.849057 98.490566 5285.0 5686.0 4684.0 4776.0
seq_5_0 54.0 34.0 37.0 75.0 90.462963 91.647059 90.756757 116.546667 3700.0 3848.0 3737.0 7915.0
seq_5_50 48.0 33.0 37.0 82.0 94.937500 113.636364 113.162162 92.756098 4277.0 7337.0 7245.0 3990.0
seq_6_0 60.0 50.0 48.0 42.0 98.500000 93.900000 106.125000 104.785714 4777.0 4139.0 5976.0 5752.0
seq_6_50 59.0 46.0 52.0 43.0 98.338983 98.826087 102.615385 102.697674 4754.0 4825.0 5402.0 5415.0
#1
A C G T uA uC uG uT cmA cmC cmG cmT
seq_1_0 47.0 47.0 54.0 52.0 100.978723 100.957447 100.370370 99.788462 5147.0 5144.0 5055.0 4968.0
seq_1_50 47.0 47.0 54.0 52.0 101.829787 101.680851 99.092593 99.692308 5279.0 5256.0 4864.0 4953.0
seq_2_0 47.0 47.0 54.0 52.0 100.978723 100.957447 100.370370 99.788462 5147.0 5144.0 5055.0 4968.0
seq_2_50 47.0 47.0 54.0 52.0 101.468085 101.425532 99.000000 100.346154 5223.0 5216.0 4850.0 5052.0
seq_3_0 47.0 47.0 54.0 52.0 100.212766 99.680851 100.870370 101.115385 5030.0 4952.0 5131.0 5169.0
seq_3_50 46.0 47.0 53.0 54.0 100.173913 100.978723 100.924528 99.944444 5026.0 5148.0 5139.0 4990.0
seq_4_0 45.0 47.0 54.0 54.0 99.044444 99.000000 101.407407 102.111111 4856.0 4851.0 5214.0 5323.0
seq_4_50 47.0 47.0 53.0 53.0 101.872340 104.382979 97.849057 98.490566 5285.0 5686.0 4684.0 4776.0
seq_5_0 54.0 34.0 37.0 75.0 90.462963 91.647059 90.756757 116.546667 3700.0 3848.0 3737.0 7915.0
seq_5_50 48.0 33.0 37.0 82.0 94.937500 113.636364 113.162162 92.756098 4277.0 7337.0 7245.0 3990.0
#2
A C G T uA uC uG uT cmA cmC cmG cmT
seq_1_0 48.0 48.0 53.0 51.0 100.291667 99.208333 101.943396 100.411765 5042.0 4882.0 5297.0 5062.0
seq_1_50 48.0 47.0 54.0 51.0 100.083333 101.680851 99.092593 101.294118 5012.0 5256.0 4864.0 5196.0
seq_2_0 47.0 47.0 54.0 52.0 100.978723 100.957447 100.370370 99.788462 5147.0 5144.0 5055.0 4968.0
seq_2_50 47.0 47.0 54.0 52.0 101.468085 101.425532 99.000000 100.346154 5223.0 5216.0 4850.0 5052.0
seq_3_0 50.0 47.0 53.0 50.0 98.980000 99.680851 101.490566 101.740000 4847.0 4952.0 5226.0 5265.0
seq_3_50 49.0 47.0 52.0 52.0 95.857143 100.978723 102.519231 102.423077 4403.0 5148.0 5387.0 5371.0
And I want to compare all the columns of the first dataframe (#0) with the other 2 dataframes (#1 and #2), to identify which index have different column values (eg the indexes seq_6_0
and seq_6_50
are present in dataframe #0 and absent in the other two dataframes).我想将第一个数据帧(#0)的所有列与其他 2 个数据帧(#1 和 #2)进行比较,以确定哪个索引具有不同的列值(例如索引
seq_6_0
和seq_6_50
存在于数据帧 #0 中并且在其他两个数据帧中不存在)。
But I want too put a tolerance variation of each column to consider columns of different dataframes as equals, eg:但我也想把每列的容差变化考虑为相等的不同数据帧的列,例如:
the index seq_1_0
of dataframe #0 have these values:数据帧 #0 的索引
seq_1_0
具有以下值:
A C G T uA uC uG uT cmA cmC cmG cmT
47.0 47.0 54.0 52.0 100.978723 100.957447 100.370370 99.788462 5147.0 5144.0 5055.0 4968.0
while the index seq_1_0
of daframe #2 have:而 daframe #2 的索引
seq_1_0
有:
A C G T uA uC uG uT cmA cmC cmG cmT
48.0 48.0 53.0 51.0 100.291667 99.208333 101.943396 100.411765 5042.0 4882.0 5297.0 5062.0
So I want put difference tolerance values for each column, eg for columns ["A","C","T","G"]
I need a tolerance value of 90% between compared values, but for other columns I need diferent percentage between compared values.所以我想为每一列设置差异容差值,例如对于列
["A","C","T","G"]
我需要比较值之间的容差值为 90%,但对于其他列,我需要不同的比较值之间的百分比。
Have any pandas function that I can use for do this?有我可以使用的熊猫函数吗?
Best,最好的事物,
Use np.isclose
, which allows you to precisely control the absolute and relative tolerance of the comparison.使用
np.isclose
,它允许您精确控制比较的绝对和相对容差。
I assume that you only want to compare rows with labels that exist in both dataframes.我假设您只想将行与两个数据框中都存在的标签进行比较。 Rows that exist in one but not the other are ignored.
存在于一个中但不存在于另一个中的行将被忽略。 Also, since you use a relative criterion for A, C, G, T,
compare(df0,df1)
is not the same as compare(df1,df0)
.此外,由于您对 A、C、G、T 使用相对标准,
compare(df0,df1)
与compare(df1,df0)
。 It assumes the second parameter is the reference value.它假定第二个参数是参考值。 This is consistent with how
np.isclose
works.这与
np.isclose
工作方式一致。
def compare(dfa, dfb):
s = pd.Series(['A','C','G','T'])
tmp = dfa.join(dfb, how='inner', lsuffix='_a', rsuffix='_b')
# The A, C, G, T columns: within 90% of dfb
lhs = tmp[s + '_a'].values
rhs = tmp[s + '_b'].values
compare1 = np.isclose(lhs, rhs, atol=0, rtol=0.9)
# The uA, uC, uG, uT columns: within 1e-5
lhs = tmp['u' + s + '_a'].values
rhs = tmp['u' + s + '_b'].values
compare2 = np.isclose(lhs, rhs, atol=1e-5, rtol=0)
# The cmA, cmC, cmG, cmT columns: within 1e-3
lhs = tmp['cm' + s + '_a'].values
rhs = tmp['cm' + s + '_b'].values
compare3 = np.isclose(lhs, rhs, atol=1e-3, rtol=0)
# Assemble the result
data = np.concatenate([compare1, compare2, compare3], axis=1)
cols = pd.concat([s, 'u'+s, 'cm'+s])
result = pd.DataFrame(data, columns=cols, index=tmp.index)
return result
compare(df0, df2)
For an easy visualization of the result:为了简单地可视化结果:
def highlight_false(cell):
return '' if cell else 'background-color: yellow'
result = compare(df0,df2)
result.style.applymap(highlight_false)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.