按具有容差变化的列比较不同的熊猫数据框

Question

I have 3 pandas dataframes like these:我有 3 个像这样的熊猫数据框：

#0
                 A     C     G     T          uA          uC          uG          uT     cmA     cmC     cmG     cmT
    seq_1_0   47.0  47.0  54.0  52.0  100.978723  100.957447  100.370370  99.788462   5147.0  5144.0  5055.0  4968.0
    seq_1_50  47.0  47.0  54.0  52.0  101.829787  101.680851  99.092593   99.692308   5279.0  5256.0  4864.0  4953.0
    seq_2_0   47.0  47.0  54.0  52.0  100.978723  100.957447  100.370370  99.788462   5147.0  5144.0  5055.0  4968.0
    seq_2_50  47.0  47.0  54.0  52.0  101.468085  101.425532  99.000000   100.346154  5223.0  5216.0  4850.0  5052.0
    seq_3_0   47.0  47.0  54.0  52.0  100.212766  99.680851   100.870370  101.115385  5030.0  4952.0  5131.0  5169.0
    seq_3_50  46.0  47.0  53.0  54.0  100.173913  100.978723  100.924528  99.944444   5026.0  5148.0  5139.0  4990.0
    seq_4_0   45.0  47.0  54.0  54.0  99.044444   99.000000   101.407407  102.111111  4856.0  4851.0  5214.0  5323.0
    seq_4_50  47.0  47.0  53.0  53.0  101.872340  104.382979  97.849057   98.490566   5285.0  5686.0  4684.0  4776.0
    seq_5_0   54.0  34.0  37.0  75.0  90.462963   91.647059   90.756757   116.546667  3700.0  3848.0  3737.0  7915.0
    seq_5_50  48.0  33.0  37.0  82.0  94.937500   113.636364  113.162162  92.756098   4277.0  7337.0  7245.0  3990.0
    seq_6_0   60.0  50.0  48.0  42.0  98.500000   93.900000   106.125000  104.785714  4777.0  4139.0  5976.0  5752.0
    seq_6_50  59.0  46.0  52.0  43.0  98.338983   98.826087   102.615385  102.697674  4754.0  4825.0  5402.0  5415.0
#1
                 A     C     G     T          uA          uC          uG          uT     cmA     cmC     cmG     cmT
    seq_1_0   47.0  47.0  54.0  52.0  100.978723  100.957447  100.370370  99.788462   5147.0  5144.0  5055.0  4968.0
    seq_1_50  47.0  47.0  54.0  52.0  101.829787  101.680851  99.092593   99.692308   5279.0  5256.0  4864.0  4953.0
    seq_2_0   47.0  47.0  54.0  52.0  100.978723  100.957447  100.370370  99.788462   5147.0  5144.0  5055.0  4968.0
    seq_2_50  47.0  47.0  54.0  52.0  101.468085  101.425532  99.000000   100.346154  5223.0  5216.0  4850.0  5052.0
    seq_3_0   47.0  47.0  54.0  52.0  100.212766  99.680851   100.870370  101.115385  5030.0  4952.0  5131.0  5169.0
    seq_3_50  46.0  47.0  53.0  54.0  100.173913  100.978723  100.924528  99.944444   5026.0  5148.0  5139.0  4990.0
    seq_4_0   45.0  47.0  54.0  54.0  99.044444   99.000000   101.407407  102.111111  4856.0  4851.0  5214.0  5323.0
    seq_4_50  47.0  47.0  53.0  53.0  101.872340  104.382979  97.849057   98.490566   5285.0  5686.0  4684.0  4776.0
    seq_5_0   54.0  34.0  37.0  75.0  90.462963   91.647059   90.756757   116.546667  3700.0  3848.0  3737.0  7915.0
    seq_5_50  48.0  33.0  37.0  82.0  94.937500   113.636364  113.162162  92.756098   4277.0  7337.0  7245.0  3990.0
#2
                 A     C     G     T          uA          uC          uG          uT     cmA     cmC     cmG     cmT
    seq_1_0   48.0  48.0  53.0  51.0  100.291667  99.208333   101.943396  100.411765  5042.0  4882.0  5297.0  5062.0
    seq_1_50  48.0  47.0  54.0  51.0  100.083333  101.680851  99.092593   101.294118  5012.0  5256.0  4864.0  5196.0
    seq_2_0   47.0  47.0  54.0  52.0  100.978723  100.957447  100.370370  99.788462   5147.0  5144.0  5055.0  4968.0
    seq_2_50  47.0  47.0  54.0  52.0  101.468085  101.425532  99.000000   100.346154  5223.0  5216.0  4850.0  5052.0
    seq_3_0   50.0  47.0  53.0  50.0  98.980000   99.680851   101.490566  101.740000  4847.0  4952.0  5226.0  5265.0
    seq_3_50  49.0  47.0  52.0  52.0  95.857143   100.978723  102.519231  102.423077  4403.0  5148.0  5387.0  5371.0

And I want to compare all the columns of the first dataframe (#0) with the other 2 dataframes (#1 and #2), to identify which index have different column values (eg the indexes seq_6_0 and seq_6_50 are present in dataframe #0 and absent in the other two dataframes).我想将第一个数据帧（#0）的所有列与其他 2 个数据帧（#1 和 #2）进行比较，以确定哪个索引具有不同的列值（例如索引seq_6_0和seq_6_50存在于数据帧 #0 中并且在其他两个数据帧中不存在）。

But I want too put a tolerance variation of each column to consider columns of different dataframes as equals, eg:但我也想把每列的容差变化考虑为相等的不同数据帧的列，例如：

the index seq_1_0 of dataframe #0 have these values:数据帧 #0 的索引seq_1_0具有以下值：

A     C     G     T          uA          uC          uG          uT     cmA     cmC     cmG     cmT
47.0  47.0  54.0  52.0  100.978723  100.957447  100.370370  99.788462   5147.0  5144.0  5055.0  4968.0

while the index seq_1_0 of daframe #2 have:而 daframe #2 的索引seq_1_0有：

A     C     G     T          uA          uC          uG          uT     cmA     cmC     cmG     cmT
48.0  48.0  53.0  51.0  100.291667  99.208333   101.943396  100.411765  5042.0  4882.0  5297.0  5062.0

So I want put difference tolerance values for each column, eg for columns ["A","C","T","G"] I need a tolerance value of 90% between compared values, but for other columns I need diferent percentage between compared values.所以我想为每一列设置差异容差值，例如对于列["A","C","T","G"]我需要比较值之间的容差值为 90%，但对于其他列，我需要不同的比较值之间的百分比。

Have any pandas function that I can use for do this?有我可以使用的熊猫函数吗？

Best,最好的事物，

Answer 1

Use np.isclose , which allows you to precisely control the absolute and relative tolerance of the comparison.使用np.isclose ，它允许您精确控制比较的绝对和相对容差。

I assume that you only want to compare rows with labels that exist in both dataframes.我假设您只想将行与两个数据框中都存在的标签进行比较。 Rows that exist in one but not the other are ignored.存在于一个中但不存在于另一个中的行将被忽略。 Also, since you use a relative criterion for A, C, G, T, compare(df0,df1) is not the same as compare(df1,df0) .此外，由于您对 A、C、G、T 使用相对标准， compare(df0,df1)与compare(df1,df0) 。 It assumes the second parameter is the reference value.它假定第二个参数是参考值。 This is consistent with how np.isclose works.这与np.isclose工作方式一致。

def compare(dfa, dfb):
    s = pd.Series(['A','C','G','T'])
    tmp = dfa.join(dfb, how='inner', lsuffix='_a', rsuffix='_b')

    # The A, C, G, T columns: within 90% of dfb
    lhs = tmp[s + '_a'].values
    rhs = tmp[s + '_b'].values
    compare1 = np.isclose(lhs, rhs, atol=0, rtol=0.9)

    # The uA, uC, uG, uT columns: within 1e-5
    lhs = tmp['u' + s + '_a'].values
    rhs = tmp['u' + s + '_b'].values
    compare2 = np.isclose(lhs, rhs, atol=1e-5, rtol=0)

    # The cmA, cmC, cmG, cmT columns: within 1e-3
    lhs = tmp['cm' + s + '_a'].values
    rhs = tmp['cm' + s + '_b'].values
    compare3 = np.isclose(lhs, rhs, atol=1e-3, rtol=0)

    # Assemble the result
    data = np.concatenate([compare1, compare2, compare3], axis=1)
    cols = pd.concat([s, 'u'+s, 'cm'+s])    
    result = pd.DataFrame(data, columns=cols, index=tmp.index)

    return result

compare(df0, df2)

For an easy visualization of the result:为了简单地可视化结果：

def highlight_false(cell):
    return '' if cell else 'background-color: yellow'

result = compare(df0,df2)
result.style.applymap(highlight_false)

按具有容差变化的列比较不同的熊猫数据框

问题描述

1 个解决方案

解决方案1
4 已采纳 2020-01-14 14:59:44

按具有容差变化的列比较不同的熊猫数据框

问题描述

1 个解决方案

解决方案1 4 已采纳 2020-01-14 14:59:44

解决方案1
4 已采纳 2020-01-14 14:59:44