Comparing different pandas dataframes by columns with tolerance variation

Question

I have 3 pandas dataframes like these:

#0
                 A     C     G     T          uA          uC          uG          uT     cmA     cmC     cmG     cmT
    seq_1_0   47.0  47.0  54.0  52.0  100.978723  100.957447  100.370370  99.788462   5147.0  5144.0  5055.0  4968.0
    seq_1_50  47.0  47.0  54.0  52.0  101.829787  101.680851  99.092593   99.692308   5279.0  5256.0  4864.0  4953.0
    seq_2_0   47.0  47.0  54.0  52.0  100.978723  100.957447  100.370370  99.788462   5147.0  5144.0  5055.0  4968.0
    seq_2_50  47.0  47.0  54.0  52.0  101.468085  101.425532  99.000000   100.346154  5223.0  5216.0  4850.0  5052.0
    seq_3_0   47.0  47.0  54.0  52.0  100.212766  99.680851   100.870370  101.115385  5030.0  4952.0  5131.0  5169.0
    seq_3_50  46.0  47.0  53.0  54.0  100.173913  100.978723  100.924528  99.944444   5026.0  5148.0  5139.0  4990.0
    seq_4_0   45.0  47.0  54.0  54.0  99.044444   99.000000   101.407407  102.111111  4856.0  4851.0  5214.0  5323.0
    seq_4_50  47.0  47.0  53.0  53.0  101.872340  104.382979  97.849057   98.490566   5285.0  5686.0  4684.0  4776.0
    seq_5_0   54.0  34.0  37.0  75.0  90.462963   91.647059   90.756757   116.546667  3700.0  3848.0  3737.0  7915.0
    seq_5_50  48.0  33.0  37.0  82.0  94.937500   113.636364  113.162162  92.756098   4277.0  7337.0  7245.0  3990.0
    seq_6_0   60.0  50.0  48.0  42.0  98.500000   93.900000   106.125000  104.785714  4777.0  4139.0  5976.0  5752.0
    seq_6_50  59.0  46.0  52.0  43.0  98.338983   98.826087   102.615385  102.697674  4754.0  4825.0  5402.0  5415.0
#1
                 A     C     G     T          uA          uC          uG          uT     cmA     cmC     cmG     cmT
    seq_1_0   47.0  47.0  54.0  52.0  100.978723  100.957447  100.370370  99.788462   5147.0  5144.0  5055.0  4968.0
    seq_1_50  47.0  47.0  54.0  52.0  101.829787  101.680851  99.092593   99.692308   5279.0  5256.0  4864.0  4953.0
    seq_2_0   47.0  47.0  54.0  52.0  100.978723  100.957447  100.370370  99.788462   5147.0  5144.0  5055.0  4968.0
    seq_2_50  47.0  47.0  54.0  52.0  101.468085  101.425532  99.000000   100.346154  5223.0  5216.0  4850.0  5052.0
    seq_3_0   47.0  47.0  54.0  52.0  100.212766  99.680851   100.870370  101.115385  5030.0  4952.0  5131.0  5169.0
    seq_3_50  46.0  47.0  53.0  54.0  100.173913  100.978723  100.924528  99.944444   5026.0  5148.0  5139.0  4990.0
    seq_4_0   45.0  47.0  54.0  54.0  99.044444   99.000000   101.407407  102.111111  4856.0  4851.0  5214.0  5323.0
    seq_4_50  47.0  47.0  53.0  53.0  101.872340  104.382979  97.849057   98.490566   5285.0  5686.0  4684.0  4776.0
    seq_5_0   54.0  34.0  37.0  75.0  90.462963   91.647059   90.756757   116.546667  3700.0  3848.0  3737.0  7915.0
    seq_5_50  48.0  33.0  37.0  82.0  94.937500   113.636364  113.162162  92.756098   4277.0  7337.0  7245.0  3990.0
#2
                 A     C     G     T          uA          uC          uG          uT     cmA     cmC     cmG     cmT
    seq_1_0   48.0  48.0  53.0  51.0  100.291667  99.208333   101.943396  100.411765  5042.0  4882.0  5297.0  5062.0
    seq_1_50  48.0  47.0  54.0  51.0  100.083333  101.680851  99.092593   101.294118  5012.0  5256.0  4864.0  5196.0
    seq_2_0   47.0  47.0  54.0  52.0  100.978723  100.957447  100.370370  99.788462   5147.0  5144.0  5055.0  4968.0
    seq_2_50  47.0  47.0  54.0  52.0  101.468085  101.425532  99.000000   100.346154  5223.0  5216.0  4850.0  5052.0
    seq_3_0   50.0  47.0  53.0  50.0  98.980000   99.680851   101.490566  101.740000  4847.0  4952.0  5226.0  5265.0
    seq_3_50  49.0  47.0  52.0  52.0  95.857143   100.978723  102.519231  102.423077  4403.0  5148.0  5387.0  5371.0

And I want to compare all the columns of the first dataframe (#0) with the other 2 dataframes (#1 and #2), to identify which index have different column values (eg the indexes seq_6_0 and seq_6_50 are present in dataframe #0 and absent in the other two dataframes).

But I want too put a tolerance variation of each column to consider columns of different dataframes as equals, eg:

the index seq_1_0 of dataframe #0 have these values:

A     C     G     T          uA          uC          uG          uT     cmA     cmC     cmG     cmT
47.0  47.0  54.0  52.0  100.978723  100.957447  100.370370  99.788462   5147.0  5144.0  5055.0  4968.0

while the index seq_1_0 of daframe #2 have:

A     C     G     T          uA          uC          uG          uT     cmA     cmC     cmG     cmT
48.0  48.0  53.0  51.0  100.291667  99.208333   101.943396  100.411765  5042.0  4882.0  5297.0  5062.0

So I want put difference tolerance values for each column, eg for columns ["A","C","T","G"] I need a tolerance value of 90% between compared values, but for other columns I need diferent percentage between compared values.

Have any pandas function that I can use for do this?

Best,

Answer 1

Use np.isclose , which allows you to precisely control the absolute and relative tolerance of the comparison.

I assume that you only want to compare rows with labels that exist in both dataframes. Rows that exist in one but not the other are ignored. Also, since you use a relative criterion for A, C, G, T, compare(df0,df1) is not the same as compare(df1,df0) . It assumes the second parameter is the reference value. This is consistent with how np.isclose works.

def compare(dfa, dfb):
    s = pd.Series(['A','C','G','T'])
    tmp = dfa.join(dfb, how='inner', lsuffix='_a', rsuffix='_b')

    # The A, C, G, T columns: within 90% of dfb
    lhs = tmp[s + '_a'].values
    rhs = tmp[s + '_b'].values
    compare1 = np.isclose(lhs, rhs, atol=0, rtol=0.9)

    # The uA, uC, uG, uT columns: within 1e-5
    lhs = tmp['u' + s + '_a'].values
    rhs = tmp['u' + s + '_b'].values
    compare2 = np.isclose(lhs, rhs, atol=1e-5, rtol=0)

    # The cmA, cmC, cmG, cmT columns: within 1e-3
    lhs = tmp['cm' + s + '_a'].values
    rhs = tmp['cm' + s + '_b'].values
    compare3 = np.isclose(lhs, rhs, atol=1e-3, rtol=0)

    # Assemble the result
    data = np.concatenate([compare1, compare2, compare3], axis=1)
    cols = pd.concat([s, 'u'+s, 'cm'+s])    
    result = pd.DataFrame(data, columns=cols, index=tmp.index)

    return result

compare(df0, df2)

For an easy visualization of the result:

def highlight_false(cell):
    return '' if cell else 'background-color: yellow'

result = compare(df0,df2)
result.style.applymap(highlight_false)

Comparing different pandas dataframes by columns with tolerance variation

Question

1 answers

solution1
4 ACCPTED 2020-01-14 14:59:44

Comparing different pandas dataframes by columns with tolerance variation

Question

1 answers

solution1 4 ACCPTED 2020-01-14 14:59:44

solution1
4 ACCPTED 2020-01-14 14:59:44