简体   繁体   中英

Python: how to find correlation between two values and remove noise?

I have two curves A and B like that are highly correlated as shown in the figure below where C is the Pearson correlation between A and B .

The file containing the data can be downloaded here .

import numpy as np
import pandas as pd
import pylab as plt

df = pd.read_csv('prova.csv')
A = df['A'].values
B = df['B'].values
from scipy.stats.stats import pearsonr 
C = pearsonr(A,B)[0]


fig, ax = plt.subplots(1,2, figsize=(20, 5))
ax1 = ax[0]
ax2 = ax1.twinx()
ax1.plot(A, 'g-')
ax2.plot(B, 'b-')
ax1.set_ylabel('A', color='g', fontsize=20);
ax2.set_ylabel('B', color='b', fontsize=20);


ax2 = ax[1]
txt = 'C = %.2f'%C
ax2.scatter(A, B, label=txt)
ax2.set_xlabel('A', color='g', fontsize=20);
ax2.set_ylabel('B', color='b', fontsize=20);
ax2.legend(fontsize = 16)

The values of the green curve should be 0 but the signal is affected by B . I would like to find the relation between A and B in order to be for A and B to cancel out, but I am unsure how to proceed.

数据和相关图

Clearly, A and B predict each other quite well. We can exploit this to ensure we obtain a value at about 0 given values of A and B . My method of choice is the least_squares fit.

We want to minimize A - x * B - c for some parameters x and c . This can be done using,

import matplotlib.pyplot as plt
import pandas as pd
import scipy.optimize as opt


df = pd.read_csv('prova.csv')

def fit(x):
    return df['A'] - x[0] * df['B'] - x[1]


result = opt.least_squares(fit, [0, 0])

fit(result.x).plot()
plt.show()

This results in,

结果

Which is many orders of magnitude closer to zero.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM