I have a big pandas Dataframe, which essentially has a structure like the following one:
df = pd.DataFrame(np.random.randint(0,100,size=(20, 20)), columns=list('ABCDEFGHIJKLMNOPQRST'))
Each of the 'rows' is an array of numbers. eg:
row_one = df.iloc[0, :].values
row_two = df.iloc[1, :].values
....
I would like to calculate the correlation coefficient (np.corrcoef) between all combination of rows, eg:
np.corrcoef(row_one, row_one)[0][1]
np.corrcoef(row_one, row_two)[0][1]
np.corrcoef(row_one, row_three)[0][1]
....
np.corrcoef(row_two, row_one)[0][1]
np.corrcoef(row_one, row_two)[0][1]
np.corrcoef(row_one, row_three)[0][1]
...
I want to obtain a DataFrame in the end that will hold all the correlation coefficients (CC) for all combinations. I can't figure out how to vectorize the code. My original dataframe is pretty huge, wherefore I would be grateful for any advice how to speed up the code.
Thanks!
Pandas has a method for that already: corr
. It works on the columns so you just need to transpose your dataframe.
corr_matrix = df.T.corr()
It'll generate a correlation matrix where you can find the correlation coefficient between datasets. So coefficient for the 4th and 7th dataset is corr_matrix.iloc[3, 6]
(or corr_matrix.iloc[6, 3]
since it's symmetric).
The simplest way to do so is to use panda's built-in method .corr()
. Note however that it computes it over columns by default:
Compute pairwise correlation of columns, excluding NA/null values
So you could do:
df.T.corr()
You can check any pair correlation doing:
row_one = df.iloc[0, :].values
row_two = df.iloc[1, :].values
np.corrcoef(row_one,row_two)
As a simple example:
df = pd.DataFrame(np.random.randint(0,10,size=(3, 3)), columns=list('ABC'))
0 1 2
0 1.000000 -0.479317 -0.921551
1 -0.479317 1.000000 0.782467
2 -0.921551 0.782467 1.000000
Checking on rows 0
and 1
for instance you can see that the result is the same:
row_one = df.iloc[0, :].values
row_two = df.iloc[1, :].values
np.corrcoef(row_one,row_two)
array([[ 1. , -0.47931716],
[-0.47931716, 1. ]])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.