简体   繁体   中英

Vectorize code in big pandas Dataframe, where each row should be treated as a numpy array

I have a big pandas Dataframe, which essentially has a structure like the following one:

df = pd.DataFrame(np.random.randint(0,100,size=(20, 20)), columns=list('ABCDEFGHIJKLMNOPQRST'))

Each of the 'rows' is an array of numbers. eg:

row_one = df.iloc[0, :].values
row_two = df.iloc[1, :].values
....

I would like to calculate the correlation coefficient (np.corrcoef) between all combination of rows, eg:

np.corrcoef(row_one, row_one)[0][1]
np.corrcoef(row_one, row_two)[0][1]
np.corrcoef(row_one, row_three)[0][1]
....
np.corrcoef(row_two, row_one)[0][1]
np.corrcoef(row_one, row_two)[0][1]
np.corrcoef(row_one, row_three)[0][1]
...

I want to obtain a DataFrame in the end that will hold all the correlation coefficients (CC) for all combinations. I can't figure out how to vectorize the code. My original dataframe is pretty huge, wherefore I would be grateful for any advice how to speed up the code.

Thanks!

Pandas has a method for that already: corr . It works on the columns so you just need to transpose your dataframe.

corr_matrix = df.T.corr()

It'll generate a correlation matrix where you can find the correlation coefficient between datasets. So coefficient for the 4th and 7th dataset is corr_matrix.iloc[3, 6] (or corr_matrix.iloc[6, 3] since it's symmetric).

The simplest way to do so is to use panda's built-in method .corr() . Note however that it computes it over columns by default:

Compute pairwise correlation of columns, excluding NA/null values

So you could do:

df.T.corr()

You can check any pair correlation doing:

row_one = df.iloc[0, :].values
row_two = df.iloc[1, :].values
np.corrcoef(row_one,row_two)

As a simple example:

df = pd.DataFrame(np.random.randint(0,10,size=(3, 3)), columns=list('ABC'))

  0         1         2
0  1.000000 -0.479317 -0.921551
1 -0.479317  1.000000  0.782467
2 -0.921551  0.782467  1.000000

Checking on rows 0 and 1 for instance you can see that the result is the same:

row_one = df.iloc[0, :].values
row_two = df.iloc[1, :].values
np.corrcoef(row_one,row_two)

array([[ 1.        , -0.47931716],
       [-0.47931716,  1.        ]])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM