简体   繁体   English

Python Pandas 相关矩阵

[英]Python Pandas Correlation Matrix

Given a data frame "df," I am trying to print a correlation matrix to display the upper triangle so that it does not display the duplicate correlation coefficients.给定一个数据框“df”,我试图打印一个相关矩阵来显示上三角,这样它就不会显示重复的相关系数。 I want to output the correlation coefficients only where the correlation is +/- 0.7 or greater.我只想 output 相关系数只有在相关性为 +/- 0.7 或更大的情况下。

Command:命令:

# Define correlation matrix
cor_matrix = df.corr().abs()

# Upper triangle of correlation matrix
upper_tri = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool))
print(upper_tri)

Output: Output:

               price  bedrooms  bathrooms  sqft_living  sqft_lot    floors  \
price            NaN  0.318373   0.526308     0.701579  0.088422  0.256794   
bedrooms         NaN       NaN   0.532335     0.592354  0.029548  0.184539   
bathrooms        NaN       NaN        NaN     0.754604  0.087124  0.495209   
sqft_living      NaN       NaN        NaN          NaN  0.168363  0.355316   
sqft_lot         NaN       NaN        NaN          NaN       NaN  0.005682   
floors           NaN       NaN        NaN          NaN       NaN       NaN   
waterfront       NaN       NaN        NaN          NaN       NaN       NaN   
view             NaN       NaN        NaN          NaN       NaN       NaN   
condition        NaN       NaN        NaN          NaN       NaN       NaN   
grade            NaN       NaN        NaN          NaN       NaN       NaN   
sqft_above       NaN       NaN        NaN          NaN       NaN       NaN   
sqft_basement    NaN       NaN        NaN          NaN       NaN       NaN   
yr_built         NaN       NaN        NaN          NaN       NaN       NaN   
yr_renovated     NaN       NaN        NaN          NaN       NaN       NaN   
zipcode          NaN       NaN        NaN          NaN       NaN       NaN   
sqft_living15    NaN       NaN        NaN          NaN       NaN       NaN   
sqft_lot15       NaN       NaN        NaN          NaN       NaN       NaN   

               waterfront      view  condition     grade  sqft_above  \
price            0.266369  0.092607   0.036362  0.667434    0.605567   
bedrooms         0.004450  0.022115   0.024955  0.369402    0.492133   
bathrooms        0.068855  0.039679   0.123293  0.664220    0.684384   
sqft_living      0.107053  0.067329   0.059371  0.763833    0.875966   
sqft_lot         0.021173  0.008187   0.009154  0.111713    0.181152   
floors           0.023698  0.022721   0.263768  0.458183    0.523885   
waterfront            NaN  0.006540   0.016653  0.082775    0.072075   
view                  NaN       NaN   0.019697  0.048944    0.021839   
condition             NaN       NaN        NaN  0.144674    0.158214   
grade                 NaN       NaN        NaN       NaN    0.755923   
sqft_above            NaN       NaN        NaN       NaN         NaN   
sqft_basement         NaN       NaN        NaN       NaN         NaN   
yr_built              NaN       NaN        NaN       NaN         NaN   
yr_renovated          NaN       NaN        NaN       NaN         NaN   
zipcode               NaN       NaN        NaN       NaN         NaN   
sqft_living15         NaN       NaN        NaN       NaN         NaN   
sqft_lot15            NaN       NaN        NaN       NaN         NaN   

               sqft_basement  yr_built  yr_renovated   zipcode  sqft_living15  \
price               0.180230  0.054012      0.126092  0.053203       0.585379   
bedrooms            0.164041  0.161578      0.020543  0.159061       0.404490   
bathrooms           0.166565  0.504844      0.050453  0.205309       0.569378   
sqft_living         0.202806  0.319783      0.056751  0.199637       0.756901   
sqft_lot            0.034901  0.052165      0.009096  0.131311       0.145112   
floors              0.256560  0.489319      0.006260  0.059121       0.279885   
waterfront          0.037227  0.026161      0.093294  0.030285       0.086463   
view                0.087539  0.034053      0.033574  0.043251       0.076880   
condition           0.135577  0.361417      0.060139  0.003026       0.092824   
grade               0.051838  0.446963      0.014008  0.184862       0.713202   
sqft_above          0.210991  0.423898      0.023178  0.261190       0.731870   
sqft_basement            NaN  0.167902      0.049004  0.162968       0.043830   
yr_built                 NaN       NaN      0.225195  0.346869       0.326229   
yr_renovated             NaN       NaN           NaN  0.064335       0.002755   
zipcode                  NaN       NaN           NaN       NaN       0.279033   
sqft_living15            NaN       NaN           NaN       NaN            NaN   
sqft_lot15               NaN       NaN           NaN       NaN            NaN   

               sqft_lot15  
price            0.082447  
bedrooms         0.026450  
bathrooms        0.089010  
sqft_living      0.181697  
sqft_lot         0.728800  
floors           0.011269  
waterfront       0.030703  
view             0.009125  
condition        0.003406  
grade            0.119248  
sqft_above       0.194050  
sqft_basement    0.040733  
yr_built         0.070958  
yr_renovated     0.007920  
zipcode          0.147221  
sqft_living15    0.183192  
sqft_lot15            NaN  

Is there a way to print the correlation matrix with values +/- 0.7 or higher?有没有办法打印值 +/- 0.7 或更高的相关矩阵?


Update : Output of df.iloc[:6,:6].to_dict()更新:df.iloc df.iloc[:6,:6].to_dict()的 Output

{'date': {0: Timestamp('2014-10-13 00:00:00'),
  1: Timestamp('2014-12-09 00:00:00'),
  2: Timestamp('2015-02-25 00:00:00'),
  3: Timestamp('2014-12-09 00:00:00'),
  4: Timestamp('2015-02-18 00:00:00'),
  5: Timestamp('2014-05-12 00:00:00')},
 'price': {0: 221900.0,
  1: 538000.0,
  2: 180000.0,
  3: 604000.0,
  4: 510000.0,
  5: 1225000.0},
 'bedrooms': {0: 3.0, 1: 3.0, 2: 2.0, 3: 4.0, 4: 3.0, 5: 4.0},
 'bathrooms': {0: 1.0, 1: 2.25, 2: 1.0, 3: 3.0, 4: 2.0, 5: 4.5},
 'sqft_living': {0: 1180.0,
  1: 2570.0,
  2: 770.0,
  3: 1960.0,
  4: 1680.0,
  5: 5420.0},
 'sqft_lot': {0: 5650.0,
  1: 7242.0,
  2: 10000.0,
  3: 5000.0,
  4: 8080.0,
  5: 101930.0}}

You can use a mask to hide the values lower than the threshold, and dropna to clear up the empty rows/columns:您可以使用掩码隐藏低于阈值的值,并dropna清除空行/列:

(cor_matrix
 .mask(cor_matrix.abs().lt(0.7))
 .dropna(how='all')
 .dropna(how='all', axis=1)
)

Output (I used only the col/rows up to "floors" as an example): Output(我仅使用直到“楼层”的列/行作为示例):

           sqft_living
price         0.701579
bathrooms     0.754604

Another option as 1D output:另一种选择是 1D output:

m1 = np.triu(np.ones(cor_matrix.shape).astype(bool))
m2 = cor_matrix.abs().ge(0.7)

cor_matrix.where(m1&m2).stack()

Output: Output:

price      sqft_living    0.701579
bathrooms  sqft_living    0.754604
dtype: float64

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM