[英]How to put F-statistic and P-value into a table?
如何將這些代碼簡化為 for 循環並創建一個表格來顯示特征的 F 統計量和 P 值。
print(scipystats.f_oneway(df_data.loc[df_data["SaleCondition"] == 'Normal'].SalePrice,
df_data.loc[df_data["SaleCondition"] == 'Abnorml'].SalePrice,
df_data.loc[df_data["SaleCondition"] == 'Partial'].SalePrice,
df_data.loc[df_data["SaleCondition"] == 'AdjLand'].SalePrice,
df_data.loc[df_data["SaleCondition"] == 'Alloca'].SalePrice,
df_data.loc[df_data["SaleCondition"] == 'Family'].SalePrice))
>>>F_onewayResult(statistic=45.57842830969571, pvalue=7.988268404991176e-44)
print(scipystats.f_oneway(df_data.loc[df_data["Fence"] == 'MnPrv'].SalePrice,
df_data.loc[df_data["Fence"] == 'GdWo'].SalePrice,
df_data.loc[df_data["Fence"] == 'GdPrv'].SalePrice,
df_data.loc[df_data["Fence"] == 'MnWw'].SalePrice))
>>>
F_onewayResult(statistic=4.948158647146986, pvalue=0.002312645635631918)
如何創建表格並提取 F 統計量和 P 值作為相應列的輸入? 並對具有最高 F 統計值的變量進行升序排序?
已編輯 - 哪個結果更准確?
我的方法的結果:
F-statistics P-value
ExterQual 443.334831 1.439551e-204
KitchenQual 407.806352 3.032213e-192
BsmtQual 392.913506 9.610615e-186
GarageFinish 250.962467 1.199117e-93
MasVnrType 111.672380 4.793331e-65
Foundation 100.253851 5.791895e-91
CentralAir 98.305344 1.809506e-22
HeatingQC 88.394462 2.667062e-67
Neighborhood 71.784865 1.558600e-225
GarageType 71.522123 1.247154e-66
BsmtExposure 70.887984 1.022671e-42
BsmtFinType1 67.602175 1.807731e-63
SaleCondition 45.578428 7.988268e-44
MSZoning 43.840282 8.817634e-35
PavedDrive 42.024179 1.803569e-18
LotShape 40.132852 6.447524e-25
Alley 35.562060 4.899826e-08
SaleType 28.863054 5.039767e-42
FireplaceQu 24.398929 5.016300e-19
Electrical 23.067673 1.663249e-18
HouseStyle 19.595001 3.376777e-25
Exterior1st 18.611743 2.586089e-43
RoofStyle 17.805497 3.653523e-17
Exterior2nd 17.500840 4.842186e-43
BsmtCond 14.030600 5.136901e-09
BldgType 13.011077 2.056736e-10
LandContour 12.850188 2.742217e-08
GarageQual 9.570389 1.240803e-07
GarageCond 9.541161 1.309714e-07
ExterCond 8.798714 5.106681e-07
LotConfig 7.809954 3.163167e-06
RoofMatl 6.727305 7.231445e-08
Condition1 6.118017 8.904549e-08
Fence 4.948159 2.312646e-03
Heating 4.259819 7.534721e-04
Functional 4.057875 4.841697e-04
BsmtFinType2 2.702450 1.941009e-02
Street 2.459290 1.170486e-01
MiscFeature 2.157324 1.047276e-01
Condition2 2.073899 4.342566e-02
LandSlope 1.958817 1.413964e-01
PoolQC 1.627469 3.039853e-01
Utilities 0.298804 5.847168e-01
MSSubClass NaN NaN
MoSold NaN NaN
YrSold NaN NaN
@kitman0804 方法的結果:
def anova(data, x, y):
x_val = data[x].unique()
fstat = scipy.stats.f_oneway(*[df_data[y][data[x].isin([x_v])] for x_v in x_val])
tbl = pd.DataFrame({'F-statistics': [fstat.statistic], 'P-value': [fstat.pvalue]})
tbl.index = [x]
return tbl
f2_table = pd.concat([anova(categorical_data, x, 'SalePrice') for x in categorical_data.columns])
F-statistics P-value
ExterQual 443.334831 1.439551e-204
KitchenQual 407.806352 3.032213e-192
BsmtQual 316.148635 8.158548e-196
GarageFinish 213.867028 6.228747e-115
FireplaceQu 121.075121 2.971217e-107
Foundation 100.253851 5.791895e-91
CentralAir 98.305344 1.809506e-22
HeatingQC 88.394462 2.667062e-67
MasVnrType 84.672201 1.054025e-64
GarageType 80.379992 6.117026e-87
Neighborhood 71.784865 1.558600e-225
BsmtFinType1 64.688200 2.386358e-71
BsmtExposure 63.939761 7.557758e-50
SaleCondition 45.578428 7.988268e-44
MSZoning 43.840282 8.817634e-35
PavedDrive 42.024179 1.803569e-18
LotShape 40.132852 6.447524e-25
MSSubClass 33.732076 8.662166e-79
SaleType 28.863054 5.039767e-42
GarageQual 25.776093 5.388762e-25
GarageCond 25.750153 5.711746e-25
BsmtCond 19.708139 8.195794e-16
HouseStyle 19.595001 3.376777e-25
Exterior1st 18.611743 2.586089e-43
Electrical 18.460192 8.226925e-18
RoofStyle 17.805497 3.653523e-17
Exterior2nd 17.500840 4.842186e-43
Alley 15.176614 2.996380e-07
Fence 13.433276 9.379977e-11
BldgType 13.011077 2.056736e-10
LandContour 12.850188 2.742217e-08
PoolQC 10.509853 7.700989e-07
ExterCond 8.798714 5.106681e-07
LotConfig 7.809954 3.163167e-06
BsmtFinType2 7.565378 5.225649e-08
RoofMatl 6.727305 7.231445e-08
Condition1 6.118017 8.904549e-08
Heating 4.259819 7.534721e-04
Functional 4.057875 4.841697e-04
MiscFeature 2.593622 3.500367e-02
Street 2.459290 1.170486e-01
Condition2 2.073899 4.342566e-02
LandSlope 1.958817 1.413964e-01
MoSold 0.957865 4.833523e-01
YrSold 0.645525 6.300888e-01
Utilities 0.298804 5.847168e-01
F-statistics 和 P-value 分別存儲在<class 'scipy.stats.stats.F_onewayResult'>
中的屬性statistics
和pvalue
中。
您可以只提取里面的值,然后創建表。 下面是一個快速示例。
import scipy.stats
import pandas as pd
tillamook = [0.0571, 0.0813, 0.0831, 0.0976, 0.0817, 0.0859, 0.0735, 0.0659, 0.0923, 0.0836]
newport = [0.0873, 0.0662, 0.0672, 0.0819, 0.0749, 0.0649, 0.0835, 0.0725]
petersburg = [0.0974, 0.1352, 0.0817, 0.1016, 0.0968, 0.1064, 0.105]
magadan = [0.1033, 0.0915, 0.0781, 0.0685, 0.0677, 0.0697, 0.0764, 0.0689]
tvarminne = [0.0703, 0.1026, 0.0956, 0.0973, 0.1039, 0.1045]
fstat = scipy.stats.f_oneway(tillamook, newport, petersburg, magadan, tvarminne)
tbl = pd.DataFrame({'F-statistics': [fstat.statistic], 'P-value': [fstat.pvalue]})
tbl.index = ['OverallQual']
print(tbl)
# F-statistics P-value
# OverallQual 7.121019 0.000281
如果您有多個 F-test 正在進行,您可以使用函數和 for 循環。 下面是一個例子,
df = pd.DataFrame({'a': [0,0,0,1,1,1,2,2,2], 'b': [0,1,1,0,0,1,1,0,0], 'outcome': [1,2,3,4,5,6,7,8,9]})
def anova(data, x, y, drop_nan=True):
# Unique values in the column
if drop_nan:
x_val = data[x].dropna().unique()
else:
x_val = data[x].unique()
# F-test
fstat = scipy.stats.f_oneway(*[data[y][data[x].isin([x_v])] for x_v in x_val])
# Tabulate the results
tbl = pd.DataFrame({'F-statistics': [fstat.statistic], 'P-value': [fstat.pvalue]})
tbl.index = ['{:}~{:}'.format(y, x)]
return tbl
f_table = pd.concat([anova(df, x, 'outcome') for x in ['a', 'b']])
print(f_table)
# F-statistics P-value
# outcome~a 27.000000 0.001000
# outcome~b 0.216495 0.655852
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.