[英]Encoding categorical data to numerical
I'm using this Kaggle dataset, and I'm trying to convert the categorical values to numerical, so I can apply regression.我正在使用这个 Kaggle 数据集,并且我正在尝试将分类值转换为数字,因此我可以应用回归。
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
Here's an example of what I have tried so far.这是我迄今为止尝试过的一个例子。
train_data = pd.read_csv('train.csv')
column_contents = []
for row in train_data['Street']:
if type(row) not in (int,float):
column_contents.append(row)
unique_contents = set(column_contents)
ds = {}
for i,j in enumerate(unique_contents):
ds[j] = i
train_data['Street'] = train_data['Street'].replace(ds.keys(), list(map(str, ds.values())), regex=True)
Thereafter, I created the following function to apply it to all the columns of the df:此后,我创建了以下 function 以将其应用于 df 的所有列:
def calculation(df,column):
column_contents = []
for row in df[column]:
if type(row) not in (int,float):
column_contents.append(row)
unique_contents = set(column_contents)
ds = {}
for i,j in enumerate(unique_contents):
ds[j] = i
df[column] = df[column].replace(ds.keys(), list(map(str, ds.values())), regex=True)
return df[column]
for column in train_data:
train_data[column] = calculation(train_data,column)
However, this function does not work, and I think it wrong in many levels.但是这个function不行,我觉得很多层面都错了。 Any help will be appreciated.
任何帮助将不胜感激。 Also I am aware that this can be done using other modules (numpy) but I'd rather do it this way to practice.
我也知道这可以使用其他模块(numpy)来完成,但我宁愿这样做来练习。
You have coded it correctly expect using the regex=True
in replace.您已经正确编码,期望在替换中使用
regex=True
。 Since you want to replace the matched keys with values you should not use regex
.由于您想用值替换匹配的键,因此不应使用
regex
。 Also NaNs have to be handled separately. NaN 也必须单独处理。
Also in the method calculation
you are already replacing the column in the dataframe so you don't have to return it and assign it again.同样在方法
calculation
中,您已经替换了 dataframe 中的列,因此您不必返回它并再次分配它。
train_data = pd.read_csv('train.csv')
# Replace all NaNs with -1
train_data = train_data.fillna(-1)
def calculation(df,column):
column_contents = []
for row in df[column]:
if type(row) not in (int,float):
column_contents.append(row)
unique_contents = set(column_contents)
ds = {}
for i,j in enumerate(unique_contents):
ds[j] = i
df[column] = df[column].replace(ds.keys(), list(map(str, ds.values()))).astype(float)
for column in train_data:
calculation(train_data,column)
print (train_data.dtypes)
Output: Output:
Id float64
MSSubClass float64
MSZoning float64
LotFrontage float64
LotArea float64
...
MoSold float64
YrSold float64
SaleType float64
SaleCondition float64
SalePrice float64
Length: 81, dtype: object
As you can see all the columns are converted into float
.如您所见,所有列都转换为
float
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.