简体   繁体   English

将分类数据编码为数值

[英]Encoding categorical data to numerical

I'm using this Kaggle dataset, and I'm trying to convert the categorical values to numerical, so I can apply regression.我正在使用这个 Kaggle 数据集,并且我正在尝试将分类值转换为数字,因此我可以应用回归。

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

Here's an example of what I have tried so far.这是我迄今为止尝试过的一个例子。

train_data = pd.read_csv('train.csv')

column_contents = []
for row in train_data['Street']:
 if type(row) not in (int,float):
 column_contents.append(row)
 unique_contents = set(column_contents)

ds = {}
for i,j in enumerate(unique_contents):
 ds[j] = i 

train_data['Street'] = train_data['Street'].replace(ds.keys(), list(map(str, ds.values())), regex=True)

Thereafter, I created the following function to apply it to all the columns of the df:此后,我创建了以下 function 以将其应用于 df 的所有列:

def calculation(df,column):
 column_contents = []
 for row in df[column]:
  if type(row) not in (int,float):
   column_contents.append(row)
   unique_contents = set(column_contents)

 ds = {}
 for i,j in enumerate(unique_contents):
  ds[j] = i 

df[column] = df[column].replace(ds.keys(), list(map(str, ds.values())), regex=True)

return df[column]

for column in train_data:
 train_data[column] = calculation(train_data,column)

However, this function does not work, and I think it wrong in many levels.但是这个function不行,我觉得很多层面都错了。 Any help will be appreciated.任何帮助将不胜感激。 Also I am aware that this can be done using other modules (numpy) but I'd rather do it this way to practice.我也知道这可以使用其他模块(numpy)来完成,但我宁愿这样做来练习。

You have coded it correctly expect using the regex=True in replace.您已经正确编码,期望在替换中使用regex=True Since you want to replace the matched keys with values you should not use regex .由于您想用值替换匹配的键,因此不应使用regex Also NaNs have to be handled separately. NaN 也必须单独处理。

Also in the method calculation you are already replacing the column in the dataframe so you don't have to return it and assign it again.同样在方法calculation中,您已经替换了 dataframe 中的列,因此您不必返回它并再次分配它。

Code:代码:

train_data = pd.read_csv('train.csv')
# Replace all NaNs with -1
train_data = train_data.fillna(-1)

def calculation(df,column):
  column_contents = []
  for row in df[column]:
    if type(row) not in (int,float):
      column_contents.append(row)
  
  unique_contents = set(column_contents)
  ds = {}
  for i,j in enumerate(unique_contents):
    ds[j] = i 
  
  df[column] = df[column].replace(ds.keys(), list(map(str, ds.values()))).astype(float)

for column in train_data:
  calculation(train_data,column)

print (train_data.dtypes)

Output: Output:

Id               float64
MSSubClass       float64
MSZoning         float64
LotFrontage      float64
LotArea          float64
                  ...   
MoSold           float64
YrSold           float64
SaleType         float64
SaleCondition    float64
SalePrice        float64
Length: 81, dtype: object

As you can see all the columns are converted into float .如您所见,所有列都转换为float

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM