简体   繁体   English

Python的Pandas:例外:数据必须是1维的

[英]Pandas for Python: Exception: Data must be 1-dimensional

Here's what I got from a tutorial 这是我从教程中得到的

# Data Preprocessing

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

# Encoding categorical data
# Encoding the Independent Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
# Encoding the Dependent Variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

This is the X matrix with encoded dummy variables 这是带有编码虚拟变量的X矩阵

1.000000000000000000e+00    0.000000000000000000e+00    0.000000000000000000e+00    4.400000000000000000e+01    7.200000000000000000e+04
0.000000000000000000e+00    0.000000000000000000e+00    1.000000000000000000e+00    2.700000000000000000e+01    4.800000000000000000e+04
0.000000000000000000e+00    1.000000000000000000e+00    0.000000000000000000e+00    3.000000000000000000e+01    5.400000000000000000e+04
0.000000000000000000e+00    0.000000000000000000e+00    1.000000000000000000e+00    3.800000000000000000e+01    6.100000000000000000e+04
0.000000000000000000e+00    1.000000000000000000e+00    0.000000000000000000e+00    4.000000000000000000e+01    6.377777777777778101e+04
1.000000000000000000e+00    0.000000000000000000e+00    0.000000000000000000e+00    3.500000000000000000e+01    5.800000000000000000e+04
0.000000000000000000e+00    0.000000000000000000e+00    1.000000000000000000e+00    3.877777777777777857e+01    5.200000000000000000e+04
1.000000000000000000e+00    0.000000000000000000e+00    0.000000000000000000e+00    4.800000000000000000e+01    7.900000000000000000e+04
0.000000000000000000e+00    1.000000000000000000e+00    0.000000000000000000e+00    5.000000000000000000e+01    8.300000000000000000e+04
1.000000000000000000e+00    0.000000000000000000e+00    0.000000000000000000e+00    3.700000000000000000e+01    6.700000000000000000e+04

The problem is there are no column labels . 问题是没有列标签 I tried 我试过了

something = pd.get_dummies(X)

But I get the following Exception 但我得到以下例外

Exception: Data must be 1-dimensional

Most sklearn methods don't care about column names, as they're mainly concerned with the math behind the ML algorithms they implement. 大多数sklearn方法都不关心列名,因为它们主要关注它们实现的ML算法背后的数学。 You can add column names back onto the OneHotEncoder output after fit_transform() , if you can figure out the label encoding ahead of time. 如果您可以提前确定标签编码,则可以在fit_transform()之后将列名添加回OneHotEncoder输出。

First, grab the column names of your predictors from the original dataset , excluding the first one (which we reserve for LabelEncoder ): 首先,从原始dataset获取预测变量的列名,不包括第一个(我们为LabelEncoder保留):

X_cols = dataset.columns[1:-1]
X_cols
# Index(['Age', 'Salary'], dtype='object')

Now get the order of the encoded labels. 现在获取编码标签的顺序。 In this particular case, it looks like LabelEncoder() organizes its integer mapping alphabetically: 在这种特殊情况下,看起来LabelEncoder()按字母顺序组织其整数映射:

labels = labelencoder_X.fit(X[:, 0]).classes_ 
labels
# ['France' 'Germany' 'Spain']

Combine these column names, and then add them to X when you convert to DataFrame : 合并这些列名,然后在转换为DataFrame时将它们添加到X

# X gets re-used, so make sure to define encoded_cols after this line
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
encoded_cols = np.append(labels, X_cols)
# ...
X = onehotencoder.fit_transform(X).toarray()
encoded_df = pd.DataFrame(X, columns=encoded_cols)

encoded_df
   France  Germany  Spain        Age        Salary
0     1.0      0.0    0.0  44.000000  72000.000000
1     0.0      0.0    1.0  27.000000  48000.000000
2     0.0      1.0    0.0  30.000000  54000.000000
3     0.0      0.0    1.0  38.000000  61000.000000
4     0.0      1.0    0.0  40.000000  63777.777778
5     1.0      0.0    0.0  35.000000  58000.000000
6     0.0      0.0    1.0  38.777778  52000.000000
7     1.0      0.0    0.0  48.000000  79000.000000
8     0.0      1.0    0.0  50.000000  83000.000000
9     1.0      0.0    0.0  37.000000  67000.000000

NB: For example data I'm using this dataset , which seems either very similar or identical to the one used by OP. 注意:例如我正在使用此数据集的数据 ,它看起来与OP使用的数据非常相似或相同。 Note how the output is identical to OP's X matrix. 注意输出如何与OP的X矩阵相同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM