简体   繁体   English

选择熊猫数据帧的切片时保留索引

[英]Preserving the index when selecting a slice of a pandas dataframe

So I am creating my training and test sets for use in a Multiple Linear Regression model using sklearn.因此,我正在使用 sklearn 创建用于多重线性回归模型的训练和测试集。

my dataset contains 182 features looks like the following;我的数据集包含 182 个特征,如下所示;

id      feature1 feature2  ....  feature182 Target
D24352  145      8               7          1
G09340  10       24              0          0
E40988  6        42              8          1
H42093  238      234             2          1   
F32093  12       72              1          0

I have then have the following code;然后我有以下代码;

import pandas as pd

dataset = pd.read_csv('C:\\mylocation\\myfile.csv')
dataset0 = dataset.set_index('t1.id')
dataset2 = pd.get_dummies(dataset0)
y = dataset0.iloc[:, 31:32].values
dataset2.pop('Target')
X = dataset2.iloc[:, :180].values

Once I use dataframe.iloc however, I loose my indexes (which I have set to be my IDs).但是,一旦我使用dataframe.iloc ,我就会丢失索引(我已将其设置为我的 ID)。 I would like to keep these as I currently have no way of telling which records in my results relate to which records in my original dataset when I do the following step;我想保留这些,因为当我执行以下步骤时,我目前无法判断结果中的哪些记录与原始dataset哪些记录相关;

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

It looks like your data is stored as object type.看起来您的数据存储为object类型。 You should convert it to float64 (assuming that all your data is of numeric type. Else only convert those rows, that you want to have as numeric type).您应该将其转换为 float64(假设您的所有数据都是数字类型。否则只转换那些您想要作为数字类型的行)。 Since it turns out your index is of type string, you need to set the dtype of your dataframe after setting the index (and generating the dummies).因为事实证明你的索引是字符串类型,你需要设置dtype设置索引(以及产生的假人)您的数据帧的。 Again assuming that the rest of your data is of numeric type:再次假设您的其余数据是数字类型:

dataset = pd.read_csv('C:\\mylocation\\myfile.csv')
dataset0 = dataset.set_index('t1.id')
dataset2 = pd.get_dummies(dataset0)
dataset0 = dataset0.astype(np.float64)  # add this line to explicitly set the dtype

Now you should be able to just leave out values when slicing the DataFrame:现在您应该能够在切片 DataFrame 时省略values

y = dataset0.iloc[:, 31:32]
dataset2.pop('Target')
X = dataset2.iloc[:, :180]

With .values you access the underlying numpy arrays of the DataFrame.使用.values可以访问.values的底层 numpy 数组。 These do not have an index column.这些没有索引列。 Since sklearn is, in most cases, compatible with pandas , you can simply pass a pandas DataFrame to sklearn.由于sklearn在大多数情况下与pandas兼容,因此您可以简单地将 pandas DataFrame 传递给 sklearn。

If this does not work, you can still apply reset_index to your DataFrame.如果这不起作用,您仍然可以将 reset_index 应用于您的 DataFrame。 This will add the index as a new column, which you will have to drop when passing the training data to sklearn:这会将索引添加为新列,在将训练数据传递给 sklearn 时必须删除该列:

dataset0.reset_index(inplace=True)
dataset2.reset_index(inplace=True)
y = dataset0.iloc[:, 31:32].values
dataset2.pop('Target')
X = dataset2.iloc[:, :180].values

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 0)

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train.drop('index', axis=1), y_train.drop('index', axis=1))

y_pred = regressor.predict(X_test.drop('index', axis=1))

In this case you'll still have to change the slicing [:, 31:32] and [:, :180] to the correct columns, so that the index will be included in the slice.在这种情况下,您仍然需要将切片[:, 31:32][:, :180]更改为正确的列,以便索引将包含在切片中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM