简体   繁体   中英

Preserving the index when selecting a slice of a pandas dataframe

So I am creating my training and test sets for use in a Multiple Linear Regression model using sklearn.

my dataset contains 182 features looks like the following;

id      feature1 feature2  ....  feature182 Target
D24352  145      8               7          1
G09340  10       24              0          0
E40988  6        42              8          1
H42093  238      234             2          1   
F32093  12       72              1          0

I have then have the following code;

import pandas as pd

dataset = pd.read_csv('C:\\mylocation\\myfile.csv')
dataset0 = dataset.set_index('t1.id')
dataset2 = pd.get_dummies(dataset0)
y = dataset0.iloc[:, 31:32].values
dataset2.pop('Target')
X = dataset2.iloc[:, :180].values

Once I use dataframe.iloc however, I loose my indexes (which I have set to be my IDs). I would like to keep these as I currently have no way of telling which records in my results relate to which records in my original dataset when I do the following step;

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

It looks like your data is stored as object type. You should convert it to float64 (assuming that all your data is of numeric type. Else only convert those rows, that you want to have as numeric type). Since it turns out your index is of type string, you need to set the dtype of your dataframe after setting the index (and generating the dummies). Again assuming that the rest of your data is of numeric type:

dataset = pd.read_csv('C:\\mylocation\\myfile.csv')
dataset0 = dataset.set_index('t1.id')
dataset2 = pd.get_dummies(dataset0)
dataset0 = dataset0.astype(np.float64)  # add this line to explicitly set the dtype

Now you should be able to just leave out values when slicing the DataFrame:

y = dataset0.iloc[:, 31:32]
dataset2.pop('Target')
X = dataset2.iloc[:, :180]

With .values you access the underlying numpy arrays of the DataFrame. These do not have an index column. Since sklearn is, in most cases, compatible with pandas , you can simply pass a pandas DataFrame to sklearn.

If this does not work, you can still apply reset_index to your DataFrame. This will add the index as a new column, which you will have to drop when passing the training data to sklearn:

dataset0.reset_index(inplace=True)
dataset2.reset_index(inplace=True)
y = dataset0.iloc[:, 31:32].values
dataset2.pop('Target')
X = dataset2.iloc[:, :180].values

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 0)

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train.drop('index', axis=1), y_train.drop('index', axis=1))

y_pred = regressor.predict(X_test.drop('index', axis=1))

In this case you'll still have to change the slicing [:, 31:32] and [:, :180] to the correct columns, so that the index will be included in the slice.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM