So I am creating my training and test sets for use in a Multiple Linear Regression model using sklearn.
my dataset contains 182 features looks like the following;
id feature1 feature2 .... feature182 Target
D24352 145 8 7 1
G09340 10 24 0 0
E40988 6 42 8 1
H42093 238 234 2 1
F32093 12 72 1 0
I have then have the following code;
import pandas as pd
dataset = pd.read_csv('C:\\mylocation\\myfile.csv')
dataset0 = dataset.set_index('t1.id')
dataset2 = pd.get_dummies(dataset0)
y = dataset0.iloc[:, 31:32].values
dataset2.pop('Target')
X = dataset2.iloc[:, :180].values
Once I use dataframe.iloc
however, I loose my indexes (which I have set to be my IDs). I would like to keep these as I currently have no way of telling which records in my results relate to which records in my original dataset
when I do the following step;
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
It looks like your data is stored as object
type. You should convert it to float64 (assuming that all your data is of numeric type. Else only convert those rows, that you want to have as numeric type). Since it turns out your index is of type string, you need to set the dtype
of your dataframe after setting the index (and generating the dummies). Again assuming that the rest of your data is of numeric type:
dataset = pd.read_csv('C:\\mylocation\\myfile.csv')
dataset0 = dataset.set_index('t1.id')
dataset2 = pd.get_dummies(dataset0)
dataset0 = dataset0.astype(np.float64) # add this line to explicitly set the dtype
Now you should be able to just leave out values
when slicing the DataFrame:
y = dataset0.iloc[:, 31:32]
dataset2.pop('Target')
X = dataset2.iloc[:, :180]
With .values
you access the underlying numpy arrays of the DataFrame. These do not have an index column. Since sklearn
is, in most cases, compatible with pandas
, you can simply pass a pandas DataFrame to sklearn.
If this does not work, you can still apply reset_index to your DataFrame. This will add the index as a new column, which you will have to drop when passing the training data to sklearn:
dataset0.reset_index(inplace=True)
dataset2.reset_index(inplace=True)
y = dataset0.iloc[:, 31:32].values
dataset2.pop('Target')
X = dataset2.iloc[:, :180].values
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train.drop('index', axis=1), y_train.drop('index', axis=1))
y_pred = regressor.predict(X_test.drop('index', axis=1))
In this case you'll still have to change the slicing [:, 31:32]
and [:, :180]
to the correct columns, so that the index will be included in the slice.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.