如何在 Scikit-learn 中使用 NumPy 阵列

Question

for a machine learning project I made a Pandas data frame to use in Scikit as input对于一个机器学习项目，我制作了一个 Pandas 数据框以在 Scikit 中用作输入

  label                                             vector
0      0   1:0.02776011 2:-0.009072121 3:0.05915284 4:-0...
1      1   1:0.014463682 2:-0.00076486735 3:0.044999316 ...
2      1   1:0.010583069 2:-0.0072133583 3:0.03766079 4:...
3      0   1:0.02776011 2:-0.009072121 3:0.05915284 4:-0...
4      1   1:0.039645035 2:-0.039485127 3:0.0898234 4:-0...
..   ...                                                ...
95     0   1:-0.013014212 2:-0.008092734 3:0.050860845 4...
96     0   1:-0.038887568 2:-0.007960074 3:0.03387617 4:...
97     0   1:-0.038887568 2:-0.007960074 3:0.03387617 4:...
98     0   1:-0.038887568 2:-0.007960074 3:0.03387617 4:...
99     0   1:-0.038887568 2:-0.007960074 3:0.03387617 4:...

Where label correspond to the label of the dataset record and vector correspond to the vector feature of each record.其中label对应数据集记录的label，向量对应每条记录的向量特征。

To pass the data frame to Scikit I'm creating two different arrays, one for the Col label (y) and the other for the col vector (X)为了将数据帧传递给 Scikit，我创建了两个不同的 arrays，一个用于 Col label (y)，另一个用于 col 向量 (X)

As suggested here to create the X array I'm doing:正如这里建议的那样创建我正在做的 X 数组：

X = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print(X.astype(float).to_numpy())
print(X)

Everything works and I'm having as output一切正常，我拥有 output

               1               2            3  ...            298            299           300
0     0.02776011    -0.009072121   0.05915284  ...  0.00035095372    -0.01569933  -0.010564591
1    0.014463682  -0.00076486735  0.044999316  ...   -0.008144852  -0.0066369134  -0.013060478
2    0.010583069   -0.0072133583   0.03766079  ...   0.0041615684    0.008569179  -0.008645372
3     0.02776011    -0.009072121   0.05915284  ...  0.00035095372    -0.01569933  -0.010564591
4    0.039645035    -0.039485127    0.0898234  ...   0.0046293125     0.01663368   0.010215017
..           ...             ...          ...  ...            ...            ...           ...
95  -0.013014212    -0.008092734  0.050860845  ...   0.0021799654   -0.011884902   0.016460473
96  -0.038887568    -0.007960074   0.03387617  ...   0.0057248613    0.026993237   0.025746094
97  -0.038887568    -0.007960074   0.03387617  ...   0.0057248613    0.026993237   0.025746094
98  -0.038887568    -0.007960074   0.03387617  ...   0.0057248613    0.026993237   0.025746094
99  -0.038887568    -0.007960074   0.03387617  ...   0.0057248613    0.026993237   0.025746094

[100 rows x 300 columns]

Where 100 rows are my records and 300 columns the vector feature.其中 100 行是我的记录，300 列是矢量特征。

To create the y array as suggested here I'm doing instead this:要按照此处的建议创建 y 数组，我正在这样做：

y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1, 1)
print(y)

The output is: output 是：

[100 rows x 2 columns]
[[0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [...]
]

I'm having the NumPy array with the 100 records but instead of 1 column the output is 2 columns.我的 NumPy 数组有 100 条记录，但 output 不是 1 列，而是 2 列。

I think this issue is the cause of the following error.我认为这个问题是导致以下错误的原因。 Right?正确的？

/Users/mac-pro/scikit_learn/lib/python3.7/site-packages/sklearn/utils/validation.py:760: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

If so, how can I have as output something like the one I got for the X array?如果是这样，我怎么能像 output 一样拥有我为 X 阵列得到的东西？

If helps here the full code如果在这里帮助完整的代码

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
from sklearn.model_selection._validation import cross_val_score
from sklearn.model_selection import KFold
from scipy.stats import sem


r_filenameTSV = 'TSV/A19784_test3886.tsv'

tsv_read = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])

df = pd.DataFrame(tsv_read)

df = pd.DataFrame(df.vector.str.split(' ',1).tolist(),
                                   columns = ['label','vector'])

print(df)


y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1, 1)
print(y)
#exit()

X = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print(X.astype(float).to_numpy())
print(X)
#exit()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0)


clf = svm.SVC(kernel='rbf',
              C=100,
              gamma=0.001,
              )
scores = cross_val_score(clf, X, y, cv=10)

print ("K-Folds scores:")
print (scores) 

#Train the model using the training sets
clf.fit (X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

print ("Metrics and Scoring:")
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
print("F1:",metrics.f1_score(y_test, y_pred))

print ("Classification Report:")
print (metrics.classification_report(y_test, y_pred,labels=[0,1]))

Thanks again for your time.感谢你的宝贵时间。

Answer 1

As the error says, you just need to change the shape of your Y dataset.正如错误所说，您只需要更改Y数据集的形状。

A column-vector y was passed when a 1d array was expected.当需要一维数组时，传递了列向量 y。 Please change the shape of y to (n_samples, ), for example using ravel().请将 y 的形状更改为 (n_samples, )，例如使用 ravel()。

Hence, you have 2 options for your problem, here are the lines of code that will solve it.因此，您有 2 个选项可以解决您的问题，以下是可以解决问题的代码行。

Option 1:选项1：

y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1,1).ravel()
print(y.shape)
# Output
(8,)

Option 2:选项 2：

y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1,)
print(y.shape)
# Output
(8,)

Hope this helps you!希望这对你有帮助！

如何在 Scikit-learn 中使用 NumPy 阵列

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-04-20 10:17:38

如何在 Scikit-learn 中使用 NumPy 阵列

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-04-20 10:17:38

解决方案1
1 已采纳 2020-04-20 10:17:38