我将数据拆分为测试集和训练集，错误是“找到样本数量不一致的输入变量：[1000, 23486]”

Question

my project is to classify the reviews as good or bad using nlp.我的项目是使用 nlp 将评论分类为好或坏。 i have imported the data and done the tokenisation, vectorisation using bag of words model.我已经导入了数据并使用词袋模型进行了标记化和矢量化。 now i have to spilt the data into testing and training sets and i am getting an error saying "Found input variables with inconsistent numbers of samples: [1000, 23486]"现在我必须将数据溢出到测试和训练集中，我收到一条错误消息“发现样本数量不一致的输入变量：[1000, 23486]”

My file has a column called Review Text and i want to classify the reviews as good or bad.我的文件有一个名为“评论文本”的列，我想将评论分类为好或坏。 i have attached the tsv file that i am using for this project.我附上了我用于这个项目的 tsv 文件。 please do help me in correcting the error and any change in approach that i can do.请帮助我纠正错误以及我可以做的任何方法更改。 i have attached the code here too.我也在这里附上了代码。

My data file here我的数据文件在这里

import numpy as np
import pandas as pd
import nltk
import matplotlib

dataset = pd.read_csv("C:/Users/a/Downloads/data.tsv", delimiter = "\t", quoting = 1)
dataset.head()

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
  review = re.sub('[^a-zA-Z]', ' ', str(dataset['Review Text'][i]))
  review = review.lower()
  review = review.split()
  ps = PorterStemmer()
  review = [ps.stem(word) for word in review if not word in 
  set(stopwords.words('english'))]
  review = ' '.join(review)
  corpus.append(review)

  from sklearn.feature_extraction.text import CountVectorizer
  cv = CountVectorizer(max_features = 1500)
  X = cv.fit_transform(corpus).toarray()
  y = df.iloc[:, 6].values

  from sklearn.model_selection import train_test_split
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Answer 1

Ok, the problem is that X and y must have the same dimensions.好的，问题是X和y必须具有相同的尺寸。

If you want to use just 1000 reviews you can use the same for cycle and then, when selecting y you just do:如果您只想使用 1000 条评论，您可以使用相同for循环，然后，在选择y您只需执行以下操作：

y = dataset.iloc[:1000, 6].values

Otherwise, if you want to use the whole dataset you must edit the first part of the cycle.否则，如果要使用整个数据集，则必须编辑循环的第一部分。

我将数据拆分为测试集和训练集，错误是“找到样本数量不一致的输入变量：[1000, 23486]”

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-12-07 18:35:35

我将数据拆分为测试集和训练集，错误是“找到样本数量不一致的输入变量：[1000, 23486]”

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-12-07 18:35:35

解决方案1
0 已采纳 2019-12-07 18:35:35