简体   繁体   English

我将数据拆分为测试集和训练集,错误是“找到样本数量不一致的输入变量:[1000, 23486]”

[英]i am splitting the data into testing and training set, the error is 'Found input variables with inconsistent number of samples: [1000, 23486]'

my project is to classify the reviews as good or bad using nlp.我的项目是使用 nlp 将评论分类为好或坏。 i have imported the data and done the tokenisation, vectorisation using bag of words model.我已经导入了数据并使用词袋模型进行了标记化和矢量化。 now i have to spilt the data into testing and training sets and i am getting an error saying "Found input variables with inconsistent numbers of samples: [1000, 23486]"现在我必须将数据溢出到测试和训练集中,我收到一条错误消息“发现样本数量不一致的输入变量:[1000, 23486]”

My file has a column called Review Text and i want to classify the reviews as good or bad.我的文件有一个名为“评论文本”的列,我想将评论分类为好或坏。 i have attached the tsv file that i am using for this project.我附上了我用于这个项目的 tsv 文件。 please do help me in correcting the error and any change in approach that i can do.请帮助我纠正错误以及我可以做的任何方法更改。 i have attached the code here too.我也在这里附上了代码。

My data file here我的数据文件在这里

import numpy as np
import pandas as pd
import nltk
import matplotlib

dataset = pd.read_csv("C:/Users/a/Downloads/data.tsv", delimiter = "\t", quoting = 1)
dataset.head()

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
  review = re.sub('[^a-zA-Z]', ' ', str(dataset['Review Text'][i]))
  review = review.lower()
  review = review.split()
  ps = PorterStemmer()
  review = [ps.stem(word) for word in review if not word in 
  set(stopwords.words('english'))]
  review = ' '.join(review)
  corpus.append(review)

  from sklearn.feature_extraction.text import CountVectorizer
  cv = CountVectorizer(max_features = 1500)
  X = cv.fit_transform(corpus).toarray()
  y = df.iloc[:, 6].values

  from sklearn.model_selection import train_test_split
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Ok, the problem is that X and y must have the same dimensions.好的,问题是Xy必须具有相同的尺寸。

If you want to use just 1000 reviews you can use the same for cycle and then, when selecting y you just do:如果您只想使用 1000 条评论,您可以使用相同for循环,然后,在选择y您只需执行以下操作:

y = dataset.iloc[:1000, 6].values

Otherwise, if you want to use the whole dataset you must edit the first part of the cycle.否则,如果要使用整个数据集,则必须编辑循环的第一部分。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 发现样本数不一致的输入变量错误 - found input variables with inconsistent number of samples error 我将如何解决(发现样本数量不一致的输入变量) - How would I fix the (Found input variables with inconsistent number of samples) 为什么会出现此错误:发现样本数量不一致的输入变量:[1, 15] - Why am I getting this error : Found input variables with inconsistent numbers of samples: [1, 15] 发现输入变量的样本数不一致错误 - Found input variables with inconsistent numbers of samples error 拟合时出错:发现输入变量的样本数量不一致: - Error during fitting: Found input variables with inconsistent numbers of samples: 如何解决sklearn错误:“发现样本数量不一致的输入变量”? - How to solve sklearn error: "Found input variables with inconsistent numbers of samples"? pandas dropna() 导致“发现输入变量的样本数不一致”错误 - “Found input variables with inconsistent numbers of samples” error cause by pandas dropna() 我该如何解决这个错误? ValueError:发现样本数量不一致的输入变量:[4560, 9120] - How can I fix this error? ValueError: Found input variables with inconsistent numbers of samples: [4560, 9120] 发现样本数量不一致的输入变量:[1, 7] - Found input variables with inconsistent numbers of samples: [1, 7] RandomForestRegressor:找到样本数量不一致的输入变量 - RandomForestRegressor: Found input variables with inconsistent numbers of samples
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM