[英]i am splitting the data into testing and training set, the error is 'Found input variables with inconsistent number of samples: [1000, 23486]'
my project is to classify the reviews as good or bad using nlp.我的项目是使用 nlp 将评论分类为好或坏。 i have imported the data and done the tokenisation, vectorisation using bag of words model.
我已经导入了数据并使用词袋模型进行了标记化和矢量化。 now i have to spilt the data into testing and training sets and i am getting an error saying "Found input variables with inconsistent numbers of samples: [1000, 23486]"
现在我必须将数据溢出到测试和训练集中,我收到一条错误消息“发现样本数量不一致的输入变量:[1000, 23486]”
My file has a column called Review Text and i want to classify the reviews as good or bad.我的文件有一个名为“评论文本”的列,我想将评论分类为好或坏。 i have attached the tsv file that i am using for this project.
我附上了我用于这个项目的 tsv 文件。 please do help me in correcting the error and any change in approach that i can do.
请帮助我纠正错误以及我可以做的任何方法更改。 i have attached the code here too.
我也在这里附上了代码。
import numpy as np
import pandas as pd
import nltk
import matplotlib
dataset = pd.read_csv("C:/Users/a/Downloads/data.tsv", delimiter = "\t", quoting = 1)
dataset.head()
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
review = re.sub('[^a-zA-Z]', ' ', str(dataset['Review Text'][i]))
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in
set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = df.iloc[:, 6].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
Ok, the problem is that X
and y
must have the same dimensions.好的,问题是
X
和y
必须具有相同的尺寸。
If you want to use just 1000 reviews you can use the same for
cycle and then, when selecting y
you just do:如果您只想使用 1000 条评论,您可以使用相同
for
循环,然后,在选择y
您只需执行以下操作:
y = dataset.iloc[:1000, 6].values
Otherwise, if you want to use the whole dataset you must edit the first part of the cycle.否则,如果要使用整个数据集,则必须编辑循环的第一部分。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.