[英]Readlines function for an xlsx file works inproper
The goal is sentiment classification. 目标是情感分类。 The steps are to open 3 xlsx files, read them, process with gensim.doc2vec methods and classify with SGDClassificator.
步骤是打开3个xlsx文件,进行读取,使用gensim.doc2vec方法进行处理,并使用SGDClassificator进行分类。 Just try to repeat this code on doc2vec .
只需尝试在doc2vec上重复此代码即可 。 Python 2.7
Python 2.7
with open('C:/doc2v/trainpos.xlsx','r') as infile:
pos_reviews = infile.readlines()
with open('C:/doc2v/trainneg.xlsx','r') as infile:
neg_reviews = infile.readlines()
with open('C:/doc2v/unsup.xlsx','r') as infile:
unsup_reviews = infile.readlines()
But it turned out that the resulting lists are not what they are expected to be: 但是事实证明,结果列表不是预期的:
print 'length of pos_reviews is %s' % len(pos_reviews)
>>> length of pos_reviews is 1
The files contain 18, 1221 and 2203 raws correspondingly. 文件分别包含18、1221和2203原始数据。 I thought that the lists will have the same number of elements.
我认为列表将具有相同数量的元素。
The next step is to concatenate all the sentences. 下一步是连接所有句子。
y = np.concatenate((np.ones(len(pos_reviews)), np.zeros(len(neg_reviews))))
x_train, x_test, y_train, y_test = train_test_split(np.concatenate((pos_reviews, neg_reviews)), y, test_size=0.2)
This leads to the situation when x-train, x-test are lists of sentences as they should be while 当x-train,x-test是句子列表时,这会导致这种情况
y_train = [0.]
y_test = [1.]
After this division every sentence gets a label: 在该划分之后,每个句子都有一个标签:
def labelizeReviews(reviews, label_type):
labelized = []
for i,v in enumerate(reviews):
label = '%s_%s'%(label_type,i)
labelized.append(LabeledSentence(v, [label]))
return labelized
x_train = labelizeReviews(x_train, 'TRAIN')
x_test = labelizeReviews(x_test, 'TEST')
unsup_reviews = labelizeReviews(unsup_reviews, 'UNSUP')
As written in the numpy documentation , the arrays should be equal in size. 如numpy文档中所述,数组的大小应相等。 But when I reduce the bigger files to 18 lines, nothing changes.
但是,当我将较大的文件减少到18行时,没有任何变化。 As I searched on the forum noone has a similar error.
当我在论坛上搜索时,没有人遇到类似的错误。 I've broken my head what went wrong and how to fix it.
我伤了头,出了什么问题以及如何解决。 Thanks for help!
感谢帮助!
Generally you can't read Microsoft Excel files as a text files using methods like readlines
or read
. 通常,您无法使用
readlines
或read
类的方法将Microsoft Excel文件作为文本文件read
。 You should convert files to another format before (good solution is .csv which can be readed by csv module) or use a special python modules like pyexcel and openpyxl to read .xlsx files directly. 您应该先将文件转换为其他格式(好的解决方案是.csv,可以通过csv模块读取),或者使用特殊的python模块(例如pyexcel和openpyxl)直接读取.xlsx文件。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.