[英]Bag-of-words model with python
I am trying to do a sentimental analysis with python on a bunch of txt documents.我正在尝试用 python 对一堆 txt 文档进行情感分析。 I did so far the preprocessing and extracted only the important words from the text, eg I deleted stop-words, the punctuation.到目前为止,我进行了预处理并仅从文本中提取了重要的单词,例如,我删除了停用词、标点符号。 Also I created a kind of bag-of-words counting the term frequency.我还创建了一种计算词频的词袋。 The next step would be to implement a corresponding model.下一步将是实施相应的模型。
I am not experienced in machine learning resp.我在机器学习方面没有经验。 text mining.文本挖掘。 I am also uncertain about the way I created the bag-of-words model.我也不确定我创建词袋模型的方式。 Could you please have a look at my code and tell me if I am on the right track.你能不能看看我的代码,告诉我我是否在正确的轨道上。 I would also like to know if my previous path is a good basis for a model and how do I build on that basis a good model in order to categorize my documents.我还想知道我之前的路径是否是模型的良好基础,以及如何在此基础上构建良好的模型以便对我的文档进行分类。
This is my code:这是我的代码:
import spacy
import string
import os,sys
import re
import numpy as np
np.set_printoptions(threshold=sys.maxsize)
from collections import Counter
# Load English tokenizer, tagger, parser, NER and word vectors
nlp_en = spacy.load("en_core_web_sm")
nlp_de = spacy.load("de_core_news_sm")
path_train = "Sentiment/Train/"
path_test = "Sentiment/Test/"
text_train = []
text_text = []
# Process whole documents
for filename in os.listdir(path_train):
text = open(os.path.join(path_train, filename),encoding="utf8", errors='ignore').read()
text = text.replace("\ue004","s").replace("\ue006","y")
text = re.sub(r'^http?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
text = "".join(filter(lambda x: x in string.printable, text))
text = " ".join(text.split())
text = re.sub('[A-Z]+', lambda m: m.group(0).lower(), text)
if filename.startswith("de_"):
text_train.append(nlp_de(text))
else:
text_train.append(nlp_en(text))
docsClean = []
for doc in text_train:
#for token in doc:
#print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,token.shape_, token.is_alpha, token.is_stop)
cleanWords = [token.lemma_ for token in doc if token.is_stop == False and token.is_punct == False and token.pos_ != "NUM"]
docsClean.append(cleanWords)
print(docsClean)
for doc in docsClean:
bag_vector = np.zeros(len(doc))
for w in doc:
for i,word in enumerate(doc):
if word == w:
bag_vector[i] += 1
print(bag_vector)
get_dummies
,您可以尝试使用pandas
和get_dummies
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.