简体   繁体   English

带python的词袋模型

[英]Bag-of-words model with python

I am trying to do a sentimental analysis with python on a bunch of txt documents.我正在尝试用 python 对一堆 txt 文档进行情感分析。 I did so far the preprocessing and extracted only the important words from the text, eg I deleted stop-words, the punctuation.到目前为止,我进行了预处理并仅从文本中提取了重要的单词,例如,我删除了停用词、标点符号。 Also I created a kind of bag-of-words counting the term frequency.我还创建了一种计算词频的词袋。 The next step would be to implement a corresponding model.下一步将是实施相应的模型。

I am not experienced in machine learning resp.我在机器学习方面没有经验。 text mining.文本挖掘。 I am also uncertain about the way I created the bag-of-words model.我也不确定我创建词袋模型的方式。 Could you please have a look at my code and tell me if I am on the right track.你能不能看看我的代码,告诉我我是否在正确的轨道上。 I would also like to know if my previous path is a good basis for a model and how do I build on that basis a good model in order to categorize my documents.我还想知道我之前的路径是否是模型的良好基础,以及如何在此基础上构建良好的模型以便对我的文档进行分类。

This is my code:这是我的代码:

import spacy
import string
import os,sys
import re
import numpy as np
np.set_printoptions(threshold=sys.maxsize)
from collections import Counter

# Load English tokenizer, tagger, parser, NER and word vectors
nlp_en = spacy.load("en_core_web_sm")
nlp_de = spacy.load("de_core_news_sm")
path_train = "Sentiment/Train/"
path_test = "Sentiment/Test/"
text_train = []
text_text = []


# Process whole documents
for filename in os.listdir(path_train):
    text = open(os.path.join(path_train, filename),encoding="utf8", errors='ignore').read()
    text = text.replace("\ue004","s").replace("\ue006","y")
    text = re.sub(r'^http?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
    text = "".join(filter(lambda x: x in string.printable, text))
    text = " ".join(text.split())
    text = re.sub('[A-Z]+', lambda m: m.group(0).lower(), text)
    if filename.startswith("de_"):
        text_train.append(nlp_de(text))
    else:
        text_train.append(nlp_en(text))

docsClean = []
for doc in text_train:
    #for token in doc: 
        #print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,token.shape_, token.is_alpha, token.is_stop)

    cleanWords = [token.lemma_ for token in doc if token.is_stop == False and token.is_punct == False and token.pos_ != "NUM"]
    docsClean.append(cleanWords)

print(docsClean)

for doc in docsClean:

    bag_vector = np.zeros(len(doc))

    for w in doc:
        for i,word in enumerate(doc):
            if word == w:
                bag_vector[i] += 1

    print(bag_vector)

This is how my bow-model looks like:这是我的弓模型的样子: 弓型

get_dummies ,您可以尝试使用pandasget_dummies

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM