在 Python 中使用 NLTK 对单词进行标记的问题。返回单个字母而不是单词的列表

Question

我的 NLP python 程序遇到了一些问题，我正在尝试创建一个包含正面和负面推文的数据集，但是当我运行代码时，它只返回似乎被标记化的单个字母。 我是 Python 和 NLP 的新手，所以如果这是基本的，或者我对自己的解释很糟糕，我深表歉意。 我在下面添加了我的代码：

import csv
import random
import re
import string
import mysql.connector
from nltk import FreqDist, classify, NaiveBayesClassifier
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize


def remove_noise(tweet_tokens, stop_words=()):
    cleaned_tokens = []
    for token, tag in pos_tag(tweet_tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|' \
                  '(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', token)
        token = re.sub("(@[A-Za-z0-9_]+)", "", token)

        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    print(token)
    return cleaned_tokens


def get_all_words(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token


def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)


if __name__ == "__main__":


with open('positive_tweets.csv') as csv_file:
    positive_tweets = csv.reader(csv_file, delimiter=',')
with open('negative_tweets.csv') as csv_file:
    negative_tweets = csv.reader(csv_file, delimiter=',')

stop_words = stopwords.words('english')

positive_tweet_tokens = word_tokenize(positive_tweets)
negative_tweet_tokens = word_tokenize(negative_tweets)

positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []

for tokens in positive_tweet_tokens:
    positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

for tokens in negative_tweet_tokens:
    negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

all_pos_words = get_all_words(positive_cleaned_tokens_list)
all_neg_words = get_all_words(negative_cleaned_tokens_list)

freq_dist_pos = FreqDist(all_pos_words)
freq_dist_neg = FreqDist(all_neg_words)
print(freq_dist_pos.most_common(10))
print(freq_dist_neg.most_common(10))

positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

positive_dataset = [(tweet_dict, 'positive')
                    for tweet_dict in positive_tokens_for_model]

negative_dataset = [(tweet_dict, 'negative')
                    for tweet_dict in negative_tokens_for_model]

dataset = positive_dataset + negative_dataset

random.shuffle(dataset)

train_data = dataset[:7000]
test_data = dataset[7000:]

classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy is:", classify.accuracy(classifier, test_data))

CSV 文件的片段供参考：

    "tweetid","username","created_at","tweet","location","place","classification"
"1285666943073161216","MeFixerr","2020-07-21 20:04:20+00:00","Overwhelmed by all the calls, msgs and tweets. I apologize for getting lost without prior notice. Did not expect to be missed with such fervor. 
I am good &amp; taking a break. Lots of love and dua's for everyone of you in #PTIFamily ❤","Pakistan, Quetta",,"positive"

Answer 1

您的令牌来自文件名（'positive_tweets.csv'），而不是文件中的数据。 添加如下打印语句。 你会看到问题。

positive_tweet_tokens = word_tokenize(positive_tweets)
negative_tweet_tokens = word_tokenize(negative_tweets)
print("tokens=", positive_tweet_tokens)  # add this line

Output 来自完整脚本

tokens= ['positive_tweets.csv']
v
v
[('e', 3), ('v', 2), ('p', 1), ('w', 1), ('c', 1)]
[('e', 4), ('v', 2), ('n', 1), ('g', 1), ('w', 1), ('c', 1)]
Accuracy is: 0

关于第二个错误，替换这个

with open('positive_tweets.csv') as csv_file:
    positive_tweets = csv.reader(csv_file, delimiter=',')
with open('negative_tweets.csv') as csv_file:
    negative_tweets = csv.reader(csv_file, delimiter=',')

有了这个

positive_tweets = negative_tweets = ""

with open('positive_tweets.csv') as csv_file:
    positive_tweets_rdr = csv.reader(csv_file, delimiter=',')
    all = list(positive_tweets_rdr)
    for lst in all[1:]: positive_tweets += ' ' + lst[3] #tweet column
    
with open('negative_tweets.csv') as csv_file:
    negative_tweets_rdr = csv.reader(csv_file, delimiter=',')
    all = list(negative_tweets_rdr)
    for lst in all[1:]: negative_tweets += ' ' + lst[3] #tweet column

Answer 2

您提供的示例代码存在几个问题：

nltk 的word_tokenize接受一个字符串，而您提供的是一个惰性 csv 生成器。 您可能想在 CSV 的每一行的一个字段上调用word_tokenize
您的with语句在您从中读取任何数据之前关闭 csv 文件

你想要这样的东西（重复负面推文）

with open('positive_tweets.csv') as csv_file:
    positive_tweets = csv.reader(csv_file, delimiter=',')
    positive_tweet_tokens = [word_tokenize(t[3]) for t in positive_tweets]

PS 还请确保 CSV 文件格式正确。 在上面的示例中，我天真地切出了每行的第 4 个字段，它可能不存在。 你需要一些错误处理

在 Python 中使用 NLTK 对单词进行标记的问题。返回单个字母而不是单词的列表

问题描述

2 个解决方案

解决方案1
0 已采纳 2020-07-23 14:28:51

解决方案2
0 2020-07-23 15:04:25

在 Python 中使用 NLTK 对单词进行标记的问题。 返回单个字母而不是单词的列表

问题描述

2 个解决方案

解决方案1 0 已采纳 2020-07-23 14:28:51

解决方案2 0 2020-07-23 15:04:25

在 Python 中使用 NLTK 对单词进行标记的问题。返回单个字母而不是单词的列表

解决方案1
0 已采纳 2020-07-23 14:28:51

解决方案2
0 2020-07-23 15:04:25