简体繁体 English

Python如何在csv文件中的推文中应用单词袋

[英]Python how to apply bag of words to tweets in csv file

原文 2017-12-03 20:26:59 1 1 python/ twitter/ scikit-learn/ nlp/ nltk

I am currently working twitter data analysis and have been working on applying bag of words technique in Python and have been having no luck. 我目前正在从事Twitter数据分析，并且一直致力于在Python中应用单词袋技术，并且没有运气。 Currently I have been able to stream data to be stored in a database with some preprocessing which I then export the tweets into a csv file but stumbling on the next part to use bag of words in order to do machine learning. 目前，我已经能够通过一些预处理流式传输要存储在数据库中的数据，然后我将这些推文导出到一个csv文件中，但绊倒了下一部分以使用单词袋来进行机器学习。

I've tried following https://www.kaggle.com/c/word2vec-nlp-tutorial#part-1-for-beginners-bag-of-words however I have had no success and haven't been able to grasp an understanding how how to approach by just looking at either scikit or nltk documentation. 我已经尝试遵循https://www.kaggle.com/c/word2vec-nlp-tutorial#part-1-for-beginners-bag-of-words，但是我没有成功，也无法掌握仅查看scikit或nltk文档即可了解如何进行处理。 Can anyone advise tutorials I can follow to achieve bag of words with Python 3? 任何人都可以建议我可以遵循的教程，以使用Python 3来达到目的。 Thanks for the help 谢谢您的帮助

1 个解决方案

So scikit-learn's CountVectoriser is a good place to start. 因此，scikit-learn的CountVectoriser是一个不错的起点。 You'll want to create a vocabulary of some fixed size (N unique words collected from your tweets) so that you can represent each tweet as a fixed length vector, where each position in the vector represents a particular word from your vocabulary, and the value is the number of times that word has appeared. 您需要创建一个固定大小的词汇表（从您的推文中收集N个唯一单词），以便可以将每个推文表示为固定长度的向量，其中向量中的每个位置代表您词汇表中的特定单词，并且value是单词出现的次数。

With pure Python this would be: 使用纯Python，将是：

Create an array of tweet texts 创建一系列推文文本
Initialise an empty set representing your vocabulary 初始化代表您的词汇的空集

First pass through tweets 首次通过推文

For each tweet, extract unique words 对于每条推文，提取唯一的单词
- Add these words to your vocabulary if they don't exist 如果这些单词不存在，请将它们添加到您的词汇表中

Second pass through same tweets 第二次通过相同的推文

For each tweet, extract unique words 对于每条推文，提取唯一的单词
- Create a vector filled with zeros of size N representing the tweet 创建一个矢量，该矢量填充有大小为N的零，表示该推文
- For each word, increment the count corresponding to the position of the word in the vector 对于每个单词，增加与单词在向量中的位置相对应的计数

You could use 1 or 0 for a word being present or not instead of word frequencies. 您可以使用1或0表示是否存在一个单词，而不要使用单词频率。 See what works. 看看有什么用。

However, scikit-learn makes all of this much easier to do. 但是，scikit-learn使所有这些操作变得更加容易。

I found this tutorial which might help too. 我发现本教程可能也有帮助。