简体   繁体   English

Python如何在csv文件中的推文中应用单词袋

[英]Python how to apply bag of words to tweets in csv file

I am currently working twitter data analysis and have been working on applying bag of words technique in Python and have been having no luck. 我目前正在从事Twitter数据分析,并且一直致力于在Python中应用单词袋技术,并且没有运气。 Currently I have been able to stream data to be stored in a database with some preprocessing which I then export the tweets into a csv file but stumbling on the next part to use bag of words in order to do machine learning. 目前,我已经能够通过一些预处理流式传输要存储在数据库中的数据,然后我将这些推文导出到一个csv文件中,但绊倒了下一部分以使用单词袋来进行机器学习。

I've tried following https://www.kaggle.com/c/word2vec-nlp-tutorial#part-1-for-beginners-bag-of-words however I have had no success and haven't been able to grasp an understanding how how to approach by just looking at either scikit or nltk documentation. 我已经尝试遵循https://www.kaggle.com/c/word2vec-nlp-tutorial#part-1-for-beginners-bag-of-words,但是我没有成功,也无法掌握仅查看scikit或nltk文档即可了解如何进行处理。 Can anyone advise tutorials I can follow to achieve bag of words with Python 3? 任何人都可以建议我可以遵循的教程,以使用Python 3来达到目的。 Thanks for the help 谢谢您的帮助

So scikit-learn's CountVectoriser is a good place to start. 因此,scikit-learn的CountVectoriser是一个不错的起点。 You'll want to create a vocabulary of some fixed size (N unique words collected from your tweets) so that you can represent each tweet as a fixed length vector, where each position in the vector represents a particular word from your vocabulary, and the value is the number of times that word has appeared. 您需要创建一个固定大小的词汇表(从您的推文中收集N个唯一单词),以便可以将每个推文表示为固定长度的向量,其中向量中的每个位置代表您词汇表中的特定单词,并且value是单词出现的次数。

With pure Python this would be: 使用纯Python,将是:

  1. Create an array of tweet texts 创建一系列推文文本
  2. Initialise an empty set representing your vocabulary 初始化代表您的词汇的空集

First pass through tweets 首次通过推文

  1. For each tweet, extract unique words 对于每条推文,提取唯一的单词
    • Add these words to your vocabulary if they don't exist 如果这些单词不存在,请将它们添加到您的词汇表中

Second pass through same tweets 第二次通过相同的推文

  1. For each tweet, extract unique words 对于每条推文,提取唯一的单词
    • Create a vector filled with zeros of size N representing the tweet 创建一个矢量,该矢量填充有大小为N的零,表示该推文
    • For each word, increment the count corresponding to the position of the word in the vector 对于每个单词,增加与单词在向量中的位置相对应的计数

You could use 1 or 0 for a word being present or not instead of word frequencies. 您可以使用1或0表示是否存在一个单词,而不要使用单词频率。 See what works. 看看有什么用。

However, scikit-learn makes all of this much easier to do. 但是,scikit-learn使所有这些操作变得更加容易。

I found this tutorial which might help too. 我发现本教程可能也有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM