[英]Python how to apply bag of words to tweets in csv file
I am currently working twitter data analysis and have been working on applying bag of words technique in Python and have been having no luck. 我目前正在从事Twitter数据分析,并且一直致力于在Python中应用单词袋技术,并且没有运气。 Currently I have been able to stream data to be stored in a database with some preprocessing which I then export the tweets into a csv file but stumbling on the next part to use bag of words in order to do machine learning.
目前,我已经能够通过一些预处理流式传输要存储在数据库中的数据,然后我将这些推文导出到一个csv文件中,但绊倒了下一部分以使用单词袋来进行机器学习。
I've tried following https://www.kaggle.com/c/word2vec-nlp-tutorial#part-1-for-beginners-bag-of-words however I have had no success and haven't been able to grasp an understanding how how to approach by just looking at either scikit or nltk documentation. 我已经尝试遵循https://www.kaggle.com/c/word2vec-nlp-tutorial#part-1-for-beginners-bag-of-words,但是我没有成功,也无法掌握仅查看scikit或nltk文档即可了解如何进行处理。 Can anyone advise tutorials I can follow to achieve bag of words with Python 3?
任何人都可以建议我可以遵循的教程,以使用Python 3来达到目的。 Thanks for the help
谢谢您的帮助
So scikit-learn's CountVectoriser is a good place to start. 因此,scikit-learn的CountVectoriser是一个不错的起点。 You'll want to create a vocabulary of some fixed size (N unique words collected from your tweets) so that you can represent each tweet as a fixed length vector, where each position in the vector represents a particular word from your vocabulary, and the value is the number of times that word has appeared.
您需要创建一个固定大小的词汇表(从您的推文中收集N个唯一单词),以便可以将每个推文表示为固定长度的向量,其中向量中的每个位置代表您词汇表中的特定单词,并且value是单词出现的次数。
With pure Python this would be: 使用纯Python,将是:
First pass through tweets 首次通过推文
Second pass through same tweets 第二次通过相同的推文
You could use 1 or 0 for a word being present or not instead of word frequencies. 您可以使用1或0表示是否存在一个单词,而不要使用单词频率。 See what works.
看看有什么用。
However, scikit-learn makes all of this much easier to do. 但是,scikit-learn使所有这些操作变得更加容易。
I found this tutorial which might help too. 我发现本教程可能也有帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.