计算（和写入）文本文件中每行的字频率

Question

第一次在堆栈中发布 - 总是发现以前的问题足以解决我的问题！ 我遇到的主要问题是逻辑......即使是伪代码答案也会很棒。

我正在使用python从文本文件的每一行读取数据，格式如下：

This is a tweet captured from the twitter api #hashtag http://url.com/site

使用nltk，我可以逐行标记，然后可以使用reader.sents（）迭代等：

reader = TaggedCorpusReader(filecorpus, r'.*\.txt', sent_tokenizer=Line_Tokenizer())

reader.sents()[:10]

但我想计算每行某些“热词”（存储在数组或类似词中）的频率，然后将它们写回文本文件。 如果我使用reader.words（），我可以计算整个文本中“热词”的频率，但我正在寻找每行的数量（或者在这种情况下为“句子”）。

理想情况下，例如：

hotwords = (['tweet'], ['twitter'])

for each line
     tokenize into words.
     for each word in line 
         if word is equal to hotword[1], hotword1 count ++
         if word is equal to hotword[2], hotword2 count ++
     at end of line, for each hotword[index]
         filewrite count,

另外，不要担心URL被破坏（使用WordPunctTokenizer会删除标点符号 - 这不是问题）

任何有用的指针（包括伪或其他类似代码的链接）都会很棒。

----编辑------------------

结束这样的事情：

import nltk
from nltk.corpus.reader import TaggedCorpusReader
from nltk.tokenize import LineTokenizer
#from nltk.tokenize import WordPunctTokenizer
from collections import defaultdict

# Create reader and generate corpus from all txt files in dir.
filecorpus = 'Twitter/FINAL_RESULTS/tweetcorpus'
filereader = TaggedCorpusReader(filecorpus, r'.*\.csv', sent_tokenizer=LineTokenizer())
print "Reader accessible." 
print filereader.fileids()

#define hotwords
hotwords = ('cool','foo','bar')

tweetdict = []

for line in filereader.sents():
wordcounts = defaultdict(int)
    for word in line:
        if word in hotwords:
            wordcounts[word] += 1
    tweetdict.append(wordcounts)

输出是：

print tweetdict

[defaultdict(<type 'dict'>, {}),
 defaultdict(<type 'int'>, {'foo': 2, 'bar': 1, 'cool': 2}),
 defaultdict(<type 'int'>, {'cool': 1})]

Answer 1

from collections import Counter

hotwords = ('tweet', 'twitter')

lines = "a b c tweet d e f\ng h i j k   twitter\n\na"

c = Counter(lines.split())

for hotword in hotwords:
    print hotword, c[hotword]

这个脚本适用于python 2.7+

Answer 2

defaultdict是你这种事情的朋友。

from collections import defaultdict
for line in myfile:
    # tokenize
    word_counts = defaultdict(int)
    for word in line:
        if word in hotwords:
            word_counts[word] += 1
    print '\n'.join('%s: %s' % (k, v) for k, v in word_counts.items())

Answer 3

你需要标记它吗？ 您可以为每个单词的每一行使用count() 。

hotwords = {'tweet':[], 'twitter':[]}
for line in file_obj:
    for word in hotwords.keys():
        hotwords[word].append(line.count(word))

计算（和写入）文本文件中每行的字频率

问题描述

3 个解决方案

解决方案1
4 2011-04-08 13:37:23

解决方案2
1 已采纳 2011-04-08 13:36:48

解决方案3
0 2011-04-08 13:25:29

计算（和写入）文本文件中每行的字频率

问题描述

3 个解决方案

解决方案1 4 2011-04-08 13:37:23

解决方案2 1 已采纳 2011-04-08 13:36:48

解决方案3 0 2011-04-08 13:25:29

解决方案1
4 2011-04-08 13:37:23

解决方案2
1 已采纳 2011-04-08 13:36:48

解决方案3
0 2011-04-08 13:25:29