简体   繁体   中英

Convert a list of repeating values to dictionary of their frequency count in Python

I have a list of strings which have repeating values and I want to create dictionary of words where key will be the word and its value will be the frequency count and then write these words and their values in the csv:

The following has been my approach to do the same:

#!/usr/bin/env python
# encoding: utf-8

# -*- coding: utf8 -*-
import csv
from nltk.tokenize import TweetTokenizer
import numpy as np

tknzr = TweetTokenizer()

#print tknzr.tokenize(s0)

with open("dispn.csv","r") as file1,\
     open("dispn_tokenized.csv","w") as file2,\
     open("dispn_tokenized_count.csv","w") as file3:

     mycsv = list(csv.reader(file1))

     words = []
     words_set = []
     tokenize_count = {}
     for row in mycsv:
         
         lst = tknzr.tokenize(row[2])
         for l in lst:
             file2.write("\""+str(row[2])+"\""+","+"\""+str(l.encode('utf-8'))+"\""+"\n")
             l = l.lower()
             words.append(l)
     words_set = list(set(words))
     print "len of words_set : " + str(len(words_set))
     for word in words_set:
        tokenize_count[word] = 1
        
     for word in words:
        tokenize_count[word] = tokenize_count[word]+1
        

   

     print "len of tokenized words_set : " + str(len(tokenize_count))

     #print "Tokenized_words count : "
     #print tokenize_count
     #print "================================================================="
                         
     i = 0
     for wrd in words_set:
       #i = i+1
       print "i : " +str(i)
       file3.write("\""+str(i)+"\""+","+"\""+str(wrd.encode('utf-8'))+"\""+","+"\""+str(tokenize_count[wrd])+"\""+"\n")

but in csv I still found some repeating values like 1,5,4,7,9.

Some info of the approach:

  • dispn.csv = contains usernames of the users which I am tokenizing with the help of nltk module
  • after tokenizing them, I am storing these words in the list 'words' and writing the words corresponding to the username to csv.
  • creating set of it so as to get unique values out of list 'words' and storing it in 'words_set'
  • then creating dictionary 'tokenize_count' with key as word and value as its frequency count and writing the same to csv.

Why am I getting only some of the numerical values repeated? Is there a better way to do this?

`import Counter from collections

Counter can be called on a list of strings and return a dict-like object where the key values are words and their frequencies

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM