I have a list of strings which have repeating values and I want to create dictionary of words where key will be the word and its value will be the frequency count and then write these words and their values in the csv:
The following has been my approach to do the same:
#!/usr/bin/env python
# encoding: utf-8
# -*- coding: utf8 -*-
import csv
from nltk.tokenize import TweetTokenizer
import numpy as np
tknzr = TweetTokenizer()
#print tknzr.tokenize(s0)
with open("dispn.csv","r") as file1,\
open("dispn_tokenized.csv","w") as file2,\
open("dispn_tokenized_count.csv","w") as file3:
mycsv = list(csv.reader(file1))
words = []
words_set = []
tokenize_count = {}
for row in mycsv:
lst = tknzr.tokenize(row[2])
for l in lst:
file2.write("\""+str(row[2])+"\""+","+"\""+str(l.encode('utf-8'))+"\""+"\n")
l = l.lower()
words.append(l)
words_set = list(set(words))
print "len of words_set : " + str(len(words_set))
for word in words_set:
tokenize_count[word] = 1
for word in words:
tokenize_count[word] = tokenize_count[word]+1
print "len of tokenized words_set : " + str(len(tokenize_count))
#print "Tokenized_words count : "
#print tokenize_count
#print "================================================================="
i = 0
for wrd in words_set:
#i = i+1
print "i : " +str(i)
file3.write("\""+str(i)+"\""+","+"\""+str(wrd.encode('utf-8'))+"\""+","+"\""+str(tokenize_count[wrd])+"\""+"\n")
but in csv I still found some repeating values like 1,5,4,7,9.
Some info of the approach:
Why am I getting only some of the numerical values repeated? Is there a better way to do this?
`import Counter from collections
Counter can be called on a list of strings and return a dict-like object where the key values are words and their frequencies
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.