简体   繁体   中英

10 ,most frequent words in a string Python

I need to display the 10 most frequent words in a text file, from the most frequent to the least as well as the number of times it has been used. I can't use the dictionary or counter function. So far I have this:

import urllib
cnt = 0
i=0
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
uniques = []
for line in txtFile:
    words = line.split()
    for word in words:
        if word not in uniques:
            uniques.append(word)
for word in words:
    while i<len(uniques):
        i+=1
        if word in uniques:
             cnt += 1
print cnt

Now I think I should look for every word in the array 'uniques' and see how many times it is repeated in this file and then add that to another array that counts the instance of each word. But this is where I am stuck. I don't know how to proceed.

Any help would be appreciated. Thank you

The above problem can be easily done by using python collections below is the Solution.

from collections import Counter

data_set = "Welcome to the world of Geeks " \
"This portal has been created to provide well written well" \
"thought and well explained solutions for selected questions " \
"If you like Geeks for Geeks and would like to contribute " \
"here is your chance You can write article and mail your article " \
" to contribute at geeksforgeeks org See your article appearing on " \
"the Geeks for Geeks main page and help thousands of other Geeks. " \

# split() returns list of all the words in the string
split_it = data_set.split()

# Pass the split_it list to instance of Counter class.
Counters_found = Counter(split_it)
#print(Counters)

# most_common() produces k frequently encountered
# input values and their respective counts.
most_occur = Counters_found.most_common(4)
print(most_occur)

You're on the right track. Note that this algorithm is quite slow because for each unique word, it iterates over all of the words. A much faster approach without hashing would involve building a trie .

# The following assumes that we already have alice30.txt on disk.
# Start by splitting the file into lowercase words.
words = open('alice30.txt').read().lower().split()

# Get the set of unique words.
uniques = []
for word in words:
  if word not in uniques:
    uniques.append(word)

# Make a list of (count, unique) tuples.
counts = []
for unique in uniques:
  count = 0              # Initialize the count to zero.
  for word in words:     # Iterate over the words.
    if word == unique:   # Is this word equal to the current unique?
      count += 1         # If so, increment the count
  counts.append((count, unique))

counts.sort()            # Sorting the list puts the lowest counts first.
counts.reverse()         # Reverse it, putting the highest counts first.
# Print the ten words with the highest counts.
for i in range(min(10, len(counts))):
  count, word = counts[i]
  print('%s %d' % (word, count))
from string import punctuation #you will need it to strip the punctuation

import urllib
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")

counter = {}

for line in txtFile:
    words = line.split()
    for word in words:
        k = word.strip(punctuation).lower() #the The or you You counted only once
        # you still have words like I've, you're, Alice's
        # you could change re to are, ve to have, etc...
        if "'" in k:
            ks = k.split("'")
        else:
            ks = [k,]
        #now the tally
        for k in ks:
            counter[k] = counter.get(k, 0) + 1
#and sorting the counter by the value which holds the tally
for word in sorted(counter, key=lambda k: counter[k], reverse=True)[:10]:
    print word, "\t", counter[word]
import urllib
import operator
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt").readlines()
txtFile = " ".join(txtFile) # this with .readlines() replaces new lines with spaces
txtFile = "".join(char for char in txtFile if char.isalnum() or char.isspace()) # removes everything that's not alphanumeric or spaces.

word_counter = {}
for word in txtFile.split(" "): # split in every space.
    if len(word) > 0 and word != '\r\n':
        if word not in word_counter: # if 'word' not in word_counter, add it, and set value to 1
            word_counter[word] = 1
        else:
            word_counter[word] += 1 # if 'word' already in word_counter, increment it by 1

for i,word in enumerate(sorted(word_counter,key=word_counter.get,reverse=True)[:10]):
    # sorts the dict by the values, from top to botton, takes the 10 top items,
    print "%s: %s - %s"%(i+1,word,word_counter[word])

output:

1: the - 1432 2: and - 734 3: to - 703 4: a - 579 5: of - 501 6: she - 466 7: it - 440 8: said - 434 9: I - 371 10: in - 338

This methods ensures that only alphanumeric and spaces are in the counter. Doesn't matter that much tho.

Personally I'd make my own implementation of collections.Counter . I assume you know how that object works, but if not I'll summarize:

text = "some words that are mostly different but are not all different not at all"

words = text.split()

resulting_count = collections.Counter(words)
# {'all': 2,
# 'are': 2,
# 'at': 1,
# 'but': 1,
# 'different': 2,
# 'mostly': 1,
# 'not': 2,
# 'some': 1,
# 'that': 1,
# 'words': 1}

We can certainly sort that based on frequency by using the key keyword argument of sorted , and return the first 10 items in that list. However that doesn't much help you because you don't have Counter implemented. I'll leave THAT part as an exercise for you, and show you how you might implement Counter as a function rather than an object.

def counter(iterable):
    d = {}
    for element in iterable:
        if element in d:
            d[element] += 1
        else:
            d[element] = 1
    return d

Not difficult, actually.Go through each element of an iterable. If that element is NOT in d , add it to d with a value of 1 . If it IS in d , increment that value. It's more easily expressed by:

def counter(iterable):
    d = {}
    for element in iterable:
        d.setdefault(element, 0) += 1

Note that in your use case, you probably want to strip out the punctuation and possibly casefold the whole thing (so that someword gets counted the same as Someword rather than as two separate words). I'll leave that to you as well, but I will point out str.strip takes an argument as to what to strip out, and string.punctuation contains all the punctuation you're likely to need.

You can also do it through pandas dataframes and get result in convinient form as a table: "word-its freq." ordered.

def count_words(words_list):
 words_df = pn.DataFrame(words_list)
 words_df.columns = ["word"]
 words_df_unique = pn.DataFrame(pn.unique(words_list))
 words_df_unique.columns = ["unique"]
 words_df_unique["count"] = 0
 i = 0
 for word in pn.Series.tolist(words_df_unique.unique):
     words_df_unique.iloc[i, 1] =  len(words_df.word[words_df.word == word])
     i+=1
res = words_df_unique.sort_values('count', ascending = False)
return(res)

To do the same operation on a pandas data frame, you may use the following through Counter function from Collections:

from collections import Counter
cnt = Counter()
for text in df['text']:
    for word in text.split():
        cnt[word] += 1

# Find most common 10 words from the Pandas dataframe
cnt.most_common(10)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM