简体   繁体   中英

Find words that appear only once

I am retrieving only unique words in a file, here is what I have so far, however is there a better way to achieve this in python in terms of big O notation? Right now this is n squared

def retHapax():
    file = open("myfile.txt")
    myMap = {}
    uniqueMap = {}
    for i in file:
        myList = i.split(' ')
        for j in myList:
            j = j.rstrip()
            if j in myMap:
                del uniqueMap[j]
            else:
                myMap[j] = 1
                uniqueMap[j] = 1
    file.close()
    print uniqueMap

If you want to find all unique words and consider foo the same as foo. and you need to strip punctuation.

from collections import Counter
from string import punctuation

with open("myfile.txt") as f:
    word_counts = Counter(word.strip(punctuation) for line in f for word in line.split())

print([word for word, count in word_counts.iteritems() if count == 1])

If you want to ignore case you also need to use line.lower() . If you want to accurately get unique word then there is more involved than just splitting the lines on whitespace.

I'd go with the collections.Counter approach, but if you only wanted to use set s, then you could do so by:

with open('myfile.txt') as input_file:
    all_words = set()
    dupes = set() 
    for word in (word for line in input_file for word in line.split()):
        if word in all_words:
            dupes.add(word)
        all_words.add(word)

    unique = all_words - dupes

Given an input of:

one two three
two three four
four five six

Has an output of:

{'five', 'one', 'six'}

Try this to get unique words in a file.using Counter

from collections import Counter
with open("myfile.txt") as input_file:
    word_counts = Counter(word for line in input_file for word in line.split())
>>> [word for (word, count) in word_counts.iteritems() if count==1]
-> list of unique words (words that appear exactly once)

You could slightly modify your logic and move it from unique on second occurrence (example using sets instead of dicts):

words = set()
unique_words = set()
for w in (word.strip() for line in f for word in line.split(' ')):
    if w in words:
        continue
    if w in unique_words:
        unique_words.remove(w)
        words.add(w)
    else:
        unique_words.add(w)
print(unique_words)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM