简体   繁体   中英

Finding least common elements in a list

I want to generate an ordered list of the least common words within a large body of text, with the least common word appearing first along with a value indicating how many times it appears in the text.

I scraped the text from some online journal articles, then simply assigned and split;

article_one = """ large body of text """.split() 
=> ("large","body", "of", "text")

Seems like a regex would be appropriate for the next steps, but being new to programming I'm not well versed- If the best answer includes a regex, could someone point me to a good regex tutorial other than pydoc?

How about a shorter/simpler version with a defaultdict, Counter is nice but needs Python 2.7, this works from 2.5 and up :)

import collections

counter = collections.defaultdict(int)
article_one = """ large body of text """

for word in article_one.split():
    counter[word] += 1

print sorted(counter.iteritems(), key=lambda x: x[::-1])

Finding least common elements in a list. According to Counter class in Collections module

c.most_common()[:-n-1:-1]       # n least common elements

So Code for least common element in list is

from collections import Counter
Counter( mylist ).most_common()[:-2:-1]

Two least common elements is

from collections import Counter
Counter( mylist ).most_common()[:-3:-1]

This uses a slightly different approach but it appears to suit your needs. Uses code from this answer .

#!/usr/bin/env python
import operator
import string

article_one = """A, a b, a b c, a b c d, a b c d efg.""".split()
wordbank = {}

for word in article_one:
    # Strip word of punctuation and capitalization
    word = word.lower().strip(string.punctuation)
    if word not in wordbank:
        # Create a new dict key if necessary
        wordbank[word] = 1
    else:
        # Otherwise, increment the existing key's value
        wordbank[word] += 1

# Sort dict by value
sortedwords = sorted(wordbank.iteritems(), key=operator.itemgetter(1))

for word in sortedwords:
    print word[1], word[0]

Outputs:

1 efg
2 d
3 c
4 b
5 a

Works in Python >= 2.4, and Python 3+ if you parenthesize the print statement at the bottom and change iteritems to items .

ready made answer from the mothership.

# From the official documentation ->>
>>> # Tally occurrences of words in a list
>>> cnt = Counter()
>>> for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
...     cnt[word] += 1
>>> cnt
Counter({'blue': 3, 'red': 2, 'green': 1})
## ^^^^--- from the standard documentation.

>>> # Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall('\w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
 ('you', 554),  ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]

>>> def least_common(adict, n=None):
.....:       if n is None:
.....:               return sorted(adict.iteritems(), key=itemgetter(1), reverse=False)
.....:       return heapq.nsmallest(n, adict.iteritems(), key=itemgetter(1))

Obviously adapt to suite :D

If you need a fixed number of least-common words, eg, the 10 least common, you probably want a solution using a counter dict and a heapq , as suggested by sotapme's answer (with WoLpH's suggestion) or WoLpH's answer:

wordcounter = collections.Counter(article_one)
leastcommon = word counter.nsmallest(10)

However, if you need an unbounded number of them, eg, all words with fewer than 5 appearances, which could be 6 in one run and 69105 in the next, you might be better of just sorting the list:

wordcounter = collections.Counter(article_one)
allwords = sorted(wordcounter.items(), key=operator.itemgetter(1))
leastcommon = itertools.takewhile(lambda x: x[1] < 5, allwords)

Sorting takes longer than heapifying, but extracting the first M elements is a lot faster with a list than a heap . Algorithmically, the difference is just some log N factors, so the constants are going to be important here. So the best thing to do is test.

Taking my code at pastebin , and a file made by just doing cat reut2* >reut2.sgm on the Reuters-21578 corpus (without processing it to extract the text, so this is obviously not very good for serious work, but should be fine for benchmarking, because none of the SGML tags are going to be in the least common…):

$ python leastwords.py reut2.sgm # Apple 2.7.2 64-bit
heap: 32.5963380337
sort: 22.9287009239
$ python3 leastwords.py reut2.sgm # python.org 3.3.0 64-bit
heap: 32.47026552911848
sort: 25.855643508024514
$ pypy leastwords.py reut2.sgm # 1.9.0/2.7.2 64-bit
heap: 23.95291996
sort: 16.1843900681

I tried various ways to speed up each of them (including: takewhile around a genexp instead of a loop around yield in the heap version, popping optimistic batches with nsmallest and throwing away any excess, making a list and sorting in place, decorate-sort-undecorate instead of a key, partial instead of lambda , etc.), but none of them made more than 5% improvement (and some made things significantly slower).

At any rate, these are closer than I expected, so I'd probably go with whichever one is simpler and more readable. But I think sort beats heap there, as well, so…

Once again: If you just need the N least common, for reasonable N, I'm willing to bet without even testing that the heap implementation will win.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM