从列表中列出最常见的列表

Question

I have this code 我有这个代码

text = open("tags.txt", "r")
mylist = []
metalist = []

for line in text:
    mylist.append(line)

    if len(mylist) == 5:
        metalist.append(mylist)
        mylist.pop(0)

Which opens a text file with one POS tag per line. 这将打开一个文本文件，每行带有一个POS标签。 It then adds the first 5 POS tag list to mylist, which is then added to the metalist. 然后，它将前5个POS标签列表添加到mylist，然后将其添加到金属专家。 It then moves down to the next line and creates the next sequence of 5 POS tags. 然后，它向下移动到下一行并创建5个POS标签的下一个序列。 The text file has about 110k~ tags total. 文本文件总共有大约110k〜个标签。 I need to find the most common POS tag sequences from the metalist. 我需要从金属专家那里找到最常见的POS标签序列。 I tried using the counter collection but lists are not hashable. 我尝试使用计数器集合，但列表不可哈希。 What is the best way to approach this issue? 解决此问题的最佳方法是什么？

Answer 1

As mentioned in one of the comments, you can simply use a tuple of tags instead of a list of them which will work with the Counter class in the collections module. 正如其中一条注释中提到的那样，您可以简单地使用标签的元组，而不使用将与collections模块中的Counter类一起使用的标签列表。 Here's how to do that using the list-based approach of the code in your question, along with a few optimizations since you have to process a large number of POS tags: 这是使用问题中代码的基于列表的方法以及一些优化的方法，因为您必须处理大量POS标签：

from collections import Counter

GROUP_SIZE = 5
counter = Counter()
mylist = []

with open("tags.txt", "r") as tagfile:
    tags = (line.strip() for line in tagfile)
    try:
        while len(mylist) < GROUP_SIZE-1:
            mylist.append(tags.next())
    except StopIteration:
        pass

    for tag in tags:   # main loop
        mylist.pop(0)
        mylist.append(tag)
        counter.update((tuple(mylist),))

if len(counter) < 1:
    print 'too few tags in file'
else:
    for tags, count in counter.most_common(10):  # top 10
        print '{}, count = {:,d}'.format(list(tags), count)

However it would be even better to also use a deque from the collections module instead of a list for what you're doing because the former have very efficient, O(1), appends and pops from either end vs O(n) with the latter. 但是，最好也使用collections模块中的deque而不是list来执行您的操作，因为前者具有非常高效的O（1），从任一端追加和弹出，而O（n）与后者。 They also use less memory. 它们还使用较少的内存。

In addition to that, since Python v 2.6, they support a maxlen parameter which eliminates the need to explicitly pop() elements off the end after the desired size has been reached -- so here's an even more efficient version based on them: 除此之外，自Python v 2.6起，它们还支持maxlen参数，从而消除了在达到所需大小后在末端显式pop()元素的需求-因此，基于它们的一个更有效的版本：

from collections import Counter, deque

GROUP_SIZE = 5
counter = Counter()
mydeque = deque(maxlen=GROUP_SIZE)

with open("tags.txt", "r") as tagfile:
    tags = (line.strip() for line in tagfile)
    try:
        while len(mydeque) < GROUP_SIZE-1:
            mydeque.append(tags.next())
    except StopIteration:
        pass

    for tag in tags:   # main loop
        mydeque.append(tag)
        counter.update((tuple(mydeque),))

if len(counter) < 1:
    print 'too few tags in file'
else:
    for tags, count in counter.most_common(10):  # top 10
        print '{}, count = {:,d}'.format(list(tags), count)

从列表中列出最常见的列表

问题描述

1 个解决方案

解决方案1
1 已采纳 2013-06-21 21:34:21

从列表中列出最常见的列表

问题描述

1 个解决方案

解决方案1 1 已采纳 2013-06-21 21:34:21

解决方案1
1 已采纳 2013-06-21 21:34:21