简体   繁体   English

从列表中列出最常见的列表

[英]List the most common lists, from a list

I have this code 我有这个代码

text = open("tags.txt", "r")
mylist = []
metalist = []

for line in text:
    mylist.append(line)

    if len(mylist) == 5:
        metalist.append(mylist)
        mylist.pop(0)

Which opens a text file with one POS tag per line. 这将打开一个文本文件,每行带有一个POS标签。 It then adds the first 5 POS tag list to mylist, which is then added to the metalist. 然后,它将前5个POS标签列表添加到mylist,然后将其添加到金属专家。 It then moves down to the next line and creates the next sequence of 5 POS tags. 然后,它向下移动到下一行并创建5个POS标签的下一个序列。 The text file has about 110k~ tags total. 文本文件总共有大约110k〜个标签。 I need to find the most common POS tag sequences from the metalist. 我需要从金属专家那里找到最常见的POS标签序列。 I tried using the counter collection but lists are not hashable. 我尝试使用计数器集合,但列表不可哈希。 What is the best way to approach this issue? 解决此问题的最佳方法是什么?

As mentioned in one of the comments, you can simply use a tuple of tags instead of a list of them which will work with the Counter class in the collections module. 正如其中一条注释中提到的那样,您可以简单地使用标签的元组,而不使用将与collections模块中的Counter类一起使用的标签列表。 Here's how to do that using the list-based approach of the code in your question, along with a few optimizations since you have to process a large number of POS tags: 这是使用问题中代码的基于列表的方法以及一些优化的方法,因为您必须处理大量POS标签:

from collections import Counter

GROUP_SIZE = 5
counter = Counter()
mylist = []

with open("tags.txt", "r") as tagfile:
    tags = (line.strip() for line in tagfile)
    try:
        while len(mylist) < GROUP_SIZE-1:
            mylist.append(tags.next())
    except StopIteration:
        pass

    for tag in tags:   # main loop
        mylist.pop(0)
        mylist.append(tag)
        counter.update((tuple(mylist),))

if len(counter) < 1:
    print 'too few tags in file'
else:
    for tags, count in counter.most_common(10):  # top 10
        print '{}, count = {:,d}'.format(list(tags), count)

However it would be even better to also use a deque from the collections module instead of a list for what you're doing because the former have very efficient, O(1), appends and pops from either end vs O(n) with the latter. 但是,最好也使用collections模块中的deque而不是list来执行您的操作,因为前者具有非常高效的O(1),从任一端追加和弹出,而O(n)与后者。 They also use less memory. 它们还使用较少的内存。

In addition to that, since Python v 2.6, they support a maxlen parameter which eliminates the need to explicitly pop() elements off the end after the desired size has been reached -- so here's an even more efficient version based on them: 除此之外,自Python v 2.6起,它们还支持maxlen参数,从而消除了在达到所需大小后在末端显式pop()元素的需求-因此,基于它们的一个更有效的版本:

from collections import Counter, deque

GROUP_SIZE = 5
counter = Counter()
mydeque = deque(maxlen=GROUP_SIZE)

with open("tags.txt", "r") as tagfile:
    tags = (line.strip() for line in tagfile)
    try:
        while len(mydeque) < GROUP_SIZE-1:
            mydeque.append(tags.next())
    except StopIteration:
        pass

    for tag in tags:   # main loop
        mydeque.append(tag)
        counter.update((tuple(mydeque),))

if len(counter) < 1:
    print 'too few tags in file'
else:
    for tags, count in counter.most_common(10):  # top 10
        print '{}, count = {:,d}'.format(list(tags), count)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM