简体   繁体   English

如何在python中执行此排序操作

[英]How to perform this sort operation in python

I am creating a module to analyse frequencies of patterns of tokens and delimiters in a given text split up into sentences. 我正在创建一个模块来分析给定文本中分成句子的标记和定界符模式的频率。

I have a class "SequencePattern" which identifies one element (token or delimiter) in a set of tokenised sentences, where each SequencePattern has a list attribute "occurrences" consisting of tuples ( n_sentence, n_element ) where this particular element actual occurs. 我有一个“ SequencePattern”类,它标识一组标记化句子中的一个元素 (令牌或定界符),其中每个SequencePattern都有一个由元组( n_sentence, n_element )组成的列表属性“出现” n_sentence, n_element实际发生此特定元素的地方。 Class SequencePattern has a class-level field, seq_patterns (a set ), where all the individual SequencePattern instances are stored. 类SequencePattern具有一个类级别字段seq_patterns(一个set ),所有单独的SequencePattern实例都存储在该字段中。

At this stage in the processing I only have single-element SequencePatterns, and have weeded out all such SequencePatterns having < 2 occurrences. 在处理的这个阶段,我只有一个元素的SequencePatterns,并且已经淘汰了所有出现<2次的SequencePatterns。 But SequencePattern is a subclass of tuple and the idea is now to find the "two element" SequencePatterns. 但是SequencePattern是tuple的子类,现在的想法是找到“两个元素” SequencePatterns。

The next thing I need to do is to go through all the one-element SequencePatterns which remain after weeding, identifying spots where you find two (or more) adjacent occurrences in the same sentence, ie where n_sentence is the same and n_element differs by 1. 我接下来要做的是遍历除草后保留的所有一个元素的SequencePatterns,确定在同一句子中找到两个(或多个)相邻出现的点,即n_sentence相同且n_element相差1 。

So I need to do something along these lines: 因此,我需要按照以下步骤做一些事情:

occurrences_by_text_order = sorted( SequencePattern.seq_patterns.occurrences )

... but of course this doesn't work: I get ...但是这当然行不通:我明白了

AttributeError: 'set' object has no attribute 'occurences'

Somehow I need to do an iteration of all SequencePatterns in seq_patterns and then, for each, a "nested" iteration of all occurrences for each of these... and I need to submit this mass of delivered tuples ( n_sentence, n_element ) to the sorted function. 不知何故,我需要对seq_patterns中的所有SequencePatterns进行迭代,然后为每个迭代所有出现的“嵌套”迭代...并且我需要将此传递的元组质量( n_sentence, n_element )提交给sorted功能。

I'm not an experienced Pythonista but I have a suspicion this is a job for a generator (?). 我不是经验丰富的Pythonista使用者,但我怀疑这是生成器(?)的工作。 Can anyone help? 有人可以帮忙吗?

def get_occurrences():
    for seq_patt in SequencePattern.seq_patterns:
        for occurrence in seq_patt.occurrences:
            yield occurrence
occurrences_by_text_order = sorted( get_occurrences() ) 

The following then prints out a list of all the two-element sequences which may occur more than once (we now know that there is no possibility of two-element sequences with frequency > 1 occurring anywhere else): 然后,下面的代码打印出可能会出现一次以上的所有两个元素序列的列表(我们现在知道在其他任何地方都不可能出现频率大于1的两个元素序列):

prev_occurrence = None
for occurrence in sorted( occurrence for seq_patt in SequencePattern.seq_patterns for occurrence in seq_patt.occurrences ):
    if prev_occurrence and ( occurrence[ 0 ] == prev_occurrence[ 0 ] ) and ( occurrence[ 1 ] - prev_occurrence[ 1 ] == 1 ):  
        print( '# prev_occurrence %s occurrence: %s' % ( prev_occurrence, occurrence, ))
    prev_occurrence = occurrence

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM