简体   繁体   English

查找图案,计数并打印

[英]find pattern , count and print

i have a file for which i have to do 2 things. 我有一个文件,我必须做2件事。 first count a particular pattern and if count is more than 5 i have to print all the lines containing. 首先计算一个特定的模式,如果计数超过5,我必须打印包含的所有行。

input file: 输入文件:

  0-  0:        2257042_7       2930711_14  
  0-  1:        2257042_8       2930711_13    
  0-  2:        2257042_9       2930711_12    
  0-  3:        2257042_10      2930711_11  
  0-  4:        2257042_11      2930711_10  
  0-  5:        2257042_13      2930711_8   
  0-  6:        2257042_14      2930711_7   
  0-  7:        2257042_15      2930711_6   
  0-  8:        2257042_16      2930711_5  
  1-  0:        2258476_3       2994500_2  
  1-  1:        2258476_4       2994500_3          
  1-  2:        2258476_5       2994500_4  
  1-  3:        2258476_6       2994500_5  
  1-  4:        2258476_7       2994500_6       
  2-  0:        2259527_1       2921847_10       
  2-  1:        2259527_2       2921847_9   
  2-  2:        2259527_3       2921847_8                                   
  2-  3:        2259527_4       2921847_7   
  2-  4:        2259527_5       2921847_6   
  2-  5:        2259527_6       2921847_5  
38-  0:        2323304_2       3043768_5   
 38-  1:        2323304_3       3043768_6                                   
 38-  2:        2323304_4       3043768_7                                    
 38-  3:        2323304_5       3043768_8                                                                     
 38-  4:        2323304_6       3043768_9                                                                      
 38-  5:        2323304_7       3043768_10                                                                     
 38-  6:        2323304_8       3043768_11                                    
 39-  0:        2323953_1       3045012_9                                   
 39-  1:        2323953_2       3045012_8                                                                     
 39-  2:        2323953_3       3045012_7                                                                    
 39-  3:        2323953_4       3045012_6                                                                    
 39-  4:        2323953_7       3045012_3        
 39-  5:        2323953_8       3045012_2         
 40-  0:        2331568_2       3042876_8         
 40-  1:        2331568_3       3042876_7        
 40-  2:        2331568_4       3042876_6         
 40-  3:        2331568_5       3042876_5        
 40-  4:        2331568_6       3042876_4         
 40-  5:        2331568_9       3042876_2         
 40-  6:        2331568_10      3042876_1        

Expected output: 预期产量:

  0-  0:        2257042_7       2930711_14                                           
  0-  1:        2257042_8       2930711_13                                            
  0-  2:        2257042_9       2930711_12                                               
  0-  3:        2257042_10      2930711_11                                               
  0-  4:        2257042_11      2930711_10                                             
  0-  5:        2257042_13      2930711_8                                               
  0-  6:        2257042_14      2930711_7                                              
  0-  7:        2257042_15      2930711_6                                               
  0-  8:        2257042_16      2930711_5                                                
38-  0:        2323304_2       3043768_5                                               
 38-  1:        2323304_3       3043768_6                                              
 38-  2:        2323304_4       3043768_7                                                 
 38-  3:        2323304_5       3043768_8                                               
 38-  4:        2323304_6       3043768_9                                               
 38-  5:        2323304_7       3043768_10                                             
 38-  6:        2323304_8       3043768_11                                              
40-  0:        2331568_2       3042876_8                                                
 40-  1:        2331568_3       3042876_7                                              
 40-  2:        2331568_4       3042876_6                                               
 40-  3:        2331568_5       3042876_5                                              
 40-  4:        2331568_6       3042876_4                                                
 40-  5:        2331568_9       3042876_2                                             
 40-  6:        2331568_10      3042876_1                                               

So for this I have put the coding put I don't know is wrong in this. 因此,为此我放了我不知道是错的编码。 I am not getting the expected output. 我没有得到预期的输出。 Coding: 编码:

import sys                                                                                        
coll_file = open (sys.argv[1]).readlines()
old_pattern = ''

for lines in coll_file:        
           pattern_count = 0                                               
           split_line = lines.split('\t')               
           pattern    = split_line[0]                                                   
           if pattern == old_pattern:
                    pattern_count = pattern_count+1
                    if pattern_count > '5': 
                             print lines.strip()  
                             old_pattern = pattern
  1. Comparing int object with str object is meaningless. int对象与str对象进行比较是没有意义的。

     >>> 1 > '5' False >>> 10 > '5' False 
  2. following condition will never met, because old_pattern will not change. 不会满足以下条件,因为old_pattern不会更改。

     pattern == old_pattern 
import csv
from collections import defaultdict

d = defaultdict(list)

with open('lines.txt') as f:
   reader = csv.reader(f, delimiter='\t')
   for row in reader:
       d[row[0]].append(row)

for k,v in d.iteritems():
   if len(v) > 5:
      print(v)

You have to use a buffer line_buffer to store the lines, if the next pattern is different form the previous one and if the count of previous lines is greater > 5, print it. 如果下一个模式不同于上一个模式,并且如果前一行的计数大于5,则必须使用缓冲区line_buffer来存储行。 After print, you have to initialize the buffer. 打印后,您必须初始化缓冲区。

If pattern equal to old one pattern, count+=1, else set the count to 1. 如果模式等于旧的一个模式,则计数+ = 1,否则将计数设置为1。

In the end, still have to check the count again, if greater than 5, print it. 最后,仍然必须再次检查计数,如果大于5,则将其打印出来。

That's all. 就这样。

import sys
coll_file = open(sys.argv[1]).readlines()
old_pattern = ''
line_buffer = []
pattern_count = 0

for lines in coll_file:
    lines = lines.rstrip('\n')
    split_line = lines.split(' ')
    pattern = split_line[0]
    if pattern == old_pattern:
        pattern_count = pattern_count + 1
        line_buffer.append(lines)
    elif pattern != old_pattern:
        old_pattern = pattern
        if pattern_count >= 5:
            print '\n'.join(line_buffer)
        line_buffer = []
        pattern_count = 1
if pattern_count >= 5:
    print '\n'.join(line_buffer)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM