[英]find pattern , count and print
i have a file for which i have to do 2 things. 我有一个文件,我必须做2件事。 first count a particular pattern and if count is more than 5 i have to print all the lines containing. 首先计算一个特定的模式,如果计数超过5,我必须打印包含的所有行。
input file: 输入文件:
0- 0: 2257042_7 2930711_14
0- 1: 2257042_8 2930711_13
0- 2: 2257042_9 2930711_12
0- 3: 2257042_10 2930711_11
0- 4: 2257042_11 2930711_10
0- 5: 2257042_13 2930711_8
0- 6: 2257042_14 2930711_7
0- 7: 2257042_15 2930711_6
0- 8: 2257042_16 2930711_5
1- 0: 2258476_3 2994500_2
1- 1: 2258476_4 2994500_3
1- 2: 2258476_5 2994500_4
1- 3: 2258476_6 2994500_5
1- 4: 2258476_7 2994500_6
2- 0: 2259527_1 2921847_10
2- 1: 2259527_2 2921847_9
2- 2: 2259527_3 2921847_8
2- 3: 2259527_4 2921847_7
2- 4: 2259527_5 2921847_6
2- 5: 2259527_6 2921847_5
38- 0: 2323304_2 3043768_5
38- 1: 2323304_3 3043768_6
38- 2: 2323304_4 3043768_7
38- 3: 2323304_5 3043768_8
38- 4: 2323304_6 3043768_9
38- 5: 2323304_7 3043768_10
38- 6: 2323304_8 3043768_11
39- 0: 2323953_1 3045012_9
39- 1: 2323953_2 3045012_8
39- 2: 2323953_3 3045012_7
39- 3: 2323953_4 3045012_6
39- 4: 2323953_7 3045012_3
39- 5: 2323953_8 3045012_2
40- 0: 2331568_2 3042876_8
40- 1: 2331568_3 3042876_7
40- 2: 2331568_4 3042876_6
40- 3: 2331568_5 3042876_5
40- 4: 2331568_6 3042876_4
40- 5: 2331568_9 3042876_2
40- 6: 2331568_10 3042876_1
Expected output: 预期产量:
0- 0: 2257042_7 2930711_14
0- 1: 2257042_8 2930711_13
0- 2: 2257042_9 2930711_12
0- 3: 2257042_10 2930711_11
0- 4: 2257042_11 2930711_10
0- 5: 2257042_13 2930711_8
0- 6: 2257042_14 2930711_7
0- 7: 2257042_15 2930711_6
0- 8: 2257042_16 2930711_5
38- 0: 2323304_2 3043768_5
38- 1: 2323304_3 3043768_6
38- 2: 2323304_4 3043768_7
38- 3: 2323304_5 3043768_8
38- 4: 2323304_6 3043768_9
38- 5: 2323304_7 3043768_10
38- 6: 2323304_8 3043768_11
40- 0: 2331568_2 3042876_8
40- 1: 2331568_3 3042876_7
40- 2: 2331568_4 3042876_6
40- 3: 2331568_5 3042876_5
40- 4: 2331568_6 3042876_4
40- 5: 2331568_9 3042876_2
40- 6: 2331568_10 3042876_1
So for this I have put the coding put I don't know is wrong in this. 因此,为此我放了我不知道是错的编码。 I am not getting the expected output. 我没有得到预期的输出。 Coding: 编码:
import sys
coll_file = open (sys.argv[1]).readlines()
old_pattern = ''
for lines in coll_file:
pattern_count = 0
split_line = lines.split('\t')
pattern = split_line[0]
if pattern == old_pattern:
pattern_count = pattern_count+1
if pattern_count > '5':
print lines.strip()
old_pattern = pattern
Comparing int
object with str
object is meaningless. 将int
对象与str
对象进行比较是没有意义的。
>>> 1 > '5' False >>> 10 > '5' False
following condition will never met, because old_pattern
will not change. 不会满足以下条件,因为old_pattern
不会更改。
pattern == old_pattern
import csv
from collections import defaultdict
d = defaultdict(list)
with open('lines.txt') as f:
reader = csv.reader(f, delimiter='\t')
for row in reader:
d[row[0]].append(row)
for k,v in d.iteritems():
if len(v) > 5:
print(v)
You have to use a buffer line_buffer
to store the lines, if the next pattern is different form the previous one and if the count of previous lines is greater > 5, print it. 如果下一个模式不同于上一个模式,并且如果前一行的计数大于5,则必须使用缓冲区line_buffer
来存储行。 After print, you have to initialize the buffer. 打印后,您必须初始化缓冲区。
If pattern equal to old one pattern, count+=1, else set the count to 1. 如果模式等于旧的一个模式,则计数+ = 1,否则将计数设置为1。
In the end, still have to check the count again, if greater than 5, print it. 最后,仍然必须再次检查计数,如果大于5,则将其打印出来。
That's all. 就这样。
import sys
coll_file = open(sys.argv[1]).readlines()
old_pattern = ''
line_buffer = []
pattern_count = 0
for lines in coll_file:
lines = lines.rstrip('\n')
split_line = lines.split(' ')
pattern = split_line[0]
if pattern == old_pattern:
pattern_count = pattern_count + 1
line_buffer.append(lines)
elif pattern != old_pattern:
old_pattern = pattern
if pattern_count >= 5:
print '\n'.join(line_buffer)
line_buffer = []
pattern_count = 1
if pattern_count >= 5:
print '\n'.join(line_buffer)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.