简体   繁体   English

排序列表列表以获取最后一列的唯一ID

[英]sorting lists of list to get unique ids for last column

I have this data saved in a file: 我将这些数据保存在文件中:

['5',60680,60854,'gene_id "ENS1"']
['5',59106,89211,'gene_id "ENS1"']
['5',58686,58765,'gene_id "ENS1"']
['5',80835,93381,'gene_id "ENS2"']
['5',55555,92223,'gene_id "ENS2"']
['5',73902,74276,'gene_id "ENS2"']

I need help with python to get an output which ensures that items in the 4th column appear only when the second column has the minimum value and the third column has a maximum value within a 4th column item. 我需要python的帮助才能获得输出,该输出可确保仅在第二列具有最小值且第三列在第四列项目中具有最大值时才显示第四列中的项目。 So I want my output to look like this: 所以我希望我的输出看起来像这样:

['5',58686,89211,'gene_id "ENS1"']
['5',55555,93381,'gene_id "ENS2"']

Each item in the 4th column should only appear once. 第4列中的每个项目应只出现一次。 How can I also get rid of the [] around the data. 我还如何摆脱数据中的[]。 Thank you. 谢谢。

>>> from itertools import groupby
>>> for i, j in groupby(lst, key=lambda x: x[3]):
    t = list(zip(*j))
    print(t[0][0], min(t[1]), max(t[2]), t[3][0])


5 58686 89211 gene_id "ENS1"
5 55555 93381 gene_id "ENS2"

It's not clear, what do you mean by getting rid of [] , these are just syntax for python lists. 尚不清楚,您摆脱[]是什么意思,这些只是python列表的语法。

import re
pat = re.compile("\['[^']+',([^,]+),([^,]+),'([^']+)']")

ch = '''
['5',60680,60854,'gene_id "ENS1"']
['5',59106,89211,'gene_id "ENS1"']
['5',58686,58765,'gene_id "ENS1"']
['5',80835,93381,'gene_id "ENS2"']
['5',55555,92223,'gene_id "ENS2"']
['5',73902,74276,'gene_id "ENS2"']'''

li = pat.findall(ch)
print li

deekmin = {}
deekmax = {}
for a,b,c in li[1:]:
    if c in deekmin:
        if a<deekmin[c]:
            deekmin[c] = a
        if b>deekmax[c]:
            dekkmax[c] = b
    else:
        deekmin[c] = a
        deekmax[c] = b

res = [ (deekmin[c],deekmax[c],c) for c in deekmin ]
print res

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM