簡體   English   中英

Python迭代:通過.txt文件排序提取想要的數據

[英]Python iteration: sorting through a .txt file extract wanted data

我有一個示例inputfile.txt:

chr1    34870071    34899867    pi-Fam168b.1    -
chr11   98724946    98764609    pi-Wipf2.1  +
chr11   105898192   105920636   pi-Dcaf7.1  +
chr11   120486441   120495268   pi-Mafg.1   -
chr12   3891106 3914443 pi-Dnmt3a.1 +
chr12   82815946    82882157    pi-Map3k9.1 -
chr13   23855536    23856215    pi-Hist1h1a.1   +
chr13   55206682    55236190    pi-Zfp346.1 +
chr1    95700553    95718679    pi-Ing5.1   +
chr13   55313417    55419685    pi-Nsd1.1   +
chr14   27852218    27920472    pi-Il17rd.1 +
chr14   65430438    65568699    pi-Hmbox1.1 -
chr1    120524521   120581739   pi-Tfcp2l1.1    +
chr15   81633147    81657289    pi-Tef.1    +
chr15   89331804    89390691    pi-Shank3.1 +
chr15   103021983   103070259   pi-Cbx5.1   -
chr16   16896549    16927451    pi-Ppm1f.1  +
chr16   17233679    17263523    pi-Hic2.1   +
chr16   17452059    17486929    pi-Crkl.1   +
chr16   24393531    24992661    pi-Lpp.1    +
chr16   43964878    43979143    pi-Zdhhc23.1    -
chr17   25098236    25152532    pi-Cramp1l.1    -
chr17   27993451    28036985    pi-Uhrf1bp1.1   +
chr17   83973363    84031786    pi-Kcng3.1  -
chr1    133904194   133928161   pi-Elk4.1   +
chr18   60844148    60908308    pi-Ndst1.1  -
chr19   10057193    10059582    pi-Fth1.1   +
chr19   44637337    44650762    pi-Hif1an.1 +
chr1    135027714   135036359   pi-Ppp1r15b.1   +
chr2    28677821    28695861    pi-Gtf3c4.1 -
chr1    136651241   136852527   pi-Ppp1r12b.1   -
chr2    154262219   154365092   pi-Cbfa2t2.1    +
chr2    156022393   156135687   pi-Phf20.1  +
chr3    51028854    51055547    pi-Ccrn4l.1 +
chr3    94985683    95021902    pi-Gabpb2.1 -
chr1    158488203   158579750   pi-Abl2.1   +
chr4    45411294    45421633    pi-Mcart1.1 -
chr4    56879897    56960355    pi-D730040F13Rik.1  -
chr4    59818521    59917612    pi-Snx30.1  +
chr4    107847846   107890527   pi-Zyg11a.1 -
chr4    107900359   107973695   pi-Zyg11b.1 -
chr4    132195002   132280676   pi-Eya3.1   +
chr4    134968222   134989706   pi-Rcan3.1  -
chr4    136025678   136110697   pi-Luzp1.1  +
chr1    162933052   162964958   pi-Zbtb37.1 -
chr5    38591490    38611628    pi-Zbtb49.1 -
chr5    67783388    67819359    pi-Bend4.1  -
chr5    114387108   114443767   pi-Ssh1.1   -
chr5    115592990   115608225   pi-Mlec.1   -
chr5    143628624   143656891   pi-Fbxl18.1 -
chr1    172123561   172145541   pi-Uhmk1.1  -
chr6    83312367    83391602    pi-Tet3.1   -
chr6    85419571    85434653    pi-Fbxo41.1 -
chr6    116288039   116359551   pi-March08.1    +
chr6    120786229   120842859   pi-Bcl2l13.1    +
chr7    71031236    71083761    pi-Klf13.1  -
chr7    107068766   107128968   pi-Rnf169.1 -
chr7    139903770   140044311   pi-Fam53b.1 -
chr8    72285224    72298794    pi-Zfp866.1 -
chr8    106872110   106919708   pi-Cmtm4.1  -
chr8    112250549   112261649   pi-Atxn1l.1 -
chr10   41901651    41911816    pi-Foxo3.1  -
chr8    119682164   119739895   pi-Gan.1    +
chr8    125406988   125566154   pi-Ankrd11.1    -
chr9    27148219    27165314    pi-Igsf9b.1 +
chr9    44100521    44113717    pi-Hinfp.1  -
chr9    61761092    61762348    pi-Rplp1.1  -
chr9    106590412   106691503   pi-Rad54l2.1    -
chr9    114416339   114473487   pi-Trim71.1 -
chr9    119311403   119351032   pi-Acvr2b.1 +
chr9    119354082   119373348   pi-Exog.1   +
chr10   82822985    82831579    pi-D10Wsu102e.1 +
chr10   126415753   126437016   pi-Ctdsp2.1 +
chr1    90159688    90174093    pi-Hjurp.1  -
chr11   60591039    60597792    pi-Smcr8.1  +
chr11   69209318    69210176    pi-Lsmd1.1  +
chr11   75345218    75391069    pi-Slc43a2.1    +
chr11   79474214    79511524    pi-Rab11fip4.1  +
chr11   95818479    95868022    pi-Igf2bp1.1    -
chr11   97223641    97259855    pi-Socs7.1  +
chr11   97524530    97546757    pi-Mllt6.1  +
chr1    120355721   120355843   1-qE2.3-2.1 -
chr2    120518324   120540873   2-qE5-4.1   +
chr7    82913927    82926993    7-qD2-40.1  -

列1 = chromosome_number

列2 =啟動

欄3 =端

Column4 = gene_name

Column5 =方向(+或 - )

1.)我需要提取具有相同染色體編號的行 (第1列), 它們的起始位點具有 相反方向的200最大值(200或更小) (第2列) 的差異 (一個是正/負)。

這是我到目前為止所不知道我的錯誤在哪里:

import csv
import itertools as it
f=open('inputfile.txt', 'r')

def getrecords(f):
    for line in open(f):
        yield line.strip().split()
key=lambda x: x[0]
for i, rec in it.groupby(sorted(getrecords('inputfile.txt'), key=key), key=key):
    for c0, c1 in it.combinations(rec, 2):
        if (c0[4]!= c1[4] and (abs(int(c0[1])-int(c1[1]))) < 200):
            print ("%s\t%s\t%s" % (c0[0], c0[1], c0[3]))
            print("%s\t%s\t%s" % (c1[0], c1[1], c1[3]))

請注意:此代碼運行,但不會給出任何輸出,當我確定應該有一些東西,我期待有大約15個獨特的序列行。 預期產量:

ChrX   start_number1            gene_name1
ChrX   start_number1+/-200      gene_name2
ChrY   start_number2            gene_name3
ChrY   start_number2+/-200      gene_name4

然后我會通過這些行來排除重復。

您的示例中沒有符合指定條件的值,因此我在inputfile.txt添加了一行:

chr1    34870091    34899887    pi-Fam168b.1 +

我復制了inputfile.txt的第一行,並在第二和第三列的整數中添加了20

首先,您不需要導入csv ,也不會使用它。 你應該導入groupbyproductitemgetter ,我將在下面解釋。

from itertools import groupby,product
from operator import itemgetter

這個塊只是將inputfile.txt解析為可用的數據結構(字典列表),其中文件中的每個記錄都是sites列表中的dictionary元素。

with open('/home/kevin/inputfile.txt', 'rb') as f: # should use with open()
    sites = []  #list to hold each record as a dictionary
    for row in f:
        row = tuple(row.strip().split())
        d = {'chr': row[0], 'start': row[1], 'stop':row[2], 'gene_name':row[3], 'strand':row[4]}
        sites.append(d)

我使用選擇第一,排序 itemgetter ,現在,當你groupby擱淺,我們可以在字典中分離到所有的列表plus股和所有的列表minus股:

plus = []
minus = []

for elmt,grp in groupby(sites, itemgetter('strand')): # sites is our sorted list of dicts
    for item in grp:
        if elmt == '+':
            plus.append(item)
        else:
            minus.append(item)

現在你可以使用product來迭代plusminus ,它就像一個嵌套的for循環並比較start位置:

for p,m in product(plus,minus):
    if p['chr'] == m['chr'] and abs(int(p['start']) - int(m['start'])) < 200:
            print ("%s\t%s\t%s") % (p['chr'], p['start'], p['gene_name'])
            print ("%s\t%s\t%s") % (m['chr'], m['start'], m['gene_name'])

這返回:

chr1    34870091    pi-Fam168b.1 #remember I artificially added this one
chr1    34870071    pi-Fam168b.1

作為參考,可以在python庫pandas中更優雅地實現這種類型的任務。 Bedtools (我認為是C ++)專門設計用於.bed文件,這是你正在使用的格式。 HTH!

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM