![](/img/trans.png)
[英]Python: iterating through .txt file to extract data to match your conditions
[英]Python iteration: sorting through a .txt file extract wanted data
我有一個示例inputfile.txt:
chr1 34870071 34899867 pi-Fam168b.1 -
chr11 98724946 98764609 pi-Wipf2.1 +
chr11 105898192 105920636 pi-Dcaf7.1 +
chr11 120486441 120495268 pi-Mafg.1 -
chr12 3891106 3914443 pi-Dnmt3a.1 +
chr12 82815946 82882157 pi-Map3k9.1 -
chr13 23855536 23856215 pi-Hist1h1a.1 +
chr13 55206682 55236190 pi-Zfp346.1 +
chr1 95700553 95718679 pi-Ing5.1 +
chr13 55313417 55419685 pi-Nsd1.1 +
chr14 27852218 27920472 pi-Il17rd.1 +
chr14 65430438 65568699 pi-Hmbox1.1 -
chr1 120524521 120581739 pi-Tfcp2l1.1 +
chr15 81633147 81657289 pi-Tef.1 +
chr15 89331804 89390691 pi-Shank3.1 +
chr15 103021983 103070259 pi-Cbx5.1 -
chr16 16896549 16927451 pi-Ppm1f.1 +
chr16 17233679 17263523 pi-Hic2.1 +
chr16 17452059 17486929 pi-Crkl.1 +
chr16 24393531 24992661 pi-Lpp.1 +
chr16 43964878 43979143 pi-Zdhhc23.1 -
chr17 25098236 25152532 pi-Cramp1l.1 -
chr17 27993451 28036985 pi-Uhrf1bp1.1 +
chr17 83973363 84031786 pi-Kcng3.1 -
chr1 133904194 133928161 pi-Elk4.1 +
chr18 60844148 60908308 pi-Ndst1.1 -
chr19 10057193 10059582 pi-Fth1.1 +
chr19 44637337 44650762 pi-Hif1an.1 +
chr1 135027714 135036359 pi-Ppp1r15b.1 +
chr2 28677821 28695861 pi-Gtf3c4.1 -
chr1 136651241 136852527 pi-Ppp1r12b.1 -
chr2 154262219 154365092 pi-Cbfa2t2.1 +
chr2 156022393 156135687 pi-Phf20.1 +
chr3 51028854 51055547 pi-Ccrn4l.1 +
chr3 94985683 95021902 pi-Gabpb2.1 -
chr1 158488203 158579750 pi-Abl2.1 +
chr4 45411294 45421633 pi-Mcart1.1 -
chr4 56879897 56960355 pi-D730040F13Rik.1 -
chr4 59818521 59917612 pi-Snx30.1 +
chr4 107847846 107890527 pi-Zyg11a.1 -
chr4 107900359 107973695 pi-Zyg11b.1 -
chr4 132195002 132280676 pi-Eya3.1 +
chr4 134968222 134989706 pi-Rcan3.1 -
chr4 136025678 136110697 pi-Luzp1.1 +
chr1 162933052 162964958 pi-Zbtb37.1 -
chr5 38591490 38611628 pi-Zbtb49.1 -
chr5 67783388 67819359 pi-Bend4.1 -
chr5 114387108 114443767 pi-Ssh1.1 -
chr5 115592990 115608225 pi-Mlec.1 -
chr5 143628624 143656891 pi-Fbxl18.1 -
chr1 172123561 172145541 pi-Uhmk1.1 -
chr6 83312367 83391602 pi-Tet3.1 -
chr6 85419571 85434653 pi-Fbxo41.1 -
chr6 116288039 116359551 pi-March08.1 +
chr6 120786229 120842859 pi-Bcl2l13.1 +
chr7 71031236 71083761 pi-Klf13.1 -
chr7 107068766 107128968 pi-Rnf169.1 -
chr7 139903770 140044311 pi-Fam53b.1 -
chr8 72285224 72298794 pi-Zfp866.1 -
chr8 106872110 106919708 pi-Cmtm4.1 -
chr8 112250549 112261649 pi-Atxn1l.1 -
chr10 41901651 41911816 pi-Foxo3.1 -
chr8 119682164 119739895 pi-Gan.1 +
chr8 125406988 125566154 pi-Ankrd11.1 -
chr9 27148219 27165314 pi-Igsf9b.1 +
chr9 44100521 44113717 pi-Hinfp.1 -
chr9 61761092 61762348 pi-Rplp1.1 -
chr9 106590412 106691503 pi-Rad54l2.1 -
chr9 114416339 114473487 pi-Trim71.1 -
chr9 119311403 119351032 pi-Acvr2b.1 +
chr9 119354082 119373348 pi-Exog.1 +
chr10 82822985 82831579 pi-D10Wsu102e.1 +
chr10 126415753 126437016 pi-Ctdsp2.1 +
chr1 90159688 90174093 pi-Hjurp.1 -
chr11 60591039 60597792 pi-Smcr8.1 +
chr11 69209318 69210176 pi-Lsmd1.1 +
chr11 75345218 75391069 pi-Slc43a2.1 +
chr11 79474214 79511524 pi-Rab11fip4.1 +
chr11 95818479 95868022 pi-Igf2bp1.1 -
chr11 97223641 97259855 pi-Socs7.1 +
chr11 97524530 97546757 pi-Mllt6.1 +
chr1 120355721 120355843 1-qE2.3-2.1 -
chr2 120518324 120540873 2-qE5-4.1 +
chr7 82913927 82926993 7-qD2-40.1 -
列1 = chromosome_number
列2 =啟動
欄3 =端
Column4 = gene_name
Column5 =方向(+或 - )
1.)我需要提取具有相同染色體編號的行 (第1列), 它們的起始位點具有 相反方向的200最大值(200或更小) (第2列) 的差異 (一個是正/負)。
這是我到目前為止所不知道我的錯誤在哪里:
import csv
import itertools as it
f=open('inputfile.txt', 'r')
def getrecords(f):
for line in open(f):
yield line.strip().split()
key=lambda x: x[0]
for i, rec in it.groupby(sorted(getrecords('inputfile.txt'), key=key), key=key):
for c0, c1 in it.combinations(rec, 2):
if (c0[4]!= c1[4] and (abs(int(c0[1])-int(c1[1]))) < 200):
print ("%s\t%s\t%s" % (c0[0], c0[1], c0[3]))
print("%s\t%s\t%s" % (c1[0], c1[1], c1[3]))
請注意:此代碼運行,但不會給出任何輸出,當我確定應該有一些東西,我期待有大約15個獨特的序列行。 預期產量:
ChrX start_number1 gene_name1
ChrX start_number1+/-200 gene_name2
ChrY start_number2 gene_name3
ChrY start_number2+/-200 gene_name4
然后我會通過這些行來排除重復。
您的示例中沒有符合指定條件的值,因此我在inputfile.txt
添加了一行:
chr1 34870091 34899887 pi-Fam168b.1 +
我復制了inputfile.txt
的第一行,並在第二和第三列的整數中添加了20
。
首先,您不需要導入csv
,也不會使用它。 你應該導入groupby
和product
和itemgetter
,我將在下面解釋。
from itertools import groupby,product
from operator import itemgetter
這個塊只是將inputfile.txt
解析為可用的數據結構(字典列表),其中文件中的每個記錄都是sites
列表中的dictionary
元素。
with open('/home/kevin/inputfile.txt', 'rb') as f: # should use with open()
sites = [] #list to hold each record as a dictionary
for row in f:
row = tuple(row.strip().split())
d = {'chr': row[0], 'start': row[1], 'stop':row[2], 'gene_name':row[3], 'strand':row[4]}
sites.append(d)
我使用選擇第一,排序鏈 itemgetter
,現在,當你groupby
擱淺,我們可以在字典中分離到所有的列表plus
股和所有的列表minus
股:
plus = []
minus = []
for elmt,grp in groupby(sites, itemgetter('strand')): # sites is our sorted list of dicts
for item in grp:
if elmt == '+':
plus.append(item)
else:
minus.append(item)
現在你可以使用product
來迭代plus
和minus
,它就像一個嵌套的for循環並比較start
位置:
for p,m in product(plus,minus):
if p['chr'] == m['chr'] and abs(int(p['start']) - int(m['start'])) < 200:
print ("%s\t%s\t%s") % (p['chr'], p['start'], p['gene_name'])
print ("%s\t%s\t%s") % (m['chr'], m['start'], m['gene_name'])
這返回:
chr1 34870091 pi-Fam168b.1 #remember I artificially added this one
chr1 34870071 pi-Fam168b.1
作為參考,可以在python庫pandas中更優雅地實現這種類型的任務。 Bedtools (我認為是C ++)專門設計用於.bed
文件,這是你正在使用的格式。 HTH!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.