[英]Python: Read text file and split file into list variables, with each variable having 4 lines each
I have a text file (fastq file). 我有一个文本文件(fastq文件)。 The file is in the format
该文件采用格式
1st line - ID
2nd Line - Sequence
3rd Line - something
4th Line - something else.
Then again its repeat of these 4 lines. 然后再次重复这4行。
Eg: 例如:
1 @M9890393393
2 ATCTGTAAAA
3 +
4 FG%@ATAAAA
5 @M9890393394
6 ATGTCTATCC
7 +
8 AA%$$983089
What I am trying to do is , split this file such that I can read this file by lines of 4. Can I make a list , with each variable containing 4 lines each. 我想要做的是,拆分这个文件,这样我就可以用4行读取这个文件。我可以制作一个列表,每个变量包含4行。 There in above example, I will have list with 2 variables.
在上面的例子中,我将列出2个变量。
Using a generator, you can define a lazy reader which yields a list of 4 values each time. 使用生成器,您可以定义一个惰性读取器,每次生成一个包含4个值的列表。
You can, as below, either exhaust or lazily iterate the generator. 您可以,如下所示,排气或延迟迭代发电机。
import csv
from io import StringIO
mystr = StringIO("""1 @M9890393393
2 ATCTGTAAAA
3 +
4 FG%@ATAAAA
5 @M9890393394
6 ATGTCTATCC
7 +
8 AA%$$983089
""")
def gen():
# replace mystr with open('file.csv', 'r')
with mystr as fin:
reader = csv.reader(fin, delimiter=' ', skipinitialspace=True)
res = []
for line in reader:
res.append(line[1])
if len(res) == 4:
yield res
res = []
Exhausting the generator: 用尽发电机:
lines = list(gen())
print(lines)
[['@M9890393393', 'ATCTGTAAAA', '+', 'FG%@ATAAAA'],
['@M9890393394', 'ATGTCTATCC', '+', 'AA%$$983089']]
Iterating the generator: 迭代生成器:
for line in gen():
print(line)
['@M9890393393', 'ATCTGTAAAA', '+', 'FG%@ATAAAA']
['@M9890393394', 'ATGTCTATCC', '+', 'AA%$$983089']
Read all the lines into a list of individual lines then use a list-comp to group the chunks of four lines together: 将所有行读入各行的列表,然后使用list-comp将四行的块组合在一起:
with open('your_file') as f:
lines = f.read().strip().split('\n')
four_lines = [lines[i:i+4] for i in range(0,len(lines),4)]
which, with your example, gives four_lines
as: 以你的例子为例,它给出了4
four_lines
:
[
[
"1 @M9890393393",
"2 ATCTGTAAAA",
"3 +",
"4 FG%@ATAAAA"
],
[
"5 @M9890393394",
"6 ATGTCTATCC",
"7 +",
"8 AA%22209983089"
]
]
If you just want to chunk it up into 4's then you can use: 如果你只是想将它分成4个,那么你可以使用:
In []:
with open('your_file') as f:
result = list(zip(*[map(str.strip, f)]*4)) # Assumes Py3+ use iter(map(...)) in Py2
result
Out[]:
[('@M9890393393', 'ATCTGTAAAA', '+', 'FG%@ATAAAA'),
('@M9890393394', 'ATGTCTATCC', '+', 'AA%$$983089')]
The idea of creating variables for each of these generally doesn't make much sense, but a dict
maybe useful if the first line contains an ID you want to use: 为每个变量创建变量的想法通常没有多大意义,但如果第一行包含您要使用的ID,则
dict
可能很有用:
In []:
with open('your_file') as f:
result = {head: tail for head, *tail in zip(*[map(str.strip, f)]*4)}
result
Out[]:
{'@M9890393393': ['ATCTGTAAAA', '+', 'FG%@ATAAAA'],
'@M9890393394': ['ATGTCTATCC', '+', 'AA%$$983089']}
Sorry assumed the line numbers were added for the example rather than part of the data set. 抱歉,假设为示例添加了行号而不是数据集的一部分。 You can replace the
zip()
with below to remove the numbers (borrowed from @jpp`s answer): 你可以用下面的
zip()
代替删除数字(借用@jpp的答案):
from operator import itemgetter
zip(*[map(itemgetter(1), csv.reader(f, delimiter=' ', skipinitialspace=True))]*4)
The fastq format is easy to parse, you can start checking for "@" at the beginning of the line. fastq格式易于解析,您可以在行的开头检查“@”。 That is your sequence ID.
那是你的序列号。 You can then simply append the next 3 lines and start again.
然后,您可以简单地追加接下来的3行并重新开始。 One "rare" problematic case may occur if the quality-score line also starts with "@".
如果质量得分线也以“@”开头,则可能出现一个“罕见”的问题。 But even that case is easy to spot since the quality-score line comes always after a "+" line.
但即使是这种情况也很容易发现,因为质量得分线总是在“+”线之后。
You can use the function below to read in your file. 您可以使用以下功能读入您的文件。
list = file.readlines()
once you have read in your file you can use a nested loop to complete the task. 一旦读入文件,就可以使用嵌套循环来完成任务。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.