简体   繁体   English

Python:读取文本文件并将文件拆分为列表变量,每个变量各有4行

[英]Python: Read text file and split file into list variables, with each variable having 4 lines each

I have a text file (fastq file). 我有一个文本文件(fastq文件)。 The file is in the format 该文件采用格式

1st line - ID
2nd Line - Sequence
3rd Line - something
4th Line - something else.

Then again its repeat of these 4 lines. 然后再次重复这4行。

Eg: 例如:

1  @M9890393393
2 ATCTGTAAAA
3 +
4 FG%@ATAAAA
5  @M9890393394
6 ATGTCTATCC
7 +
8 AA%$$983089

What I am trying to do is , split this file such that I can read this file by lines of 4. Can I make a list , with each variable containing 4 lines each. 我想要做的是,拆分这个文件,这样我就可以用4行读取这个文件。我可以制作一个列表,每个变量包含4行。 There in above example, I will have list with 2 variables. 在上面的例子中,我将列出2个变量。

Using a generator, you can define a lazy reader which yields a list of 4 values each time. 使用生成器,您可以定义一个惰性读取器,每次生成一个包含4个值的列表。

You can, as below, either exhaust or lazily iterate the generator. 您可以,如下所示,排气或延迟迭代发电机。

import csv
from io import StringIO

mystr = StringIO("""1  @M9890393393
2 ATCTGTAAAA
3 +
4 FG%@ATAAAA
5  @M9890393394
6 ATGTCTATCC
7 +
8 AA%$$983089
""")

def gen():
    # replace mystr with open('file.csv', 'r')
    with mystr as fin:
        reader = csv.reader(fin, delimiter=' ',  skipinitialspace=True)
        res = []
        for line in reader:
            res.append(line[1])
            if len(res) == 4:
                yield res
                res = []

Exhausting the generator: 用尽发电机:

lines = list(gen())

print(lines)

[['@M9890393393', 'ATCTGTAAAA', '+', 'FG%@ATAAAA'],
 ['@M9890393394', 'ATGTCTATCC', '+', 'AA%$$983089']]

Iterating the generator: 迭代生成器:

for line in gen():
    print(line)

['@M9890393393', 'ATCTGTAAAA', '+', 'FG%@ATAAAA']
['@M9890393394', 'ATGTCTATCC', '+', 'AA%$$983089']

Read all the lines into a list of individual lines then use a list-comp to group the chunks of four lines together: 所有行读入各行的列表,然后使用list-comp将四行的块组合在一起:

with open('your_file') as f:
    lines = f.read().strip().split('\n')

four_lines = [lines[i:i+4] for i in range(0,len(lines),4)]

which, with your example, gives four_lines as: 以你的例子为例,它给出了4 four_lines

[
  [
    "1  @M9890393393",
    "2 ATCTGTAAAA",
    "3 +",
    "4 FG%@ATAAAA"
  ],
  [
    "5  @M9890393394",
    "6 ATGTCTATCC",
    "7 +",
    "8 AA%22209983089"
  ]
]

If you just want to chunk it up into 4's then you can use: 如果你只是想将它分成4个,那么你可以使用:

In []:
with open('your_file') as f:
    result = list(zip(*[map(str.strip, f)]*4))   # Assumes Py3+ use iter(map(...)) in Py2
result

Out[]:
[('@M9890393393', 'ATCTGTAAAA', '+', 'FG%@ATAAAA'),
 ('@M9890393394', 'ATGTCTATCC', '+', 'AA%$$983089')]

The idea of creating variables for each of these generally doesn't make much sense, but a dict maybe useful if the first line contains an ID you want to use: 为每个变量创建变量的想法通常没有多大意义,但如果第一行包含您要使用的ID,则dict可能很有用:

In []:
with open('your_file') as f:
    result = {head: tail for head, *tail in zip(*[map(str.strip, f)]*4)}
result
Out[]:
{'@M9890393393': ['ATCTGTAAAA', '+', 'FG%@ATAAAA'],
 '@M9890393394': ['ATGTCTATCC', '+', 'AA%$$983089']}

Sorry assumed the line numbers were added for the example rather than part of the data set. 抱歉,假设为示例添加了行号而不是数据集的一部分。 You can replace the zip() with below to remove the numbers (borrowed from @jpp`s answer): 你可以用下面的zip()代替删除数字(借用@jpp的答案):

from operator import itemgetter

zip(*[map(itemgetter(1), csv.reader(f, delimiter=' ', skipinitialspace=True))]*4)

The fastq format is easy to parse, you can start checking for "@" at the beginning of the line. fastq格式易于解析,您可以在行的开头检查“@”。 That is your sequence ID. 那是你的序列号。 You can then simply append the next 3 lines and start again. 然后,您可以简单地追加接下来的3行并重新开始。 One "rare" problematic case may occur if the quality-score line also starts with "@". 如果质量得分线也以“@”开头,则可能出现一个“罕见”的问题。 But even that case is easy to spot since the quality-score line comes always after a "+" line. 但即使是这种情况也很容易发现,因为质量得分线总是在“+”线之后。

You can use the function below to read in your file. 您可以使用以下功能读入您的文件。

list = file.readlines()

once you have read in your file you can use a nested loop to complete the task. 一旦读入文件,就可以使用嵌套循环来完成任务。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 读取文本文件的每一行,然后在python中用空格分隔每一行 - Read each line of a text file and then split each line by spaces in python Python嵌套了for循环,将文件分成几行,然后将每一行分成变量 - Python nested for loop to split a file into lines then each line into variables Python - 读取文本文件的每一行并将每一行传递给变量 - Python - Read each line of text file and pass each line to variable 如何读取文本文件的每一行并将每一行转换为元组? - how to read each lines of a text file and convert each line to a tuple? 从文本文件中读取数据,并将每一列作为python中的列表 - read in data from a text file and have each column as a list in python 如何从单词文件的几行中拆分每个单词? (Python) - How to split each words from several lines of a word file? (python) Python-将多行文本文件读入列表 - Python - read a text file with multiple lines into a list 将每个文件的内容读入Python的单独列表中 - Read the contents of each file into a separate list in Python 如何读取CSV或文本文件的行,循环遍历每行并保存为每行读取的新文件 - How To Read Lines of CSV or Text File, Loop Over Each Line and Save To a New File For Each Line Read 读取文本文件的每一行并获取第一个拆分字符串 - Read each line of a text file and get the first split string
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM