在 python 中按行分隔文件

Question

I have a.fastq file (cannot use Biopython) that consists of multiple samples in different lines.我有一个 .fastq 文件（不能使用 Biopython），它由不同行中的多个样本组成。 The file contents look like this:文件内容如下所示：

@sample1
ACGTC.....
+
IIIIDDDDDFF
@sample2
AGCGC....
+
IIIIIDFDFD
.
.
.
@sampleX
ACATAG
+
IIIIIDDDFFF

I want to take the file and separate out each individual set of samples (ie lines 1-4, 5-8 and so on until the end of the file) and write each of them to a separate file (ie sample1.fastq contains that contents of sample 1 lines 1-4 and so on).我想取出文件并分离出每组样本（即第 1-4 行、第 5-8 行等直到文件末尾）并将它们中的每一个写入一个单独的文件（即 sample1.fastq 包含那个样本 1 第 1-4 行的内容，依此类推）。 Is this doable using loops in python?在 python 中使用循环是否可行？

Answer 1

You can use defaultdict and regex for this您可以为此使用 defaultdict 和 regex

import re
from collections import defaultdict

# Get file contents
with open("test.fastq", "r") as f:
    content = f.read()

samples = defaultdict(list) # Make defaultdict of empty lists
identifier = ""

# Iterate through every line in file
for line in content.split("\n"):
    # Find strings which start with @
    if re.match("^@.*", line):
        # Set identifier to match following lines to this section
        identifier = line.replace("@", "")
    else:
        # Add the line to its identifier
        samples[identifier].append(line)

Now all you have to do is save the contents of this default dictionary into multiple files:现在您所要做的就是将这个默认字典的内容保存到多个文件中：

# Loop through all samples (and their contents)
for sample_name, sample_items in samples.items():
    # Create new file with the name of its sample_name.fastq
    # (You might want to change the naming)
    with open(f"{sample_name}.fastq", "w") as f:
        # Write each element of the sample_items to new line
        f.write("\n".join(sample_items))

It might be helpful for you to also include @sample_name in the beginning of the file (first line), but I'm not sure you want that so I haven't added that.在文件的开头（第一行）包含@sample_name可能对您有所帮助，但我不确定您是否想要，所以我没有添加它。

Note that you can adjust the regex settings to only match @sample[number] instead of all @... , if you want that, you can use re.match("^@sample\d+") instead请注意，您可以调整正则表达式设置以仅匹配@sample[number]而不是所有@... ，如果需要，可以使用re.match("^@sample\d+")代替

在 python 中按行分隔文件

问题描述

1 个解决方案

解决方案1
0 2020-05-13 19:46:50

在 python 中按行分隔文件

问题描述

1 个解决方案

解决方案1 0 2020-05-13 19:46:50

解决方案1
0 2020-05-13 19:46:50