简体   繁体   English

如何在与python匹配后打印行?

[英]How to print lines after a match with python?

I have a file, with several lines (I only show two of them):我有一个文件,有几行(我只显示其中的两行):

UniRef90_A0A0K2VG56 UniRef90_A0A0P5UY87 
UniRef90_A0A095VQ09 UniRef90_A0A0C1UI80 UniRef90_A0A1M4ZSK2

and another file (I only show some lines of the file) :和另一个文件(我只显示文件的一些行):

>UniRef90_A0A095VQ09 - Cluster: LOW QUALITY PROTEIN: titin
MTTKAPTFTQPLQSVVALEGSAATFEAHISGSPVPEVSWYRDGQVLSAATLPGVQISFSD
GRAKLMIPAVAAGHSGRYTLQATNGSGQATSTAELLVTAETAPPNFSQRLQSTTARQGSQ
VRLDVRVTGIPTPVVKFYRDRAEIQSSPDFQILQEGDLYSLIIAEAYPEDSGTYSVNATN
>UniRef90_A0A0K2VG56 - Cluster: titin isoform X29
MATQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVISTSTLPGVQISFSD
GRAKLMIPAVTKANSGRYSLRATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGELYSLLIVEAYPEDSGTYSVNATN
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISKSRETRIEKKIEAHFDARSIATVEMV
IDGAAGQELPHKTPPRIPLKPKSRSPTPPSIAAKAQLARQQSPSPIRHSPSPVRHVRAPT
>UniRef90_A0A0C1UI80 - Cluster: LOW QUALITY PROTEIN: lafev
GRAKLMIPAVTKANSGRYSLRATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGLARQQSPSPIRHSPSPVRHVRAPT
>UniRef90_A0A0P5UY87 - Cluster: titin isoform X4
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGELYSLLIVEAYPEDSGTYSVNATN
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISKSRETRIEKKIEAHFDARSIATVEMV
GRAKLMIPAVAAGHSGRYTLQATNGSGQATSTAELLVTAETAPPNFSQRLQSTTARQGSQ
>UniRef90_A0A1M4ZSK2  - Cluster: titin isoform X54
SVGRATSTAELLVQGEEVVPAKKTKTIVSTSTAELLVTAETAPPNFSQRLQSTTARQGSQ
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISKSRETRIEKKIEAHFDARSIATVEMV
IDGAAGQELPHKTPPRIPLKPKSRSPTPPSIAAKAQLARQQSPSPIRHSPSPVRHVRAPT

I need to match, for each line of my first file, the Uniref90_XXXXXX ID with Uniref90_XXXXXX ID of the second file.对于第一个文件的每一行,我需要将Uniref90_XXXXXX ID 与第二个文件的Uniref90_XXXXXX ID 进行Uniref90_XXXXXX When the match is done, I need to get back the sequence (the letters ...TNGSGQATS.... = sequences) to the corresponding ID.匹配完成后,我需要将序列(字母 ...TNGSGQATS.... = 序列)恢复到相应的 ID。

For example, there are 2 Uniref90_XXXXX IDs in the first row of the first file, I woud like to get an output like this :例如,第一个文件的第一行有 2 个 Uniref90_XXXXX ID,我想得到这样的输出:

>UniRef90_A0A0K2VG56 - Cluster: titin isoform X29
MATQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVISTSTLPGVQISFSD
GRAKLMIPAVTKANSGRYSLRATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGELYSLLIVEAYPEDSGTYSVNATN
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISKSRETRIEKKIEAHFDARSIATVEMV
IDGAAGQELPHKTPPRIPLKPKSRSPTPPSIAAKAQLARQQSPSPIRHSPSPVRHVRAPT   ##first ID of the first line
>UniRef90_A0A0P5UY87 - Cluster: titin isoform X4
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGELYSLLIVEAYPEDSGTYSVNATN   
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISKSRETRIEKKIEAHFDARSIATVEMV
GRAKLMIPAVAAGHSGRYTLQATNGSGQATSTAELLVTAETAPPNFSQRLQSTTARQGSQ   ##second ID of the first line

And I need to do that for each row of my first file.我需要为我的第一个文件的每一行都这样做。

So you seem to need to order Uniref90_XXXXXX s according to their order in the first file.因此,您似乎需要根据第一个文件中的顺序对Uniref90_XXXXXX进行排序。

Here UniRef_ids.txt is your first file, UniRef_data.txt is your second file, and UniRef_data_ordered.txt is the output file.这里UniRef_ids.txt是你的第一个文件, UniRef_data.txt是你的第二个文件, UniRef_data_ordered.txt是输出文件。

I noticed each Uniref90_XXXXXX appears to start with a > and continues, spanning a variable number of lines, until the next > or, I assume, the end of file.我注意到每个Uniref90_XXXXXX似乎都以>开头并继续,跨越可变数量的行,直到下一个>或者,我假设,文件末尾。

I have only handled one exception: if a Uniref90_XXXXXX appears your first file, but not your second.我只处理了一个异常:如果Uniref90_XXXXXX出现在您的第一个文件中,但没有出现在您的第二个文件中。 It merely prints a warning to your console (not your file).它只是向您的控制台(而不是您的文件)打印警告。

If the rest of your files are formatted differently, this might not work.如果其余文件的格式不同,这可能不起作用。 Similarly, if your files are several gigabytes, my approach may not be appropriate, as I read into memory the entire contents of your second file.同样,如果您的文件有几 GB,我的方法可能不合适,因为我将第二个文件的全部内容读入内存。

# We first go through the second file, get all the Uniref90_XXXXXX IDs, and 
# put their sequences (including the Uniref90_XXXXXX header line) into a dict.
# A sequence can be accessed like so: uniref_dict["UniRef90_A0A0K2VG56"]
with open("UniRef_data.txt", "rt") as f:
    data = f.read()

uniref_dict = {}
for uniref in [f">{chunk.rstrip()}" for chunk in data.split(">")]:
    uniref_id = uniref[1:uniref.find(" ")]
    uniref_dict[uniref_id] = uniref

# Then we go through the first file, line by line, id by id, and write to 
# a new file the corresponding sequence (again, including the Uniref90_XXXXXX 
# header line, as per your output) and append the Uniref90_XXXXXX at the end.
with open("UniRef_ids.txt", "rt") as fin:
    with open("UniRef_data_ordered.txt", "wt") as fout:
        for line in fin:
            line = line.rstrip()
            uniref_ids = line.split(" ")
            for uniref_id in uniref_ids:
                try:
                    fout.write("{} ##{}\n".format(uniref_dict[uniref_id], uniref_id))
                except KeyError as e:
                    print(f"uniref_id '{uniref_id}' found in id file but not data file. Continuing...")

UniRef_data_ordered.txt: UniRef_data_ordered.txt:

>UniRef90_A0A0K2VG56 - Cluster: titin isoform X29
MATQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVISTSTLPGVQISFSD
GRAKLMIPAVTKANSGRYSLRATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGELYSLLIVEAYPEDSGTYSVNATN
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISKSRETRIEKKIEAHFDARSIATVEMV
IDGAAGQELPHKTPPRIPLKPKSRSPTPPSIAAKAQLARQQSPSPIRHSPSPVRHVRAPT ##UniRef90_A0A0K2VG56
>UniRef90_A0A0P5UY87 - Cluster: titin isoform X4
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGELYSLLIVEAYPEDSGTYSVNATN
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISKSRETRIEKKIEAHFDARSIATVEMV
GRAKLMIPAVAAGHSGRYTLQATNGSGQATSTAELLVTAETAPPNFSQRLQSTTARQGSQ ##UniRef90_A0A0P5UY87
>UniRef90_A0A095VQ09 - Cluster: LOW QUALITY PROTEIN: titin
MTTKAPTFTQPLQSVVALEGSAATFEAHISGSPVPEVSWYRDGQVLSAATLPGVQISFSD
GRAKLMIPAVAAGHSGRYTLQATNGSGQATSTAELLVTAETAPPNFSQRLQSTTARQGSQ
VRLDVRVTGIPTPVVKFYRDRAEIQSSPDFQILQEGDLYSLIIAEAYPEDSGTYSVNATN ##UniRef90_A0A095VQ09
>UniRef90_A0A0C1UI80 - Cluster: LOW QUALITY PROTEIN: lafev
GRAKLMIPAVTKANSGRYSLRATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGLARQQSPSPIRHSPSPVRHVRAPT ##UniRef90_A0A0C1UI80
>UniRef90_A0A1M4ZSK2  - Cluster: titin isoform X54
SVGRATSTAELLVQGEEVVPAKKTKTIVSTSTAELLVTAETAPPNFSQRLQSTTARQGSQ
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISKSRETRIEKKIEAHFDARSIATVEMV
IDGAAGQELPHKTPPRIPLKPKSRSPTPPSIAAKAQLARQQSPSPIRHSPSPVRHVRAPT ##UniRef90_A0A1M4ZSK2


is it possible to create separate files for each iteration of the loop?是否可以为循环的每次迭代创建单独的文件? I mean, for each row of the first file, I would like to create a file with the ID and the corresponding sequences?我的意思是,对于第一个文件的每一行,我想创建一个带有 ID 和相应序列的文件?

Yes, that's possible.是的,这是可能的。 We just need to put the output file open and writing code inside the for loop that goes over the rows in the first file, and give each file a unique name.我们只需要打开输出文件并在遍历第一个文件中的行的 for 循环中编写代码,并为每个文件指定一个唯一的名称。

# We first go through the second file, get all the Uniref90_XXXXXX IDs, and 
# put their sequences (including the Uniref90_XXXXXX header line) into a dict.
# A sequence can be accessed like so: uniref_dict["UniRef90_A0A0K2VG56"]
with open("UniRef_data.txt", "rt") as f:
    data = f.read()

uniref_dict = {}
for uniref in [f">{chunk.rstrip()}" for chunk in data.split(">")]:
    uniref_id = uniref[1:uniref.find(" ")]
    uniref_dict[uniref_id] = uniref

# Then we go through the first file, line by line, and write to a new  
# file the ids and their corresponding sequences (again, including the 
# Uniref90_XXXXXX header line, as per your output)
with open("UniRef_ids.txt", "rt") as fin:
    # Each iteration of this for loop is a new line of Uniref90_XXXXXX ids,
    # so we've moved the file writing code inside of this loop.
    # enumerate gives us a counter - i - that starts at 1, and increments by 1
    # after each iteration. We use this to give each file a unique name.
    for i, line in enumerate(fin, start=1):
        line = line.rstrip()
        uniref_ids = line.split(" ")
        with open(f"UniRef_data_by_id_row_{i:03}.txt", "wt") as fout:
            for uniref_id in uniref_ids:
                try:
                    fout.write(uniref_dict[uniref_id] + "\n")
                except KeyError as e:
                    print(f"uniref_id '{uniref_id}' found in id file but not data file. Continuing...")

By the way, this is the code that generates our filenames:顺便说一下,这是生成文件名的代码:

f"UniRef_data_by_id_row_{i:03}.txt"

The f prefix tells Python it's an f-string . f前缀告诉 Python 它是一个f-string It evaluates what's in the {} s and returns a string.它评估{}的内容并返回一个字符串。 Before the : is the value , and after are the format specifiers .:之前是,之后是格式说明符 In this case, my format specifiers 0-pads i to a width of 3, giving me filenames like:在这种情况下,我的格式说明符 0-pads i的宽度为 3,给我的文件名如下:

UniRef_data_by_id_row_001.txt
UniRef_data_by_id_row_999.txt

That way, it's very easy to sort the files in your file manager.这样,在文件管理器中对文件进行排序非常容易。

You can name the files differently.您可以对文件进行不同的命名。 For example, if you don't want underscores, and you want to pad the number with spaces instead of 0s:例如,如果您不需要下划线,并且想用空格而不是 0 填充数字:

f"UniRef Data Ordered by ID - Row {i: >4}.txt"
UniRef Data Ordered by ID - Row    1.txt
UniRef Data Ordered by ID - Row 9999.txt

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM