简体   繁体   English

如何使用 Python 修改 tsv 文件列

[英]How to modify a tsv-file column with Python

I have a GFF3 file (mainly a TSV file with 9 columns) and I'm trying to make some changes in the first column of my file in order to overwrite the modification to the file itself.我有一个 GFF3 文件(主要是一个有 9 列的 TSV 文件),我试图在我的文件的第一列中进行一些更改,以覆盖对文件本身的修改。

The GFF3 file looks like this: GFF3 文件如下所示:

## GFF3 file
## replicon1
## replicon2
replicon_1  prokka  gene    0   15  .   @   .   ID=some_gene_1;
replicon_1  prokka  gene    40  61  .   @   .   ID=some_gene_1;
replicon_2  prokka  gene    8   32  .   @   .   ID=some_gene_2;
replicon_2  prokka  gene    70  98  .   @   .   ID=some_gene_2;

I wrote few lines of code in which I decide a certain symbol to change (eg "_") and the symbol I want to replace (eg "@"):我写了几行代码,其中我决定更改某个符号(例如“_”)和我想要替换的符号(例如“@”):

import os
import re
import argparse
import pandas as pd

def myfunc() -> tuple:
    ap.add_argument("-f", "--file", help="path to file")
    ap.add_argument("-i", "--input_word",help="Symbol to delete")
    ap.add_argument("-o", "--output_word", help="Symbol to insert")
    return ap.parse_args()
args = myfunc()
my_file = args.file
in_char = args.input_word
out_char = args.output_word

with open (my_file, 'r+') as f:
    rawfl = f.read()
    rawfl = re.sub(in_char, out_char, rawfl)
    f.seek(0)
    f.write(rawfl)
    f.close()

The output is something like this: output 是这样的:

## GFF3 file
## replicon1
## replicon2
replicon@1  prokka  gene    0   15  .   @   .   ID=some@gene@1;
replicon@1  prokka  gene    40  61  .   @   .   ID=some@gene@1;
replicon@2  prokka  gene    8   32  .   @   .   ID=some@gene@2;
replicon@2  prokka  gene    70  98  .   @   .   ID=some@gene@2;

As you can see, all the "_" has been changed in "@".可以看到,“@”中所有的“_”都被改掉了。 I tried to modify the script using pandas in order to apply the modification only to the first column ( seqid , here below):我尝试使用pandas修改脚本,以便仅将修改应用于第一列( seqid ,如下所示):

with open (my_file, 'r+') as f:
    genomic_dataframe = pd.read_csv(f, sep="\t", names=['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes'])
    id = genomic_dataframe.seqid
    id = str(id) #this is used because re.sub expects strings, not dataframe
    id = re.sub(in_char, out_char, genid)
    f.seek(0)
    f.write(genid)
f.close()

I do not obtain the expected result but something like the seqid column (correctly modified) that is added to file but not overwritten respect the original one.我没有获得预期的结果,但像 seqid 列(正确修改)被添加到文件但没有被覆盖以尊重原始结果。

What I'd like to obtain is something like this:我想获得的是这样的:

## GFF3 file
## replicon1
## replicon2
replicon@1  prokka  gene    0   15  .   @   .   ID=some_gene_1;
replicon@1  prokka  gene    40  61  .   @   .   ID=some_gene_1;
replicon@2  prokka  gene    8   32  .   @   .   ID=some_gene_2;
replicon@2  prokka  gene    70  98  .   @   .   ID=some_gene_2;

Where the "@" symbol is present only in the first column while the "_" is maintained in the 9th column.其中“@”符号仅出现在第一列中,而“_”保留在第 9 列中。

Do you know how to fix this?你知道如何解决这个问题吗? Thank you all.谢谢你们。

You can use re.sub with pattern that starts with ^ (start of the string) + use lambda function in re.sub .您可以将re.sub与以^ (字符串开头)开头的模式一起使用 + 在re.sub中使用 lambda function。 For example:例如:

import re

# change only first column:
r = re.compile(r"^(.*?)(?=\s)")

in_char = "_"
out_char = "@"

with open("input_file.txt", "r") as f_in, open("output_file.txt", "w") as f_out:
    for line in map(str.strip, f_in):
        # skip empty lines and lines starting with ##
        if not line or line.startswith("##"):
            print(line, file=f_out)
            continue

        line = r.sub(lambda g: g.group(1).replace(in_char, out_char), line)
        print(line, file=f_out)

Creates output_file.txt :创建output_file.txt

## GFF3 file
## replicon1
## replicon2
replicon@1  prokka  gene    0   15  .   @   .   ID=some_gene_1;
replicon@1  prokka  gene    40  61  .   @   .   ID=some_gene_1;
replicon@2  prokka  gene    8   32  .   @   .   ID=some_gene_2;
replicon@2  prokka  gene    70  98  .   @   .   ID=some_gene_2;

If you only want to replace the first occurence of _ by @, you can do it this way without the need to load your file as a dataframe and without the use of any 3rd party lib such as pandas .如果您只想用 @ 替换第一次出现的 _,您可以这样做,而无需将您的文件加载为 dataframe,也无需使用任何第 3 方库,例如pandas

with open('f') as f:
    lines = [line.rstrip() for line in f]

for line in lines:
    # Ignore comments
    if line[0] == '#':
        continue
    line = line.replace('_', '@', 1)

This will return lines which contains这将返回包含的

## GFF3 file
## replicon1
## replicon2
replicon@1  prokka  gene    0   15  .   @   .   ID=some_gene_1;
replicon@1  prokka  gene    40  61  .   @   .   ID=some_gene_1;
replicon@2  prokka  gene    8   32  .   @   .   ID=some_gene_2;
replicon@2  prokka  gene    70  98  .   @   .   ID=some_gene_2;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM