简体   繁体   English

如何使用 bash 从 vcf 文件中删除 chr

[英]how to remove chr from vcf file using bash

I need to remove 'chr' from my vcf file.我需要从我的 vcf 文件中删除“chr”。 This is the aspect of the vcf file:这是 vcf 文件的方面:

#CHROM  POS  
chr1   10570
chr1   10574
chr1   10654

I want to have the following one我想要以下一个

#CHROM  POS  
   1   10570
   1   10574
   1   10654

I have tried several ways like the following ones:我尝试了以下几种方法:

awk '{gsub(/^chr/,""); print}' your.vcf > no_chr.vcf
sed 's/^chr//'
sed 's:chr::g'
awk '{gsub(/\chr/, "")}1'
perl -pe  's/^chr//g'
sed '/^##/! s/chr//'

but they don't work...any suggestion?但它们不起作用……有什么建议吗? Thank you!谢谢你!

For beginners it is much better to use dedicated tools rather than unix tools.对于初学者来说,使用专用工具比使用 unix 工具要好得多。 It's easy to end up messing up your file.最终很容易弄乱您的文件。

echo "chr1 1" >> rename_chrs.txt
bcftools annotate --rename-chrs rename_chrs.txt in.vcf > out.vcf

Replace it with 3 spaces.将其替换为 3 个空格。

sed 's/^chr/   /' your.vcf > no_chr.vcf

Use this Perl one-liner:使用这个 Perl 单线:

perl -i.bak -pe  's/^chr//' your.vcf

And if you want to remove all chr anywhere in the line:如果你想删除行中任何地方的所有chr

perl -i.bak -pe  's/chr//g' your.vcf

The Perl one-liner uses these command line flags: Perl 一行代码使用这些命令行标志:
-e : Tells Perl to look for code in-line, instead of in a file. -e :告诉 Perl 查找内联代码,而不是在文件中。
-p : Loop over the input one line at a time, assigning it to $_ by default. -p :一次循环输入一行,默认情况下将其分配给$_ Add print $_ after each loop iteration.在每次循环迭代后添加print $_
-i.bak : Edit input files in-place (overwrite the input file). -i.bak :就地编辑输入文件(覆盖输入文件)。 Before overwriting, save a backup copy of the original file by appending to its name the extension .bak .在覆盖之前,通过在其名称后附加扩展名.bak来保存原始文件的备份副本。 If you want to skip writing a backup file, just use -i and skip the extension.如果您想跳过写入备份文件,只需使用-i并跳过扩展名。

s/^chr// : Replace chr at the beginning of the string (here, the line) with an empty string. s/^chr// :将字符串开头(此处为行)的chr替换为空字符串。 There is no need to use the g modifier (match the pattern repeatedly), since there is only one replacement per line.不需要使用g修饰符(重复匹配模式),因为每行只有一个替换。

See also:也可以看看:


Complete example with input and output:输入和 output 的完整示例:

Create test input:创建测试输入:

cat > your.vcf <<EOF
#CHROM  POS  
chr1   10570
chr1   10574
chr1   10654
EOF

Confirm using cat and hexdump that there are no special characters:使用cathexdump确认没有特殊字符:

cat your.vcf

Prints:印刷:

#CHROM  POS  
chr1   10570
chr1   10574
chr1   10654
hexdump -C your.vcf

Prints:印刷:

00000000  23 43 48 52 4f 4d 20 20  50 4f 53 20 20 0a 63 68  |#CHROM  POS  .ch|
00000010  72 31 20 20 20 31 30 35  37 30 0a 63 68 72 31 20  |r1   10570.chr1 |
00000020  20 20 31 30 35 37 34 0a  63 68 72 31 20 20 20 31  |  10574.chr1   1|
00000030  30 36 35 34 0a                                    |0654.|
00000035

Remove chr :删除chr

perl -i.bak -pe  's/^chr//' your.vcf

Check the file:检查文件:

cat your.vcf

Prints:印刷:

#CHROM  POS  
1   10570
1   10574
1   10654

Using sed使用sed

$ sed -E '/^#/! {:a;s/[a-z]([0-9])?/ \1/;ta}' input_file
#CHROM  POS
   1   10570
   1   10574
   1   10654

When editing VCF files with awk I've found it easier to specify the column rather than using regex since the first 8 columns of VCF files are fixed (#CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO).使用awk编辑 VCF 文件时,我发现指定列比使用正则表达式更容易,因为 VCF 文件的前 8 列是固定的(#CHROM、POS、ID、REF、ALT、QUAL、FILTER、INFO)。

Here's a solution with awk that substitute "chr" with "" in column 1. (I used sub() here rather than gsub() since there's only one instance of "chr" to replace for each line.)这是awk的解决方案,它在第 1 列中将"chr"替换为"" (我在这里使用sub()而不是gsub() ,因为每一行只有一个"chr"实例要替换。)

awk '{ sub("chr", "", $1); print }' your.vcf > no_chr.vcf

Note that this code can change your delimiter.请注意,此代码可以更改您的分隔符。 By default awk uses whitespace as the input field separator and a single space as the output field separator.默认情况下, awk使用空格作为输入字段分隔符,使用单个空格作为 output 字段分隔符。

Most VCF files I've worked with are tab-delimited.我使用过的大多数 VCF 文件都是制表符分隔的。 In order to use tab as the delimiter for both input and output, you need to specify the input field separator ( FS ) and output field separator ( OFS ) at the beginning of your code.为了使用制表符作为输入和 output 的分隔符,您需要在代码开头指定输入字段分隔符 ( FS ) 和 output 字段分隔符 ( OFS )。

Here's the same solution using tab as the field separator:这是使用制表符作为字段分隔符的相同解决方案:

awk 'BEGIN { FS = OFS = "\t" } { sub("chr", "", $1); print }' your.vcf > no_chr.vcf

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM