如何使用 bash 从 vcf 文件中删除 chr

Question

I need to remove 'chr' from my vcf file.我需要从我的 vcf 文件中删除“chr”。 This is the aspect of the vcf file:这是 vcf 文件的方面：

#CHROM  POS  
chr1   10570
chr1   10574
chr1   10654

I want to have the following one我想要以下一个

#CHROM  POS  
   1   10570
   1   10574
   1   10654

I have tried several ways like the following ones:我尝试了以下几种方法：

awk '{gsub(/^chr/,""); print}' your.vcf > no_chr.vcf
sed 's/^chr//'
sed 's:chr::g'
awk '{gsub(/\chr/, "")}1'
perl -pe  's/^chr//g'
sed '/^##/! s/chr//'

but they don't work...any suggestion?但它们不起作用……有什么建议吗？ Thank you!谢谢你！

Answer 1

For beginners it is much better to use dedicated tools rather than unix tools.对于初学者来说，使用专用工具比使用 unix 工具要好得多。 It's easy to end up messing up your file.最终很容易弄乱您的文件。

echo "chr1 1" >> rename_chrs.txt
bcftools annotate --rename-chrs rename_chrs.txt in.vcf > out.vcf

Answer 2

Replace it with 3 spaces.将其替换为 3 个空格。

sed 's/^chr/   /' your.vcf > no_chr.vcf

Answer 3

Use this Perl one-liner:使用这个 Perl 单线：

perl -i.bak -pe  's/^chr//' your.vcf

And if you want to remove all chr anywhere in the line:如果你想删除行中任何地方的所有chr ：

perl -i.bak -pe  's/chr//g' your.vcf

The Perl one-liner uses these command line flags: Perl 一行代码使用这些命令行标志：
-e : Tells Perl to look for code in-line, instead of in a file. -e ：告诉 Perl 查找内联代码，而不是在文件中。
-p : Loop over the input one line at a time, assigning it to $_ by default. -p ：一次循环输入一行，默认情况下将其分配给$_ 。 Add print $_ after each loop iteration.在每次循环迭代后添加print $_ 。
-i.bak : Edit input files in-place (overwrite the input file). -i.bak ：就地编辑输入文件（覆盖输入文件）。 Before overwriting, save a backup copy of the original file by appending to its name the extension .bak .在覆盖之前，通过在其名称后附加扩展名.bak来保存原始文件的备份副本。 If you want to skip writing a backup file, just use -i and skip the extension.如果您想跳过写入备份文件，只需使用-i并跳过扩展名。

s/^chr// : Replace chr at the beginning of the string (here, the line) with an empty string. s/^chr// ：将字符串开头（此处为行）的chr替换为空字符串。 There is no need to use the g modifier (match the pattern repeatedly), since there is only one replacement per line.不需要使用g修饰符（重复匹配模式），因为每行只有一个替换。

See also:也可以看看：

Complete example with input and output:输入和 output 的完整示例：

Create test input:创建测试输入：

cat > your.vcf <<EOF
#CHROM  POS  
chr1   10570
chr1   10574
chr1   10654
EOF

Confirm using cat and hexdump that there are no special characters:使用cat和hexdump确认没有特殊字符：

cat your.vcf

Prints:印刷：

#CHROM  POS  
chr1   10570
chr1   10574
chr1   10654

hexdump -C your.vcf

Prints:印刷：

00000000  23 43 48 52 4f 4d 20 20  50 4f 53 20 20 0a 63 68  |#CHROM  POS  .ch|
00000010  72 31 20 20 20 31 30 35  37 30 0a 63 68 72 31 20  |r1   10570.chr1 |
00000020  20 20 31 30 35 37 34 0a  63 68 72 31 20 20 20 31  |  10574.chr1   1|
00000030  30 36 35 34 0a                                    |0654.|
00000035

Remove chr :删除chr ：

perl -i.bak -pe  's/^chr//' your.vcf

Check the file:检查文件：

cat your.vcf

Prints:印刷：

#CHROM  POS  
1   10570
1   10574
1   10654

Answer 4

Using sed使用sed

$ sed -E '/^#/! {:a;s/[a-z]([0-9])?/ \1/;ta}' input_file
#CHROM  POS
   1   10570
   1   10574
   1   10654

Answer 5

When editing VCF files with awk I've found it easier to specify the column rather than using regex since the first 8 columns of VCF files are fixed (#CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO).使用awk编辑 VCF 文件时，我发现指定列比使用正则表达式更容易，因为 VCF 文件的前 8 列是固定的（#CHROM、POS、ID、REF、ALT、QUAL、FILTER、INFO）。

Here's a solution with awk that substitute "chr" with "" in column 1. (I used sub() here rather than gsub() since there's only one instance of "chr" to replace for each line.)这是awk的解决方案，它在第 1 列中将"chr"替换为"" （我在这里使用sub()而不是gsub() ，因为每一行只有一个"chr"实例要替换。）

awk '{ sub("chr", "", $1); print }' your.vcf > no_chr.vcf

Note that this code can change your delimiter.请注意，此代码可以更改您的分隔符。 By default awk uses whitespace as the input field separator and a single space as the output field separator.默认情况下， awk使用空格作为输入字段分隔符，使用单个空格作为 output 字段分隔符。

Most VCF files I've worked with are tab-delimited.我使用过的大多数 VCF 文件都是制表符分隔的。 In order to use tab as the delimiter for both input and output, you need to specify the input field separator ( FS ) and output field separator ( OFS ) at the beginning of your code.为了使用制表符作为输入和 output 的分隔符，您需要在代码开头指定输入字段分隔符 ( FS ) 和 output 字段分隔符 ( OFS )。

Here's the same solution using tab as the field separator:这是使用制表符作为字段分隔符的相同解决方案：

awk 'BEGIN { FS = OFS = "\t" } { sub("chr", "", $1); print }' your.vcf > no_chr.vcf

如何使用 bash 从 vcf 文件中删除 chr

问题描述

5 个解决方案

解决方案1
4 2023-05-02 18:57:50

解决方案2
1 2023-05-02 15:09:39

解决方案3
1 2023-05-02 18:38:08

解决方案4
0 2023-05-02 15:41:44

解决方案5
0 2023-06-03 19:26:10

如何使用 bash 从 vcf 文件中删除 chr

问题描述

5 个解决方案

解决方案1 4 2023-05-02 18:57:50

解决方案2 1 2023-05-02 15:09:39

解决方案3 1 2023-05-02 18:38:08

解决方案4 0 2023-05-02 15:41:44

解决方案5 0 2023-06-03 19:26:10

解决方案1
4 2023-05-02 18:57:50

解决方案2
1 2023-05-02 15:09:39

解决方案3
1 2023-05-02 18:38:08

解决方案4
0 2023-05-02 15:41:44

解决方案5
0 2023-06-03 19:26:10