简体   繁体   English

AWK-用指定数量的随机行中的常量字符替换

[英]AWK - replace with constant character in a specified number of random lines

I'm tasked with imputing masked genotypes, and I have to mask (hide) 2% of genotypes. 我的任务是估算掩盖的基因型,我必须掩盖(隐藏)2%的基因型。

The file I do this in looks like this (genotype.dat): 我在其中执行的文件如下所示(genotype.dat):

M rs4911642
M rs9604821
M rs9605903
M rs5746647
M rs5747968
M rs5747999
M rs2070501
M rs11089263
M rs2096537

and to mask it, I simply change M to S2. 为了掩盖它,我只需将M更改为S2。

Yet, I have to do this for 110 (2%) of 5505 lines, so my strategy of using a random number generator (generate 110 numbers between 1 and 5505 and then manually changing the corresponding line number's M to S2 took almost an hour... (I know, not terribly sophisticated). 但是,我必须对5505行中的110条(2%)执行此操作,因此我使用随机数生成器(在1到5505之间生成110个数字,然后手动将相应的行号M更改为S2)的策略花了将近一个小时。 ..(我知道,不是很复杂)。

I thought about saving the numbers in a separate file (maskedlines.txt) and then telling awk to replace the first character in that line number with S2, but I could not find any adjustable example of to do this. 我曾考虑过将数字保存在单独的文件(maskedlines.txt)中,然后告诉awk用S2替换该行号中的第一个字符,但是我找不到任何可调整的示例来做到这一点。

Anyway, any suggestions of how to tackle this will be deeply appreciated. 无论如何,对于如何解决此问题的任何建议将深表感谢。

awk 'NR==FNR{a[$1]=1;next;} a[FNR]{$1="S2"} 1' maskedlines.txt genotype.dat

How it works 这个怎么运作

In sum, we first read in maskedlines.txt into an associative array a . 总之,我们首先将maskedlines.txt读入关联数组a This file is assumed to have one number per line and a of that number is set to one. 假定此文件每行有一个数字,并且该数字的a设置为1。 We then read in genotype.dat . 然后,我们阅读genotype.dat If a for that line number is one, we change the first field to S2 to mask it. 如果该行号的a为1,我们将第一个字段更改为S2以对其进行屏蔽。 The line, whether changed or not, is then printed. 然后打印该行,无论是否更改。

In detail: 详细:

  • NR==FNR{a[$1]=1;next;}

    In awk, FNR is the number of records (lines) read so far from the current file and NR is the total number of lines read so far. 在awk中, FNR是到目前为止从当前文件读取的记录(行)数,而NR是到目前为止读取的行总数。 So, when NR==FNR , we are reading the first file (maskedlines.txt). 因此,当NR==FNR ,我们正在读取第一个文件(maskedlines.txt)。 This file contains the line number of lines in genotype.dat that are to be masked. 该文件包含genotype.dat中要屏蔽的行的行数。 For each of these line numbers, we set a to 1. We then skip the rest of the commands and jump to the next line. 对于这些行号中的每一个,我们将a设置为1。然后,我们跳过其余命令,并跳到next行。

  • a[FNR]{$1="S2"}

    If we get here, we are working on the second file: genotype.dat. 如果到达这里,我们正在处理第二个文件:genotype.dat。 For each line in this file, we check to see if its line number, FNR , was mentioned in maskedlines.txt . 对于此文件中的每一行,我们检查是否在maskedlines.txt中提到了其行号FNR If it was, we set the first field to S2 to mask this line. 如果是这样,我们将第一个字段设置为S2以屏蔽该行。

  • 1

    This is awk's cryptic shorthand to print the current line. 这是awk打印当前行的隐喻速记。

Here's one simple way, if you have shuf (it's in Gnu coreutils, so if you have Linux, you almost certainly have it): 这是一种简单的方法,如果您拥有shuf (它位于Gnu coreutils中,那么如果您具有Linux,则几乎可以肯定拥有它):

sed "$(printf '%ds/M/S2/;' $(shuf -n110 -i1-5505 | sort -n))" \
    genotype.dat > genotype.masked

A more sophisticated version wouldn't depend on knowing that you want 110 of 5505 lines masked; 一个更复杂的版本将不依赖于您是否要屏蔽5505行中的110行;而无需了解。 you can easily extract the line count with lines=$(wc -l < genotype.dat) , and from there you can compute the percentage. 您可以使用lines=$(wc -l < genotype.dat)轻松提取行数,然后可以计算百分比。

shuf is used to produce a random sample of lines, usually from a file; shuf通常用于从文件生成随机的行样本; the -i1-5505 option means to use the integers from 1 to 5505 instead, and -n110 means to produce a random sample of 110 (without repetition). -i1-5505选项表示改用1到5505之间的整数,而-n110表示生成110的随机样本(无重复)。 I sorted that for efficiency before using printf to create a sed edit script. 为了提高效率,我在使用printf创建sed编辑脚本之前对其进行了排序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM