AWK-用指定数量的随机行中的常量字符替换

Question

I'm tasked with imputing masked genotypes, and I have to mask (hide) 2% of genotypes. 我的任务是估算掩盖的基因型，我必须掩盖（隐藏）2％的基因型。

The file I do this in looks like this (genotype.dat): 我在其中执行的文件如下所示（genotype.dat）：

M rs4911642
M rs9604821
M rs9605903
M rs5746647
M rs5747968
M rs5747999
M rs2070501
M rs11089263
M rs2096537

and to mask it, I simply change M to S2. 为了掩盖它，我只需将M更改为S2。

Yet, I have to do this for 110 (2%) of 5505 lines, so my strategy of using a random number generator (generate 110 numbers between 1 and 5505 and then manually changing the corresponding line number's M to S2 took almost an hour... (I know, not terribly sophisticated). 但是，我必须对5505行中的110条（2％）执行此操作，因此我使用随机数生成器（在1到5505之间生成110个数字，然后手动将相应的行号M更改为S2）的策略花了将近一个小时。 ..（我知道，不是很复杂）。

I thought about saving the numbers in a separate file (maskedlines.txt) and then telling awk to replace the first character in that line number with S2, but I could not find any adjustable example of to do this. 我曾考虑过将数字保存在单独的文件（maskedlines.txt）中，然后告诉awk用S2替换该行号中的第一个字符，但是我找不到任何可调整的示例来做到这一点。

Anyway, any suggestions of how to tackle this will be deeply appreciated. 无论如何，对于如何解决此问题的任何建议将深表感谢。

Answer 1

awk 'NR==FNR{a[$1]=1;next;} a[FNR]{$1="S2"} 1' maskedlines.txt genotype.dat

How it works 这个怎么运作

In sum, we first read in maskedlines.txt into an associative array a . 总之，我们首先将maskedlines.txt读入关联数组a 。 This file is assumed to have one number per line and a of that number is set to one. 假定此文件每行有一个数字，并且该数字的a设置为1。 We then read in genotype.dat . 然后，我们阅读genotype.dat 。 If a for that line number is one, we change the first field to S2 to mask it. 如果该行号的a为1，我们将第一个字段更改为S2以对其进行屏蔽。 The line, whether changed or not, is then printed. 然后打印该行，无论是否更改。

In detail: 详细：

NR==FNR{a[$1]=1;next;}

In awk, FNR is the number of records (lines) read so far from the current file and NR is the total number of lines read so far. 在awk中， FNR是到目前为止从当前文件读取的记录（行）数，而NR是到目前为止读取的行总数。 So, when NR==FNR , we are reading the first file (maskedlines.txt). 因此，当NR==FNR ，我们正在读取第一个文件（maskedlines.txt）。 This file contains the line number of lines in genotype.dat that are to be masked. 该文件包含genotype.dat中要屏蔽的行的行数。 For each of these line numbers, we set a to 1. We then skip the rest of the commands and jump to the next line. 对于这些行号中的每一个，我们将a设置为1。然后，我们跳过其余命令，并跳到next行。
a[FNR]{$1="S2"}

If we get here, we are working on the second file: genotype.dat. 如果到达这里，我们正在处理第二个文件：genotype.dat。 For each line in this file, we check to see if its line number, FNR , was mentioned in maskedlines.txt . 对于此文件中的每一行，我们检查是否在maskedlines.txt中提到了其行号FNR 。 If it was, we set the first field to S2 to mask this line. 如果是这样，我们将第一个字段设置为S2以屏蔽该行。
1

This is awk's cryptic shorthand to print the current line. 这是awk打印当前行的隐喻速记。

Answer 2

Here's one simple way, if you have shuf (it's in Gnu coreutils, so if you have Linux, you almost certainly have it): 这是一种简单的方法，如果您拥有shuf （它位于Gnu coreutils中，那么如果您具有Linux，则几乎可以肯定拥有它）：

sed "$(printf '%ds/M/S2/;' $(shuf -n110 -i1-5505 | sort -n))" \
    genotype.dat > genotype.masked

A more sophisticated version wouldn't depend on knowing that you want 110 of 5505 lines masked; 一个更复杂的版本将不依赖于您是否要屏蔽5505行中的110行；而无需了解。 you can easily extract the line count with lines=$(wc -l < genotype.dat) , and from there you can compute the percentage. 您可以使用lines=$(wc -l < genotype.dat)轻松提取行数，然后可以计算百分比。

shuf is used to produce a random sample of lines, usually from a file; shuf通常用于从文件生成随机的行样本； the -i1-5505 option means to use the integers from 1 to 5505 instead, and -n110 means to produce a random sample of 110 (without repetition). -i1-5505选项表示改用1到5505之间的整数，而-n110表示生成110的随机样本（无重复）。 I sorted that for efficiency before using printf to create a sed edit script. 为了提高效率，我在使用printf创建sed编辑脚本之前对其进行了排序。

AWK-用指定数量的随机行中的常量字符替换

问题描述

2 个解决方案

解决方案1
1 2015-03-09 21:55:41

How it works 这个怎么运作

解决方案2
1 2015-03-09 21:56:58

AWK-用指定数量的随机行中的常量字符替换

问题描述

2 个解决方案

解决方案1 1 2015-03-09 21:55:41

How it works 这个怎么运作

解决方案2 1 2015-03-09 21:56:58

解决方案1
1 2015-03-09 21:55:41

解决方案2
1 2015-03-09 21:56:58