根据另一个文件中指定的行号从文件中获取行（最好使用awk）

Question

I have a very large file (~5 million lines) containing numbers. 我有一个非常大的文件（约500万行）包含数字。

numbers.txt: numbers.txt：

I have another file containing data (~1million lines). 我有另一个包含数据的文件（约1百万行）。

data.txt: data.txt中：

1.000000 -1.072000 -1.000000
2.000000 -1.213000 1.009900
-1.210000 -1.043000 1.000000
-1.000000 -1.000000 -1.000000
1.000000 1.000000 -0.999999
...

The numbers.txt contains line numbers for the data.txt file. numbers.txt包含data.txt文件的行号。 I need to output a file that is the numbers.txt replaced with the corresponding line from data.txt. 我需要输出一个文件，其中numbers.txt替换为data.txt中的相应行。 So for the above example the output would look like: 所以对于上面的例子，输出看起来像：

1.000000 -1.072000 -1.000000
1.000000 1.000000 -0.999999
1.000000 -1.072000 -1.000000
-1.000000 -1.000000 -1.000000
2.000000 -1.213000 1.009900
...

I think awk would be the right way to go, but I'm unable to figure out how to do it. 我认为awk是正确的方法，但我无法弄清楚如何去做。

There are two caveats: 有两点需要注意：

Files are very large, so reading everything into memory is not an option. 文件非常大，因此将内容读入内存不是一种选择。
The file has to retain its order. 该文件必须保留其订单。 Sorting is not an option. 排序不是一种选择。

I did find this question , but it doesn't satisfy the caveats. 我确实找到了这个问题，但它不符合警告。

Answer 1

This is pretty much what Python's linecache module was built for: 这几乎是Python的linecache模块的构建方式：

#!/usr/bin/env python

from linecache import getline

with open('numbers.txt') as lines:
  for line in lines: # Read each line from the lines file
    try:
      print getline('data.txt', int(line)) # Attempt to get and print that line from the data file
    except ValueError:
      pass # line did not contain a numeral, so ignore it.

You can do this as a oneliner, as well: 您也可以将此作为oneliner：

python -c 'import linecache;print "\n".join(linecache.getline("data.txt", int(line)) for line in open("numbers.txt"))'

Answer 2

Only the data file has to be retained in memory, so the index file can be of an arbitrary size. 只有数据文件必须保留在内存中，因此索引文件可以是任意大小。

If your data file is 1 million lines of about 40 characters, it should fit in 40 Mb, which is a breeze for your average PC. 如果您的数据文件是100万行，大约40个字符，它应该适合40 Mb，这对于普通PC来说是轻而易举的。

Re-opening the data file to fetch one line at a time would be way slower, even with disk caching. 即使使用磁盘缓存，重新打开数据文件一次获取一行也会慢一些。

So I think you could safely go for a solution that would fetch the entire data file into memory. 所以我认为你可以安全地找到一个将整个数据文件提取到内存中的解决方案。

Here is how I would do it in awk: 以下是我将如何在awk中执行此操作：

gawk "{if(NR==FNR)l[NR]=$0; else print l[$1] }" data.txt numbers.txt

With this input 有了这个输入

data.txt data.txt中

1 1.000000 -1.072000 -1.000000
2 2.000000 -1.213000 1.009900
3 -1.210000 -1.043000 1.000000
4 -1.000000 -1.000000 -1.000000
5 1.000000 1.000000 -0.9999991.000000 -1.072000 -1.000000
6 2.000000 -1.213000 1.009900
7 -1.210000 -1.043000 1.000000
8 -1.000000 -1.000000 -1.000000
9 1.000000 1.000000 -0.9999991.000000 -1.072000 -1.000000
10 2.000000 -1.213000 1.009900
11 -1.210000 -1.043000 1.000000
12 -1.000000 -1.000000 -1.000000
13 1.000000 1.000000 -0.9999991.000000 -1.072000 -1.000000
14 2.000000 -1.213000 1.009900
15 -1.210000 -1.043000 1.000000
16 -1.000000 -1.000000 -1.000000
17 1.000000 1.000000 -0.9999991.000000 -1.072000 -1.000000
18 2.000000 -1.213000 1.009900
19 -1.210000 -1.043000 1.000000
20 -1.000000 -1.000000 -1.000000

(I added an index in front of your sample data for testing). （我在您的样本数据前面添加了一个索引用于测试）。

numbers.txt numbers.txt

it produces 它产生

1 1.000000 -1.072000 -1.000000
5 1.000000 1.000000 -0.9999991.000000 -1.072000 -1.000000
1 1.000000 -1.072000 -1.000000
4 -1.000000 -1.000000 -1.000000
2 2.000000 -1.213000 1.009900
20 -1.000000 -1.000000 -1.000000
1 1.000000 -1.072000 -1.000000

Performance test 性能测试

I used this PHP script to generate a test case: 我用这个PHP脚本生成了一个测试用例：

<?php
$MAX_DATA  = 1000000;
$MAX_INDEX = 5000000;

$contents = "";
for ($i = 0 ; $i != $MAX_DATA ; $i++) $contents .= ($i+1) . " " . str_shuffle("01234567890123456789012345678901234567890123456789") . "\n";
file_put_contents ('data.txt', $contents);

$contents = "";
for ($i = 0 ; $i != $MAX_INDEX ; $i++) $contents .= rand(1, $MAX_DATA) . "\n";
file_put_contents ('numbers.txt', $contents);

echo "done.";
?>

With a random input of 1M data and 5M indexes, the awk script above took about 20 seconds to produce a result on my PC. 随机输入1M数据和5M索引，上面的awk脚本大约需要20秒才能在我的PC上生成结果。
The data file was about 56 Mb and the awk process consumed about 197 mb. 数据文件大约是56 Mb，awk进程消耗了大约197 MB。

As one could have expected, the processing time is roughly proportional to the size of the index file for a given set of data. 正如人们所预料的那样，处理时间大致与给定数据集的索引文件的大小成比例。

根据另一个文件中指定的行号从文件中获取行（最好使用awk）

问题描述

2 个解决方案

解决方案1
2 2014-02-18 19:38:39

解决方案2
1 已采纳 2014-02-18 19:35:44

Performance test 性能测试

根据另一个文件中指定的行号从文件中获取行（最好使用awk）

问题描述

2 个解决方案

解决方案1 2 2014-02-18 19:38:39

解决方案2 1 已采纳 2014-02-18 19:35:44

Performance test 性能测试

解决方案1
2 2014-02-18 19:38:39

解决方案2
1 已采纳 2014-02-18 19:35:44