[英]Fetch lines from a file based on line numbers specified in another file (preferably using awk)
I have a very large file (~5 million lines) containing numbers. 我有一个非常大的文件(约500万行)包含数字。
numbers.txt: numbers.txt:
1
5
1
4
2
20
1
...
I have another file containing data (~1million lines). 我有另一个包含数据的文件(约1百万行)。
data.txt: data.txt中:
1.000000 -1.072000 -1.000000
2.000000 -1.213000 1.009900
-1.210000 -1.043000 1.000000
-1.000000 -1.000000 -1.000000
1.000000 1.000000 -0.999999
...
The numbers.txt contains line numbers for the data.txt file. numbers.txt包含data.txt文件的行号。 I need to output a file that is the numbers.txt replaced with the corresponding line from data.txt. 我需要输出一个文件,其中numbers.txt替换为data.txt中的相应行。 So for the above example the output would look like: 所以对于上面的例子,输出看起来像:
1.000000 -1.072000 -1.000000
1.000000 1.000000 -0.999999
1.000000 -1.072000 -1.000000
-1.000000 -1.000000 -1.000000
2.000000 -1.213000 1.009900
...
I think awk would be the right way to go, but I'm unable to figure out how to do it. 我认为awk是正确的方法,但我无法弄清楚如何去做。
There are two caveats: 有两点需要注意:
I did find this question , but it doesn't satisfy the caveats. 我确实找到了这个问题 ,但它不符合警告。
This is pretty much what Python's linecache module was built for: 这几乎是Python的linecache模块的构建方式:
#!/usr/bin/env python
from linecache import getline
with open('numbers.txt') as lines:
for line in lines: # Read each line from the lines file
try:
print getline('data.txt', int(line)) # Attempt to get and print that line from the data file
except ValueError:
pass # line did not contain a numeral, so ignore it.
You can do this as a oneliner, as well: 您也可以将此作为oneliner:
python -c 'import linecache;print "\n".join(linecache.getline("data.txt", int(line)) for line in open("numbers.txt"))'
Only the data file has to be retained in memory, so the index file can be of an arbitrary size. 只有数据文件必须保留在内存中,因此索引文件可以是任意大小。
If your data file is 1 million lines of about 40 characters, it should fit in 40 Mb, which is a breeze for your average PC. 如果您的数据文件是100万行,大约40个字符,它应该适合40 Mb,这对于普通PC来说是轻而易举的。
Re-opening the data file to fetch one line at a time would be way slower, even with disk caching. 即使使用磁盘缓存,重新打开数据文件一次获取一行也会慢一些。
So I think you could safely go for a solution that would fetch the entire data file into memory. 所以我认为你可以安全地找到一个将整个数据文件提取到内存中的解决方案。
Here is how I would do it in awk: 以下是我将如何在awk中执行此操作:
gawk "{if(NR==FNR)l[NR]=$0; else print l[$1] }" data.txt numbers.txt
With this input 有了这个输入
data.txt data.txt中
1 1.000000 -1.072000 -1.000000
2 2.000000 -1.213000 1.009900
3 -1.210000 -1.043000 1.000000
4 -1.000000 -1.000000 -1.000000
5 1.000000 1.000000 -0.9999991.000000 -1.072000 -1.000000
6 2.000000 -1.213000 1.009900
7 -1.210000 -1.043000 1.000000
8 -1.000000 -1.000000 -1.000000
9 1.000000 1.000000 -0.9999991.000000 -1.072000 -1.000000
10 2.000000 -1.213000 1.009900
11 -1.210000 -1.043000 1.000000
12 -1.000000 -1.000000 -1.000000
13 1.000000 1.000000 -0.9999991.000000 -1.072000 -1.000000
14 2.000000 -1.213000 1.009900
15 -1.210000 -1.043000 1.000000
16 -1.000000 -1.000000 -1.000000
17 1.000000 1.000000 -0.9999991.000000 -1.072000 -1.000000
18 2.000000 -1.213000 1.009900
19 -1.210000 -1.043000 1.000000
20 -1.000000 -1.000000 -1.000000
(I added an index in front of your sample data for testing). (我在您的样本数据前面添加了一个索引用于测试)。
numbers.txt numbers.txt
1
5
1
4
2
20
1
it produces 它产生
1 1.000000 -1.072000 -1.000000
5 1.000000 1.000000 -0.9999991.000000 -1.072000 -1.000000
1 1.000000 -1.072000 -1.000000
4 -1.000000 -1.000000 -1.000000
2 2.000000 -1.213000 1.009900
20 -1.000000 -1.000000 -1.000000
1 1.000000 -1.072000 -1.000000
I used this PHP script to generate a test case: 我用这个PHP脚本生成了一个测试用例:
<?php
$MAX_DATA = 1000000;
$MAX_INDEX = 5000000;
$contents = "";
for ($i = 0 ; $i != $MAX_DATA ; $i++) $contents .= ($i+1) . " " . str_shuffle("01234567890123456789012345678901234567890123456789") . "\n";
file_put_contents ('data.txt', $contents);
$contents = "";
for ($i = 0 ; $i != $MAX_INDEX ; $i++) $contents .= rand(1, $MAX_DATA) . "\n";
file_put_contents ('numbers.txt', $contents);
echo "done.";
?>
With a random input of 1M data and 5M indexes, the awk script above took about 20 seconds to produce a result on my PC. 随机输入1M数据和5M索引,上面的awk脚本大约需要20秒才能在我的PC上生成结果。
The data file was about 56 Mb and the awk process consumed about 197 mb. 数据文件大约是56 Mb,awk进程消耗了大约197 MB。
As one could have expected, the processing time is roughly proportional to the size of the index file for a given set of data. 正如人们所预料的那样,处理时间大致与给定数据集的索引文件的大小成比例。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.