简体   繁体   English

根据另一个文件中指定的行号从文件中获取行(最好使用awk)

[英]Fetch lines from a file based on line numbers specified in another file (preferably using awk)

I have a very large file (~5 million lines) containing numbers. 我有一个非常大的文件(约500万行)包含数字。

numbers.txt: numbers.txt:

1
5
1
4
2
20
1
...

I have another file containing data (~1million lines). 我有另一个包含数据的文件(约1百万行)。

data.txt: data.txt中:

1.000000 -1.072000 -1.000000
2.000000 -1.213000 1.009900
-1.210000 -1.043000 1.000000
-1.000000 -1.000000 -1.000000
1.000000 1.000000 -0.999999
...

The numbers.txt contains line numbers for the data.txt file. numbers.txt包含data.txt文件的行号。 I need to output a file that is the numbers.txt replaced with the corresponding line from data.txt. 我需要输出一个文件,其中numbers.txt替换为data.txt中的相应行。 So for the above example the output would look like: 所以对于上面的例子,输出看起来像:

1.000000 -1.072000 -1.000000
1.000000 1.000000 -0.999999
1.000000 -1.072000 -1.000000
-1.000000 -1.000000 -1.000000
2.000000 -1.213000 1.009900
...

I think awk would be the right way to go, but I'm unable to figure out how to do it. 我认为awk是正确的方法,但我无法弄清楚如何去做。

There are two caveats: 有两点需要注意:

  • Files are very large, so reading everything into memory is not an option. 文件非常大,因此将内容读入内存不是一种选择。
  • The file has to retain its order. 该文件必须保留其订单。 Sorting is not an option. 排序不是一种选择。

I did find this question , but it doesn't satisfy the caveats. 我确实找到了这个问题 ,但它不符合警告。

This is pretty much what Python's linecache module was built for: 这几乎是Python的linecache模块的构建方式:

#!/usr/bin/env python

from linecache import getline

with open('numbers.txt') as lines:
  for line in lines: # Read each line from the lines file
    try:
      print getline('data.txt', int(line)) # Attempt to get and print that line from the data file
    except ValueError:
      pass # line did not contain a numeral, so ignore it.

You can do this as a oneliner, as well: 您也可以将此作为oneliner:

python -c 'import linecache;print "\n".join(linecache.getline("data.txt", int(line)) for line in open("numbers.txt"))'

Only the data file has to be retained in memory, so the index file can be of an arbitrary size. 只有数据文件必须保留在内存中,因此索引文件可以是任意大小。

If your data file is 1 million lines of about 40 characters, it should fit in 40 Mb, which is a breeze for your average PC. 如果您的数据文件是100万行,大约40个字符,它应该适合40 Mb,这对于普通PC来说是轻而易举的。

Re-opening the data file to fetch one line at a time would be way slower, even with disk caching. 即使使用磁盘缓存,重新打开数据文件一次获取一行也会慢一些。

So I think you could safely go for a solution that would fetch the entire data file into memory. 所以我认为你可以安全地找到一个将整个数据文件提取到内存中的解决方案。

Here is how I would do it in awk: 以下是我将如何在awk中执行此操作:

gawk "{if(NR==FNR)l[NR]=$0; else print l[$1] }" data.txt numbers.txt

With this input 有了这个输入

data.txt data.txt中

1 1.000000 -1.072000 -1.000000
2 2.000000 -1.213000 1.009900
3 -1.210000 -1.043000 1.000000
4 -1.000000 -1.000000 -1.000000
5 1.000000 1.000000 -0.9999991.000000 -1.072000 -1.000000
6 2.000000 -1.213000 1.009900
7 -1.210000 -1.043000 1.000000
8 -1.000000 -1.000000 -1.000000
9 1.000000 1.000000 -0.9999991.000000 -1.072000 -1.000000
10 2.000000 -1.213000 1.009900
11 -1.210000 -1.043000 1.000000
12 -1.000000 -1.000000 -1.000000
13 1.000000 1.000000 -0.9999991.000000 -1.072000 -1.000000
14 2.000000 -1.213000 1.009900
15 -1.210000 -1.043000 1.000000
16 -1.000000 -1.000000 -1.000000
17 1.000000 1.000000 -0.9999991.000000 -1.072000 -1.000000
18 2.000000 -1.213000 1.009900
19 -1.210000 -1.043000 1.000000
20 -1.000000 -1.000000 -1.000000

(I added an index in front of your sample data for testing). (我在您的样本数据前面添加了一个索引用于测试)。

numbers.txt numbers.txt

1
5
1
4
2
20
1

it produces 它产生

1 1.000000 -1.072000 -1.000000
5 1.000000 1.000000 -0.9999991.000000 -1.072000 -1.000000
1 1.000000 -1.072000 -1.000000
4 -1.000000 -1.000000 -1.000000
2 2.000000 -1.213000 1.009900
20 -1.000000 -1.000000 -1.000000
1 1.000000 -1.072000 -1.000000

Performance test 性能测试

I used this PHP script to generate a test case: 我用这个PHP脚本生成了一个测试用例:

<?php
$MAX_DATA  = 1000000;
$MAX_INDEX = 5000000;

$contents = "";
for ($i = 0 ; $i != $MAX_DATA ; $i++) $contents .= ($i+1) . " " . str_shuffle("01234567890123456789012345678901234567890123456789") . "\n";
file_put_contents ('data.txt', $contents);

$contents = "";
for ($i = 0 ; $i != $MAX_INDEX ; $i++) $contents .= rand(1, $MAX_DATA) . "\n";
file_put_contents ('numbers.txt', $contents);

echo "done.";
?>

With a random input of 1M data and 5M indexes, the awk script above took about 20 seconds to produce a result on my PC. 随机输入1M数据和5M索引,上面的awk脚本大约需要20秒才能在我的PC上生成结果。
The data file was about 56 Mb and the awk process consumed about 197 mb. 数据文件大约是56 Mb,awk进程消耗了大约197 MB。

As one could have expected, the processing time is roughly proportional to the size of the index file for a given set of data. 正如人们所预料的那样,处理时间大致与给定数据集的索引文件的大​​小成比例。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据另一个文件中的行号从文件中拾取行 - Pick up lines from a file based on line numbers in another file 使用awk将特定行号从一个文件复制到另一文件的特定行号 - to copy specific line numbers from one file to specific line numbers of another file using awk "使用 awk 基于另一个文件从文件中提取行" - Extracting rows from file based on another file using awk 根据另一个文件中的数字从文件夹中的文本文件中提取行 - Extracting lines from text files in a folder based on the numbers in another file 如何基于另一个文件的内容使用awk / sed删除特定行 - How can I delete specific lines using awk/sed based on the contents of another file 使用AWK拆分file1,然后根据file2中的行命名新文件 - splitting file1 using AWK and then name the new files based on lines from file2 如何使用 awk 将文件中的每个单词替换为另一个单词(单词在 awk 中作为命令行参数给出) - How to replace every word with another word from a file using awk ( the words are given as command line parameters in awk) 使用awk在另一个文件中查找文件的编号范围 - Finding a range of numbers of a file in another file using awk Bash通过从另一个文件读取行号来从文本文件中获取行 - Bash get lines from text file by reading line numbers from another file 自动将指定文件插入另一个文件(使用sed,AWK或Perl) - Automatically inserting specified file into another (using sed, AWK, or Perl)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM