简体   繁体   English

逐行读取文件,超时时间太长?

[英]Reading a file line-by-line with a timeout for lines that are taking too long?

I have a 1.2TB file that I am running some code against, but constantly running into OutOfMemoryError exceptions. 我有一个1.2TB文件,我正在对该文件运行一些代码,但是经常遇到OutOfMemoryError异常。 I ran the following two pieces of code against the file to see what was wrong: 我对文件运行了以下两段代码,以查看出了什么问题:

import sys

with open(sys.argv[1]) as f:
    count = 1
    for line in f:
        if count > 173646280:
            print line
        else:
            print count
            count += 1

And this code: 这段代码:

#!/usr/bin/env perl
use strict;
use warnings;

my $count = 1;
while (<>) {
    print "$count\n";
    $count++;
}

Both of them zoom until they hit line 173,646,264, and then they just completely stop. 他们两个都放大,直到达到173,646,264行,然后才完全停止。 Let me just give a quick background on the file. 让我简要介绍一下该文件的背景。

I created a file called groupBy.json . 我创建了一个名为groupBy.json的文件。 I then processed that file with some Java code to transform the JSON objects and created a file called groupBy_new.json . 然后,我使用一些Java代码处理了该文件以转换JSON对象,并创建了一个名为groupBy_new.json的文件。 I put groupBy_new.json on s3, pulled it down on another server and was doing some processing on it when I started getting OOM errors. 我将groupBy_new.json放在s3上,将其拉到另一台服务器上,并且当我开始收到OOM错误时正在对其进行一些处理。 I figured that maybe the file got corrupted when transferring to s3. 我认为传输到s3时文件可能已损坏。 I ran the above Python/Perl code on groupBy_new.json on both serverA (the server where it was originally at), and serverB (the server from which I pulled the file off s3), both halted at the same line. 我在serverA(它最初所在的服务器)和serverB(我从中将文件从s3中拉出的服务器)上的groupBy_new.json上运行了上面的Python / Perl代码,它们都在同一行停止了。 I ran then ran the above Python/Perl code on groupBy.json , the original file, and it also halted. 然后,我在原始文件groupBy.json上运行了上面的Python / Perl代码,它也停止了运行。 I tried to recreate groupBy_new.json with the same code that I had used to originally create it, and ran into an OOM error. 我试图用与最初创建它时相同的代码重新创建groupBy_new.json ,但遇到了OOM错误。

So this is a really odd problem that is perplexing me. 所以这是一个让我感到困惑的真正奇怪的问题。 In short, I'd like to get rid of this line that is causing me problems. 简而言之,我想摆脱引起我麻烦的那条线。 What I'm trying to do is read a file with a timeout on the line being read. 我想要做的是读取一个文件,该文件在正在读取的行上具有超时。 If it cannot read the input line in 2 seconds or so, move on to the next line. 如果它在2秒钟左右的时间内无法读取输入行,请移至下一行。

What you can do is count the number of lines until the problem line and output it - make sure you flush the output - see https://perl.plover.com/FAQs/Buffering.html . 您可以做的是计数直到问题行的行数并输出-确保刷新输出-请参阅https://perl.plover.com/FAQs/Buffering.html Then write another program that will copy the first of this number of lines to a different file, and then read the file's input stream character by character (see http://perldoc.perl.org/functions/read.html ) until it hits a "\\n" and then copy the rest of the file - either line by line or in chunks. 然后编写另一个程序,将这行的第一行复制到另一个文件,然后逐个字符读取文件的输入流(请参阅http://perldoc.perl.org/functions/read.html ),直到命中为止一个“ \\ n”,然后复制文件的其余部分-逐行或大块复制。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM