简体   繁体   English

在 Perl 中计算由 CR/LF(回车和换行符)分隔的记录

[英]Counting records separated by CR/LF (carriage return and newline) in Perl

I'm trying to create a simple script to read a text file that contains records of book titles.我正在尝试创建一个简单的脚本来读取包含书名记录的文本文件。 Each record is separated with a plain old double space ( \\r\\n\\r\\n ).每条记录都用一个普通的旧双空格 ( \\r\\n\\r\\n ) 分隔。 I need to count how many records are in the file.我需要计算文件中有多少条记录。

For example here is the input file:例如这里是输入文件:

record 1
some text


record 2 
some text
...

I'm using a regex to check for carriage return and newline, but it fails to match.我正在使用正则表达式来检查回车和换行符,但它无法匹配。 What am I doing wrong?我究竟做错了什么? I'm at my wits' end.我不知所措。

sub readInputFile {

    my $inputFile = $_[0]; #read first argument from the commandline as fileName

    open INPUTFILE, "+<", $inputFile or die $!;    #Open File

    my $singleLine;
    my @singleRecord;
    my $recordCounter = 0;

    while (<INPUTFILE>) {                    # loop through the input file line-by-line
        $singleLine = $_;
        push(@singleRecord, $singleLine);    # start adding each line to a record array

        if ($singleLine =~ m/\r\n/) {        # check for carriage return and new line
            $recordCounter += 1;
            createHashTable(@singleRecord);  # send record make a hash table
            @singleRecord = ();              # empty the current record to start a new record
        }

    }

    print "total records : $recordCounter \n";
    close(INPUTFILE);
}

It sounds like you are processing a Windows text file on Linux, in which case you want to open the file with the :crlf layer, which will convert all CRLF line-endings to the standard Perl \\n ending.听起来您正在 Linux 上处理 Windows 文本文件,在这种情况下,您希望使用:crlf层打开该文件,这会将所有 CRLF 行结尾转换为标准的 Perl \\n结尾。

If you are reading Windows files on a Windows platform then the conversion is already done for you, and you won't find CRLF sequences in the data you have read.如果您正在 Windows 平台上读取 Windows 文件,那么转换已经为您完成,您将不会在读取的数据中找到 CRLF 序列。 If you are reading a Linux file then there are no CR characters in there anyway.如果您正在阅读 Linux 文件,那么无论如何那里都没有 CR 字符。

It also sounds like your records are separated by a blank line.听起来您的记录也用空行分隔。 Setting the built-in input record separator variable $/ to a null string will cause Perl to read a whole record at a time.将内置输入记录分隔符变量$/设置为空字符串将导致 Perl 一次读取整个记录。

I believe this version of your subroutine is what you need.我相信这个版本的子程序正是你所需要的。 Note that people familiar with Perl will thank you for using lower-case letters and underscore for variables and subroutine names.请注意,熟悉 Perl 的人会感谢您使用小写字母和下划线作为变量和子程序名称。 Mixed case is conventionally reserved for package names.混合大小写通常保留用于包名称。

You don't show create_hash_table so I can't tell what data it needs.您没有显示create_hash_table所以我无法判断它需要什么数据。 I have chomped and split the record into lines, and passed a list of the lines in the record with the newlines removed.我已经将记录切碎并分成几行,并传递了删除换行符的记录中的行列表。 It would probably be better to pass the entire record as a single string and leave create_hash_table to process it as required.将整个记录作为单个字符串传递并让create_hash_table根据需要处理它可能会更好。

sub read_input_file {

    my ($input_file) = @_;

    open my $fh, '<:crlf', $input_file or die $!;
    local $/ = '';

    my $record_counter = 0;

    while (my $record = <$fh>) {
        chomp;
        ++$record_counter;
        create_hash_table(split /\n/, $record);
    }
    close $fh;

    print "Total records : $record_counter\n";
}

You can do this more succinctly by changing Perl's record-separator, which will make the loop return a record at a time instead of a line at a time.您可以通过更改 Perl 的记录分隔符来更简洁地执行此操作,这将使循环一次返回一条记录而不是一次返回一行。

Eg after opening your file:例如,打开文件后:

local $/ = "\r\n\r\n";
my $recordCounter = 0;
$recordCounter++ while(<INPUTFILE>);    

$/ holds Perl's global record-separator, and scoping it with local allows you to override its value temporarily until the end of the enclosing block, when it will automatically revert back to its previous value. $/持有 Perl 的全局记录分隔符,并且使用local来限定它的范围允许您临时覆盖它的值,直到封闭块的末尾,当它会自动恢复到以前的值时。

But it sounds like the file you're processing may actually have "\\n\\n" record-separators, or even "\\r\\r".但听起来您正在处理的文件实际上可能具有“\\n\\n”记录分隔符,甚至“\\r\\r”。 You'd need to set the record-separator correctly for whatever file you're processing.您需要为正在处理的任何文件正确设置记录分隔符。

If your files are not huge multi-gigabytes files, the easiest and safest way is to read the whole file, and use the generic newline metacharacter \\R .如果您的文件不是巨大的数 GB 文件,最简单和最安全的方法是读取整个文件,并使用通用换行元字符\\R

This way, it also works if some file actually uses LF instead of CRLF (or even the old Mac standard CR).这样,如果某些文件实际上使用 LF 而不是 CRLF(甚至是旧的 Mac 标准 CR),它也可以工作。

Use it with split if you also need the actual records:如果您还需要实际记录,请将其与split一起使用:

perl -ln -0777 -e 'my @records = split /\R\R/; print scalar(@records)' $Your_File

Or if you only want to count the records:或者,如果您只想计算记录:

perl -ln -0777 -e 'my $count=()=/\R\R/g; print $count' $Your_File

For more details, see also my other answer here to a similar question.有关更多详细信息,另请参阅我对类似问题的其他回答

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM