简体   繁体   English

如何在 Perl 中将文件的多行读入块中?

[英]How can I read multiple lines of a file into blocks in Perl?

I have a file which contains the text below.我有一个包含以下文本的文件。

#L_ENTRY    <s_slash_1>
#LEX        </>
#ROOT       </>
#POS        <sp>
#SUBCAT     <slash>
#S_LINK           <>
#BITS    <>
#WEIGHT      <0.1>
#SYNONYM     <0>

#L_ENTRY    <s_comma_1>
#LEX        <,>
#ROOT       <,>
#POS        <sp>
#SUBCAT     <comma>
#S_LINK           <>
#BITS    <>
#WEIGHT      <0.1>
#SYNONYM     <0>

#L_ENTRY    <s_tilde_1>
#LEX        <~>
#ROOT       <~>
#POS        <sp>
#SUBCAT     <tilde>
#S_LINK           <>
#BITS    <>
#WEIGHT      <0.1>
#SYNONYM     <0>

#L_ENTRY    <s_at_1>
#LEX        <@>
#ROOT       <@>
#POS        <sp>
#SUBCAT     <at>
#S_LINK           <>
#BITS    <>
#WEIGHT      <0.1>
#SYNONYM     <0>

I know how to make the lines into an array using Perl, but in this case I want to make an array with two elements.我知道如何使用 Perl 将这些行组成一个数组,但在这种情况下,我想创建一个包含两个元素的数组。 Each that begins with #L_ENTRY and ends with #SYNONYM <0> .每个以#L_ENTRY并以#SYNONYM <0>结尾。

Can anyone help?任何人都可以帮忙吗?

If you set the input record separator variable to the empty string, then perl will work in paragraph mode , and return a block at a time separated by one or more blank lines in the input data如果将输入记录分隔符变量设置为空字符串,那么 perl 将工作在段落模式,并在输入数据中以一个或多个空行分隔的时间返回一个块

use strict;
use warnings 'all';

local $/ = '';


my $n;
while ( <DATA> ) {
    printf "Block %d:\n<<%s>>\n\n", ++$n, $_;
}

__DATA__
A
B
C
D
E
F

A
B
C
D
E
F

output输出

Block 1:
<<A
B
C
D
E
F

>>

Block 2:
<<A
B
C
D
E
F

>>

There are two ways to do it.有两种方法可以做到。 Firstly, you can set the "input record separator" special variable (see more here ).首先,您可以设置“输入记录分隔符”特殊变量( 在此处查看更多信息)。 In short, you are telling perl that a line is not terminated by a new-line char.简而言之,您是在告诉 perl 一行不是由换行符终止的。 In your case, you could set it to '#SYNONYM <0>'.在您的情况下,您可以将其设置为“#SYNONYM <0>”。 Then when you read in one line, you get everything up to that point in the file that has that tag - if the tag is not there, then you get what's left in the file.然后,当您阅读一行时,您将获得具有该标签的文件中该点的所有内容 - 如果该标签不存在,那么您将获得文件中剩余的内容。 So, for input data that looks like this;因此,对于看起来像这样的输入数据;

#L_ENTRY        <s_slash_1>
#LEX         </>
#ROOT        </>
#POS         <sp>
#SUBCAT      <slash>
#S_LINK            <>
#BITS     <>
#WEIGHT      <0.1>
#SYNONYM     <0>

#L_ENTRY        <s_comma_1>
#LEX         <,>
#ROOT        <,>
#POS         <sp>
#SUBCAT      <comma>
#S_LINK            <>
#BITS     <>
#WEIGHT      <0.1>
#SYNONYM     <0>

if you run this;如果你运行这个;

use v5.14;
use warnings;

my $filename = "data.txt" ;
open(my $fh, '<', $filename) or die "$filename: $!" ;
local $/ = "#SYNONYM     <0>\n" ;
my @chunks = <$fh> ;
say $chunks[0] ;
say '---' ;
say $chunks[1] ;

You get;你得到;

#L_ENTRY        <s_slash_1>
#LEX         </>
#ROOT        </>
#POS         <sp>
#SUBCAT      <slash>
#S_LINK            <>
#BITS     <>
#WEIGHT      <0.1>
#SYNONYM     <0>

---

#L_ENTRY        <s_comma_1>
#LEX         <,>
#ROOT        <,>
#POS         <sp>
#SUBCAT      <comma>
#S_LINK            <>
#BITS     <>
#WEIGHT      <0.1>
#SYNONYM     <0>

A couple of notes about this;关于这个的一些注意事项;

  1. Any extra data between your records is going to "get caught in the net" and end up at the start of each record;您的记录之间的任何额外数据都将“陷入网络”并最终出现在每条记录的开头;
  2. The record separator itself is still part of the data and is at the end of each record.记录分隔符本身仍然是数据的一部分,位于每条记录的末尾。

To get more control, it's better to process the data line-by-line and use regexs to switch between "capture" mode and "dont capture" mode:为了获得更多控制,最好逐行处理数据并使用正则表达式在“捕获”模式和“不捕获”模式之间切换:

use v5.14;
use warnings;

my $filename = "data.txt" ;
open(my $fh, '<', $filename) or die "$filename: $!" ;

my $found_start_token = qr/ \s* \#L_ENTRY \s* /x;
my $found_stop_token  = qr/ \s* \#SYNONYM \s+ \<0\> \s* \n /x;

my @chunks ;
my $chunk  ;
my $capture_mode = 0 ;

while ( <$fh> )  {
    $capture_mode = 1 if /$found_start_token/ ;
    $chunk .= $_ if $capture_mode ;
    if (/$found_stop_token/) {
        push @chunks, $chunk ;
        $chunk = '' ;
        $capture_mode = 0 ;
    }
}
say $chunks[0] ;
say '---' ;
say $chunks[1] ;
exit 0

A couple of notes;一些注意事项;

  1. The program works by string concatenation of the current line, $_ , on to $chunk if we're in caputure mode.如果我们处于捕获模式,该程序通过将当前行$_的字符串连接到$chunk来工作。
  2. Capture mode is turned off and on using regexs in 'extended mode', /x .在“扩展模式” /x使用正则表达式关闭和打开捕获模式。 This allows adding whitespace to the regex for easier reading.这允许向正则表达式添加空格以便于阅读。
  3. Extra data between record will not appear in the chunks.记录之间的额外数据不会出现在块中。
  4. It produces the same output as before.它产生与以前相同的输出。

From this and your succeeding question it's looking like you have the answer but are unaware of it从这个和你接下来的问题来看,你似乎有答案但不知道

As long as your blocks are separated by at least one blank line, you can use Perl's paragraph mode , which will hand you back the text in blocks只要你的块被至少一个空行隔开,你就可以使用 Perl 的段落模式,它会将文本以块的形式返回

Here's another, different example that I hope you understand.这是另一个不同的例子,我希望你能理解。 I've created a file called test.txt that contains the data that you posted, and opened it in paragraph mode我创建了一个名为test.txt的文件,其中包含您发布的数据,并以段落模式打开它

The output is from Data::Dump , which I've used only to demonstrate that the resulting array contains exactly the four strings that you asked for输出来自Data::Dump ,我仅用它来证明结果数组正好包含您要求的四个字符串

Please add a comment to this solution if you need any more explanation如果您需要更多解释,请对此解决方案添加评论

use strict;
use warnings 'all';
use autodie;

my $file = 'test.txt';

my @chunks = do {
    open my $fh, '<', $file;
    local $/ = '';
    <$fh>;
};

use Data::Dump;
dd \@chunks;

output输出

[
  "#L_ENTRY    <s_slash_1>\n#LEX        </>\n#ROOT       </>\n#POS        <sp>\n#SUBCAT     <slash>\n#S_LINK           <>\n#BITS    <>\n#WEIGHT      <0.1>\n#SYNONYM     <0>\n\n",
  "#L_ENTRY    <s_comma_1>\n#LEX        <,>\n#ROOT       <,>\n#POS        <sp>\n#SUBCAT     <comma>\n#S_LINK           <>\n#BITS    <>\n#WEIGHT      <0.1>\n#SYNONYM     <0>\n\n",
  "#L_ENTRY    <s_tilde_1>\n#LEX        <~>\n#ROOT       <~>\n#POS        <sp>\n#SUBCAT     <tilde>\n#S_LINK           <>\n#BITS    <>\n#WEIGHT      <0.1>\n#SYNONYM     <0>\n\n",
  "#L_ENTRY    <s_at_1>\n#LEX        <\@>\n#ROOT       <\@>\n#POS        <sp>\n#SUBCAT     <at>\n#S_LINK           <>\n#BITS    <>\n#WEIGHT      <0.1>\n#SYNONYM     <0>\n",
]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM