简体   繁体   English

用perl regex用条件替换文本块

[英]Replace blocks of text with perl regex with conditional

I have a perl script that does some regex replacements on a text file, which I need to modify along the following lines: (a) I need to process the text as blocks of text, and then, depending on the presence/absence of one line different replacements need to be done. 我有一个perl脚本,它在文本文件上执行一些正则表达式替换,需要按照以下几行进行修改:(a)我需要将文本处理为文本块,然后再根据是否存在一个文本块进行处理。生产线需要进行不同的替换。 (b) I need to add text to the end of each block. (b)我需要在每个块的末尾添加文本。 (this transforms text from a transcription program to LaTeX code) (这会将文本从转录程序转换为LaTeX代码)

These are supposed to be two columns: 这些应该是两列:
To the left is how the input looks, to the right what it should become: 左边是输入的外观,右边是应变成的内容:

ORIGINAL INPUT               EXPECTED OUTCOME

# Single line blocks: label to be replaced and \xe added to en
txt@#Name  Text text text    \ex[exno=\spkr{Name}] \txt  Text text text 
                             \xe

nvb@#Name  Text text text    \ex[exno=\spkr{Name}] \nvb  Text text text 
                             \xe

# Multi-line blocks: labels to be replaced and \xe added to end
txt@#Name  Text text text    \ex[exno=\spkr{Name}] \txt  Text text text 
fte@#Name  Text text text    \freetr Text text text
                             \xe

txt@#Name  Text text text    \ex[exno=\spkr{Name}] \txt  Text text text 
SD (0.0)                     \silence{0.0}
                             \xe

txt@#Name  Text text text    \ex[exno=\spkr{Name}] \txt  Text text text 
tli@#Name  Text text text    \translit   Text text text
fte@#Name  Text text text    \freetr    Text text text
                             \xe

# Multi-line block that has the mrb@... line (must start with txt): 
txt@#Name  Text text text    \ex[exno=\spkr{Name}] \begingl \glpreamble  Text text text // 
mrb@#Name  Text text text    \gla Text text text //
gle@#Name  Text text text    \glb Text text text //
fte@#Name  Text text text    \glft Text text text //
SD (0.0)                     \endgl 
                             \silence{0.0}
                             \xe
# The tricky thing here is that (a) the labels get replaced differently, the txt line gets two commands, \begingl and \glpreamble, all lines have to end with  // and they end with \endgl and \xe.  In case there is an SD (silence duration) line then that needs to go between the \endgl and the \xe. (but not all have the SD). 



Blocks are separated by an extra blank line. 块之间用多余的空白行隔开。 The first line of each block begins with a label txt@... , nvb@... or event and may or may not be followed by subsequent lines starting with different labels. 每个块的第一行以txt@...nvb@...event标签开头,并且可能会或可能不会后面跟随以不同标签开头的后续行。 Each label needs to be replaced with something else, here accomplished through regexes like in the example below (plus some other replacements, this is just minimal for purpose of explanation). 每个标签都需要替换为其他标签,这里可以通过正则表达式来完成,如下面的示例所示(加上一些其他替换,出于解释目的,这只是最小的)。 And then I need to mark the end of each block. 然后,我需要标记每个块的结尾。

Furthermore, I need to have one conditional somewhere in there: If the block includes a line starting with an mrb@ label (like the sixth block above), different replacement patterns apply. 此外,我需要在其中某处有一个条件:如果该块包含以mrb @标签开头的行(如上面的第六个块),则将应用不同的替换模式。

The following script is what I have, but it processes everything line by line. 以下脚本是我所拥有的,但是它逐行处理所有内容。 I know perl can do block by block, which should then make it possible to do the modifications, but unfortunately my skills are way too rudimentary to figure it out by myself. 我知道perl可以一步一步地做,然后应该可以进行修改,但是不幸的是我的技能太基本了,无法自己解决。

#!/usr/bin/perl
use warnings;
use strict;

open my $fh_in, '<', $ARGV[0] or die "No input: $!";
open my $fh_out, '>', $ARGV[1] or die "No output: $!";

print $fh_out "\\begin{myenv}\n\n"; # begins group at beginning of file

while (<$fh_in>) 
{
    # general replacements for everything except if block includes a "mrb@" line:
    s/^txt@#(\S*)\s+(.*)/\\ex[exno=\\spkr{$1}] \\txt $2 /g; 
    s/^nvb@#(\S*)\s+(.*)/\\ex[exno=\\spkr{$1}] \\txt $2 /g;  
    s/^tli@#\S*\s+(.*)/\\translit $1 /g; 
    s/^fte@#\S*\s+(.*)/\\freetr $1 /g; 
    s/^SD\s*\((\d*)\.(\d*)\)/\\silence{\($1\.$2\)}/g; 

    # after each block I need to add "\\xe" 

    # replacements if block includes a "mrb@" line: 
    s/^txt@#(\S*)\s+(.*)/\\ex[exno=\\spkr{$1}] \\begingl \\glpreamble $2 \/\/ /g; 
    s/^mrb@#\S*\s+(.*)/\\gla $1 \/\/ /g; # 
    s/^gle@#\S*\s+(.*)/\\glb $1 \/\/ /g; # 
    s/^fte@#\S*\s+(.*)/\\glft $1 \/\/ /g; # 
    s/^tli@#\S*\s+(.*)/\\translit $1 \/\/ /g; #
    s/^fte@#\S*\s+(.*)/\\freetr $1 \/\/ /g; # 
    s/^SD\s*\((\d*)\.(\d*)\)/\\silence{\($1\.$2\)}/g;
    # after each block with a "mrb@" line I need to add "\\endgl" and "\\xe"
    # if there is a line starting with SD at the end of the block it needs to go between "\\endgl" and "\\xe"


    print $fh_out $_;    
} 

print $fh_out "\\end{myenv}"; # ends group

Any help much appreciated! 任何帮助,不胜感激!

The processing details are apparently complex; 处理细节显然很复杂; let's first clear up how to process blocks. 让我们首先弄清楚如何处理块。

One way is to go line-by-line and accumulate lines for a block, until you get to an empty line. 一种方法是逐行移动并累积一个块的行,直到到达空行。 Then you process your block and clear the buffer, and keep going. 然后,您处理块并清除缓冲区,然后继续进行。 For example 例如

use warnings;
use strict;
use feature 'say';

sub process_block {
    say "Block:"; say "\t$_" for @{$_[0]};
}

my $file = shift // die "Usage: $0 filename\n";  #/

open my $fh, '<', $file or die "Can't open $file: $!";

my @block;
while (<$fh>) {
    chomp;
    if (not /\S/) {
        if (@block) {                # the first empty line
            process_block(\@block);
            @block = (); 
        }
        next;
    }   

    push @block, $_; 
}
process_block(\@block) if @block;    # last block may have remained

The process_block call after the while loop doesn't fire for the shown sample, since there are empty lines before the end of the file so the last block gets processed inside the loop. while循环后的process_block调用不会为所示示例触发,因为文件末尾之前有空行,因此最后一个块在循环内得到处理。 But we need to ensure that the last block is processed when there are no empty lines at the end as well. 但是我们需要确保在末尾也没有空行时处理最后一个块。

Inside process_block you can now check whether @block contains mrb@#Name , apply other (apparently complex) conditions, run regex, and print processed lines. process_block内部,您现在可以检查@block是否包含mrb@#Name ,应用其他(显然复杂的)条件,运行正则表达式以及打印处理过的行。

Here is an example, following clarifications but still leaving out some details 这是一个示例,经过澄清,但仍然省略了一些细节

use List::Util qw(any);  # used to be in List::MoreUtils

sub process_block {
    my @block = @{ $_[0] };  # local copy, to not change @block in caller

    if ($block[0] =~ /^txt\@/ and any { /^mrb\@/ } @block) {
        for (@block) {
            s{^txt\@#(\S*)\s+(.*)}
             {\\ex[exno=\\spkr{$1}, exnoformat=X] \\begingl \\glpreamble $2 // }g;  #/
            s{^mrb\@#\S*\s+(.*)}{\\gla $1 // }g;
            # etc
        }   
        if ($block[-1] =~ /^\s*SD/) {
            my $SD_line = pop @block;
            push @block, '\endgl', $SD_line, '\xe';
        }
        else {
            push @block, '\endgl', '\xe';
        }
    }
    else {
        for (@block) {
            s/^txt\@#(\S*)\s+(.*)/\\ex[exno=\\spkr{$1}, exnoformat=X] \\txt $2 /g; 
            s/^tli\@#\S*\s+(.*)/\\translit $1 /g;
            # etc
        }
        push @block, '\xe';
    }
    say for @block;
    say "\n";        # two lines to separate blocks
}

A note on efficiency. 关于效率的说明。

This code processes each line in a block against all regex substitutions, to find the one that applies to it. 此代码针对所有正则表达式替换处理块中的每一行,以找到适用于它的行。 The distinguishing pattern comes right at the beginning so "wrong" lines fail right away but we still run the regex engine for all checks for each line. 区别模式是从一​​开始就出现的,因此“错误”的行会立即失败,但是我们仍然对每行的所有检查运行正则表达式引擎。

This may (or may not) be a problem with many regex or long blocks or if done often, and it can be optimized if it is slow. 对于许多正则表达式或较长的代码块,或者如果经常执行,这可能是(或可能不是)问题,如果速度较慢,则可以对其进行优化。 Since the list of substitutions is always the same we can build a hash with regex keyed by the distinguishing start of the pattern (as a dispatch table ). 由于替换列表始终相同,因此我们可以使用正则表达式构建散列,该正则表达式由模式的区别性开头(作为调度表 )作为键。 For example 例如

my %repl_non_mrb = ( 
    'txt@' => sub { s/^txt\@#(\S*)\s+(.*)/\\ex[exno=\\spkr{$1}, exnoformat=X] \\txt $2 /g }
    'tli@' => sub { s/^tli\@#\S*\s+(.*)/\\translit $1 /g },
    ...
);
my %repl_mrb = ( ... );

and then use it along the lines of 然后沿着

# For blocks without 'mrb@'
for (@block) {
    # Capture key: up to # for 'txt@' (etc), up to \s for 'SD'. Other cases?
    my ($key) = /^(.*?)(?:#|\s)/; 
    if ($key and exists $repl_non_mrb{$key}) {
        $repl_non_mrb{$key}->();                  # run the coderef
    }
    else { say "No processing key (?) for: $_" }  # some error?
}

This clearly needs more (careful) work, while there are also other ways to organize those regex. 显然,这需要更多(仔细)的工作,同时还有其他方式来组织这些正则表达式。 But an implementation of these (fixed) regex substitutions hashed by their distinguishing patterns will surely improve on the O(NM) complexity of always running all regex on each line. 但是,这些(固定的)正则表达式替换(通过其区别模式进行散列)的实现肯定会提高始终在每行上运行所有正则表达式的O(NM)复杂度。


Another way is what you inquire about 另一种方式是您查询的内容

I know perl can do block by block 我知道perl可以一步一步地做

what can be done by setting the $/ variable . 可以通过设置$/变量来完成。 It sets what is then used as the separator between input records. 它设置什么然后用作输入记录之间的分隔符。 If you set it to \\n\\n here you get a block served for each read, in a string 如果您将其设置为\\n\\n ,则会在字符串中为每次读取提供一个块

open my $fh, '<', $file or die "Can't open $file: $!";

PROCESS_FILE: { 
    local $/ = "\n\n";
    while (my $block = <$fh>) { 
        chomp $block;
        say "|$block|"; 
    }
};

I put this inside a block (named PROCESS_FILE just so) so that we can change $/ by using local . 我将其放在一个块中(这样命名为PROCESS_FILE ),以便我们可以使用local更改$/ Then its previous value is restored as the block is exited and files are again read normally. 然后,当退出该块并再次正常读取文件时,将恢复其先前的值。

However, I don't see a benefit of doing this here since you now have a block in a scalar variable, while what you need to do seems to be line oriented. 但是,我看不到这样做的好处,因为您现在在标量变量中有一个块,而您所需要做的似乎是面向行的。 So I'd recommend the first approach. 因此,我建议第一种方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM