简体   繁体   中英

Perl multiline regex

I have a file full of json objects to parse, similar to this one:

{
"_id" : ObjectId("523a58c1e4b09611f4c58a66"),
"_items" : [
    {
        "adGroupId" : NumberLong(1230610621),
        "keywordId" : NumberLong("5458816773")
    },
    {
        "adGroupId" : NumberLong(1230613681),
        "keywordId" : NumberLong("3204196588")
    },
    {
        "adGroupId" : NumberLong(1230613681),
        "keywordId" : NumberLong("4340421772")
    },
    {
        "adGroupId" : NumberLong(1230615571),
        "keywordId" : NumberLong("10525630645")
    },
    {
        "adGroupId" : NumberLong(1230617641),
        "keywordId" : NumberLong("4178290208")
    }
]}

I want to take the numbers from inside de NumberLong(). At first I needed just the keywordId, and managed to accomplish it with:

cat listado.txt |& perl -ne 'print "$1," if /\"keywordId\" : NumberLong\(\"?(\d*)\"?\)/' keywordIds.txt

This generated a comma separated file with the numbers. I now need also de adGroupIds, so I'm trying the following matching regex with no luck:

cat ./work/listado.txt |& perl -ne 'print "$1-$2," if /\"adGroupId\" : NumberLong\(\"?(\d*)\"?\),\s*\"keywordId\" : NumberLong\(\"?(\d*)\"?\)/m'

The regex matches, but I believe perl is not doing multiline, even though I'm using /m .

Any ideas?

/m affects what ^ and $ match. You use neither, so /m has no effect.

You only read a single line at a time, so you only match against a single line at a time. /m cannot possibly cause the regex to match against data that is awaiting to be read from some file handle it doesn't know anything about.

You could load the entire file into memory by using -0777 and loop over all matches instead of just grabbing the first.

This is pretty straightforward with just grep and sed :

grep adGroupId listado.txt | sed -E  "s/[^0-9]+//g"
  1. Match lines with adGroupId in them
  2. Remove everything that isn't a digit

Depending of exact structure of your data you may make use of line numbers:

while (<>) {
  if ( /NumberLong\("?(?<nr>\d+)/ ) {
    $.%2 ? print "$+{nr}-" : print "$+{nr}\n";
  }
}

Or use flags:

my $flag = 0;

while (<>) {
  if ( /NumberLong\("?(?<nr>\d+)/ ) {
    !$flag 
      ? (print "$+{nr}-" and $flag++)
      : (print "$+{nr}\n" and $flag--);
  }
}

Or with slurping:

use 5.010;
my $file;

{
  local $/;
  $file = <>;
}

while ($file =~ /adGroupId" : NumberLong\("?(?<first>\d+).+?keywordId" : NumberLong\("?(?<second>\d+)/gs ) {
  say "$+{first}-$+{second}";
}
perl -ne "print $1.'-' if /adGroupId.+?(\d+)/;print $1.',' if /keywordId.+?(\d+)/" listado.txt

Take a look at File::MultilineGrep

Excerpt from its description: To be considered text files having repeated structures. These structures possess repeated start delimiter, optional stop delimiter and variable contents. That is some or all fields of these structures are optional. A task is to select all whole structures, that contain a specified pattern. This can be done using a multiline regular expressions. But there is a performance issue: Processing time using regular expression is not directly proportional to amount of structures, so that increasing of this amount might cause the reqular expression will never finish. Processing time of the proposed function is directly proportional to amount of structures.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM