简体   繁体   中英

Regular expression: determine info from utilite output with grep

An utilite nconvert produce info about some file:

nconvert -info file.tiff

output

** NCONVERT v7.00 (c) 1991-2017 Pierre-E Gougelet (Apr 18 2017/09:49:26) **
        Version for Windows NT/9x/2000/Xp/Vista/7  (All rights reserved)
** This is freeware software (for non-commercial use)

Over...

file.tiff : Success
    Format               : TIFF
    Name                 : tiff
    Compression          : CCITT Group 4
    Width                : 3194
    Height               : 5056
    Components per pixel : 1
    Bits per component   : 1
    Depth                : 1
    # colors             : 2
    Color model          : RGB
    Bytes Per Plane      : 400
    Orientation          : Top Left
    Xdpi                 : 600
    Ydpi                 : 600
    Page(s)              : 30
    Info:
      Photometric Interpretation: White=0
      PhotometricInterpretation: 0
      PlanarConfiguration: 1
      SamplesPerPixel: 1
      Software: LIBFORMAT (c) Pierre-e Gougelet
    Metadata             : ( EXIF )

I need extract numeral information using grep. Suppose, I want to define number of pages, I use

nconvert -info file.tiff | grep -oP "(?<=Page\(s\)).*$"

I get:

      : 30

But I need only number 30 !

Modification below also does not bring the desired result

nconvert -info efile.tiff | grep -oP "(?<=Page\(s\)\s+\:).*$"

How can I get info after colons?

You can use this grep :

nconvert -info efile.tiff | grep -oP 'Page\(s\)\h*:\h*\K\d+'
30

\\K will reset the matched information.

You can also use awk :

nconvert -info efile.tiff | awk -F '[: \t]*' '$2=="Page(s)"{print $3}'
30

You need to convert the positive lookbehind with \\K match reset operator here to allow variable width pattern before the value you need to extract:

grep -oP 'Page\(s\)\s*:\s*\K.*'

Here,

  • Page\\(s\\) - matches Page(s)
  • \\s*:\\s* - matches : enclosed with 0+ whitespaces
  • \\K - omits the text matched so far
  • .* - matches the rest of the line.
nconvert -info file.tiff |
sed -n '/^[[:space:]]*Page\(s\)/{s/^[^[:digit:]]*//;p}'

should do it. __

Explanation

  • -n in sed restricts it printing every line to the output. By default it prints everything.
  • /pattern/ is self explanatory, ie to look for a pattern, enclose it in two forward slashes.
  • /^pattern/ looks for a pattern in the beginning of the line
  • /^[[:space:]]*Page\\(s\\)/ looks for any number of spaces in the beginning of a line followed by Page(s)
  • If we find the above part in the lines, then sed processes the commands which are inside the curly braces {commands} .
  • The first command is substitute which has the format s/patten/substitution/ .
  • [] in regex is used for character ranges, for example [AZ] or [0-9]
  • But character classes are also available and the [:digit:] character class is the same as 0-9 By putting a ^ in the beginning of [] you're negating that particular character class. So in short s/^[^[:digit:]]*// means delete any non-digit characters at the beginning. Note * means any zero or more times.
  • The p at the end prints the lines. Also not s and p commands are separated by a semicolon

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM