简体   繁体   中英

Awk pattern matching

I want to print

userId = 1234
userid = 12345
timestamp = 88888888
js = abc

from my data

messssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
<input name="userId" value="1234" type="hidden"> messsssssssssssssssssss
<input name="userid" value="12345" type="hidden"> messssssssssssssssssss
<input name="timestamp" value="88888888" type="hidden"> messssssssssssss
<input name="js" value="abc" type="hidden"> messssssssssssssssssssssssss
messssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss

How can I do this with AWK(or whatever)? Assume that my data is stored in the " $info " variable (single line data).

Edit : single line data i mean all data represent like this

messss...<input name="userId" value="1234" type="hidden">messsss...<input ....>messssssss

So i can't use grep to extract interest section.

I'm not sure I understand your "single line data" comment but if this is in a file, you can just do something like:

cat file
    | grep '^<input '
    | sed 's/^<input name="//'
    | sed 's/" value="/ = /'
    | sed 's/".*$//'

Here's the cut'n'paste version:

cat file | grep '^<input ' | sed 's/^<input name="//' | sed 's/" value="/ = /' | sed 's/".*$//'

This turns:

messssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
<input name="userId" value="1234" type="hidden"> messsssssssssssssssssss
<input name="userid" value="12345" type="hidden"> messssssssssssssssssss
<input name="timestamp" value="88888888" type="hidden"> messssssssssssss
<input name="js" value="abc" type="hidden"> messssssssssssssssssssssssss
messssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss

quite happily into:

userId = 1234
userid = 12345
timestamp = 88888888
js = abc

The grep simply extracts the lines you want while the sed commandsrespectively:

  • strip off up to the first quote.
  • replace the section between the name and value with an "=".
  • remove everything following the value closing quote (including that quote).

This part should probably be a comment on Pax's answer, but it got a bit long for that little box. I'm thinking 'single line data' means you don't have any newlines in your variable at all? Then this will work:

echo "$info" | sed -n -r '/<input/s/<input +name="([^"]+)" +value="([^"]+)"[^>]*>[^<]*/\1 = \2\n/gp'

Notes on interesting bits: - -n means don't print by default - we'll say when to print with that p at the end.

  • -r means extended regex

  • /<input/ at the beginning makes sure we don't even bother to work on lines that don't contain the desired pattern

  • That \\n at the end is there to ensure all records end up on separate lines - any original newlines will still be there, and the fastest way to get rid of them is to tack on a '| grep .' on the end - you could use some sed magic but you wouldn't be able to understand it thirty seconds after you typed it in.

I can think of ways to do this in awk, but this is really a job for sed (or perl!).

要处理包含多行的变量,您需要将变量名称放在双引号中:

echo "$info"|sed 's/^\(<input\( \)name\(=\)"\([^"]*\)" value="\([^"]*\)"\)\?.*/\4\2\3\2\5/'

使用Perl

cat file | perl -ne 'print($1 . "=" . $2 . "\n") if(/name="(.*?)".*value="(.*?)"/);'

IMO, parsing HTML should be done with a proper HTML/XML parser. For example, Ruby has an excellent package, Nokogiri, for parsing HTML/XML:

ruby -e '
    require "rubygems"
    require "nokogiri"
    doc = Nokogiri::HTML.parse(ARGF.read)
    doc.search("//input").each do |node|
        atts = node.attributes
        puts "%s = %s" % [atts["name"], atts["value"]]
    end
' mess.html

produces the output you're after

AWK:

BEGIN {
  # Use record separator "<", instead of "\n".
  RS = "<"
  first = 1
}

# Skip the first record, as that begins before the first tag
first {
  first = 0
  next
}

/^input[^>]*>/ { #/
  # make sure we don't match outside of the tag
  end = match($0,/>/)

  # locate the name attribute
  pos = match($0,/name="[^"]*"/)
  if (pos == 0 || pos > end) { next }
  name = substr($0,RSTART+6,RLENGTH-7)

  # locate the value attribute
  pos = match($0,/value="[^"]*"/)
  if (pos == 0 || pos > end) { next }
  value = substr($0,RSTART+7,RLENGTH-8)

  # print out the result
  print name " = " value
}

awk和sed之类的工具可以与XMLStarletHTML Tidy一起使用来解析HTML。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM