Awk pattern matching

Question

I want to print

userId = 1234
userid = 12345
timestamp = 88888888
js = abc

from my data

messssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
<input name="userId" value="1234" type="hidden"> messsssssssssssssssssss
<input name="userid" value="12345" type="hidden"> messssssssssssssssssss
<input name="timestamp" value="88888888" type="hidden"> messssssssssssss
<input name="js" value="abc" type="hidden"> messssssssssssssssssssssssss
messssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss

How can I do this with AWK(or whatever)? Assume that my data is stored in the " $info " variable (single line data).

Edit : single line data i mean all data represent like this

messss...<input name="userId" value="1234" type="hidden">messsss...<input ....>messssssss

So i can't use grep to extract interest section.

Answer 1

I'm not sure I understand your "single line data" comment but if this is in a file, you can just do something like:

cat file
    | grep '^<input '
    | sed 's/^<input name="//'
    | sed 's/" value="/ = /'
    | sed 's/".*$//'

Here's the cut'n'paste version:

cat file | grep '^<input ' | sed 's/^<input name="//' | sed 's/" value="/ = /' | sed 's/".*$//'

This turns:

messssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
<input name="userId" value="1234" type="hidden"> messsssssssssssssssssss
<input name="userid" value="12345" type="hidden"> messssssssssssssssssss
<input name="timestamp" value="88888888" type="hidden"> messssssssssssss
<input name="js" value="abc" type="hidden"> messssssssssssssssssssssssss
messssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss

quite happily into:

userId = 1234
userid = 12345
timestamp = 88888888
js = abc

The grep simply extracts the lines you want while the sed commandsrespectively:

strip off up to the first quote.
replace the section between the name and value with an "=".
remove everything following the value closing quote (including that quote).

Answer 2

This part should probably be a comment on Pax's answer, but it got a bit long for that little box. I'm thinking 'single line data' means you don't have any newlines in your variable at all? Then this will work:

echo "$info" | sed -n -r '/<input/s/<input +name="([^"]+)" +value="([^"]+)"[^>]*>[^<]*/\1 = \2\n/gp'

Notes on interesting bits: - -n means don't print by default - we'll say when to print with that p at the end.

-r means extended regex
/<input/ at the beginning makes sure we don't even bother to work on lines that don't contain the desired pattern
That \\n at the end is there to ensure all records end up on separate lines - any original newlines will still be there, and the fastest way to get rid of them is to tack on a '| grep .' on the end - you could use some sed magic but you wouldn't be able to understand it thirty seconds after you typed it in.

I can think of ways to do this in awk, but this is really a job for sed (or perl!).

Answer 3

要处理包含多行的变量，您需要将变量名称放在双引号中：

echo "$info"|sed 's/^\(<input\( \)name\(=\)"\([^"]*\)" value="\([^"]*\)"\)\?.*/\4\2\3\2\5/'

Answer 4

使用Perl

cat file | perl -ne 'print($1 . "=" . $2 . "\n") if(/name="(.*?)".*value="(.*?)"/);'

Answer 5

IMO, parsing HTML should be done with a proper HTML/XML parser. For example, Ruby has an excellent package, Nokogiri, for parsing HTML/XML:

ruby -e '
    require "rubygems"
    require "nokogiri"
    doc = Nokogiri::HTML.parse(ARGF.read)
    doc.search("//input").each do |node|
        atts = node.attributes
        puts "%s = %s" % [atts["name"], atts["value"]]
    end
' mess.html

produces the output you're after

Answer 6

AWK:

BEGIN {
  # Use record separator "<", instead of "\n".
  RS = "<"
  first = 1
}

# Skip the first record, as that begins before the first tag
first {
  first = 0
  next
}

/^input[^>]*>/ { #/
  # make sure we don't match outside of the tag
  end = match($0,/>/)

  # locate the name attribute
  pos = match($0,/name="[^"]*"/)
  if (pos == 0 || pos > end) { next }
  name = substr($0,RSTART+6,RLENGTH-7)

  # locate the value attribute
  pos = match($0,/value="[^"]*"/)
  if (pos == 0 || pos > end) { next }
  value = substr($0,RSTART+7,RLENGTH-8)

  # print out the result
  print name " = " value
}

Answer 7

awk和sed之类的工具可以与XMLStarlet和HTML Tidy一起使用来解析HTML。

Awk pattern matching

Question

7 answers

solution1
4 2009-09-22 14:20:02

solution2
3 ACCPTED 2009-09-22 14:54:48

solution3
2 2009-09-22 14:24:52

solution4
2 2009-09-22 15:34:02

solution5
1 2009-09-22 16:33:44

solution6
0 2009-09-22 23:15:59

solution7
0 2009-09-22 23:22:30

Awk pattern matching

Question

7 answers

solution1 4 2009-09-22 14:20:02

solution2 3 ACCPTED 2009-09-22 14:54:48

solution3 2 2009-09-22 14:24:52

solution4 2 2009-09-22 15:34:02

solution5 1 2009-09-22 16:33:44

solution6 0 2009-09-22 23:15:59

solution7 0 2009-09-22 23:22:30

solution1
4 2009-09-22 14:20:02

solution2
3 ACCPTED 2009-09-22 14:54:48

solution3
2 2009-09-22 14:24:52

solution4
2 2009-09-22 15:34:02

solution5
1 2009-09-22 16:33:44

solution6
0 2009-09-22 23:15:59

solution7
0 2009-09-22 23:22:30