简体   繁体   中英

Parsing XML via Command Line

So I have an XML file I want to parse with a BASH script, etc. using xmlstarlet (or an alternative if people can give me an example).

The basic structure is this:

 <character>
    <literal>恵</literal>
    <misc>
        <stroke_count>10</stroke_count>
    </misc>
    <reading_meaning>
        <rmgroup>
               <reading r_type="ja_on">ケイ</reading>
               <reading r_type="ja_on">エ</reading>
            <reading r_type="ja_kun">めぐ.む</reading>
               <reading r_type="ja_kun">めぐ.み</reading>
               <meaning>favor</meaning>
               <meaning>blessing</meaning>
               <meaning>grace</meaning>
               <meaning>kindness</meaning>
           </rmgroup>
    </reading_meaning>
  </character>

There are some other fields there and the meanings and readings can change in number. Basically I'd like to get all of the readings, meanings, stroke count, etc. out and generate an HTML table with BASH.

This is also a large file with many characters that need looking up. So I'd like to do this with a script that takes in a $1 and uses that to look up the values based on the tag. So ideally it'd be:

kanjilookup.sh 恵

And then generate an html table based on the content.

Thoughts? (I'd also be up for using another program like xpath)

As @thatotherguy suggested, you'll probably want to do this with something like XSLT instead of Bash. You can parse XML with Bash , but it's probably going to get tricky pretty quick.

Following @thatotherguy's suggestion, you could have an XSLT stylesheet that looks something like this:

<!-- kanjilookup.xsl -->

<?xml version="1.0" encoding="iso-8859-1"?>

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:param name="character"/>
  <xsl:output method="html" indent="yes"/>
  <xsl:strip-space elements="*"/>

  <!--
  From https://stackoverflow.com/questions/9611569/xsl-how-do-you-capitalize-first-letter
  -->

  <xsl:variable name="vLower" select="'abcdefghijklmnopqrstuvwxyz'"/>
  <xsl:variable name="vUpper" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>

  <xsl:template name="capitalize">
    <xsl:param name="string"/>

    <xsl:value-of select=
    "concat(translate(substring(
            $string, 1, 1), $vLower, $vUpper),
            substring($string, 2)
           )
    "/>
  </xsl:template>

  <xsl:template match="/">
    <xsl:if test="string-length($character) = 0 or not(//literal[. = $character])">
      <xsl:message terminate="yes">ERR: No input character given.</xsl:message>
    </xsl:if>
    <xsl:apply-templates select="characters/character[literal[. = $character]]"/>
  </xsl:template>

  <xsl:template match="character">
    <xsl:text disable-output-escaping='yes'>&lt;!DOCTYPE html>
</xsl:text>

    <html>
      <head/>
      <body>
        <table>
          <tbody>
            <xsl:apply-templates/>
          </tbody>
        </table>
      </body>
    </html>
  </xsl:template>

  <xsl:template match="literal">
    <caption>
      <xsl:value-of select="."/>
    </caption>
  </xsl:template>

  <xsl:template match="stroke_count">
    <tr>
      <td>
        <xsl:call-template name="capitalize">
          <xsl:with-param name="string" select="translate(local-name(), '_', ' ')"/>
        </xsl:call-template>
      </td>
      <td><xsl:value-of select="."/></td>
    </tr>
  </xsl:template>

  <xsl:template match="misc | reading_meaning | rmgroup">
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:template match="reading | meaning">
    <tr>
      <td>
        <xsl:call-template name="capitalize">
          <xsl:with-param name="string" select="local-name()"/>
        </xsl:call-template>
        <xsl:apply-templates select="@r_type"/>
      </td>
      <td>
        <xsl:value-of select="."/>
      </td>
    </tr>
  </xsl:template>

  <xsl:template match="@r_type">
    <xsl:value-of select="concat(' ', '(', ., ')')"/>
  </xsl:template>
</xsl:stylesheet>

Let's say you have a file called characters.xml :

<characters>
  <character>
    <literal>恵</literal>
    <misc>
      <stroke_count>10</stroke_count>
    </misc>
    <reading_meaning>
      <rmgroup>
        <reading r_type="ja_on">ケイ</reading>
        <reading r_type="ja_on">エ</reading>
        <reading r_type="ja_kun">めぐ.む</reading>
        <reading r_type="ja_kun">めぐ.み</reading>
        <meaning>favor</meaning>
        <meaning>blessing</meaning>
        <meaning>grace</meaning>
        <meaning>kindness</meaning>
      </rmgroup>
    </reading_meaning>
  </character>
</characters>

You could run kanjilookup.xsl on it with XMLStarlet like this:

xml tr kanjilookup.xsl -s character=恵 characters.xml

That'll produce a HTML table that looks like this (after pretty-printing):

<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <table>
      <tbody>
        <caption>恵</caption>
        <tr>
          <td>Stroke count</td>
          <td>10</td>
        </tr>
        <tr>
          <td>Reading (ja_on)</td>
          <td>ケイ</td>
        </tr>
        <tr>
          <td>Reading (ja_on)</td>
          <td>エ</td>
        </tr>
        <tr>
          <td>Reading (ja_kun)</td>
          <td>めぐ.む</td>
        </tr>
        <tr>
          <td>Reading (ja_kun)</td>
          <td>めぐ.み</td>
        </tr>
        <tr>
          <td>Meaning</td>
          <td>favor</td>
        </tr>
        <tr>
          <td>Meaning</td>
          <td>blessing</td>
        </tr>
        <tr>
          <td>Meaning</td>
          <td>grace</td>
        </tr>
        <tr>
          <td>Meaning</td>
          <td>kindness</td>
        </tr>
      </tbody>
    </table>
  </body>
</html>

You'd have to modify the XSLT stylesheets to suit your needs, of course.

Nowadays with XQuery there is no reason to use XSLT anymore, XQuery is much nicer.

Eg with my XQuery interpreter , you can run it on directly without additional file like this:

xidel --printed-node-format xml characters.xml -e "(character:='恵')[2]"  -e - <<<'xquery version "1.0";
(<title>{$character}</title>, 
for $char in //character[literal eq $character] return
  <table>
    <tbody>
      <caption>{$character}</caption>
      <tr>
        <td>Stroke count</td>
        <td>{$char/misc/stroke_count/text()}</td>
      </tr>
      { for $reading in $char//rmgroup/reading return 
        <tr>
          <td>Reading ({$reading/@r_type/data(.)})</td>
          <td>{$reading/text()}</td>
        </tr> } 
      { for $meaning in $char//rmgroup/meaning return 
         <tr>
           <td>Meaning</td>
           <td>{$meaning/text()}</td>
         </tr> } 
   </tbody>
  </table>
)
'

Creates a similar table as the xslt answer. (but you need to prepend <?xml version="1.0" encoding="utf-8"?> to the characters.xml posted there)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM