简体   繁体   中英

XSLT for Word Documents

I am working on a project where I need to send users Word-documents that are generated from a Linux script. The Word-documents are stored as docx, and will have some markers inside them (ie ${Firstname} ) that will be replaced by the script.

I cannot use Word on this Linux machine. I can only use xsltproc which uses XSLT1.0, which makes grouping much harder.

The script that I have written works fine for most Word-documents, but in some cases Word spreads out a single sentence, or even a word, across multiple <w:t> tags when there is no change in styling.

Because of this I'm trying to figure out a way to merge consecutive <w:t> tags inside a run ( <w:r> ) if the styling is exactly the same.

Here is some sample input, that, based on the comments below, I have sanitised a bit, but I'm not trying to hide that this is Word-generated XML.

 <w:body>
    <w:p>
      <w:r>
        <w:rPr>
          <w:rFonts w:ascii="Arial" w:eastAsia="Times New Roman" w:hAnsi="Arial" w:cs="Arial"/>
          <w:sz w:val="20"/>
          <w:szCs w:val="20"/>
        </w:rPr>
        <w:t>{if}${Dossier.Person.City.city}==”New York”{then}HOMECITY!{else}Far away{</w:t>
      </w:r>
      <w:proofErr w:type="spellStart"/>
      <w:r>
        <w:rPr>
          <w:rFonts w:ascii="Arial" w:eastAsia="Times New Roman" w:hAnsi="Arial" w:cs="Arial"/>
          <w:sz w:val="20"/>
          <w:szCs w:val="20"/>
        </w:rPr>
        <w:t>endif</w:t>
      </w:r>
      <w:proofErr w:type="spellEnd"/>
      <w:r>
        <w:rPr>
          <w:rFonts w:ascii="Arial" w:eastAsia="Times New Roman" w:hAnsi="Arial" w:cs="Arial"/>
          <w:sz w:val="20"/>
          <w:szCs w:val="20"/>
        </w:rPr>
        <w:t>}</w:t>
      </w:r>
    </w:p>
    <w:sectPr>
      <w:pgSz/>
      <w:pgMar w:top="1417" w:right="1417" w:bottom="1417" w:left="1417" w:header="708" w:footer="708" w:gutter="0"/>
      <w:cols w:space="708"/>
      <w:docGrid w:linePitch="360"/>
    </w:sectPr>
  </w:body>

What I would like to achieve is this:

  • Remove all <w:proofErr /> elements. This I can do easily with my XSLT.

But then, I would basically like to do:

  • iterate over all <w:p> elements
  • if they contain consecutive runs ( <w:r> ) where the styling is exactly the same ( <w:rPr> ) then just create one run, with the styling once, and merge all the text ( <w:t> ).
  • keep everything else in the XML

So my desired end result in this case would be:

  <w:body>
    <w:p>
      <w:r>
        <w:rPr>
          <w:rFonts w:ascii="Arial" w:eastAsia="Times New Roman" w:hAnsi="Arial" w:cs="Arial"/>
          <w:sz w:val="20"/>
          <w:szCs w:val="20"/>
        </w:rPr>
        <w:t>{if}${Dossier.Person.City.city}==”New York”{then}HOMECITY!{else}Far Away{endif}</w:t>
      </w:r>
    </w:p>
    <w:sectPr>
      <w:pgSz w:w="11906" w:h="16838"/>
      <w:pgMar w:top="1417" w:right="1417" w:bottom="1417" w:left="1417" w:header="708" w:footer="708" w:gutter="0"/>
      <w:cols w:space="708"/>
      <w:docGrid w:linePitch="360"/>
    </w:sectPr>
  </w:body>

I have come this far, but I don't know how to check for those exact values inside the <w:rPr> , which means the style changes inside a paragraph have now disappeared. It now just picks up the first <w:rPr> node.

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">

    <xsl:output method="xml" encoding="utf-8" indent="yes"/>

    <!-- Identity template : copy all text nodes, elements and attributes -->   
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" />
        </xsl:copy>
    </xsl:template>

    <!-- Ignore w:proofErr nodes -->
    <xsl:template match="w:proofErr" />

    <!-- w:r nodes are processed in the for-each loop -->
    <xsl:template match="w:r"/>

    <xsl:template match="w:p">
      <xsl:element name="w:p">
        <xsl:apply-templates select="@*|node()"/>
        <xsl:element name="w:r">
          <xsl:copy-of select="w:r[1]/w:rPr"/> 
          <xsl:element name="w:t">
            <xsl:for-each select="w:r">
              <xsl:for-each select="w:t">
                <xsl:value-of select="."/>
              </xsl:for-each>
            </xsl:for-each>
          </xsl:element>
        </xsl:element>
      </xsl:element>
    </xsl:template>

</xsl:stylesheet>

I had tried to figure out various ways of de-duplication before I posted, but based on the kind comments I have looked again into Muenchian grouping. I still don't understand how I could use this here.

I don't care if multiple <w:rPr> have the exact same value within a paragraph, as long as there are <w:rPr> between them that have a different value.

Do something like this:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  
  <xsl:output method="xml" encoding="utf-8" indent="yes"/>
  
  <!-- Identity template : copy all text nodes, elements and attributes -->   
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()" />
    </xsl:copy>
  </xsl:template>
  
  <!-- Ignore w:proofErr nodes -->
  <xsl:template match="w:proofErr" />
  
  <xsl:template match="w:p">
    <xsl:copy>
      <xsl:apply-templates select="@*"/>
      <xsl:apply-templates select="w:r[1]"/>
    </xsl:copy>
  </xsl:template>
    
  <xsl:template match="w:r">
    <xsl:variable name="w:rPr" select="w:rPr"/>
    <xsl:copy>
      <xsl:apply-templates select="@*"/>
      <xsl:copy-of select="w:rPr"/> 
      <xsl:element name="w:t">
        <xsl:apply-templates select="(w:t|following-sibling::w:r[w:rPr=$w:rPr]/w:t)/node()"/>
      </xsl:element>
    </xsl:copy>
    <xsl:apply-templates select="following-sibling::w:r[not(w:rPr=$w:rPr)][1]"/>
  </xsl:template>

</xsl:stylesheet>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM