简体   繁体   中英

Use OpenXML to replace text in DOCX file - strange content

I'm trying to use the OpenXML SDK and the samples on Microsoft's pages to replace placeholders with real content in Word documents.

It used to work as described here , but after editing the template file in Word adding headers and footers it stopped working. I wondered why and some debugging showed me this:

在此处输入图片说明

Which is the content of texts in this piece of code:

using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(DocumentFile, true))
{
    var texts = wordDoc.MainDocumentPart.Document.Body.Descendants<Text>().ToList();
}

So what I see here is that the body of the document is "fragmented", even though in Word the content looks like this:

在此处输入图片说明

Can somebody tell me how I can get around this?


I have been asked what I'm trying to achieve. Basically I want to replace user defined "placeholders" with real content. I want to treat the Word document like a template. The placeholders can be anything. In my above example they look like {var:Template1} , but that's just something I'm playing with. It could basically be any word.

So for example if the document contains the following paragraph:

Do not use the name USER_NAME

The user should be able to replace the USER_NAME placeholder with the word admin for example, keeping the formatting intact. The result should be

Do not use the name admin

The problem I see with working on paragraph level, concatenating the content and then replacing the content of the paragraph, I fear I'm losing the formatting that should be kept as in

Do not use the name admin

Various things can fragment text runs. Most frequently proofing markup (as apparently is the case here, where there are "squigglies") or rsid (used to compare documents and track who edited what, when), as well as the "Go back" bookmark Word sets in the background. These become readily apparent if you view the underlying WordOpenXML (using the Open XML SDK Productivity Tool, for example) in the document.xml "part".

It usually helps to go an element level "higher". In this case, get the list of Paragraph descendants and from there get all the Text descendants and concatenate their InnerText.

OpenXML is indeed fragmenting your text:

I created a library that does exactly this : render a word template with the values from a JSON.

From the documenation of docxtemplater :

Why you should use a library for this

Docx is a zipped format that contains some xml. If you want to build a simple replace {tag} by value system, it can already become complicated, because the {tag} is internally separated into <w:t>{</w:t><w:t>tag</w:t><w:t>}</w:t> . If you want to embed loops to iterate over an array, it becomes a real hassle.

The library basically will do the following to keep formatting :

If the text is :

<w:t>Hello</w:t>
<w:t>{name</w:t>
<w:t>} !</w:t>
<w:t>How are you ?</w:t>

The result would be :

<w:t>Hello</w:t>
<w:t>John !</w:t>
<w:t>How are you ?</w:t>

You also have to replace the tag by <w:t xml:space=\\"preserve\\"> to ensure that the space is not stripped out if they is any in your variables.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM