How do I identify and NOT read in field codes in Docx4j?

Question

To get text from an object, currently I am using:

String someText = TextUtils.extractText(obj, stringWriter);

Where obj is usually a Run, but can really be anything. I am having an issue where I am reading in field codes such as:

 " PAGE   \* MERGEFORMAT "

when I really want to ignore it. Is there a way I can detect when a Text in a Run is a field code and ignore it?

Thanks

Answer 1

You could pre-process the fields before you run TextUtils.extractText.

One can imagine a little utility which you configure by saying, for each field-type, whether you wish to remove it entirely, or keep just the result (possibly updating it first).

docx4j doesn't include this right now, so below I sketch out what is involved.

Note that there are 2 types of fields: simple and complex; see further http://webapp.docx4java.org/OnlineDemo/ecma376/WordML/XML.html

There is code in docx4j for converting from simple to complex; see https://github.com/plutext/docx4j/blob/master/docx4j-core/src/main/java/org/docx4j/model/fields/FieldsPreprocessor.java

Once your fields are in the "complex" form, for example:

<w:r>
  <w:fldChar w:fldCharType="begin"/>
</w:r>

<w:r>
  <w:instrText xml:space="preserve"> DATE </w:instrText>
</w:r>

<w:r>
  <w:fldChar w:fldCharType="separate"/>
</w:r>

<w:r>
  <w:t>12/31/2005</w:t>
</w:r>

<w:r>
  <w:fldChar w:fldCharType="end"/>
</w:r>

You can remove them, keeping just the result (ie the bit between "separate" and "end") if you want it.

The representation docx4j creates is actually a bit easier to work with than the example above; see https://github.com/plutext/docx4j/blob/master/docx4j-core/src/main/java/org/docx4j/model/fields/FieldRef.java

Note that there are quite a few different fields, see http://webapp.docx4java.org/OnlineDemo/ecma376/WordML/file_2.html

You'll want to know which ones are in your documents, and how you want to handle them. For example, you might wish to remove a PAGE field entirely; but a MERGEFIELD you may want to keep the result. If you need to update it first, see https://github.com/plutext/docx4j/blob/master/docx4j-samples-docx4j/src/main/java/org/docx4j/samples/FieldsMailMerge.java

Here is how just the result is kept in the MAILMERGE case: https://github.com/plutext/docx4j/blob/master/docx4j-core/src/main/java/org/docx4j/model/fields/merge/MailMerger.java#L590

Its that easy because the XML is at that point in a known predictable pattern.

For DOCPROPERTY and DOCVARIABLE field processing examples, see https://github.com/plutext/docx4j/blob/master/docx4j-samples-docx4j/src/main/java/org/docx4j/samples/FieldUpdaterExample.java

How do I identify and NOT read in field codes in Docx4j?

Question

1 answers

solution1
1 ACCPTED 2020-09-20 23:44:15

How do I identify and NOT read in field codes in Docx4j?

Question

1 answers

solution1 1 ACCPTED 2020-09-20 23:44:15

solution1
1 ACCPTED 2020-09-20 23:44:15