Missing document text when using python-docx

Question

I am using python-docx 0.8.6 and python 3.6 to preform a simple search/replace operation.

I'm having a problem where not all of the document's text appears when iterating over the doc.paragraphs

For debugging I have tried

doc = Document(input_file)
fullText = []
for para in doc.paragraphs:
    fullText.append(para.text)
print('\n'.join(fullText))

Which only seems to print out about half of the file's contents.

There are no tables or special formatting in the file. Is there any reason why so much of the document's contents cannot be read by python-docx?

Edit: the missing text is contained within a mail merge field if that makes any difference

Answer 1

The mail merge field does make a difference. Unfortunately, python-docx is not sophisticated enough to know which "container" elements hold displayable text and which do not. So it only reports paragraphs (and tables) that are at the "top" level.

This is also a limitation when it comes to revision marks, for example, which have two or more pieces of text of which only one appears, depending on the revision marks setting (show original, show latest after edits, etc.).

The only way around it with python-docx is to navigate the XML yourself, although some of the domain objects in python-docx can be handy, like Paragraph , etc. once you've gotten hold of the elements you want.

Missing document text when using python-docx

Question

1 answers

solution1
2 ACCPTED 2018-01-20 19:48:38

Missing document text when using python-docx

Question

1 answers

solution1 2 ACCPTED 2018-01-20 19:48:38

solution1
2 ACCPTED 2018-01-20 19:48:38