简体   繁体   中英

Missing document text when using python-docx

I am using python-docx 0.8.6 and python 3.6 to preform a simple search/replace operation.

I'm having a problem where not all of the document's text appears when iterating over the doc.paragraphs

For debugging I have tried

doc = Document(input_file)
fullText = []
for para in doc.paragraphs:
    fullText.append(para.text)
print('\n'.join(fullText))

Which only seems to print out about half of the file's contents.

There are no tables or special formatting in the file. Is there any reason why so much of the document's contents cannot be read by python-docx?

Edit: the missing text is contained within a mail merge field if that makes any difference

The mail merge field does make a difference. Unfortunately, python-docx is not sophisticated enough to know which "container" elements hold displayable text and which do not. So it only reports paragraphs (and tables) that are at the "top" level.

This is also a limitation when it comes to revision marks, for example, which have two or more pieces of text of which only one appears, depending on the revision marks setting (show original, show latest after edits, etc.).

The only way around it with python-docx is to navigate the XML yourself, although some of the domain objects in python-docx can be handy, like Paragraph , etc. once you've gotten hold of the elements you want.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM