简体   繁体   English

使用python-docx遍历docx中的目录

[英]Iterate through Table of Contents in docx using python-docx

I have a doc with a table of contents that was auto generated in the beginning of the doc and would like to parse through this table of contents. 我有一个带有目录的文档,该目录是在文档的开头自动生成的,并且希望通过此目录进行解析。 Is this possible using python-docx ? 是否可以使用python-docx If I try to iterate through doc.paragraphs.text , the text in that is in the table of contents does not show up. 如果我尝试遍历doc.paragraphs.text ,则目录中的文本不会显示。

I tried the following: iterating through paragraphs and checking for the paragraph.style.name being toc 1 Then I know that I am in a ToC. 我试过如下:通过段落迭代和检查的paragraph.style.nametoc 1然后,我知道我是在目录中。 But I am unable to get the actual text. 但是我无法获得实际的文本。 I tried this: 我尝试了这个:

if para.style.name == "toc 1" #then print para.text. 

But para.text is giving me a blank string. 但是para.text给了我一个空白字符串。 Why would this be the case? 为什么会这样呢?

Thanks 谢谢

I believe you'll find that the actual generated contents of the TOC is "wrapped" in a non-paragraph element. 我相信您会发现TOC的实际生成内容被“包装”在一个非段落元素中。 python-docx won't get you there directly as it only finds paragraphs that are direct children of the w:document/w:body element. python-docx不会直接将您python-docx那里,因为它只会查找属于w:document/w:body元素的直接子代的段落。

To get at these you'll need to go down to the lxml level, using python-docx to get you as close as possible. 为了达到这些目的,您需要降至lxml级别,使用python-docx使您尽可能接近。 You can get to (and print) the body element with this: 您可以使用以下方法进入(并打印)body元素:

document = Document('my-doc.docx')
body_element = document._body._body
print(body_element.xml)  # this will be big if your document is

From there you can identify the specific XML location of the parts you want and use lxml/XPath to access them. 在这里,您可以标识所需部件的特定XML位置,并使用lxml / XPath进行访问。 Then you can wrap them in python-docx Paragraph objects for ready access: 然后,您可以将它们包装在python-docx Paragraph对象中以进行访问:

from docx.text.paragraph import Paragraph

ps = body_element.xpath('./w:something/w:something_child/w:p'
paragraphs = [Paragraph(p, None) for p in ps]

This is not an exact recipe and will require some research on your part to work out what w:something etc. are, but if you want it bad enough to surmount those hurdles, this approach will work. 这是不是一个确切的配方,并要求你做一些研究工作出了什么w:something等都是,但如果你希望它坏到足以克服这些障碍,这种方法会奏效。

Once you get it working, posting your exact solution may be of help to others on search. 一旦它开始工作,发布确切的解决方案可能会对搜索中的其他人员有所帮助。

Since most of the solution is hidden in the comment section and it took me a while to figure out exactly what the OP did and how scanny's answer changed what he was doing, I'll just post my solution here, which is only what is written in the comment section of scanny's answer. 由于大多数解决方案都隐藏在注释部分中,并且花了我一段时间才弄清楚OP的操作以及scanny的答案如何改变了他的工作,所以我将解决方案发布在这里,这只是所写的内容在scanny的答案的评论部分。 I don't fully comprehend, how the code works, so if somebody wants to edit my answer, please feel free to do so. 我不完全了解代码的工作原理,因此,如果有人想编辑我的答案,请随时这样做。

#open docx file with python-docx
document = docx.Document("path\to\file.docx")
#extract body elements
body_elements = document._body._body
#extract those wrapped in <w:r> tag
rs = body_elements.xpath('.//w:r')
#check if style is hyperlink (toc)
table_of_content = [r.text for r in rs if r.style == "Hyperlink"]

table_of_content will be a list, comprised of first the numbering as an item, followed by the title. table_of_content将是一个列表,由首先作为项目的编号,然后是标题组成。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM