简体   繁体   English

如何使用python-docx从文档中提取标题编号?

[英]How to extract heading numbers from a document using python-docx?

I'm using python-docx library to extract data from a docx document, however I also want the heading number/ paragraph number.我正在使用 python-docx 库从 docx 文档中提取数据,但是我也想要标题编号/段落编号。 I want to build a build a proof reading tool for which I need to know that information, however I can neither find that information in the text, nor the style of the paragraph .我想构建一个校对工具,我需要知道该信息,但是我既找不到文本中的信息,也找不到段落的样式 Is there some way to extract that information?有没有办法提取这些信息? I can just loop through the tags of same heading number, but what if the user didn't use proper heading tags while writing the document?我可以循环遍历相同标题编号的标签,但是如果用户在编写文档时没有使用正确的标题标签怎么办? Or what if they choose to not use the default word convention of 1, 1.1, 1.1.1, a and choose to use something of their own instead?或者,如果他们选择不使用默认的单词约定1, 1.1, 1.1.1, a并选择使用他们自己的东西呢?

默认约定

Basically I want a way to extract these numbers, 2, 2.1, 2.2.1, (a) .基本上我想要一种方法来提取这些数字, 2, 2.1, 2.2.1, (a) How can I do it?我该怎么做?

I tried a similar one but for multilanguage.我尝试了类似的方法,但用于多语言。

Firstly you have to observe the heading(1, 2, 3 ..) and subheading (2.1, 2.2 ..) and try to extract some common things.首先你必须观察标题(1, 2, 3 ..)和副标题(2.1, 2.2 ..)并尝试提取一些常见的东西。 They might have some of the below unique patterns:他们可能有以下一些独特的模式:

  1. Bold text粗体
  2. Font, size字体大小
  3. Headings start with int(2) and subheadings with float (2.1)标题以 int(2) 开头,副标题以 float (2.1) 开头
  4. What is the delimiter ('\\t' or 'space') before the text and after the number文本之前和数字之后的分隔符('\\t' 或 'space')是什么

Observe these things and try to frame the pattern.观察这些事情并尝试构建模式。 By using the regex, we can extract the required.通过使用正则表达式,我们可以提取所需的。

Here is the regex, which will satisfy your case.这是正则表达式,它将满足您的情况。 Even for multi-language.即使是多语言。

headings = regex.search("\d+\.\t(\p{Lu}+([\s]+)?)+")
subHeadings =regex.search("\d+\.\d+\t\p{Lu}(\p{Ll}+)+")

The python regex ( re ) is not backward compatible. python regex ( re ) 不向后兼容。 So use this [regex][1] especially if your text is multi-language.所以使用这个 [regex][1] 尤其是如果你的文本是多语言的。

import regex
from docx import Document
doc = Document("<<Your doc file name here>>")

# Iterate through paragraphs ( in a word everything is a paragraph)
# Even the blank lines are paragraphs
for index, para in enumerate(doc.paragraphs):

# Skipping the blank paragraphs
    if(para.text):
        headings = regex.search("\d+\.\t(\p{Lu}+([\s]+)?)+",para.text,regex.UNICODE)
        subHeadings = regex.search("\d+\.\d+\t\p{Lu}(\p{Ll}+)+",para.text,regex.UNICODE)
        if headings:
            if para.runs:
                for run in para.runs:
                    # At run level checking for bold or italic.
                    if run.bold:
                        print("Bold Heading :",headings.group(0))
                    if run.italic:
                        print("Italic Heading :",headings.group(0))
          if subHeadings :
            if para.runs:
                for run in para.runs:
                    # At run level checking for bold or italic.
                    if run.bold:
                        print("Bold subHeadings :",subHeadings .group(0))
                    if run.italic:
                        print("Italic subHeadings :",subHeadings .group(0))

Note: The Bold or Italic will not always be there at the run level.注意:粗体或斜体在运行级别并不总是存在。 If you are not getting these parameters, you should check in style and para level.如果你没有得到这些参数,你应该检查 style 和 para level。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM