简体   繁体   中英

Use textract to get txt from a docx but the numbers are ignored in python

I'm trying to get section number before each paragraph. But the weird thing is when I using textract to get txt from some docx. The numbers are ignored. Is there a way to get these numbers back? EX: 1.Term. XXXXXXXXXXXXXXend

I only got 'Term. XXXXXXXXXXXXXXend' in txt. I guess when these section are inputed within word's numbering feature, they will be ignored

text = textract.process(url, extension='docx')
strText = text.decode("utf8")
children = strText.split('\n\n')

Thanks in advance

Yes, you're hypothesis is correct. The section numbers are not actually stored in the document, they are computed and displayed at runtime only.

The only way to get them is to keep track yourself based on what may be the style of those paragraphs, something like 'Heading 1' and 'Heading 2' etc. It's possible for them to be assigned other ways which make it more difficult, but often it's done with headings since that's so easy for the author.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM