Use textract to get txt from a docx but the numbers are ignored in python

Question

I'm trying to get section number before each paragraph. But the weird thing is when I using textract to get txt from some docx. The numbers are ignored. Is there a way to get these numbers back? EX: 1.Term. XXXXXXXXXXXXXXend

I only got 'Term. XXXXXXXXXXXXXXend' in txt. I guess when these section are inputed within word's numbering feature, they will be ignored

text = textract.process(url, extension='docx')
strText = text.decode("utf8")
children = strText.split('\n\n')

Thanks in advance

Answer 1

Yes, you're hypothesis is correct. The section numbers are not actually stored in the document, they are computed and displayed at runtime only.

The only way to get them is to keep track yourself based on what may be the style of those paragraphs, something like 'Heading 1' and 'Heading 2' etc. It's possible for them to be assigned other ways which make it more difficult, but often it's done with headings since that's so easy for the author.

Use textract to get txt from a docx but the numbers are ignored in python

Question

1 answers

solution1
1 2017-06-27 04:57:47

Use textract to get txt from a docx but the numbers are ignored in python

Question

1 answers

solution1 1 2017-06-27 04:57:47

solution1
1 2017-06-27 04:57:47