简体   繁体   English

使用textract从docx获取txt,但是数字在python中被忽略

[英]Use textract to get txt from a docx but the numbers are ignored in python

I'm trying to get section number before each paragraph. 我想在每个段落之前获取节号。 But the weird thing is when I using textract to get txt from some docx. 但是奇怪的是,当我使用textract从某些docx中获取txt时。 The numbers are ignored. 数字将被忽略。 Is there a way to get these numbers back? 有办法找回这些数字吗? EX: 1.Term. 例如:1.Term。 XXXXXXXXXXXXXXend XXXXXXXXXXXXXX结束

I only got 'Term. 我只有'任期。 XXXXXXXXXXXXXXend' in txt. XXXXXXXXXXXXXXend”。 I guess when these section are inputed within word's numbering feature, they will be ignored 我想当这些部分输入到单词的编号功能中时,它们将被忽略

text = textract.process(url, extension='docx')
strText = text.decode("utf8")
children = strText.split('\n\n')

Thanks in advance 提前致谢

Yes, you're hypothesis is correct. 是的,你的假设是正确的。 The section numbers are not actually stored in the document, they are computed and displayed at runtime only. 章节编号实际上没有存储在文档中,它们仅在运行时计算和显示。

The only way to get them is to keep track yourself based on what may be the style of those paragraphs, something like 'Heading 1' and 'Heading 2' etc. It's possible for them to be assigned other ways which make it more difficult, but often it's done with headings since that's so easy for the author. 获取它们的唯一方法是根据这些段落的样式(例如“标题1”和“标题2”等)来跟踪自己。有可能为他们分配其他方式,使他们变得更加困难,但通常是用标题完成的,因为这对作者来说很容易。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM