简体   繁体   English

Python-docx 忽略非 unicode 符号,如“大于或等于”

[英]Python-docx ignoring non-unicode Symbols like 'greater than or equal to'

When reading a word docx that contains tables and text with阅读包含表格和文本的单词 docx 时

象征

into python with python-docx the symbols all just get dropped.使用 python-docx 进入 python ,所有符号都会被删除。 The symbols were all created with the normal insert symbol steps.这些符号都是使用正常的插入符号步骤创建的。 It says it is from the Font Symbol , Character code 179 , from Symbol (decimal)它说它来自字体符号,字符代码179 ,来自符号(十进制)

添加符号

Python-docx is just showing it as ''. Python-docx 只是将其显示为 ''。 The same for the 'plus or minus' symbol to the left of it.左侧的“加号或减号”符号相同。

When reading the text from the paragraph (not the ones in a table) I use the following code:从段落中读取文本(不是表格中的文本)时,我使用以下代码:

def listText():
   test = docx.Document('Problem.docx')
   testp=test.paragraphs[0] #The only paragraph in the test docx
   stringThatShouldHaveSymbol = testp.text

   print(stringThatShouldHaveSymbol)

   return stringThatShouldHaveSymbol

This only returns '' from a document that only contains those symbols.这仅从仅包含这些符号的文档中返回 ''。 If it has the symbol then 10 it will just return 10.如果它有符号,那么 10 它只会返回 10。

I also tried this xml approach, but even that returned "".我也尝试了这种 xml 方法,但即使返回“”。

def get_accepted_text(p):
    """Return text of a paragraph after accepting all changes"""
    xml = p._p.xml
    if "w:del" in xml or "w:ins" in xml:
        tree = docx.Document.XML(xml)
        runs = (node.text for node in tree.getiterator(TEXT) if node.text)
        return "".join(runs)
    else:
        return p.text
for p in doc.paragraphs:
    print(p.text)
    print("---")
    print(get_accepted_text(p))
    print("=========") 

How can I extract the data from these documents?如何从这些文档中提取数据? Is there a way to programmatically convert these symbols(decimals) to Unicode(hex)?有没有办法以编程方式将这些符号(十进制)转换为 Unicode(十六进制)?

Try this尝试这个

  1. Click on the symbol drop down and select (normal text)单击符号下拉和 select(普通文本)
  2. Now select your special symbol现在 select 您的特殊符号

If you now read the docx file you should get your symbol.如果你现在阅读 docx 文件,你应该得到你的符号。

Not sure why the symbol font doesn't work.不知道为什么符号字体不起作用。 In Arial, 179 that is a 3 superscript.在 Arial 中,179 是一个 3 上标。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM