简体   繁体   中英

Python-docx ignoring non-unicode Symbols like 'greater than or equal to'

When reading a word docx that contains tables and text with

象征

into python with python-docx the symbols all just get dropped. The symbols were all created with the normal insert symbol steps. It says it is from the Font Symbol , Character code 179 , from Symbol (decimal)

添加符号

Python-docx is just showing it as ''. The same for the 'plus or minus' symbol to the left of it.

When reading the text from the paragraph (not the ones in a table) I use the following code:

def listText():
   test = docx.Document('Problem.docx')
   testp=test.paragraphs[0] #The only paragraph in the test docx
   stringThatShouldHaveSymbol = testp.text

   print(stringThatShouldHaveSymbol)

   return stringThatShouldHaveSymbol

This only returns '' from a document that only contains those symbols. If it has the symbol then 10 it will just return 10.

I also tried this xml approach, but even that returned "".

def get_accepted_text(p):
    """Return text of a paragraph after accepting all changes"""
    xml = p._p.xml
    if "w:del" in xml or "w:ins" in xml:
        tree = docx.Document.XML(xml)
        runs = (node.text for node in tree.getiterator(TEXT) if node.text)
        return "".join(runs)
    else:
        return p.text
for p in doc.paragraphs:
    print(p.text)
    print("---")
    print(get_accepted_text(p))
    print("=========") 

How can I extract the data from these documents? Is there a way to programmatically convert these symbols(decimals) to Unicode(hex)?

Try this

  1. Click on the symbol drop down and select (normal text)
  2. Now select your special symbol

If you now read the docx file you should get your symbol.

Not sure why the symbol font doesn't work. In Arial, 179 that is a 3 superscript.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM