如何在Python中識別文本中的上標和/或下標

Question

我有一個文檔，其中必須提取與Python中的上標或下標關聯的字符串。 我探索了docx庫，可以在其中添加上標和下標，但是我想知道應該如何提取此類字符串。 我已經用谷歌搜索，但是找不到任何好的解決方案。

from docx import Document
document = Document()

p = document.add_paragraph('Normal text with ')

super_text = p.add_run('superscript text')
super_text.font.superscript = True

p.add_run(' and ')

sub_text = p.add_run('subscript text')
sub_text.font.subscript = True

document.save('test.docx')

Answer 1

您可以先嘗試將docx文件轉換為xml。 然后使用正則表達式捕獲上標和下標值。

這是一個例子

import re
import zipfile

try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML


def get_docx_xml(path):
    """Take the path of a docx file as argument, return the text in unicode."""
    document = zipfile.ZipFile(path)
    xml_content = document.read('word/document.xml')
    document.close()
    return xml_content


def get_superscript_subscript(xml_content):
    """Returns a dictionary with a value of list of superscipt and subscript."""
    superscript = re.findall('<w:vertAlign w:val="superscript"\/><w:lang w:val="[\S\s]*?"\/><\/w:rPr><w:t>([\S]+)<\/w:t><\/w:r>[\s\S]*?<w:t xml:space="preserve">([\s]*[\S]*)[\s\S]*?<\/w:t><\/w:r>', xml_content)
    subscript = re.findall('<w:vertAlign w:val="subscript"\/><w:lang w:val="[\S\s]*?"\/><\/w:rPr><w:t>([\S]+)<\/w:t><\/w:r>[\s\S]*?<w:t xml:space="preserve">([\s]*[\S]*)[\s\S]*?<\/w:t><\/w:r>', xml_content)
    return {"superscript": superscript, "subscript": subscript}

if __name__ == '__main__':
    xml_content = get_docx_xml(<docx_file_path>)
    superscripts_subscripts = get_superscript_subscript(xml_content)

輸出將是這樣的-一個具有元組項列表值的字典：第一個是上標/下標，第二個是后一個單詞。

{'下標'：[（'28'，'）'），（'28'，'分數'），（'28'，'人'），（'28'，'總和'），（'28' ，'and'），（'28'，'score'），（'28'，'）'）]，'上標'：[（'28'，'）'），（'28'，'score' ），（'28'，'are'），（'28'，'sum'），（'28'，'和'），（'28'，'得分'），（'28'，'）' ）]}

如何在Python中識別文本中的上標和/或下標

問題描述

1 個解決方案

解決方案1
0 2018-03-20 08:25:45

如何在Python中識別文本中的上標和/或下標

問題描述

1 個解決方案

解決方案1 0 2018-03-20 08:25:45

解決方案1
0 2018-03-20 08:25:45