繁体   English   中英

根据字母数和长度提取带括号的首字母缩写词和缩写词

[英]Extract parenthesized acronyms and abbreviations based on letter count and length

我确实意识到这里已经解决了这个问题(例如,检索带括号的缩写的定义,基于字母计数)。 不过,我希望这个问题有所不同。

我想从给定的字符串中提取带括号的首字母缩写词和缩写词。

def extract_acronyms_abbreviations(text):

    eaa = {}
    
    for match in re.finditer(r"\((.*?)\)", text):
        start_index = match.start()
        abbr = match.group(1)
        size = len(abbr)
        words = text[:start_index].split()[-size:]
        definition = " ".join(words)

        eaa[abbr] = definition


    return eaa

但上面的 function 将两者(such as Apple's Siri, Amazon's Alexa, or Google's Voice Assistant)(MSWC)都视为首字母缩略词。 它将括号中的所有字符视为首字母缩略词。

就我而言,我想提取所有缩写词和首字母缩略词,首字母缩略词大写,括号内的首字母缩略词长度小于8 如果有的话,我需要再添加一个词。

示例文本:

text = """Today we are extremely excited to announce the initial release of the Multilingual Spoken Words Corpus (MSWC), a large and growing audio dataset of spoken words in 50 different languages. These languages are collectively spoken by over 5 billion people and for most languages, this is the first publicly available, free-to-use dataset for training voice interfaces. It is licensed under CC-BY 4.0 to inspire academic research and commercial work in keyword spotting, spoken term search, and other applications that can benefit people across the planet. Our ultimate goal is to make keyword spotting voice-based interfaces available for any keyword in any language.Voice-based interaction is already democratizing access to technology. For example, keyword spotting is a common application in many smart devices (such as Apple’s Siri, Amazon’s Alexa, or Google’s Voice Assistant).Resource and Information Management (RIM)"""

Output

extract_acronyms_abbreviations(text)
{'MSWC': 'Multilingual Spoken Words Corpus',
 'such as Apple’s Siri, Amazon’s Alexa, or Google’s Voice Assistant': 'It is licensed under CC-BY 4.0 to inspire academic research and commercial work in keyword spotting, spoken term search, and other applications that can benefit people across the planet. Our ultimate goal is to make keyword spotting voice-based interfaces available for any keyword in any language.Voice-based interaction is already democratizing access to technology. For example, keyword spotting is a common application in many smart devices',
'RIM': 'and Information Management'}

所需 Output

{'MSWC': 'Multilingual Spoken Words Corpus',
'RIM': 'Resource and Information Management'}

您可以将正则表达式更改为: r"\(([AZ]{1,7})\)" 这将只匹配大写字母 AZ,并确保首字母缩略词的长度为 1 到 7 个字符。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM