簡體   English   中英

根據字母數和長度提取帶括號的首字母縮寫詞和縮寫詞

[英]Extract parenthesized acronyms and abbreviations based on letter count and length

我確實意識到這里已經解決了這個問題(例如,檢索帶括號的縮寫的定義,基於字母計數)。 不過,我希望這個問題有所不同。

我想從給定的字符串中提取帶括號的首字母縮寫詞和縮寫詞。

def extract_acronyms_abbreviations(text):

    eaa = {}
    
    for match in re.finditer(r"\((.*?)\)", text):
        start_index = match.start()
        abbr = match.group(1)
        size = len(abbr)
        words = text[:start_index].split()[-size:]
        definition = " ".join(words)

        eaa[abbr] = definition


    return eaa

但上面的 function 將兩者(such as Apple's Siri, Amazon's Alexa, or Google's Voice Assistant)(MSWC)都視為首字母縮略詞。 它將括號中的所有字符視為首字母縮略詞。

就我而言,我想提取所有縮寫詞和首字母縮略詞,首字母縮略詞大寫,括號內的首字母縮略詞長度小於8 如果有的話,我需要再添加一個詞。

示例文本:

text = """Today we are extremely excited to announce the initial release of the Multilingual Spoken Words Corpus (MSWC), a large and growing audio dataset of spoken words in 50 different languages. These languages are collectively spoken by over 5 billion people and for most languages, this is the first publicly available, free-to-use dataset for training voice interfaces. It is licensed under CC-BY 4.0 to inspire academic research and commercial work in keyword spotting, spoken term search, and other applications that can benefit people across the planet. Our ultimate goal is to make keyword spotting voice-based interfaces available for any keyword in any language.Voice-based interaction is already democratizing access to technology. For example, keyword spotting is a common application in many smart devices (such as Apple’s Siri, Amazon’s Alexa, or Google’s Voice Assistant).Resource and Information Management (RIM)"""

Output

extract_acronyms_abbreviations(text)
{'MSWC': 'Multilingual Spoken Words Corpus',
 'such as Apple’s Siri, Amazon’s Alexa, or Google’s Voice Assistant': 'It is licensed under CC-BY 4.0 to inspire academic research and commercial work in keyword spotting, spoken term search, and other applications that can benefit people across the planet. Our ultimate goal is to make keyword spotting voice-based interfaces available for any keyword in any language.Voice-based interaction is already democratizing access to technology. For example, keyword spotting is a common application in many smart devices',
'RIM': 'and Information Management'}

所需 Output

{'MSWC': 'Multilingual Spoken Words Corpus',
'RIM': 'Resource and Information Management'}

您可以將正則表達式更改為: r"\(([AZ]{1,7})\)" 這將只匹配大寫字母 AZ,並確保首字母縮略詞的長度為 1 到 7 個字符。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM