简体   繁体   中英

How to extract acronyms and abbreviations from pandas dataframe?

I have a pandas dataframe, column with text data

Sample data

s = """The MLCommons Association, an open engineering consortium dedicated to improving machine learning for everyone, today announced the general availability of the People's Speech Dataset and the Multilingual Spoken Words Corpus (MSWC). These trail-blazing and permissively licensed datasets advance innovation in machine learning research and commercial applications. Also today, the MLCommons Association is issuing a call for participation in the new DataPerf benchmark suite, which measures and encourages innovation in data-centric AI."""
k = """The MLCommons Association is a firm proponent of Data-Centric AI (DCAI), the discipline of systematically engineering the data for AI systems by developing efficient software tools and engineering practices to make dataset creation and curation easier. Our open datasets and tools like DataPerf concretely support the DCAI movement and drive machine learning innovation."""
j = """The key global provider of sustainable packaging solutions has now taken a significant step towards reaching these ambitions by signing two 10-year virtual Power Purchase Agreements (VPPA) with global renewable energy developer BayWa r.e covering its operations in Europe. The agreements form the largest solar VPPA for the packaging industry in Europe, as well as the first major solar VPPA by a Finnish company."""
a = """Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP)."""

I want to extract all unique acronyms and abbreviations in that text column.

So far I have a function to extract all acronyms and abbreviations from a given text.

def extract_acronyms_abbreviations(text):
    eaa = {}
    for match in re.finditer(r"\((.*?)\)", text):
        start_index = match.start()
        abbr = match.group(1)
        size = len(abbr)
        words = text[:start_index].split()[-size:]
        definition = " ".join(words)

        eaa[abbr] = definition


    return eaa
extract_acronyms_abbreviations(a)
{'FHH': 'family health history', 'NP': 'nurse practitioner'}

I want to apply/extract all unique acronyms and abbreviations from the text column.

Desired output

{'MSWC': 'Multilingual Spoken Words Corpus',
'DCAI': 'proponent of Data-Centric AI',
'VPPA': 'virtual Power Purchase Agreements',
'NP': 'nurse practitioner',
'FHH': 'family health history'}

Lets say df['text'] contains the text data you want to work with.

df["acronyms"] = df.apply(extract_acronyms_abbreviations)
# It will create a new columns containing dictionary return by your function.

Now create a master dict like

master_dict = dict()
for d in df["acronyms"].values:
    master_dict.update(d)
print(master_dict)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM